ABSTRACT BAYESIAN VARIABLE SELECTION IN LINEAR AND …
Transcript of ABSTRACT BAYESIAN VARIABLE SELECTION IN LINEAR AND …
ABSTRACT
BAYESIAN VARIABLE SELECTION IN LINEAR AND NON-LINEARMODELS
Arnab Kumar Maity, Ph.D.Division of Statistics
Northern Illinois University, 2016Sanjib Basu, Director
Appropriate feature selection is a fundamental problem in the field of statistics. Models
with large number of features or variables require special attention due to the computational
complexity of the huge model space. This is generally known as the variable or model se-
lection problem in the field of statistics whereas in machine learning and other literature,
this is also known as feature selection, attribute selection or variable subset selection. The
method of variable selection is the process of efficiently selecting an optimal subset of rele-
vant variables for use in model construction. The central assumption in this methodology is
that the data contain many redundant variable; those which do not provide any significant
additional information than the optimally selected subset of variable. Variable selection is
widely used in all application areas of data analytics, ranging from optimal selection of genes
in large scale micro-array studies, to optimal selection of biomarkers for targeted therapy in
cancer genomics to selection of optimal predictors in business analytics. Under the Bayesian
approach, the formal way to perform this optimal selection is to select the model with highest
posterior probability. Using this fact the problem may be thought as an optimization prob-
lem over the model space where the objective function is the posterior probability of model
and the maximization is taken place with respect to the models. We propose an efficient
method for implementing this optimization and we illustrate its feasibility in high dimen-
sional problems. By means of various simulation studies, this new approach has been shown
to be efficient and to outperform other statistical feature selection methods methods namely
median probability model and sampling method with frequency based estimators. Theoreti-
cal justifications are provided. Applications to logistic regression and survival regression are
discussed.
NORTHERN ILLINOIS UNIVERSITYDE KALB, ILLINOIS
AUGUST 2016
BAYESIAN VARIABLE SELECTION IN LINEAR AND NON-LINEAR
MODELS
BY
ARNAB KUMAR MAITYc© 2016 Arnab Kumar Maity
A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE
DOCTOR OF PHILOSOPHY
DIVISION OF STATISTICS
Dissertation Director:Sanjib Basu
ACKNOWLEDGEMENTS
I express my sincerely profound gratitude to my advisor Dr. Sanjib Basu, for his constant
support, mentoring, advise, and personal care during my entire NIU life.
I am greatly indebted to the guidance from Dr. Rama Lingam, Dr. Balakrishna Hosmane,
and Dr. Alan Polansky. I acknowledge the help and enthusiasm from Dr. Shuva Gupta, Dr.
Zhuan Ye, Dr. Nader Ebrahimi, Dr. Duchwan Ryu, and Dr. Michelle Xia.
At this moment of achievement, I am deeply grateful to Dr. Isha Dewan and other
faculty members in ISI; I am immensely grateful to Sri Palas Pal, Sri Subhadeep Banerjee,
Sri Tulsidas Mukhopadhyay, Dr. Dilip Kumar Sahoo, and Sri Parthsarathi Chakrabarti, and
other stuffs at Narendrapur.
As the dissertation is in its present form, I would like to convey my gratitude to my uncle
Sri Subrata Maiti, my uncle Dr. Tapabrata Maiti and Dr. Tathagata Bandyopadhyay for
their unselfish support.
In addition, I continue to acknowledge the guidance of Mondal-Sir, Sri Nandadulal Jana,
Sri Dhaniram Tudu, Badal-da, Sri Rajendra Nath Giri, Santi-Babu, Sri Swapan Bhaumik, Sri
Bhabesh Barman, Samya-Babu, Bhaskar-Babu, Abhijit-Babu, Hari-babu, Surjendyu-Babu,
Sukumar-Babu, Pintu-da, Kaushik-da, Panda-Miss, Jana-Miss, Gayen-Miss, Gobindo-Sir
and other school teachers.
I would like to thank John Winans and Department of Computer Science for letting me
use their super computer to execute my time consuming computations.
Most importantly, I wish to express my heartfelt thanks to beloved Dr. Bilal Khan, Dr.
Amartya Chakrabarti, Smt. Sreya Chakraborty, Dr. Santu Ghosh, Dr. Ujjwal das, Dr.
Arpita Chatterjee, Mr. Alan Hurt Jr., Sri Rajendranath Maiti, Sri Joydeep Das, Sri Suman
iii
Sarkar, Sri Biswajit Pal, Sri Soumen Achar, Sri Suman Bhunia, Sourav-da, Sri Anirban
Roy Chowdhuri, Sri Nirmal Kandar, Sri Subhasis Samanta, Chandan, Dr. Raju Maiti, Sri
Dines Pal, Sri Shibshankar Banerjee, Sri Kshaunis Misra, Sri Anupam Mondal, Sri Kaushik
Jana, Sri Bappa Mondal, Sri Avishek Guha, Sri Maitreya Samanta, Dr. Himel Mallick, Dr.
Abhra Sarkar, Sri Sayantan Jana, Dr. Bipul Saurabh, Sri Sayan Dasgupta, Smt. Susmita
Bose, Smt. Tuhina Biswas, Smt. Upama Roy, Sri Sesha Sai Ram, Sri Koustuv Lahiri,
Sri Suresh Venkat, Sri Kushdesh Prasad, Sri Bappaditya Mondal, Sri Pradeep Sadhu, Sri
Pratyush Chandra Sinha, Smt. Pragya Patwari, Sri Vijay Anand, Sri Sougata Dhar, Sri
Saptarshi Chatterjee, Smt. Erina Paul, Smt. Gunisha Arora, Dr. Tsehaye Eyassu, Md.
Rafi Hossain, Dr. Priyanka Grover, Sri Narendra Chopra, Sri Paramahansa Pramanik, Mr.
Jacob Holzman and many other friends and philosophers for their various inspirational roles
in different stages of my life.
I must acknowledge the significant amount of resource I fetched from Northern Illi-
nois University (NIU) libraries, https://www.google.co.in/, https://scholar.google.com/, and
https://www.wikipedia.org/.
Last but not the least, I would like to thank my other-half Smt. Puja Saha for always
keeping faith in me.
DEDICATION
To my father Sri Arjun Kumar Maiti, my mother Smt. Rina Maiti, my sister Smt.
Sudeshna Maity, and to my soul-partner Smt. Puja Saha.
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF FIGURES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 VARIABLE SELECTION PROBLEM: PAST AND PRESENT . . . . . . . . . . . . . 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Classical Measures of Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 A Selective Review of Penalized Regression Methods . . . . . . . . . . . . . . . . . 12
2.3.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Bridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Non-negative Garrote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.4 Least Absolute Shrinkage and Selection Operator (LASSO) . . . . . . . 14
2.3.5 Smoothly Clipped Absolute Deviation Penalty . . . . . . . . . . . . . . . . 14
2.3.6 Fused Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.7 Elastic Net. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.8 Adaptive Lasso. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.9 Sure Independent Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.10 Minimax Concave Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.11 Reciprocal Lasso. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
vi
Chapter Page
2.4 A Selective Review of Bayesian Model Selection Criteria. . . . . . . . . . . . . . . 19
2.4.1 Highest Posterior Model (HPM). . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.2 Marginal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.3 Log Pseudo Marginal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.4 L-Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.5 Deviance Information Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.6 Median Probability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Development of Prior Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.1 Bayesian Connection to Penalized Regressions. . . . . . . . . . . . . . . . . 26
2.5.2 Shrinkage Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Sampling over Model Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 COMPARISON AMONG LPML, DIC, AND HPM . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Inconsistency of LPML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1 Linear Model in Presence of Multicollinearity . . . . . . . . . . . . . . . . . 43
3.3.2 A Non-conjugate Setting: Logistic Regression . . . . . . . . . . . . . . . . . 44
3.3.3 Nodal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.4 Melanoma data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 MEDIAN PROBABILITY MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Comparison of MPM and HPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
vii
Chapter Page
5 SIGNIFICANCE OF HIGHEST POSTERIOR MODEL . . . . . . . . . . . . . . . . . . 52
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 VARIABLE SELECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Requirement of Sampling or Stochastic Search . . . . . . . . . . . . . . . . . . . . . . 54
6.3 Limitation of Bayesian Lasso and its Extensions . . . . . . . . . . . . . . . . . . . . 55
6.4 Maximization in the Model Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.5 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.6 Our Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.7 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7 EMPIRICAL STUDY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8 LOGISTIC REGRESSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.2 Power Posterior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9 SURVIVAL MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.2 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.3 Mixture of Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.4 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10 DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.2 Comparison of Methods for Linear Models. . . . . . . . . . . . . . . . . . . . . . . . . 94
viii
Chapter Page
10.3 Non-Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.4 Survival Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.5 Future Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.5.1 Large p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.5.2 Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.5.3 Different Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.5.4 Posterior Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.5.5 Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
11 CONCLUSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
LIST OF TABLES
Table Page
3.1 Probabilities of selecting data generating models for Gunst and Mason data. 33
3.2 Probabilities of selecting the data-generating Model . . . . . . . . . . . . . . . . . . 43
3.3 Probabilities of selecting the data-generating models for logistic regressionsimulation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Probabilities of selecting the data-generating models for nodal data. . . . . . . 48
3.5 Probabilities of selecting the data-generating models for melanoma data . . . 49
4.1 Comparison between MPM and HPM for linear models in presence of collinear-ity: Result presents the number of times the data generating model is recov-ered using 1000 simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.1 Number of times the correct model is obtained based on 100 repetitions:Model with Uncorrelated Predictors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.2 Number of times the correct model is obtained based on 100 repetitions:Linear regression in presence of collinearity . . . . . . . . . . . . . . . . . . . . . . . . 68
7.3 Number of times the correct model is obtained based on 100 repetitions:Lasso example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.4 Number of times the correct model is obtained based on 100 repetitions:Adaptive lasso example (lasso fails) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.5 Number of times the correct model is obtained based on 100 repetitions:p = 30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.6 Number of times the correct model is obtained based on 100 repetitions:Elastic net example. p = 40.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.7 Number of times the correct model is obtained based on 100 repetitions:p > n, p = 200, n = 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
x
Table Page
7.8 Number of times the correct model is obtained based on 100 repetitions:p > n, p = 200, n = 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.9 Number of times the correct model is obtained based on 100 repetitions:p > n, p = 1000, n = 200. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.10 Number of times the correct model is obtained based on 100 repetitions:p > n, p = 1000, n = 200 and Presence of Correlation.. . . . . . . . . . . . . . . . . 76
7.11 Description of the Ozone dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.12 The HPM and MPM for Ozone-35 data. The last 2 columns provide Bayesfactor against the null model and log of that respectively. . . . . . . . . . . . . . . 76
8.1 Number of times the correct model is obtained based on 100 repetitions:Logistic regression in presence of collinearity. . . . . . . . . . . . . . . . . . . . . . . . 82
8.2 Number of times the correct model is obtained based on 100 repetitions:Logistic regression for large p. p = 20.. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.1 Number of times the correct model is obtained based on 100 repetitions:Weibull regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.2 Number of times the correct model is obtained based on 100 repetitions:Data generating model – Weibull regression. Analysis model – Mixture ofWeibull regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.3 Number of times the correct model is obtained based on 100 repetitions:Data generating model – Gamma regression. Analysis model – Mixture ofWeibull regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.4 Number of times the correct model is obtained based on 100 repetitions:Data generating model – Log Normal regression. Analysis model – Mixtureof Weibull regression.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.5 Number of times the correct model is obtained based on 100 repetitions:Presence of Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.6 Models obtained using different methods for Prot data . . . . . . . . . . . . . . . . 92
LIST OF FIGURES
Figure Page
7.1 Solution path for different starting model for SA-HPM model. The pointsare the log of Bayes factor of models against the null model. The log(BayesFactor) of the data generating model (1, 2, 5) is 20.909. . . . . . . . . . . . . . . . 77
9.1 (a)The path for dimensions of models obtained in each step of SA-HPM.(b) The path for log(marginal likelihood) of models obtained in each step ofSA-HPM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
CHAPTER 1
INTRODUCTION
A regression model typically specifies relationship between a response y = (y1, . . . , yn)′,
n × 1 response variable, p many regressors (predictors, or covariates) x1, . . . xp using X =
[x1, . . . , xp], a n × p design matrix, and β = (β1, . . . , βp)′, a p × 1 vector of coefficients
(parameters). The simplest and most commonly used regression is linear regression which is
given by,
E(y) = Xβ
Assuming that there exists a true model or a data-generating model, for analysis, we fit
a collection of several nested models which are the sub-models or super-models of the true
model, in order to identify the most useful model. Thus, in the analyses models some of βj
are zero or not according as the corresponding xj, j = 1, . . . , p are absent or present in the
model. So we follow a more convenient way to write the analyses models.
E(y) = Xγβγ (1.1)
where γ = (γ1, . . . , γp)′ is p× 1 indicator vector with
γj =
1 if xj is present
0 if xj is absent
2
Therefore the model space M consists of 2p models where each model is variant of Mγ,
indexed by the binary vector γ, according to (1.1). The first column of X is 1, unless
otherwise stated, and the corresponding intercept term is assumed to be always present in
the model.
The objectives of the variable selection can be broadly described as,
1. Providing efficient and effective predictors.
2. Providing a better understanding of the underlying process.
3. Improving the prediction performance of the predictors.
Variable selection attempts to select an “optimal subset of predictors” as our goal is
to explain the data in the simplest possible way with the removal of redundant predictors.
The principle of Occam’s Razor states that among several plausible explanations for a phe-
nomenon, the simplest is best. Furthermore, redundant predictors may add noise to the
statistical inferential process.
As an example of the statistical variable selection problem, an essential biomedical ques-
tion in assessing the effect of the biomarkers in diseases such as cancer is which of the factors
under investigation have significant association with the outcomes of the disease such as
cancer recurrence and mortality. This is often handled informally via some sort of stepwise
approach (such as by sequentially considering the p-values for the relevant predictors), but
from a formal viewpoint, this is a variable selection problem. Variable or feature selection is
a crucial part of any statistical analysis.
The vast collection of criteria for model selection include multiple R2, Akaike Informa-
tion Criterion (AIC) (Akaike, 1974), Bayesian Information Criterion (BIC), Mallow’s Cp,
Widely Applicable Akaike Information Criterion (WAIC) (Watanabe, 2010), Widely Appli-
cable Bayesian Information Criterion (WBIC) (Watanabe, 2013), and many others. Dif-
ferent methods of selecting a model include forward selection, backward selection, stepwise
3
selection, stagewise selection, ridge regression, non-negative garrote (Breiman, 1995), bridge
regression (Frank & Friedman, 1993) and other methods. However, in large scale variable
selection problems, the performances of these methods can be far from satisfactory, see, for
example, Breiman (1995).
Bayesian variable selection now has an extensive literature; see Kass & Raftery (1995),
O’Hara & Sillanpaa (2009) for an overall review and Ibrahim et al. (2004) (chapter 6) for
model comparison in time-to-event data. Criteria for Bayesian model selection include Bayes
factors, measures of model complexity and information, and goodness of fit measures.
The Bayes factor ((Kass & Raftery, 1995), (Basu & Chib, 2003)) for model M1 against
model M2 is the ratio of their marginal likelihoods.
BF12 =Pr(y|M1)
Pr(y|M2)(1.2)
In the log-scale
log BF12 = log Pr(y|M1)− log Pr(y|M2)
where Pr(y|M) =∫
Pr(y|M, θ)π(θ|M)dθ and y denotes the observed data and θ denotes
all unobservables, Pr(y|M, θ) is the likelihood for modelM, and π(θ|M) is the prior density
of θ.
Kass & Raftery (1995) stated that the Bayes factor is a summary of evidence for model
M1 against modelM2 and provided a table of cutoffs for interpreting logBF12. In general,
the model with higher log-marginal likelihood is preferable in this model selection criterion.
In modern era, Bayesian inference is typically done by Markov Chain sampling. The
computation of Bayes factor from Markov chain sampling, however, is typically difficult
since the Markov Chain methods avoid the computation of the normalizing constant of the
posterior and it is precisely this constant(s) that is (are) needed for the marginal likelihood.
4
Several methods for estimating the marginal likelihood (or integrated likelihood or nor-
malizing constant) have been suggested. See Chen et al. (2001). Kass & Raftery (1995) used
Laplace approximation. Chib (1995) proposed a method of estimating marginal likelihood
from Gibbs sampling (Casella & George, 1992) output while Chib & Jeliazkov (2001) cal-
culated the same from Metropolis Hastings sampling (Hastings, 1970) output. Basu et al.
(2003) compared different parametric models for masked competing risks using Bayes factors
whereas Basu & Ebrahimi (2003) used Bayes factors to compare their martingale process
model with other competing models. Basu & Chib (2003) presented a general method for
comparing semiparametric Bayesian models, constructed under the Dirichlet Process Mixture
(DPM) framework, with alternative semiparametric or parametric Bayesian models. A dis-
tinctive feature of their method is that it can be applied to semiparametric models containing
covariates and hierarchical prior structures. This innovative method proposes two separate
computations schemes for estimating the likelihood and posterior ordinates of the DPM
model at a single high density point which are then combined via the Basic marginal iden-
tity to obtain an estimate of the marginal likelihood. For computing the marginal likelihood
in the mixture cure rate competing risks model, Basu & Tiwari (2010) used a combination
of the volume-corrected harmonic mean estimator proposed by DiCiccio et al. (1997) which
restricts the Monte Carlo averaging in a high posterior density region with the estimate
proposed by Chib (1995). This combination method was used as the conditional posterior of
the cure fraction c, available in closed form (due to conditional conjugacy) whereas the other
conditionals were not available in closed form. The estimate of the marginal likelihood was
finally obtained by the Basic marginal identity ((Chib, 1995), (Basu & Chib, 2003)). Neal
(2001) used importance sampling to calculate marginal likelihood while Gelman & Meng
(1998) used path sampling to calculate the integrated likelihood. Friel & Pettitt (2008)
extended the theory of path sampling to develop power posterior method in calculating the
5
normalizing constant. Raftery et al. (2007) used harmonic mean type estimator to estimate
integrated likelihood.
Marginal likelihood can be a measure of model fit as well. The log-pseudo marginal
likelihood (LPML) (Ibrahim et al., 2004), (Yin & Ibrahim, 2005), (Ghosh et al., 2009)
is a ”leave-one-out” cross-validated criterion based on the conditional Predictive Ordinate
(CPO) (Gelfand et al., 1992). CPOs can be estimated from Markov chain samples. The
posterior predictive L-measure (Ibrahim et al., 2004), (Laud & Ibhrahim, 1995) is another
model selection criterion which is a combination of goodness-of-fit and a penalty (similar to
variance and bias2).
The Deviance Information Criterion (DIC), proposed by Spiegelhalter et al. (2002) is
another Bayesian measure of model fit penalized for increased model complexity but the
DIC, in its orginal form, may not be appropriate in models with missing or latent variables
(mixture models or models involving random effects), and Celeux et al. (2006) developed
various modifications of the DIC. Ghosh et al. (2009) used a modified DIC and LPML to
compare joinpoint models for modeling cancer rates over the years.
In frequentist model selection the lasso (Tibshirani, 1996) has become a widely popular
procedure for regularized variable selection in least square type regression problems. The
lasso can be fitted by the LARS algorithm (Efron et al., 2004), or the coordinate descent
algorithm (Friedman et al., 2010). Penalties other than L1 have also been explored. Recent
work on regularized variable selection methods include Smoothly Clipped Absolute Devi-
ation Penalty (SCAD) (Fan & Li, 2001), Adaptive Lasso (Zou, 2006), Elastic Net (Zou
& Hastie, 2005), Fused Lasso (Tibshirani et al., 2005), Grouped Lasso (Yuan & Lin, 2006),
Bootstrapped Lasso (Hall et al., 2009), Weighted Grouped Lasso. Oracle property (discussed
later) has been established for SCAD, Adaptive Lasso and Bootstrapped Lasso.
Bayesian regularization literature is rich as well. Bayesian lasso via the double-exponential
prior has been explored in Park & Casella (2008) and Hans (2009). Bayesian Adaptive Lasso
6
has been developed and used by Leng et al. (2014) and Sun et al. (2009). Bayesian Elastic
Net have been developed by Li & Lin (2010) and Kyung et al. (2010). In addition, Kyung et
al. (2010) also developed Bayesian Group Lasso and Bayesian Fused Lasso. Recently, Polson
et al. (2014) considered Bayesian Bridge.
Note that, the Bayesian regularization methods implement the regularization by choice
of an appropriate prior on the regression coefficients β. These can be referred as shrinkage
prior. In addition to these penalty priors other shrinkage priors have also been proposed.
Carvalho et al. (2010) considered the horseshoe prior. Polson & Scott (2012) proposed joint
priors generated by a Levy Process.
Spike and Slab prior is another type of prior in Bayesian shrinkage area, which was
originally proposed by Mitchell & Beauchamp (1988), was greatly improved by George &
McCulloch (1993), George & McCulloch (1997), and Kuo & Mallick (1998), and further
developed by Ishwaran & Rao (2005) and Dey et al. (2008). Garcıa-Donato & Martınez-
Beneito (2013) used posterior summary based on Stochastic Search Variable Selection (SSVS)
(George & McCulloch, 1997) for variable selection in high dimensional data. They illustrated
advantage of Bayesian model selection method based on visit frequency compared to other
methods when the parameter space is finite, both by theoretical justification and simulation.
A competing Bayesian variable selection proposal is based on the median probability
model (MPM). Barbieri & Berger (2004) theoretically established that in normal linear
model under a predictive loss the optimal predictive model is the median probability model.
However, their theoretical optimality result is based on the assumption of linear models and
orthogonal design matrix. Performance of median probability model in a correlated structure
remains an active area of research.
The marginal likelihood based highest posterior model (HPM) approach, on the other
hand, chooses the model which has highest posterior probability or the highest marginal like-
lihood. This is a formal approach which is grounded in the basic probabilistic developments
7
of the Bayesian statistical paradigm. However, this approach is often computationally diffi-
cult to implement as one needs to calculate the posterior probabilities for all of the models
in the model space. For example, even in the presence of just 30 variables or features, the
model space contains more than one billion models (230), and any method which explore one
model at a time may take years to explore this model space!!
This article proposes an efficient method for variable selection using the highest posterior
probability approach.
CHAPTER 2
VARIABLE SELECTION PROBLEM: PAST AND PRESENT
2.1 Introduction
A statistical model specifies a relationship between a n×1 response variable or dependent
variable y = (y1, . . . , yn)′ and p many n× 1 regressors or covariates or predictors or explana-
tory variables or independent variables, x1, x2, . . . , xp. This is typically done by specifying
the stochastic law of the response y which involves X = [x1, . . . , xp], the n× p design matrix
whose columns are the covariates, and β = (β1, . . . , βp)′, the parameter vector.
1. In linear model,
E(y) = Xβ
2. For logistic regression model the response has only two outcomes, denoted by 0 and 1.
The following then provides the relation between y and X.
y ∼ Bernoulli(π) (2.1)
where
π = P (y = 1|X, β) =exp (Xβ)
1 + exp (Xβ)(2.2)
9
3. In survival settings the response is time to event. Let S(y) be the survival function of
y. Then the stochastic relation of between y and X is given by,
h(y) =
[h0(y)
]exp(Xβ)
where, S0(y) is the baseline survival function.
When there are large number of predictors, some of them are more significant than others
in explaining the response. One of the main aim of variable selection is to select important
subset of explanatory variables. Define a p× 1 indicator γ = (γ1, . . . , γp)′ as
γj =
1 if xj is present
0 if xj is absent
The stochastic law of representation of y then depends on (x1, . . . , xp) via Xγβγ where γ
is working as a subscript of X (β) such that xj(βj) is present in the model whenever γj = 1,
j = 1, 2, . . . , p. For example, suppose x1 and x2 are present in the model. Then
γ = (1, 1, 0, . . . , 0︸ ︷︷ ︸p - 2
)
Xγ = [x1, x2]
βγ = (β1, β2)
Note that, γj can take only two different possible values namely 0 or 1. There are p many
elements in γ. Thus, all together there are 2p possible combinations. This implies that there
are 2p models in the model space Ω (say). We denote, individual model in the model space
as Mγ, indexed by the binary vector γ.
10
When x1 and x2 are present in the model space the corresponding model can alternatively
denoted by (1, 2). We shall use these notations in this dissertation frequently. The null model
which has no independent variable in the model is denoted by M0.
Notice that one interesting feature of the model space is that it can get extremely large
even for moderate p. For instance, when there are 30 independent variables the model space
has 1,073,741,824 many models!!
We can broadly divide the collection of variable selection methods into two broad groups.
The first group of methods enumerate the whole model space to choose a good model whereas
the second group of methods search the model space without visiting all models to arrive
a good model. Naturally, when p is large methods from the first group can be difficult or
impossible to implement. We review methods from both groups below, primarily in the
setting of linear models unless otherwise stated, though, these methods are applicable to
more complex models as well.
2.2 Classical Measures of Goodness of Fit
Classical criteria for goodness of fit of a model includes Error Sum of Square (SSE), R2,
Mallow’s Cp, Akaike Information Criterion (AIC) (Akaike, 1974), and Bayesian Information
Criterion (BIC) (Schwarz, 1978).
For nested models, SSE always decreases as number of predictors increases. Similarly,
multiple R2 always increases as number of predictors increases. Therefore they cannot serve
alone as good measures of goodness of fit. This is precisely the reason, the adjusted R2 have
been suggested. However, adjusted R2 is often criticized for its lack of easy interpretation.
AIC is derived from Kullback-Leibler distance. However, AIC suffers overfitting as sample
size grows (Yan & Su, 2009). On the other hand, the Bayesian version of this type of
11
information criterion is BIC, which puts large penalty for overfitting and in doing so, BIC
faces issue of underfitting (Yan & Su, 2009).
Forward selection is a process of selecting a model rather than deciding a goodness of fit of
a model. Forward selection adds one explanatory variable at a time and the extra variable in
the model is kept by comparing some goodness of fit criterion (all of the independent variables
are significant, say) of the current model and the model without the current covariate.
In the backward selection process of selection a model we fit the full model first and then
keep removing one covariate at a time according to a goodness of fit measure until we reach
desired goodness of fit.
Stepwise selection is a combination of forward selection and backward selection where
each step considers both addition and removal of variables, one at a time. This method
has been subjected to several criticisms in the literature. For a nice summary of limitations
of stepwise method see Flom & Cassell (2007). Essentially, stepwise regression produces
unstable results. Slight change in data may produce different set of regressors (Breiman,
1995).
A model is often built on a subset of data and validation of the fitted model is done using
remaining dataset. The data which are used to build the model is called the training set and
remaining data is referred to as the test set or the validation set. A k-fold or leave-k-out
cross validation involves having k observations in the test set and repeating the process
of building the model by selecting different training set and hence test set several times.
However, Shao (1993) proved that the probability of obtaining the data generating model
using leave-one-out cross validation based prediction error never tends to one as sample size
increases.
12
2.3 A Selective Review of Penalized Regression Methods
2.3.1 Ridge Regression
Ridge regression can provide efficient parameter estimates in presence of multicollinear-
ity among predictors (Seber & Lee, 2003). Furthermore, when there are large number of
covariates present ridge regression tends to shrink some of the coefficients toward zero. The
ridge estimate is defined as,
β = argminβ
n∑i=1
(yi −
p∑j=1
βjxij
)2
subject to
p∑j=1
β2j ≤ t
or equivalently,
β = argminβ
n∑i=1
(yi −
p∑j=1
βjxij
)2
+ λ
p∑j=1
β2j
A drawback of ridge regression is that the ridge estimates are not scale invariant.
2.3.2 Bridge Regression
Frank & Friedman (1993) introduced bridge regression. The Bridge estimate is defined
as
β = argminβ
n∑i=1
(yi −
p∑j=1
βjxij
)2
subject to
p∑j=1
βqj ≤ t
or equivalent to
β = argminβ
n∑i=1
(yi −
p∑j=1
βjxij
)2
+ λ
p∑j=1
βqj
where q > 0.
13
Oracle properties for an estimator has been discussed by Fan & Li (2001), Zou (2006).
Let β(n) be an estimator of β based on a sample size n. Define, B = γ : βγ 6= 0, the
non-zero components of β, and Bn = γ : β(n)γ 6= 0. The estimate βn is said to have oracle
property if
1. limn→∞ P (Bn = B) = 1.
2.√n(β
(n)B − βB)
D−→ N(0,Σ∗) where Σ∗ is covariance matrix of β(n)B .
For 0 < q ≤ 1 bridge regression is able to produce sparse estimates which is useful when
there are large number of covariates. Recently Huang et al. (2008) proved oracle properties
in the sense of Fan & Li (2001) for this type of Bridge estimators. But when q > 1 this
method tends to keep all of the covariates. Moreover, it is not clear how to set values of λ
and q.
2.3.3 Non-negative Garrote
The nonnegative garrote estimate (Breiman, 1995) is defined as,
β = argminβ
n∑i=1
(yi −
p∑j=1
cjβjxij
)2
subject to cj ≥ 0 and
p∑j=1
cj ≤ t
or equivalent to
β = argminβ
n∑i=1
(yi −
p∑j=1
cjβjcij
)2
+ λ
p∑j=1
|cj|
Breiman (1995) illustrated its better performance than stepwise selection and ridge re-
gression.
14
2.3.4 Least Absolute Shrinkage and Selection Operator (LASSO)
Tibshirani (1996) introduced the lasso estimate which is defined as,
β = argminβ
n∑i=1
(yi −
p∑j=1
βjxij
)2
subject to
p∑j=1
|βj| ≤ t
or equivalent to
β = argminβ
n∑i=1
(yi −
p∑j=1
βjxij
)2
+ λ
p∑j=1
|βj|
Lasso can be viewed as a special case of bridge regression with q = 1, and has the
ability to shrink the coefficients exactly to zero. Tibshirani (1996) compared performance of
lasso in selecting the data generating model with ridge regression and non-negative garrote.
Lasso is very popular and lasso estimates are easily interpretable. However, there are some
limitations of lasso.
• The oracle property (Fan & Li, 2001) does not hold for lasso estimators.
• When p > n, lasso cannot select more than n covariates.
• In presence of multicollinearity, lasso may perform poorly.
Many modifications to lasso have been developed to address these limitations.
2.3.5 Smoothly Clipped Absolute Deviation Penalty
Fan & Li (2001) noted that any penalty should satisfy following desired properties:
1. Unbiasedness: The resulting estimator is nearly unbiased when the true unknown
parameter is large to avoid unnecessary modeling bias.
15
2. Sparsity: The resulting estimator is a thresholding rule, which automatically sets
small estimated coefficients to zero to reduce model complexity.
3. Continuity: The resulting estimator is continuous in data in appropriate metrics to
avoid instability in model prediction.
According to Fan & Lv (2010a), “The convex Lq penalty with q > 1 does not satisfy the
sparsity condition, whereas the convex L1 penalty does not satisfy the unbiasedness condi-
tion, and the concave Lq penalty with 0 < q < 1 does not satisfy the continuity condition.
In other words, none of the Lq penalties satisfies all three conditions simultaneously.” In
order to achieve all three properties, the following penalty function was proposed.
The Smoothly Clipped Absolute Deviation (SCAD) penalty (Fan & Li, 2001) estimate
is defined as,
β = argminβ
n∑i=1
(yi −
p∑j=1
βjxij
)2
+ pλ
where the derivative of pλ is given by,
p′λ = λ
p∑j=1
I(βj ≤ λ) +
(aλ− βj)+
(a− 1)λI(βj > λ)
for a > 2 and θ > 0.
The penalty function satisfies the mathematical conditions for unbiasedness, sparsity and
continuity. Furthermore, Fan & Li (2001) showed that the estimate has oracle property.
16
2.3.6 Fused Lasso
The fused lasso (Tibshirani et al., 2005) estimate is defined as,
β = argminβ
n∑i=1
(yi −
p∑j=1
βjxij
)2
+ λ1j
p∑j=1
|βj|+ λ2j
p∑j=2
|βj − βj−1|
This is particularly useful when there exists a natural ordering in the covariates.
2.3.7 Elastic Net
The elastic net estimate (Zou & Hastie, 2005) is defined as,
β = argminβ
n∑i=1
(yi −
p∑j=1
βjxij
)2
+ λ1j
p∑j=1
|βj|+ λ2j
p∑j=1
β2j
The elastic net estimator can select more than n predictors, and the additional square
penalty term helps to address the issue of collinear predictors.
2.3.8 Adaptive Lasso
The adaptive lasso estimate (Zou, 2006) is defined as,
β = argminβ
n∑i=1
(yi −
p∑j=1
βjxij
)2
+
p∑j=1
λj|βj|
Zou (2006) established that this estimate has oracle property, and can select more than
n predictors.
17
2.3.9 Sure Independent Screening
Consider the componentwise regression,
ω = XTy
Sure Independent Screening (SIS) (Fan & Lv, 2008) proposes to select the model Mγ
where, for a given γ ∈ (0, 1),
Mγ =
1 ≤ i ≤ p : |ωi| is the first [γn] largest of all
with [γn] denoting the integer part of γn.
Fan & Lv (2008) showed that, under some regularity assumptions,
Pr(M∗ ⊂Mγ)→ 1
where, M∗ is the data-generating model.
By definition, SIS selects less than n covariates, but we note that, the variable selection
property holds good even if we select more than n covariates. Iterative Sure Independent
Screening (ISIS) (Fan & Lv, 2008) procedure is an iterative procedure of SIS. Because of its
iterative nature, ISIS often can recover the important predictors if it is left out by the SIS.
After we select the modelMγ using SIS or ISIS, the final model is obtained by applying
usual penalized procedures such as lasso, SCAD, and MCP (defined in the following section).
18
2.3.10 Minimax Concave Penalty
Zhang (2010) proposed the Minimax Concave Penalty (MCP) estimate as
β = argminβ
n∑i=1
(yi −
p∑j=1
βjxij
)2
+ pλ
where pλ is given by,
pλ = λ
p∑j=1
∫ βj
0
(1− x
γλ)+dx
Zhang (2010) noted that, MCP estimate obeys the oracle property, and satisfies the
mathematical conditions for unbiasedness, sparsity and continuity in the sense of Fan & Li
(2001)
2.3.11 Reciprocal Lasso
The reciprocal lasso estimate (Song & Liang, 2015) is defined as,
β = argminβ
n∑i=1
(yi −
p∑j=1
βjxij
)2
+
p∑j=1
λ
|βj|I(βi 6= 0)
The estimate has oracle property.
Other important references of penalized regression include grouped lasso (Yuan & Lin,
2006), Dantzig selector (Candes & Tao, 2007), Bootstrapped lasso (Hall et al., 2009), adap-
tive elastic net (Zou & Zhang, 2009).
19
2.4 A Selective Review of Bayesian Model Selection Criteria
2.4.1 Highest Posterior Model (HPM)
In the Bayesian paradigm the uncertainty about parameters is expressed by a prior
probability distribution. Uncertainty about models can be further expressed by putting
a prior distribution on the model space. In model selection, Bayesian model can thus be
defined as,
y ∼ f(y|θ,M)
θ ∼ π(θ|M)
Mγ ∼ Pr(M)
where π(θ|M) is the prior distribution of parameter θ under model M and Pr(M) is the
prior on the model.
Then posterior distribution of θ under model M is given by,
π(θ|y,M) =f(y|θ,M)π(θ|M)∫f(y|θ,M)π(θ|M)dθ
and the posterior probability of model M is given by,
Pr(M|y) =Pr(y|M) Pr(M)∑
Mγ∈Ω Pr(y|Mγ) Pr(Mγ)(2.3)
where
Pr(y|M) =
∫Pr(y|θ,M)π(θ|M)dθ (2.4)
20
is called the marginal likelihood or integrated likelihood of data y under model M.
Highest posterior model (HPM) is defined as the model having highest posterior proba-
bility among all the models in the model space, that is,
HPM = argmaxγ Pr(Mγ|y) = argmaxγ Pr(y|M) Pr(M)
or equivalently, the model M is said to be the highest posterior model if
Pr(M|y) = maxγ
Pr(Mγ|y)
Kass & Raftery (1995) pointed out that the theory of highest posterior model has solid
theoretical foundation. In addition, Raftery (1999) described, “The hypothesis testing proce-
dure defined by choosing the model with the higher posterior probability minimizes the total
error rate, that is, the sum of type I and type II error rates. Note that, frequentist statisti-
cians sometimes recommend reducing the significance level in tests when the sample size is
large, the Bayes factor (and hence highest posterior model) does this automatically!!” Most
importantly, model uncertainty can be captured when using posterior probability approach.
2.4.2 Marginal Likelihood
Finding HPM requires the following:
1. the enumeration of the whole model space.
2. the computation of marginal likelihood in (2.4).
When large number of covariates are present the enumeration of the whole model space
can be challenging even with the use of modern computing capabilities. As far as the second
21
point is concerned, an extensive amount of research has been done for estimating the marginal
likelihood in complex non-conjugate settings. In the contrary, the calculation of marginal
likelihood is straightforward when we have conjugate prior.
Consider the conjugate setting of normal linear model with normal priors, where we have,
y|β ∼ N(Xβ, τ−1y ) (2.5)
and β ∼ N(β0, τ−10 ) (2.6)
It follows that the posterior distribution of β is given by
β|y ∼ N((XT τyX + τ0)−1(XT τyy + τ0β0), (XT τyX + τ0)−1)
and the marginal distribution y is given by,
y ∼ N(Xβ0, τ−1y +Xτ−1
0 XT ) (2.7)
which provides the value of marginal likelihood directly.
The g-prior (Zellner, 1986) for β in the setting of linear model is given by,
β|σ,M∼ N(0, gσ2(XTX)−1
)(2.8)
Consider g prior which is given by (2.8). Also assume a constant improper prior on the
intercept α and non-informative prior on standard deviation σ where σ2 = τ−1y , i.e. π(σ) ∝ 1
σ.
Then the posterior is given by (Garcıa-Donato & Martınez-Beneito, 2013),
π(α, β, σ|g) = σ−1N
(β|0, gσ2
(XT (I − 1
n11T )X
)−1)
for γ 6= 0 and π0(α, σ) = 1σ
for the null model M0.
22
Then the Bayes factor of any modelMγ against the null modelM0 is given by (Garcıa-
Donato & Martınez-Beneito, 2013),
BFγ0 =
(1 + g
SSEγSSE0
)n−12
(1 + g)n−kγ−1
2 (2.9)
where SSEγ is the sum of square of Mγ, SSE0 is the sum of square of M0, and kγ is
the number of explanatory variables present in Mγ.
Note that, under constant prior probability on the models, posterior probability of Mγ
can be written as
Pr(Mγ|y) =BFγ0 Pr(Mγ)∑i BFi0 Pr(Mi)
∝ BFγ0P (Mγ)
Therefore, under constant prior probability on the models, the posterior probability of
model Mγ is proportional to Bayes factor of model Mγ against the null model M0, after
using g prior for β.
When we have non-conjugate priors and we rely on Markov Chain Monte Carlo (MCMC)
for Bayesian analysis, estimation of integrated likelihood is challenging as that MCMC by-
passes computing normalizing constants whereas the marginal likelihood is the normaliz-
ing constant of the posterior distribution. Chib (1995) developed a method for estimating
marginal likelihood from Gibbs sampling output which was extended by Chib & Jeliazkov
(2001) for Metropolis Hastings sampling output. Gelman & Meng (1998) used path sampling
for calculating integrated likelihood which is a motivation for power posterior method of Friel
& Pettitt (2008). Raftery et al. (2007) used harmonic mean type estimator to estimate the
marginal likelihood.
23
2.4.3 Log Pseudo Marginal Likelihood
The conditional predictive ordinate Ibrahim et al. (2004) for the i − th observation is
defined as
CPOi = f(yi|y−i) =
∫f(yi|θ)π(θ|y−i)dθ, i = 1, . . . , n (2.10)
where y−i = (y1, . . . , yi−1, yi+1, . . . , yn)′ denotes the set of all observed data points excluding
yi and π(θ|y−i) is the posterior distribution of θ given y−i.
The CPO is thus the conditional density of yi given the remaining observations which
can alternatively be described as
CPOi =
∫f(yi|θ)π(θ|yi)dθ
=
∫f(yi|θ)
∏j 6=i f(yi|θ)π(θ)∫ ∏j 6=i f(yi|θ)π(θ)dθ
dθ
=f(y)
f(yi)
Geisser & Eddy (1979) defined the log pseudo marginal likelihood (LPML) is defined as
LPML =1
n
n∑i=1
log CPOi
and proposed it as a model comparison criterion.
Notice that, LPML is based on the notion of leave-one-out cross validation. Models with
higher values of LPML are preferred in this criterion.
24
2.4.4 L-Measure
Laud & Ibhrahim (1995) introduced L measure as a model comparison criterion which
attempts to balance predictive loss and variability of the predictions. L measure is defined
as,
L2m =
n∑i=1
[ν(E(zi)− yi)2 + V ar(zi)]
For ν = 1 we get
L2m =
n∑i=1
[E(zi)− yi)2 + V ar(zi)]
where z = (z1, . . . , zn) are new observation to be predicted.
In the setting of normal linear model (2.5, 2.6),
z|β ∼ N(Xβ, τ−1y )
p(z|y) =
∫p(z|β)p(β|y)dβ
It follows that,
z|y ∼ N(X(XT τyX + τ0)−1(XT τyy + τ0β0), X(XT τyX + τ0)−1XT + τ−1y )
Models with smaller L measure is preferred in this criterion.
25
2.4.5 Deviance Information Criterion
Spiegelhalter et al. (2002) introduced the Deviance Information Criterion (DIC) as,
DIC = E(D(θ|y)) + pd
= E(D(θ|y)) + E(D(θ|y))−D(E(θ|y))
where D(θ|y) = −2 log f(y|θ) and pd is the effective number of parameters in the model.
The DIC is available as a goodness of fit measure incorporated in the popular Gibbs
sampling software OpenBUGS (Thomas et al., 2006). In the linear model setting, under
(2.5) and (2.6) we get,
E(D(β|y)) =− log |τy|+ n log 2π + tr(XT τyX(XT τyX + τ0)−1)+
(X(XT τyX + τ0)−1(XT τyy + τ0β0)− y)′τy
(X(XT τyX + τ0)−1(XT τyy + τ0β0)− y)
and
D(E(β|y)) =− log |τy|+ n log 2π+
(y −X(XT τyX + τ0)−1(XT τyy + τ0β0))′τy
(y −X(XT τyX + τ0)−1(XT τyy + τ0β0))
26
2.4.6 Median Probability Model
Barbieri & Berger (2004) introduced the concept of median probability model (MPM)
which is one model in the model space. This is thus not a criterion based method. The
MPM utilizes the inclusion probability of a given explanatory variable xl defined as
ql =∑γ:γ=1
P (Mγ|y)
The median probability model is then defined as the model which includes variables with
xl : ql > 0.5 i.e. median probability model contains those variables which have at least 50%
posterior probability over all models.
When the goal of model selection is to choose a model for future prediction, Barbieri &
Berger (2004) showed that, for normal linear models, if we have orthogonal design matrix,
under suitably chosen loss function, the optimal predictive model is the MPM. Barbieri &
Berger (2004) established that in the setting of normal linear models with an orthogonal
design matrix, the MPM is the optimal model under a predictive loss function.
2.5 Development of Prior Distributions
2.5.1 Bayesian Connection to Penalized Regressions
A penalized regression estimate is typically obtained by minimizing the residual sum of
squares after adding a penalty term pλ(β). For the normal linear model, it then follows that,
under the prior π(β) ∝ exp(−pλ(β)), the posterior mode is the corresponding penalized
estimate after suitably adjusting the location and scale parameters of the prior. Indeed,
27
placing the independent normal prior π(β) ∝ exp
(−λ
2|β|2)
on the coefficients recreates the
objective function of ridge regression as the log likelihood of a ridge regression.
As noted in Tibshirani (1996), lasso estimates can be derived as the Bayes posterior mode
under independent double exponential priors for the β.
f(β) =λ
2exp
(−λ|β|
)
Park & Casella (2008) expressed the Laplace or double exponential distribution as a scale
mixture of normals (with an exponential mixing density) (Andrews & Mallows, 1974)
a
2exp(−a|z|) =
∫ ∞0
1√2πs
exp(−z2
2s)a2
2exp(−a
2s
2)ds
and described the model in the following hierarchical structure
y|β, σ2 ∼ N(Xβ, σ2In)
β|σ2, τ 21 , . . . , τ
2p ∼ N(0, σ2Dτ )
Dτ = diag(τ 21 , . . . , τ
2p )
σ2, τ 21 , . . . , τ
2p ∼ π(σ2)dσ2
p∏j=1
λ2
2exp−1
2λ2τ 2
j dτ 2j
π(σ2) ∼ 1
σ2
σ2, τ 21 , . . . , τ
2p > 0.
Other important references in this theme include Bayesian elastic net (Li & Lin, 2010),
Bayesian adaptive lasso (Leng et al., 2014), (Sun et al., 2009), Bayesian bridge regression
(Polson et al., 2014) and others.
28
2.5.2 Shrinkage Prior
Besides developing Bayesian lasso type priors researchers have also been considering
other priors for efficient model selection. George & McCulloch (1993) introduced Stochastic
Search Variable Selection (SSVS). They placed independent Bernoulli type priors on each
γj, j = 1, . . . , p and collected the sequence γ1, . . . , γm after running Gibbs (Casella & George,
1992) sampler with m iterations with the hope of exploring the model space more easily and
efficiently such that important models would have higher posterior probability and would
appear in the above sequence more frequently. In this way they proposed to perform variable
selection avoiding the enumeration of the whole model space. George & McCulloch (1997)
noted that when p is large then it is difficult to set m = 2p and hence is impossible to visit
all the models at least once.
On the other hand, Dey et al. (2008) proved variable selection consistency of HPM when
spike and slab prior (Mitchell & Beauchamp, 1988) is placed on the coefficients. This type
of consistency is similar to oracle property discussed by Fan & Li (2001).
Zellner’s g-prior (Zellner, 1986) is another popular prior mainly because of its computa-
tional ease of posterior probability of the models. One can show that this prior is conjugate
to (2.6), the conjugacy has been discussed in Section 2.4.2, which makes the computation
of the marginal likelihood straightforward. Fernandez et al. (2001) proved model posterior
consistency for g-prior when g = n and Liang et al. (2008) proved model selection consistency
for mixture of g priors and hyper g priors.
For shrinkage purpose, Carvalho et al. (2010) considered placing Beta(12, 1
2) prior density
of which looks like a horseshoe and hence referred to it as the horseshoe prior. They also
justified performance of this prior theoretically. Recently, Polson & Scott (2012) suggested
to use joint priors generated by a Levy process.
29
2.6 Sampling over Model Space
Since it is impossible to enumerate the whole model space for large p, many researchers
proposed to sample from the model space or to stochastically search the model space. An
early work in this area has been done by Berger & Molina (2005) who employed stochastic
search algorithm with path-based pairwise prior. They proposed on sampling without re-
placement with probabilities proportional to posterior probabilities of the models visited in
the MCMC run. Casella & Moreno (2006) proposed to search the model space stochastically
using Intrinsic Bayes factors as an estimate of posterior model probability. The posterior
probabilities were also used by Hans et al. (2007) where they took advantage of parallel
computing to address the issue. They named thier stochastic search process as “Shotgun
Stochastic Search (SSS)”. Bottolo & Richardson (2010) devised another stochastic search
process which they called evolutionary stochastic search (ESS). On the other hand, Clyde et
al. (2011) discussed the feasibility and advantage of Bayesian adaptive sampling (BAS) which
is a variant of without replacement sampling of the variables according to their marginal in-
clusion probabilities (the theory of MPM is based on this marginal inclusion probability)
which are estimated adaptively during the search process.
CHAPTER 3
COMPARISON AMONG LPML, DIC, AND HPM
3.1 Introduction
Variable selection remains an active area of research in both Bayesian and frequentist
statistics. There are many discussions available in this area. For example, see O’Hara
& Sillanpaa (2009), Hahn & Carvalho (2015), and references therein for existing Bayesian
variable selection methods, and for frequentist variable selection methods, see Shao (1993),
Fan & Lv (2010b), Tibshirani (2011), and the references therein.
Classical Bayesian model selection criteria depend on marginal likelihoods and the highest
posterior models thereafter. However, researchers face difficulty when the data and the priors
on the parameters do not follow a conjugacy. For instance, in time-to-event data, computa-
tion of marginal likelihood of the models is itself a non trivial task. Placing noninformative
priors on the parameters adds one more level of difficulty, because in such scenarios, often
the Bayes factor becomes undefined and hence the highest posterior model can not be com-
puted. A preferred Bayesian criterion is then the Log Pseudo Marginal Likelihood (LPML)
of Gelfand et al. (1992), which relies upon predictive likelihood approach. It is derived from
predictive considerations and leads to pseudo Bayes factors for choosing among models. This
approach has seen increased popularity due in part to the relative ease with which LPML is
steadily estimated from MCMC output.
The theory of LPML solely stands upon the idea of the famous cross validation tech-
nique. Cross validation is widely accepted predictive model selection criterion. As the name
31
suggests, cross validation relies on splitting the data into training set and test set, building
the model using the training set and examining the accuracy of the fit using the test set.
In this process, the model is validated across many possible combinations of the training
set and test set. The number of training-test set combination depends on the size of the
data n and the size of the test set. The most popular cross validation method is to keep
only one observation in the test set or the validation set, which is referred to as the one
deleted cross validation. Suppose there are n observations, and we fit the model to the n− 1
observations. After that, the remaining observations are predicted using the fitted model
and the assessment of the fitted model is done by considering all possible observations in
the test set and by taking average squared error of the fitted values and the actual observed
values.
However, despite of the simplicity of the one deleted cross validation, this technique has
been criticized by the researchers. For instance, the fitting of model on a part of the avail-
able observations violates the sufficiency principle of statistics (see Picard & Cook (1984)).
Another significant worrisome fact, perhaps, is the inconsistency of model selection for one
deleted cross validation (see Shao (1993) and references therein).
This chapter revisits this issue and establish that LPML, which utilizes one deleted cross
validation in a Bayesian setting, suffers the same problem of inconsistency of model selection.
We extend our study beyond linear models to logistic regression and time to event regression
models.
Spiegelhalter et al. (2002) introduced the Deviance Information Criterion (DIC), a Bayesian
information criterion. Since its introduction, DIC has received widespread attention due to
its straightforward computation from a Markov Chain Monte Carlo (MCMC) sample. How-
ever, if a model has random effects or latent variables, then the calculation of DIC is not
well defined, and Celeux et al. (2006) suggested various modifications. Ghosh et al. (2009)
used a modified DIC and LPML to compare joinpoint models for modeling cancer rates over
32
the years. We extend our study of model selection accuracy to the DIC and illustrate that
the DIC suffers similar inaccuracy problem as the LPML, for model selection. On the other
hand, one should establish that the HPM approach to model selection performs strongly.
Example 3.1. Gunst and Mason data.
We consider Gunst and Mason data (Table 1, Shao (1993)) to assess the performance of
LPML, DIC, and HPM using simulation. There are n = 40 observations p = 4 explanatory
variables in Gunst and Mason data. Without loss of generality, we assume the intercept as
a covariate which is always included in the analysis models. We keep the covariates fixed.
Data y is simulated according to equation
y ∼ N(Xβ, σ2I) (3.1)
with some of the βj j = 2, . . . , 5 set to 0. σ2 is set to 1. This is done 1000 times i.e. the
number of simulations is set to 1000.
In the Bayesian analysis, we place a conjugate flat (kind of noninformative) prior on β.
As in Shao (1993), if we know whether each component of β is 0 or not, then the analysis
modelsMγ can be classified into two categories (β is the parameter vector of data generating
model M and βγ is the parameter vector of the analysis models Mγ):
• Category I: At least one nonzero component of β is not in βγ.
• Category II: βγ contains all nonzero components of β.
For each simulation we fit all possible 24 − 1 = 31 models. For DIC criterion the model
with lowest DIC is chosen, while for LPML and HPM criteria, the models with the highest
LPML and marginal likelihood are selected respectively. Table 3.1 reports the sample prob-
abilities of selecting each models based on 1000 simulations. For one deleted cross validation
33
the model with minimum average prediction square error, as in Shao (1993), is chosen. We
report the sample probabilities of selecting each models due to one deleted cross validation
mainly for the comparison purpose. We denote the models by a subset of 1, . . . , 5 that
contains the corresponding covariates in the model.
Table 3.1: Probabilities of selecting data generating models for Gunst and Mason data
Model Category CV(1) LPML DIC HPM
1, 4 Data-generating 0.501 0.589 0.626 0.999β = 1, 2, 4 II 0.141 0.109 0.098 0.000(2, 0, 1, 3, 4 II 0.110 0.111 0.110 0.001
0, 4, 0) 1, 4, 5 II 0.138 0.126 0.107 0.0001, 2, 3, 4 II 0.036 0.024 0.021 0.0001, 2, 4, 5 II 0.027 0.017 0.013 0.0001, 3, 4, 5 II 0.034 0.018 0.018 0.0001, 2, 3, 4, 5 II 0.013 0.006 0.007 0.0001, 3, 5 I 0.002 0.000 0.000 0.000
β = 1, 4, 5 Data-generating 0.650 0.721 0.719 0.996(2, 0, 1, 2, 4, 5 II 0.158 0.119 0.131 0.002
0, 4, 8)’ 1, 3, 4, 5 II 0.119 0.110 0.120 0.0021, 2, 3, 4, 5 II 0.073 0.050 0.030 0.0001, 2, 5 I 0.000 0.000 0.000 0.000
β = 1, 4, 5 I 0.000 0.000 0.000 0.000(2, 9, 1, 2, 3, 5 I 0.000 0.000 0.000 0.000
0, 4, 8)’ 1, 2, 4, 5 Data-generating 0.796 0.838 0.841 0.9991, 3, 4, 5 I 0.003 0.000 0.000 0.0011, 2, 3, 4, 5 II 0.201 0.162 0.159 0.0001, 2, 3, 5 I 0.000 0.000 0.000 0.000
β = 1, 2, 4, 5 I 0.000 0.000 0.000 0.000(2, 9, 1, 3, 4, 5 I 0.001 0.000 0.000 0.002
6, 4, 8)’ 1, 2, 3, 4, 5 Data-generating 0.999 1.000 1.000 0.998
There are following conclusions which are evident from Table 3.1:
1. The performances of LPML and DIC are poor and similar to CV(1) method.
2. HPM performs strongly and consistently in selecting the data generating model.
34
3. In particular, when the data generating model is a sparse model like in the top panel
(β = (2, 0, 0, 4, 0)) in Table 3.1, the LPML and DIC tend to select unnecessary covari-
ates and as a result can not distinguish between the category II models.
4. The probabilities of selecting the category I models remain very small irrespective of
the model selection criteria.
3.2 Inconsistency of LPML
In this section, we present the following theorem which can be used to justify the poor
performance of LPML for linear models.
We note that, as before,M is the data-generating model andMγ are the analyses models.
Furthermore, we assume, limn→∞max1≤i≤n hiγ = 0 where hiγ is the i -th diagonal element
of the hat matrix for the model Mγ.
Theorem 3.2. We consider the noninformative prior, π(β) ∝ 1, and assume that σ2 is
known and fixed. If Mγ belongs to category II then,
plimn→∞LPML(Mγ) = −σ2
In particular, all models in category II have the same asymptotic value. That is, LPML
cannot asymptotically distinguish among the category II models.
Proof. For simplicity, we denote X ≡ Xγ and β ≡ βγ. Let β and β(i) denote the least square
estimate of β with and without the i-th observation included in the data. Furthermore, let
y−i and X(i) denote the response vector and the regression matrix respectively, with the i-th
row deleted.
35
Then,
Pr(y|β) =1
(2πσ2)n2
exp
− 1
2σ2(y −Xβ)T (y −Xβ)
Then, the marginal likelihood is,
Pr(y) =
∫f(y|β)dβ
=1
(2πσ2)n2
∫exp
− 1
2σ2(y −Xβ)T(y −Xβ)
dβ
=1
(2πσ2)n2
∫exp
− 1
2σ2(y −Xβ +Xβ −Xβ)T(y −Xβ +Xβ −Xβ)
dβ
=
exp
− 1
2σ2 (y −Xβ)T(y −Xβ)
(2πσ2)
n2
∫exp
− 1
2σ2(β − β)T(XTX)(β − β)
dβ
=|(XTX)−1| 12√
(2πσ2)n−pexp
− 1
2σ2(y −Xβ)T(y −Xβ)
(3.2)
Then the predictive density of yi given the other observations,
Pr(yi|y1, y2, . . . , yi−1, yi+1, . . . , yn)
= Pr(yi|y−i)
=f(y1, y2, . . . , yn)
Pr(y−i)
=
|(XTX)−1| 12 exp
− 1
2σ2
(y −Xβ
)T(y −Xβ
)|(X(i)TX(i))−1| 12 exp
− 1
2σ2
(y−i −X(i)β(i)
)T(y−i −X(i)β(i)
)=|X(i)TX(i)| 12|XTX| 12
√2πσ2
exp
− 1
2σ2(Q1 −Q2)
36
where
Q1 −Q2 =
(y −Xβ
)T(y −Xβ
)−(y−i −X(i)β(i)
)T(y−i −X(i)β(i)
)=yTy − 2yTXβ + βTXTXβ − yT
−iy−i + 2yT−iX(i)β(i)− β′(i)XT(i)X(i)β(i)
We know that (Seber & Lee, 2003),
X(i)TX(i) = XTX − xixTi (3.3)
yT−iX(i) = yTX − yixT
i (3.4)
β − βi =XTX − xiei
1− hi(3.5)
ei = yi − xTi β (3.6)
where xi is the i-th row of X and hi = xTi (XTX)−1xi is the i-th diagonal element of the hat
matrix.
So,
Q1 −Q2 = y2i − 2
(yTXβ − yT
−iX(i)β(i)
)+
(βTXTXβ − β(i)TX(i)TX(i)β(i)
)
37
Now,
yTXβ − yT−iX(i)β(i) =yTXβ − yT
−iX(i)
[β − (XTX)−1xiei
1− hi
][using 3.5]
=
(yTXβ − yT
−iX(i)β
)+yT−iX(i)(XTX)−1xiei
1− hi[using 3.4]
=
(yTX − yT
−iX(i)
)β +
yT−iX(i)(XTX)−1xiei
1− hi
=yixTi β +
yT−iX(i)(XTX)−1xiei
1− hi
=yixTi β +
(yTX − yixTi )(XTX)−1xiei
1− hi[using 3.5]
=yixTi β +
yTX(XTX)−1xiei − yixTi (XTX)−1xiei
1− hi
=yixTi β +
βTxiei − yihiei1− hi
=yixTi β +
βTxi(yi − xTi β)− yihi(yi − xT
i β)
1− hi
=yixTi β +
βTxiyi − βTxixTi β − hiy2
i + hiyixTi β
1− hi
=yixTi β
(1− hi
1− hi
)+yiβ
Txi − βTxixTi β − hiy2
i
1− hi
=1
1− hi
[yix
Ti β − β + yiβ
Txi − βTxixTi β − hiy2
i
]=
1
1− hi
[2yix
Ti β − βTxix
Ti β − hiy2
i
]
38
Now,
βTXTXβ − β(i)TX(i)TX(i)β(i)
=βTXTXβ −(β − (XTX)−1xiei
1− hi
)T
X(i)TX(i)
(β − (XTX)−1xiei
1− hi
)[using 3.5]
=βTXTXβ − βTX(i)TX(i)β +βTX(i)TX(i)(XTX)−1xiei
1− hi+
eixTi (XTX)−1X(i)TX(i)β
1− hi− e2
ixTi (XTX)−1X(i)TX(i)(XTX)−1xi
(1− hi)2
=βTxixTi β +
2
1− hiβTX(i)TX(i)(XTX)−1xiei−
e2i
(1− hi)2xTi (XTX)−1X(i)TX(i)(XTX)−1xi
=βTxixTi β +
2
1− hiβT(XTX − xixT
i )XTX−1xiei−
e2i
(1− hi)2xTi (XTX)−1(XTX − xixT
i )(XTX)−1xi
=βTxixTi β +
2
1− hi
βTxiei − βTxihiei
− e2
i
(1− hi)2
xTi (XTX)−1xi − h2
i
=βTxix
Ti β +
2
1− hi(1− hi)βTxiei −
e2i
(1− hi)2(hi − h2
i )
=βTxixTi β + 2βTxi(yi − xT
i β)− hie2i
(1− hi)2(1− hi)
=βTxixTi β + 2βTxiyi − 2βTxIx
Ti β −
hie2i
1− hi
39
It follows that,
Q1 −Q2
=y2i −
2
1− hi
2yix
Ti β − βTxix
Ti β − hiy2
i
+ 2yix
Ti β − βTxix
Ti β −
hie2i
1− hi
=y2i −
4
1− hiyiβ
Txi +2
1− hiβTxix
Ti β +
2hiy2i
1− hi+ 2yix
Ti β − βTxix
Ti β −
hie2i
1− hi
=y2i
(1 +
2hi1− hi
)− yixT
i β
(2− 4
1− hi
)+ βTxix
Ti β
(2
1− hi− 1
)− hie
2i
1− hi
=y2i
1 + hi1− hi
− yixTi β
(−2)(1 + hi)
1− hi+ βTxix
Ti β
1 + hi1− hi
− hie2i
1− hi
=1 + hi1− hi
(y2i − 2yix
Ti β + βTxix
Ti β
)− hie
2i
1− hi
=1 + hi1− hi
(yi − xTi β)T(yi − xT
i β)− hie2i
1− hi
=1 + hi1− hi
eTi ei −
hi1− hi
e2i
=1 + hi1− hi
e2i −
hi1− hi
e2i
=e2i
1− hi
=(yi − xT
i β)2
1− hi
∴ Pr(yi|y1, y2, . . . , yi−1, yi+1, . . . , yn) =|X(i)TX(i)| 12|XTX| 12
√2πσ2
exp
−(yi − xT
i β)2
2σ2(1− hi)
40
Thus, for LPML for model Mγ becomes,
LPML(Mγ) =1
n
n∑i=1
log Pr(yi|y−i,Mγ)
=− 1
n
n∑i=1
[1
2log(2πσ2)− 1
2log|Xγ(i)
TXγ(i)||XT
γ Xγ|+
1
2σ2
(yi − xTiγβγ)2
1− hiγ
]=− 1
2n
n∑i=1
[log(2πσ2)− log
|Xγ(i)TXγ(i)|
|XTγ Xγ|
+1
σ2(yi − xT
iγβγ)2(1− hiγ)−1
]
Now,
1
n
n∑i=1
(yi − xTiγβ)2(1− hiγ)−1
=1
n
n∑i=1
(yi − xTiγβγ)
2(1 + hiγ +O(h2iγ))
=1
n
n∑i=1
(yi − xTiγβγ)
2 +1
n
n∑i=1
(yi − xTiγβγ)
2(hiγ +O(h2iγ))
Now, recall the model Mγ is, y = Xγβγ + ε, and the projection matrix is, Pγ =
Xγ(XTγ Xγ)
−1XTγ . Hence,
1
n
n∑i=1
(yi − xTiγβγ)
2
=1
nyT(I − Pγ)y
=1
n[εT(I − Pγ)ε+ βT
γ XTγ (I − Pγ)Xγβγ + 2εT(I − Pγ)Xγβγ]
If model Mγ belongs to category II, then,
1
n
n∑i=1
(yi − xTiγβγ)
2 =εTε
n− εTPγε
n
41
Now,
Pr
[|εTPγε|n
> δ
]≤ 1
δnE(|εTPγε|)
=1
δnE(εTPγε), since εTPγε > 0
=kγσ
2
δn→ 0 as n→∞
where kγ is the number of covariates for model Mγ
This implies that, εTPγε
n
P→ 0, and we know that, εTεn
P→ σ2.
Assumption: limn→∞maxi≤n hiγ = 0.
Now,
1
n
n∑i=1
(hiγ +O(h2
iγ)
)(yi − xT
iγβγ)2 =
1
n
n∑i=1
O(hiγγ )(yi − xTiγβγ)
2
Again,
0 ≤ 1
n
n∑i=1
O(hiγ)(yi − xTiγβγ)
2 ≤O(maxi≤n
hiγ)1
n
n∑i=1
(yi − xTiγβγ)
2
=O(maxi≤n
hiγ)
[εTε
n− εTPγε
n
]P→ 0, by the assumption.
42
log|Xγ(i)
TXγ(i)||XT
γ Xγ|= log
|XTX − xixTi |
|XTX|, [using 3.3]
= log|XTX|(1− xT
i (XTX)−1xi)
|XTX|
= log
(1− xT
i XTX)−1xi
)= log(1− hi)→ 0 as n→∞
log|XT
i Xi||XTX|
= log|XTX − xixT
i ||XTX|
)
= log|XTX|(1− xT
i (XTX)−1xi)
|XTX|
= log(1− xTi (XTX)−1xi)
= log(1− hi)→ 0 as n→∞
This completes the proof.
3.3 Simulation Study
Theorem 3.2 establishes that the LPML criterion cannot distinguish among category II
models when σ2 is known. In the following we present a series of simulation studies to
investigate the performance of LPML in other settings.
43
3.3.1 Linear Model in Presence of Multicollinearity
Our goal is to assess the performance of LPML when there is multicollinearity is present
in data. We consider the linear model again. We set n = 50 and p = 10, i.e. there are 50
observations and 10 covariates x1, . . . , x10. The intercept term is included in the analysis
model as in the previous example. The covariates are generated from multivariate normal
distribution such that correlation(xi, xj) = 0.95, 1 ≤ i < j ≤ 3. x4, . . . , x10 are generated
from standard normal distribution independently. The covariates are kept fixed throughout
simulations. The response vector is sampled according to equation (3.1). σ2 is set to 1. We
place standard normal prior on β for Bayesian analysis.
Table 3.2: Probabilities of selecting the data-generating Model
Data-generating model LPML based selection
All uncorrelated 4, 5, 6, 7, 8, 9, 10 0.943All correlated 1, 2, 3 0.573All 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 0.720One correlated 3, 4, 5, 6, 7, 8, 9, 10 0.575and all uncorrelatedOne correlated 3, 4, 5, 6, 7 0.522and 4 uncorrelated
We report the sample probabilities of selecting the data generating models in table 3.2.
We note that, when all the uncorrelated variables present in the data generating model, the
LPML appears to be successful in recovering the data generating models. However, LPML
performs poorly in all the other cases.
44
3.3.2 A Non-conjugate Setting: Logistic Regression
Logistic regression is a widely used binary regression model which is an example of non
linear model. There exists no conjugate prior for logistic model. We are left with no choice
but MCMC for sampling from posterior densities. We consider the following regression
model,
y ∼ Bernoulli(πi) (3.7)
where
πi =Xγβγ
1 + exp(Xγβγ)
We set n = 100 and p = 5. We generate x1, . . . , x5 from standard normal distribution
and keep them fixed. The response vector is generated according to equation (3.7). This is
a non-conjugate Bayesian model and we use Markov Chain sampling for Bayesian analysis
using the mcmclogit function of R package MCMCpack (Martin et al., 2011). We have used
Laplace approximations to compute the marginal likelihoods of the models. This can be
easily done by setting marginal.likelihood = “Laplace”.
We have run 100 simulations. Table 3.3 shows that, the poor performance of LPML and
DIC continues beyond the linear models such as logistic models. However, HPM remains
consistent in selecting the data generating models. In particular, all of the methods show
similar performance as for the linear models.
3.3.3 Nodal Data
Chib (1995) and many others considered Nodal data for Bayesian variable selection pur-
pose. There were 53 patients. Following is an overview of the variables:
45
• y: 1 or 0 according as the cancer has spread to the surrounding lymph nodes or not.
• x1: age of the patient.
• x2: level of serum acid phosphate.
• x3: 1 or 0 according as the result of an X-ray examination is positive or negative.
• x4: 1 or 0 according as the size of the tumor is large or small.
• x5: 1 or 0 according as the pathological grade of the tumor is more serious or less
serious.
Chib (1995) reported the modelMγ = (x2, x3, x4) as the highest posterior model (HPM)
for the observed data. We note that, Chib (1995) employed a probit regression model for
nodal data. A probit model is given by,
y ∼ Bernoulli(πi) (3.8)
where
πi = Φ(Xγβγ)
where Φ is the cumulative density function of standard normal distribution. We consider
a simulation study where the response y is generated according to (3.8) after keeping the
covariates fixed, and setting different β as given in Table 3.4. We adopt a normal distribution
with mean 1 and variance 1 for the priors on β for the Bayesian analysis. Intercept term is
included in all the analyses models as usual. We run a total of 100 simulations. We have
used the mcmcprogit function for MCMC sampling.
The important feature of this experiment is that LPML and DIC fail to produce con-
vincing result even if the data generating model is the full model.
46
3.3.4 Melanoma data
Survival models do not typically provide a conditional conjugate setup in Markov Chain
sampling. Due to this complexity, LPML based model selection has been used extensively
for model comparison in the setting of complex survival models. We consider the Bounded
Cumulative Hazard (BCH) cure rate model for survival data which models the survival
function S(t) as,
S(t) = exp
(−θG(t)
)where G(t) is a proper cdf (with pdf g(t)) with G(0) = 0 and limt→∞G(t) = 1 (see Tsodikov
et al. (2003), Chen et al. (1999)). The cure fraction is given by,
c = limt→∞
S(t) = exp(−θ limt→∞
G(t)) = exp(−θ)
Suppose an individual has N latent factors (carcinogenic cells) with activation times
T1, . . . , TN which are assumed to be i.i.d. If N = 0 (no carcinogenic cells) then the subject is
cured. The first activation scheme assumes that time to failure of the subject is determined
by the first activation T(1) = min(T1, . . . , Tn) and we have,
S(t) = Pr(N = 0) +∞∑n=1
Pr(N = n)S0(t)n
When N ∼ Poisson(θ), this gives the BCH model S(t) = exp(−θS0(t)). We consider
a regression model where the cure rate parameter θ depends on the covariates through the
relationship θ = exp(Xβ). We assume a Weibull density for g(t)
g(t) = ηtη−1 exp(λ− tη exp(λ))
47
We consider data from a phase III melanoma clinical trial conducted by the Eastern
Cooperative Oncology Group (ECOG) (Chen et al. (1999)). The study, denoted as E1684,
was a two-arm clinical trial involving patients randomized to one of two treatment arms:
high-dose interferon (IFN) or observation. There are four covariates – treatment, age, gender
and performance status (binary), and 255 observations.
Several authors (Chen et al. (1999), Chen et al. (2002), Cooner et al. (2007)) have
proposed complex model for these data and used LPML for model validation purpose. Our
goal is to examine the performance of LPML as a model validation criterion in this setting.
For the data generating model, as usual, the covariates are fixed and β is set to dif-
ferent take values as shown in Table 3.5. The censored time is generated according to
set up described above. In order to carry out the analysis, we adopt following priors:
η ∼ Gamma(shape = 1, rate = 0.1), and λ, βj ∼ N(0, 10000). This is a complex non-
conjugate model. The number of simulations is 100.
We present the results in Table 3.5 which clearly supports the previous conclusions. We
estimated the marginal likelihood using Laplace approximations. We find that the HPM has
been 100% successful in detecting the data generating model in this simulation study.
3.4 Conclusion
LPML is extremely popular, in particular, in censored model. DIC is readily available
in OpenBUGS. The performance of the one-deleted LPML criterion based model selection
and DIC based model selection are questioned in this work. Highest posterior model (HPM)
criterion or marginal likelihood based model selection is found to be superior in our and
many other recent studies. There are many recent and ongoing advances on computation of
marginal likelihood and intelligent searches over the model space.
48
Table 3.3: Probabilities of selecting the data-generating models for logistic regression simu-lation example
Model Category LPML DIC HPM
1 I 0.00 0.00 0.01β = 1, 4 Data-generating 0.67 0.64 0.88
(1, 0, 0, 1, 0) 1, 2, 4 II 0.05 0.08 0.031, 3, 4 II 0.13 0.13 0.031, 4, 5 II 0.09 0.08 0.023, 4, 5 II 0.00 0.00 0.011, 2, 3, 4 II 0.01 0.01 0.001, 2, 4, 5 II 0.03 0.03 0.011, 3, 4, 5 II 0.02 0.03 0.011, 4 I 0.02 0.02 0.02
β = 1, 5 I 0.01 0.00 0.01(1, 0, 0, 1, 1) 4, 5 I 0.01 0.01 0.02
1, 2, 4 I 0.01 0.01 0.001, 3, 5 I 0.00 0.01 0.001, 4, 5 Data-generating 0.62 0.60 0.863, 4, 5 I 0.01 0.01 0.001, 2, 4, 5 II 0.17 0.18 0.041, 3, 4, 5 II 0.12 0.12 0.041, 2, 3, 4, 5 II 0.03 0.04 0.011, 2 I 0.00 0.00 0.01
β = 1, 4 I 0.00 0.00 0.01(1, 1, 0, 1, 1) 1, 2, 4 I 0.02 0.01 0.04
1, 4, 5 I 0.00 0.00 0.022, 4, 5 I 0.00 0.00 0.021, 2, 4, 5 Data-generating 0.82 0.81 0.821, 2, 3, 4, 5 II 0.16 0.18 0.041, 2, 3, 4 I 0.02 0.02 0.03
β = 2, 3, 4, 5 I 0.03 0.02 0.04(1, 1, -1, 1, 1) 1, 2, 3, 4, 5 Data-generating 0.95 0.96 0.93
Table 3.4: Probabilities of selecting the data-generating models for nodal data
Data-generating model LPML DIC HPM
β = (-2, 0, 2, 0, 2, 0) 2, 4 0.44 0.34 0.71β = (-2, 0, 2, 0, 2, 2) 2, 4, 5 0.47 0.53 0.85β = (-2, 0, 2, 2, 2, 2) 2, 3, 4, 5 0.58 0.46 0.91β = (-2, -2, 2, 2, 2, 2) 1, 2, 3, 4, 5 0.57 0.48 0.86
49
Table 3.5: Probabilities of selecting the data-generating models for melanoma data
Model Category LPML DIC HPM
β = 2, 4 Data-generating 0.72 0.74 1.00(0, -1, 0, -1) 1, 2, 4 II 0.13 0.12 0.00
2, 3, 4 II 0.13 0.12 0.001, 2, 3, 4 II 0.02 0.02 0.00
β = 1, 2, 4 Data-generating 0.76 0.77 1.00(1, -1, 0, 1) 1, 2, 3, 4 II 0.24 0.23 0.00
β = (1, -1, 1, -1) 1, 2, 3, 4 Data-generating 1.00 1.00 1.00
CHAPTER 4
MEDIAN PROBABILITY MODEL
4.1 Introduction
In this section we revisit the properties of MPM.
Theorem 4.1. (Barbieri & Berger, 2004) In the setting of normal linear model with an
orthogonal design matrix, the MPM is the optimal model under a predictive loss function.
Moreover, Barbieri & Berger (2004) provided a condition on prior probabilities of the
models under which MPM and HPM coincides. One such scenario has been considered in
Dey et al. (2008). The MPM depends only on the marginal inclusion probability of each
predictor xj, which are often easier to estimate, for example, from Markov Chain sampling
as opposed to other criteria which depends on joint inclusion probabilities.
4.2 Comparison of MPM and HPM
Hahn & Carvalho (2015), however, noted that, theoretical results for the performance of
the median probability model under non-orthogonal design matrix have not been established.
In the following, we present a simulation study comparing MPM and HPM in presence of
multicollinearity.
We consider a linear model. We generate ten covariates from normal distribution such
that first three of them are pairwise correlated with correlation around 0.95 and rest are
51
Table 4.1: Comparison between MPM and HPM for linear models in presence of collinearity:Result presents the number of times the data generating model is recovered using 1000simulations
Prior Model Result
independent MPM 27normal HPM 603
g MPM 457prior HPM 513
uncorrelated. We set n = 50. The response is generated according to (3.1) with β =
(0, 0, 1, 1, 1, 1, 1, 0, 0, 0)′. Thus the data generating model is Mγ = (3, 4, 5, 6, 7) where γ =
(0, 0, 1, 1, 1, 1, 1, 0, 0, 0)′.
We consider independent normal prior and g prior with g = n for carrying out the
Bayesian analyses. The results reported in 4.1 are based on 1000 reproduced data simu-
lations. Computation of posterior probability of the models is done using (2.7) and (2.9)
respectively. We summarize the findings below.
• When independent normal prior is placed on the parameters then MPM selects the
data generating model 2.7% times only whereas the HPM recovers the data generating
model 60.3% times.
• When we use g prior for the parameters, MPM procedure is able to select the data
generating model 45.7% times while HPM recovers the same 51.3% times out of 1000
simulations.
This study clearly indicates that HPM outperforms MPM when the collinearity is present
in the data.
CHAPTER 5
SIGNIFICANCE OF HIGHEST POSTERIOR MODEL
5.1 Introduction
Empirically, as in the previous examples in Chapter 3 and in Chapter 4, highest posterior
model tends to select the data generating model more frequently.
The purpose of hypothesis testing is to evaluate the evidence in favor of a scientific theory
and Bayes factors offer a way of including other information when evaluating the evidence in
favor of a null hypothesis. As Raftery (1999) described, ”The hypothesis testing procedure
defined by choosing the model with the higher posterior probability minimizes the total error
rate, that is, the sum of Type I and Type II error rates. Note that, frequentist statisticians
sometimes recommend reducing the significance level in tests when the sample size is large,
the Bayes factor does this automatically”!! Therefore, as Kass & Raftery (1995) pointed
out, the essential strength of Bayes factors and hence highest posterior models, is their solid
logical foundation. In particular, when competing models are non nested, inference can be
drawn. Most importantly, model uncertainty can be captured.
Model uncertainty can be examined by exploring the model space. George & McCulloch
(1993) observed that even with a moderate number of Markov chain iterations, the non
visited models constitute merely a fraction of total posterior the probabilities of the model
space. Garcıa-Donato & Martınez-Beneito (2013) explored this in detail. They conducted an
extensive and time-consuming study and showed that stochastic search on the model space
53
is often useful in recovering a large portion of posterior probabilities of the model space in
the presence of finite number of parameters.
Recently, Hahn & Carvalho (2015) developed a method for variable selection in high
dimensional models which utilizes posterior summary of the model space. However, their
method includes a tuning parameter similar to the tuning parameter in lasso, and hence
suffers from the same problem of selecting that tuning parameter.
5.2 Consistency
There is an extensive literature on Bayesian model selection consistency for linear models
(Casella et al., 2009). The notion of model consistency is developed with respect to posterior
probabilities of the models. If Mγ∗ is the true model then a model is consistent if, the
posterior probability of Mγ∗ tends to 1 as sample size increases and that of other models
goes to zero as sample size increases that is
p limn→∞
P (Mγ∗|y) = 1 (5.1)
and p limn→∞
P (Mγ|y) = 0 for any γ 6= γ∗ (5.2)
where the probability limit is taken with respect to the true sampling distribution (3.1).
When we have g prior then Fernandez et al. (2001) proved (5.1) and (5.2), for g = n and
other values of g.
CHAPTER 6
VARIABLE SELECTION
6.1 Introduction
Despite of its straightforwardness, the problem of variable selections demands a detail
and careful treatment because of many issues which includes the fact that the number of
models 2p gets easily extremely large as p increases (Garcıa-Donato & Martınez-Beneito,
2013). The most difficult barrier of variable selection lies its infeasibility of visiting all the
models. With increasing p, enumerating the whole model space tends to an impossible
and infeasible assignment, even if we use ultra modern machineries. To give an essence of
difficulty of tackling the problem for large p we refer to Garcıa-Donato & Martınez-Beneito
(2013) where the authors noted that merely a binary representation of a model space with
p = 40 would occupy 5 terabytes of memory.
6.2 Requirement of Sampling or Stochastic Search
Early work on exploring model space includes Stochastic Search Variable Selection (SSVS)
(George & McCulloch, 1993). George & McCulloch (1993) placed Spike and Slab (Ishwaran
& Rao, 2005) type prior on the parameters and run a Gibbs sampler on the model space
with the hope that the MCMC sampler will visit the models having higher posterior prob-
ability more frequently and those models which have lower posterior probability will not be
visited by the MCMC run or will be visited with negligible frequency. After a complete run
55
of MCMC one can obtain good models with the help of frequencies of the models visited
by the sampler. This is called as SSVS method (George & McCulloch, 1993). Note that a
sampler with iteration less than the total number of models will not visit every model in the
model space and the non visited models are hoped to accumulate a negligible amount of the
posterior model probabilities of the model space. Therefore the goal is to find at least some
of the more probable models (Hahn & Carvalho, 2015).
Other search processes developed in the literature include stochastic search by Berger &
Molina (2005) (SSBM), stochastic search by Casella & Moreno (2006), Shotgun Stochastic
Search (Hans et al., 2007), Evolutionary Stochastic Search (Bottolo & Richardson, 2010),
Particle Stochastic Search (Shi & Dunson, 2011). On the other hand, recent work of Clyde
et al. (2011) develops Bayesian adaptive sampling (BAS) which is a variant of without
replacement sampling according to adaptively updated marginal inclusion probabilities of
the variables.
6.3 Limitation of Bayesian Lasso and its Extensions
On the other hand applying shrinkage priors on the coefficients overcomes these type of
computational burdens automatically. For instance, Bayesian lasso by Park & Casella (2008)
use scale mixture of normal distribution (Andrews & Mallows, 1974) to obtain posterior
estimates from the MCMC sampler. This is aimed to achieve variable selection as done by its
frequentist counterparts. However, the procedures by themselves only provides the estimate
and does not do variable selection. Post processing of the results of the Bayesian procedures
are needed to make decision of inclusion or exclusion of variables. Given the mechanisms,
the best way, probably, would be to look at the posterior summary (for example, the highest
posterior model) after placing the shrinkage priors.
56
6.4 Maximization in the Model Space
We revisit the definition of HPM here.
Definition 6.1. Highest Posterior Model.
Highest posterior model (HPM) is defined as the model having highest posterior model
among all the model in the model space that is
HPM = argmaxγ∈M Pr(Mγ|y)
According to (6.1), HPM is merely the model which has the highest posterior probability
among all the models in the model space. The model space consists of 2p models which are
the elements of the model space. Therefore, one can find the HPM by optimization over the
model space with the posterior probability of a model as the desired objective function. We
can thus take resource from the huge optimization literature.
However, in variable selection setting, a model is represented by Mγ where γ ∈ 0, 1p.
Clearly, the representation of each model is binary. So if we want to maximize any objec-
tive function over the model the solution must belong to the binary representation of the
models. In this sense this problem is more similar to integer problem of simplex method
literature. The model space thus can be represented as all the 2p possible p -dimensional bi-
nary vectors γ ∈ 0, 1p. This unique structure of the model space severely limits the choice
of optimization methods.
57
6.5 Simulated Annealing
Because of the infeasibility of enumerating the whole model space and because of the
features of the maximization problem, we propose to conduct the maximization process
stochastically. Simulated Annealing (SA) is a widely known stochastic maximization routine.
Therefore, we decide to explore feasibility and tenacity of this routine for variable selection.
In this section we provide an introduction of an SA algorithm. For details see Bertsimas
& Tsitsiklis (1993). Suppose there exists a finite set S. We define a real valued function J
on S. J is our objective function which we want to minimize over S. Let S∗ ⊂ S be the
set of global minima of the function J , assumed to be a proper subset of S. For each i ∈ S,
there exists a set S(i) ⊂ S − i, called the set of neighbors of i. In addition, for every i, there
exist a collection of positive coefficients qij, j ∈ S(i), such that,∑
j∈S(i) qij = 1. qij form a
transition matrix elements of which provide the transition probabilities of moving from i to
j. It is assumed that j ∈ S(i) if and only if i ∈ S(j). We define a nonincreasing function
T : N → (0,∞) which is called the cooling schedule. Here N is the set of positive integers,
and T (t) is called the temperature at time t.
Let x(t) be a discrete time inhomogeneous Markov Chain (MC). The search process starts
at an initial state x(0) ∈ S. Then the steps of the algorithm are as follows.
1. Fix i = current state of x(t).
2. Choose a neighbor j of i at random according to probability qij.
3. Once j is chosen, the next state x(t+ 1) is determined as follows:
58
If J(j) ≤ J(i), then x(t+ 1) = j
If J(j) > J(i), then x(t+ 1) = j with probability exp
[−J(j)− J(i)
T (t)
]x(t+ 1) = i otherwise
If j 6= i and j /∈ S(i), then Pr
[x(t+ 1) = j|x(t) = i
]= 0.
4. Repeat above steps until convergence.
According to this algorithm, x(t) converges to the optimal set S∗. Mathematically, for
k = 0, 1, 2, . . . and all j ∈ S,
limn→∞
Pr
(x(n+ k) ∈ S∗|x(k) = j
)= lim
n→∞Pr
(J(x(n+ k)
)= J∗|x(k) = j
)= 1 (6.1)
where, J∗ = minj∈S J(j) and S∗ = i : i ∈ S, J(i) = J∗.
Note that, the acceptance probability function defined in the third step is similar to one
in Metropolis-Hastings (Hastings, 1970) sampler. In this spirit this is a stochastic search.
According to SA algorithm, x(t) converges to the optimal set S∗.
6.6 Our Setup
Now let us frame the variable selection problem in the setting of SA elements. The main
SA routine is developed for minimization but we can easily convert it to a maximization
purpose. Assume x(t) = γ(t), where j − th element of γ is 1 or 0 according as j − th
covariate being present or absent in the model. Set J = posterior probability of the model
59
Mγ. Basically, we want to maximize the posterior probabilities over the model space applying
simulated annealing algorithm. At the end of a run of this algorithm we expect to have
maximum posterior probability and hence the corresponding highest posterior model i.e.
γ = argmaxγMγ
There are couple of issues to deal with before we actually perform the maximization. The
issues are the following:
• It is very crucial to set a cooling schedule. An appropriately chosen cooling schedule
accelerates the convergence. When T is very small, the time it takes for the Markov
chain x(t) to reach equilibrium can be excessive. The main significance of cooling
schedule is that, during the beginning of the search process it helps the algorithm to
escape from the local modes and then when the search is actually in the neighbor of
the global optimum the algorithm tries to focus in that region by reducing the value
of cooling schedule and thereby finding the actual optimum. There are a number of
suggestions available in the literature to choose a functional form for cooling schedule.
However, we set cooling temperature at time t,
T (t) =
p
(J(j)− J(i)
)log(t+ 1)
• Let the transition matrix be denoted by Q. Then the (i, j)−th element of the transition
matrix Q is taken as
qij =posterior probability of j − th model
sum of posterior probabilities of neighbors of i− th model
where j − th model ∈ neighborhood of i− th model.
60
• Then we attempt to define a neighborhood of a model. Note that we could specify
the whole model space as neighborhood of a model. But then that would be similar
to enumerate the all models which we want to avoid. On the other hand if we specify
very few models in the neighborhood of a model then a large number of steps are likely
to be required to reach to the higher posterior probability region. This forces us to
keep balance in selecting number of models in the neighborhood region.
For any given modelMγ we define the collection, Mγ,Mγ0 ,Mγ00, as the neighbor-
hood where
1. A deletion or an addition:
γ0 is such that if |γ0 − γ| = 1
that is the model Mγ0 can be obtained from model Mγ by either adding or
deleting one predictor.
2. A swap:
γ00′1 = γ′1 and |γ00 − γ| = 2
that is model Mγ00 can be obtained from model Mγ by swapping one predictor
with another.
61
For example, when p = 6 and suppose at time t the modelMγ = 2, 3, 4 gets selected.
Then the neighborhood is given by,
Mγ = 2, 3, 4,
Mγ0 =1, 2, 3, 4, 3, 4, 2, 4, 2, 3, 2, 3, 4, 5, 2, 3, 4, 6
,
Mγ00 =1, 2, 3, 2, 3, 5, 2, 3, 6, 1, 2, 4, 2, 4, 5, 2, 4, 6, 1, 3, 4,
3, 4, 5, 3, 4, 6
Note that, our selection provides the advantage for getting different region of neigh-
borhood at every step and thus eliminates the possibility of keeping old models in the
search region which is the case in Berger & Molina (2005) and Hans et al. (2007). In
this way our approach is different in the sense that the search procedure requires nei-
ther more than one processor nor a complicated and long Markov Chain to converge.
According to our knowledge, no such search process have been developed till now. Also
notice that, in the example above, the number of models in the neighborhood is 16
which is significantly lower than the total number of models 26 = 64.
• The widely accepted acceptance probability function for SA algorithm is given by
aT (t)(j|i) = min
[1, exp
−J(j)− J(i)
T (t)
](6.2)
This is known as Gibbs acceptance function. As we show later use of this function
meets the criteria of convergence.
Definition 6.2. We name the above set up of SA as SA-HPM.
62
6.7 Convergence
Cruz & Dorea (1998) provided simple conditions for convergence of simulated annealing
algorithm which are easy to verify. We discuss their conditions and compare our set up with
that of Cruz & Dorea (1998) below.
Condition 6.3. The transition matrix Q is irreducible with qii > 0 for all i ∈ S.
This follows that, when S is finite there exists a n0 ≥ 0, such that
minq(n0)ij : i, j ∈ S, > 0
Condition 6.4. For T (t) ↓ 0, Let, at(j|i) = aT (t)(j|i) > 0. Then limt→∞ at(j|i) exists, and
at(j|i) ↓ 0 if J(j) > J(i),
and
at(j|i) ↑ 1 if J(j) < J(i)
Moreover, if J(j′) > J(j) > J(i) then,
at(j′|i)
a(j|i)→ 0,
and
at(j′|i)
at(j′|j)→ 0.
Theorem 6.5. (Cruz & Dorea, 1998) Under the acceptance probability function aT (t) given
by (6.2) and under the conditions (6.3), and (6.4), the simulated annealing algorithm (6.5)
converges in the sense of (6.1).
63
Lemma 6.6. The model space M is finite.
Proof. Suppose there exists p many predictors. Thus the model space has 2p many elements
which is always finite.
Lemma 6.7.
qii > 0
Proof. Suppose at time t the i -th model gets selected. Since the i-th model itself belongs
to the neighborhood of i th model according to our neighborhood region selection, we have,
qii =posterior probability of i− th model
sum of posterior probabilities of neighbors of i− th model(6.3)
=P (Mi)P (y|Mi)∑
γ∈nbd(i) P (Mγ)P (y|Mγ), using (2.3) (6.4)
(6.5)
By definition prior probability of any model is strictly greater than 0 and posterior proba-
bilities are never 0 in practice for any given proper prior. These complete the proof.
Lemma 6.8. Q is irreducible.
Proof. Note that, each model has its own binary representation using γ. Given two binary
sequences, they are neighbors within a finite number of transformations of the coordinates
which is given by either 0 → 1 or 1 → 0. Hence, P (Mj|Mi) > 0 for all i, j ∈ M, trivially
completing the proof.
Lemma 6.9. The probability of moving to j -th model from i -th model in p steps is positive
i.e.
q(p)ij > 0
64
Proof. Consider two null model M0 and the full model M1 (say). These are two ex-
treme models. One can visit from M0 to M1 using following steps. 0, 0, . . . , 0, 0 →
1, 0, . . . , 0, 0 . . .→ 1, 1, . . . , 1, 0 → 1, 1, . . . , 1, 1.
Clearly, there are p many steps. Since we have considered two extereme models, any
other models can be visited in p0 (p0 ≤ p) many steps. This completes the proof.
Theorem 6.10. The convergence of simulated annealing (SA) algorithm (6.2) holds good
in the sense of (6.1).
Proof. The conditions of Theorem 1 (Cruz & Dorea, 1998) can be shown to satisfy using
(6.6), (6.7), (6.8), (6.9), (6.6) and the Gibbs acceptance probability function as in Example
1 at Cruz & Dorea (1998). Therefore SA algorithm (6.5) converges according to theorem 1
of Cruz & Dorea (1998) completing the proof.
In practice, to make the computation stable, we suggest to calculate the log of posterior
probabilities in stead of actual posterior probabilities and use that as estimates of proposal
distribution. Of course, one has to transform these quantities in the exponential scale to
obtain an move from the current model to its neighborhood. In this process we also need
to take care the possibility of the exponential term to blow out by adding or subtracting a
constant term to the power terms of these exponential quantities. For instance, we add the
maximum of log posterior probabilities of all the neighbor models.
CHAPTER 7
EMPIRICAL STUDY
In the following we describe a series of examples to assess the performance of the pro-
posed SA-HPM method. We compare the proposed method with HPM, SSVS (George &
McCulloch, 1993), MPM (Barbieri & Berger, 2004), lasso (Tibshirani, 1996), SCAD (Fan
& Li, 2001), adaptive lasso (Zou, 2006) and elastic net (Zou & Hastie, 2005) (whenever
feasible). We compare the various methods with respect to how many times each method is
recovering the data generating model based on complete enumeration.
Example 7.1. Linear Model.
The simulation set up is similar to what is considered in George & McCulloch (1993).
We set n = 60, p = 5, β = (0, 0, 0, 1, 1.2). So the data generating model is (4, 5). Covariates
are generated independently from standard normal distribution. The response is generated
according to (3.1) with σ2 = 1. The value of σ2 = 1 is kept same for all other examples
considered below. In this way, 100 data are generated keeping the predictors same. The
results are provided in Table 7.1. We observe how many times the data generating model
is selected by the several methods. We also record the times taken by each method. In
particular, since the Bayesian procedures are known to more time-consuming due to analysis
via Markov Chain sampling, we show that how our method solves that issue.
To carry out the SSVS and MPM, all the continuous parameters are analytically inte-
grated out utilizing the conjugate structure in the model. For the derivation of the posterior
distribution of γ see Garcıa-Donato & Martınez-Beneito (2013). The calculation of SSVS
and MPM were done using 4 processors parallely, in a Intel(R) Core(TM) i5-4210U CPU @
1.70 GHz machine. On the other hand, since we have used g prior with g = n and hence an-
66
alytical expression of posterior probability is available, thus the computation times of HPM
and the proposed method SA-HPM are comparable to different frequentist methods. The
time reported here is the total time taken for 100 simulations. All codes are written in R (R
Core Team, 2015). In addition, pmomMarginalK function of R package mombf (Rossell et
al., 2014) has been used to obtain Laplace approximation to marginal likelihood for SA-HPM
method when pMOM prior was used. We refer this method SA-HPM-pMOM. When g prior
has been used, the proposed SA-HPM is called SA-HPM.
The number of iterations for gibbs sampler in SSVS method and MPM method is set to
10000. First 2000 samples are discarded as burn-in samples.
Lasso and elastic net have been fitted using R package glmnet (Friedman et al., 2010)
which implements the coordinate descent algorithm. SCAD and MCP are fitted using ncvreg
(Breheny & Huang, 2011) Package. adalasso function of package parcor (Kraemer et al.,
2009) is used for fitting adaptive lasso. SIS and related methods have been computed using
SIS function of SIS (Fan et al., 2015) package. The model size is bounded to the number of
regressors by setting the argument nsis appropriately.
The results clearly indicates that all four Bayesian methods perform similarly. On the
other hand, all frequentist methods performs similar in terms of selecting the data generating
models. As far as the time is concerned since SSVS and MPM procedures are implemented
using Markov Chain sampling, they are more time consuming.
Example 7.2. Role of Correlation.
This data generating scheme is based on an example in George & McCulloch (1993). The
set up is exactly similar to (7.1) expect here we set β = (0, 0, 0, 1, 1.2) and x1, x2 and x4 are
generated independently from standard normal distribution and correlation (x3, x5) = 0.95.
So x3 can be thought as substantial proxy for x5. Therefore we expect (3, 4) instead of (4,
5) to show up as our final model often times. The results are reported in Table 7.2.
67
Table 7.1: Number of times the correct model is obtained based on 100 repetitions: Modelwith Uncorrelated Predictors.
Method Correct FDR FNR Time
SSVS 91% 0.032 0 1.7hMPM 83% 0.062 0 1.6hHPM 91% 0.033 0 8.5sLasso 77% 0.083 0 10.9sSCAD 78% 0.100 0 5.6sElastic Net 55% 0.174 0 12.1sAdaptive Lasso 80% 0.087 0 1.2mMCP 76% 0.108 0 5.6sSIS-SCAD 79% 0.091 0 31.1sSIS-MCP 83% 0.072 0 9.1sISIS-SCAD 78% 0.092 0 10.6sISIS-MCP 82% 0.073 0 10.5sSA-HPM 90% 0.037 0 8.3s
All Bayesian methods perform poorly in this example with MPM being the worst per-
formance. Among the frequentist methods SIS-SCAD has the best performance.
Example 7.3. Lasso example.
This example was considered by Tibshirani (1996). Set n = 20 and p = 8. Covariates
are generated independently from standard normal distribution with pairwise correlation
between xi and xj to be 0.5|i−j|. β = (3, 1.5, 0, 0, 2, 0, 0, 0)′ so that the data generating model
is (1, 2, 5). The results are reported in Table 7.3.
Again the performance of the Bayesian methods is superior. However, SSVS and MPM
took around 2 hours using 4 processors while SA-HPM takes only 12.6 seconds. On the other
hand ISIS-MCP remains the best method among the frquentist methods considered.
Example 7.4. Adaptive Lasso Example (lasso fails).
This set up was considered by Zou (2006). Zou (2006) illustrated this example to show
that Lasso is not consistent for variable selection. Zou (2006) extended lasso to estimate
68
Table 7.2: Number of times the correct model is obtained based on 100 repetitions: Linearregression in presence of collinearity
Method Correct FDR FNR Time
SSVS 58% 0.182 0.078 2.0hMPM 66% 0.148 0.067 1.9hHPM 66% 0.158 0.085 15.7sLasso 47% 0.221 0.067 10.5sSCAD 49% 0.267 0.136 5.0sElastic Net 0% 0.393 0 19.3sAdaptive Lasso 53% 0.214 0.065 1.3mMCP 30% 0.365 0.178 5.1sSIS-SCAD 63% 0.183 0.082 10.0sSIS-MCP 35% 0.338 0.198 10.1sISIS-SCAD 62% 0.192 0.099 11.6sISIS-MCP 34% 0.335 0.196 11.6sSA-HPM 65% 0.171 0.105 12.1s
parameters adaptively and established that adaptive lasso is a consistent variable selection
procedure and also holds oracle property.
For this example n = 60 and p = 4. Predictors are generated independently from
standard normal distribution with correlation (xj, xk) = −0.39, j < k < 4 and correlation
(xj, x4) = 0.23, j < 4. And we set β = (5.6, 5.6, 5.6, 0)′ the data generating model is (1, 2,
3). The results are reported in Table 7.4.
As expected all Bayesian methods remain consistent to recover the true model. The time
taken by SSVS and MPM is around 1.2 hours using 4 processors. The proposed SA-HPM
selects the data generating model 97% times successfully and takes only 6.1 seconds. On
the other hand Lasso performs poorly as illustrated by Zou (2006) and Adaptive lasso, is
the best performing frequentist method. Surprisingly, Elastic Net was never able to find the
data generating model out of 100 simulations.
Example 7.5. p = 30 and Presence of Correlation.
69
Table 7.3: Number of times the correct model is obtained based on 100 repetitions: Lassoexample
Method Correct FDR FNR Time
SSVS 95% 0.01 0.002 2.3hMPM 94% 0.016 0.002 2.0hHPM 87% 0.029 0.007 41.5sLasso 36% 0.209 0 10.9sSCAD 53% 0.163 0.002 7.8sElastic Net 7% 0.384 0 11.2sAdaptive Lasso 58% 0.148 0.002 58.0MCP 57% 0.162 0.009 4.6sSIS-SCAD 43% 0.206 0.002 10.1sSIS-MCP 58% 0.152 0.010 10.5sISIS-SCAD 48% 0.185 0.005 11.0sISIS-MCP 61& 0.144 0.006 11.9sSA-HPM 81% 0.044 0.019 13.0s
We consider this example from Kuo & Mallick (1998) to asses the performance of SA-
HPM along with other methods when we have large p. We set n = 100 and p = 30.
Covariates are generated from standard normal distribution with pairwise correlation 0.5.
β = (0, . . . , 0︸ ︷︷ ︸10
, 1, . . . , 1︸ ︷︷ ︸10
, 0, . . . , 0︸ ︷︷ ︸10
)′. So the data generating model is (11, . . . , 20). The results
are presented in Table 7.5.
Notice that calculation of HPM is omitted due to the infeasibility of enumerating 230
models. Otherwise, the performances of the Bayesian methods are similar. However, the
SSVS and MPM take a significant amount of time compare to that taken by proposed SA-
HPM. The reported time is required by the methods for 100 repetitions using 4 processors
parallely. Since each repetition takes around 40 to 50 minutes the calculation is done using
100 processors parallely.
Interestingly, almost all the frequentist regularized methods except adaptive lasso and
ISIS-MCP often fail to select the data generating model.
Example 7.6. p = 40 and Presence of Correlation.
70
Table 7.4: Number of times the correct model is obtained based on 100 repetitions: Adaptivelasso example (lasso fails)
Method Correct FDR FNR Time
SSVS 97% 0.008 0 1.2hMPM 95% 0.012 0 1.2hHPM 97% 0.01 0 4.7sLasso 1% 0.01 0.248 12.0sSCAD 90% 0.025 0 4.7sElastic Net 0% 0.25 0 13.4sAdaptive Lasso 94% 0.343 0 1.4mMCP 82% 0.045 0 4.6sSIS-SCAD 78% 0.055 0 9.8sSIS-MCP 78% 0.055 0 11.7sISIS-SCAD 79% 0.052 0 11.9sISIS-MCP 79& 0.052 0 11.9sSA-HPM 97% 0.008 0 6.1s
This example is taken from Zou & Hastie (2005) and illustrates the performance of
proposed SA-HPM method. Set n = 100 and p = 40. Covariates are generated independently
from standard normal distribution with pairwise correlation 0.5.
β = (0, . . . , 0︸ ︷︷ ︸10
, 2, . . . , 2︸ ︷︷ ︸10
, 0, . . . , 0︸ ︷︷ ︸10
, 2, . . . , 2︸ ︷︷ ︸10
)′ so that the data generating model (true model)
is (11, . . . , 20, 31, . . . , 40). σ2 is set to 1 as in the previous examples. The results are presented
in Table 7.6.
Here we omit presenting results for SSVS, MPM, and HPM methods because they are
computationally expensive. However, SA-HPM outperforms other frequentist methods.
Example 7.7. p > n, p = 200, n = 100.
Motivated by Song & Liang (2015), we consider this experiment where p > n. We
set n = 100, p = 200 and β = 1, . . . , 1︸ ︷︷ ︸8
, 0, . . . , 0︸ ︷︷ ︸192
. Each row of the design matrix X was
independently drawn from a multivariate normal distribution having mean 0 and identity
covariance matrix. Table 7.7 reports the results.
71
Table 7.5: Number of times the correct model is obtained based on 100 repetitions: p = 30.
Method Correct FDR FNR Time
SSVS 89% 0.01 0 20.8hMPM 96% 0.004 0 21.1hHPM - - - -Lasso 0% 0.319 0 24.4sSCAD 77% 0.052 0 11.0sElastic Net 0% 0.414 0 12.5sAdaptive Lasso 85% 0.018 0 1.3mMCP 77% 0.037 0 11.3sSIS-SCAD 66% 0.067 0 16.5sSIS-MCP 75% 0.044 0 19.0sISIS-SCAD 76% 0.048 0 21.2sISIS-MCP 82% 0.033 0 24.4sSA-HPM 97% 0.003 0 2.7m
Note that, g prior defined by 2.8 is not defined for p > n settings. So we apply non local
priors.
The function pmomLM of the package mombf was used for MCMC method of Johnson &
Rossell (2012) using pMOM prior. We refer this method as JR12. The Beta Binomial prior
with B(1, 20) was used on the models. A non informative inverse gamma with IG(0.001,
0.001) has been placed on σ2. The dispersion parameter is set to, τ = 2.85 for pMOM priors
on β as suggested by Johnson (2013).
Here we omit presenting results for SSVS, MPM, and HPM methods because they are
computationally expensive. However, SA-HPM-pMOM is the champion in recovering the
data generating model and outperforms all other methods.
Example 7.8. p > n, p = 200, n = 100 and Presence of Correlation.
Motivated by Song & Liang (2015), we consider this experiment where p > n and multi-
collinearity is present. We set n = 100, p = 200 and β = (1, . . . , 1︸ ︷︷ ︸8
, 0, . . . , 0︸ ︷︷ ︸192
). Each row of the
design matrix X was independently drawn from a multivariate normal distribution having
72
Table 7.6: Number of times the correct model is obtained based on 100 repetitions: Elasticnet example. p = 40.
Method Correct FDR FNR Time
SSVS - - - -MPM - - - -HPM - - - -Lasso 1% 0.180 0 13.0sSCAD 84% 0.022 0 24.7Elastic Net 0% 0.963 0.155 12.8sAdaptive Lasso 100% 0 0 2.1mMCP 83% 0.012 0 13.1sSIS-SCAD 76% 0.025 0 24.7sSIS-MCP 81% 0.017 0 23.1sISIS-SCAD 78% 0.022 0 34.1sISIS-MCP 85% 0.015 0 30.8sSA-HPM 100% 0 0 2.8m
mean 0 and covariance matrix Σ with diagonal entries equal to 1 and off-diagonal entries
equal to 0.5. Table 7.8 reports the results.
Obviously, SA-HPM-pMOM appears to be the most successful method in picking up
the data generating model. These examples demonstrate that, the proposed SA-HPM with
pMOM priors is potential better alternative for SIS, ISIS, and JR12 methods when p > n.
Example 7.9. p > n, p = 1000, n = 200.
Motivated by Song & Liang (2015), we consider this experiment where p > n. We
set n = 200, p = 1000 and β = 1, . . . , 1︸ ︷︷ ︸8
, 0, . . . , 0︸ ︷︷ ︸992
. Each row of the design matrix X was
independently drawn from a multivariate normal distribution having mean 0 and identity
covariance matrix. Table 7.9 reports the results.
Example 7.10. p > n, p = 1000, n = 200 and Presence of Correlation.
Motivated by Song & Liang (2015), we consider this experiment where p > n and mul-
ticollinearity is present. We set n = 200, p = 1000 and β = (1, . . . , 1︸ ︷︷ ︸8
, 0, . . . , 0︸ ︷︷ ︸992
). Each row
73
Table 7.7: Number of times the correct model is obtained based on 100 repetitions: p >n, p = 200, n = 100
Method Correct FDR FNR Time
SSVS - - - -MPM - - - -HPM - - - -JR12 91% 0 0.003 55.2sLasso 0% 0.526 0 14.8sSCAD 4% 0.440 0 22.3sElastic Net 0% 0.664 0 15.0sAdaptive Lasso 24% 0.216 0 1.7mMCP 29% 0.202 0 20.6sSIS-SCAD 15% 0.351 0.002 18.2sSIS-MCP 29% 0.239 0.002 17.7sISIS-SCAD 0% 0.765 0 48.7sISIS-MCP 1% 0.757 0 51.1sSA-HPM 100% 0 0 5.3m
of the design matrix X was independently drawn from a multivariate normal distribution
having mean 0 and covariance matrix Σ with diagonal entries equal to 1 and off-diagonal
entries equal to 0.5. Table 7.10 reports the results.
Example 7.11. Real Data Example: Ozone-35.
The Ozone-35 data has been considered by Garcıa-Donato & Martınez-Beneito (2013).
It has n = 178 observations and p = 35 covariates. For a detail description of the original
Ozone dataset, we refer to Table 7.11 which is also in Casella & Moreno (2006). We have
considered the covariates x3 − x7, along with the square terms and interaction terms, as
in Garcıa-Donato & Martınez-Beneito (2013). The data set is readily available in the R
package BayesVarSel (Garcia-Donato & Forte, 2015). Garcıa-Donato & Martınez-Beneito
(2013) illustrated that the posterior probability of the median probability model is 23 times
lower than that of the highest posterior model. The model space is huge which makes
complete enumeration infeasible. g-prior was used to compute the Bayes factors easily and
continuous parameters were integrated out as before.
74
Table 7.8: Number of times the correct model is obtained based on 100 repetitions: p >n, p = 200, n = 100
Method Correct FDR FNR Time
SSVS - - - -MPM - - - -HPM - - - -JR12 72% 0 0.002 1.2mLasso 0% 0.665 0 14.2sSCAD 38% 0.125 0 14.8sElastic Net 0% 0.743 0 15.3sAdaptive Lasso 33% 0.116 0 1.5mMCP 48% 0.080 0.0001 11.0sSIS-SCAD 0% 0.318 0.010 19.7sSIS-MCP 0% 0.193 0.010 28.1sISIS-SCAD 1% 0.757 0 1.3mISIS-MCP 4% 0.721 0 1.2mSA-HPM 83% 0.005 0.001 6.1m
.
Since this is a real dataset, the data generating model is not known. As discussed before,
HPM is perceived to be good model. Using a huge time consuming study, Garcıa-Donato &
Martınez-Beneito (2013) calculated the HPM. The Bayes factor of HPM and MPM, against
M0 are reported in Table 7.12. The main contribution is that, the proposed SA-HPM
method able to find the HPM 100 times out of 100 repetitions. In particular, the SA-HPM
method is extremely useful even for large model spaces.
Example 7.12. Independent of Starting Model
The theory of simulated annealing method holds good for any start point in the state
space. In the examples, we have considered above, the starting model was the null model
M0. In this example we show that in our set-up, indeed, convergence does not depend on
the start model. We revisit the Tibshirani (1996) example where we produce the response
only once using the model (1, 2, 5) having β = (3, 1.5, 0, 0, 2, 0, 0, 0). We plot the log(Bayes
factor) of the models against the null modelM0 with number of steps to reach to the HPM
75
Table 7.9: Number of times the correct model is obtained based on 100 repetitions: p >n, p = 1000, n = 200.
Method Correct FDR FNR Time
SSVS - - - -MPM - - - -HPM - - - -JR12 100% 0 0 8.0mLasso 2% 0.484 0 41.6sSCAD 28% 0.278 0 1.2mElastic Net 0% 0.672 0 43.9sAdaptive Lasso 60% 0.085 0 13.0mMCP 58% 0.142 0 1.2mSIS-SCAD 28% 0.276 0 24.5sISIS-SCAD 0% 0.789 0 3.4mSIS-MCP 56% 0.168 0 22.9sISIS-MCP 0% 0.789 0 3.6mSA-HPM 100% 0 0 5.5h
model in the horizontal axis for different start model in figure 7.1. The HPM, here, is the
model (1, 2, 5) having the log(Bayes factor) = 20.909 against M0.
76
Table 7.10: Number of times the correct model is obtained based on 100 repetitions: p >n, p = 1000, n = 200 and Presence of Correlation.
Method Correct FDR FNR Time
SSVS - - - -MPM - - - -HPM - - - -JR12 100% 0 0 8.2mLasso 0% 0.765 0 49.5sSCAD 73% 0.041 0 59.0sElastic Net 0% 0.832 0 51.9sAdaptive Lasso 74% 0.031 0 9.3mMCP 87% 0.015 0 46.8sSIS-SCAD 0% 0.726 0.003 1.0mISIS-SCAD 1% 0.782 0 6.2mSIS-MCP 0% 0.652 0.003 39.6sISIS-MCP 0% 0.789 0 7.5mSA-HPM 100% 0 0 5.5h
Table 7.11: Description of the Ozone dataset
Variable Description
y Response = Daily maximum 1-hour-averageozone reading (ppm) at Upland, CA
x1 Month: 1 = January, . . . , 12 = Decemberx2 Day of monthx3 Day of week: 1 = Monday, . . . , 7 = Sundayx4 500-millibar pressure height (m) measured at Vandenberg AFBx5 Wind speed (mph) at Los Angeles International Airport (LAX)x6 Humidity (%) at LAXx7 Temperature (F) measured at Sandburg, CAx8 Inversion base height (feet) at LAXx9 Pressure gradient (mm Hg) from LAX to Daggett, CAx10 Visibility (miles) measured at LAX
Table 7.12: The HPM and MPM for Ozone-35 data. The last 2 columns provide Bayes factoragainst the null model and log of that respectively.
Serial No Model Bayes Factor log(Bayes Factor)
HPM 7 10 23 26 29 1.02E+47 108.2364944MPM 21 22 23 29 4.34E+45 105.0834851
77
Figure 7.1: Solution path for different starting model for SA-HPM model. The points arethe log of Bayes factor of models against the null model. The log(Bayes Factor) of the datagenerating model (1, 2, 5) is 20.909.
CHAPTER 8
LOGISTIC REGRESSION
8.1 Introduction
Computation of marginal likelihood of models in the model space is one of the main
key in computing HPM. In linear model with conjugate priors (examples include normal
prior or g prior) we are able obtain an analytical expression for marginal likelihoods which
constitutes the objective function for the optimization needed to find the HPM. However,
if we place a non- conjugate prior on the coefficients, then we fail to derive the marginal
likelihood analytically and the evaluation of the objective function becomes a non-trivial
task.
In logistic regression, primarily due to the nonlinearity of the link function, there exists
no prior distribution for which analytical expression of the marginal likelihood available.
Therefore non-trivial rigorous approach needs to be developed for this purpose. Early work
on estimation of marginal likelihood includes how to compute it from a Gibbs sampler output
(Chib, 1995). Chib (1995) elaborated this procedure using an example in logistic regression.
This idea was extended for output obtained from a Metropolis Hastings sampler (Chib &
Jeliazkov, 2001). Raftery et al. (2007) used harmonic mean type estimator to estimate the
marginal likelihood. However, in our experience, we have seen that it is very difficult task
to estimate marginal likelihood using this type of estimator.
79
8.2 Power Posterior
Motivated by the work of (Gelman & Meng, 1998) Friel & Pettitt (2008) derive a simple
method to estimate the marginal likelihood which they name the power posterior method.
They illustrated the computational ease of this technique using several examples. In this
section we provide an introduction to the power posterior method.
Define,
z(y|t) =
∫θγ
L(y|θγ)
tp(θγ)dθγ
where θk stands for the parameters for model Mγ.
Then, marginal likelihood is given by,
log
(p(y)
)= log
z(y|t = 1)
z(y|t = 0)
=
∫ 1
0
Eθk|y,t logL(y|θk)dt
Applying trapezoidal rule,
log
(p(y)
)=
n−1∑i=1
1
2(ti+1 − ti)
[Eθk|y,ti+1
logL(y|θk) + Eθk|y,ti logL(y|θk)]
(8.1)
According to the suggestions in Friel & Pettitt (2008) we set n = 10. ti = a5i , i =
1, . . . , n where ai’s are equally divided points in the [0, 1] interval. Friel & Pettitt (2008)
commented that if collecting samples from posterior distribution of parameters is possible
then theoretically it is also possible to collect power posterior samples. Once we have power
posterior samples we calculate the right hand side of (8.1) and hence we obtain an estimate
of the marginal likelihood.
In order to apply power posterior method to logistic regression, however, some issues
need to be taken care of.
80
1. First, note that since we apply trapezoidal rule and like to get power posterior samples
for each interval the number iterations should be kept as small as possible for compu-
tational efficiency. A single marginal likelihood calculation actually requires parallel
Markov Chain sampling from 10 power posteriors.
2. Second, prior specification plays an important role as usual. We place a normal prior
for all of our computation. We did not experience a significant amount of difference
between normal prior and double exponential prior when p is small enough.
3. The usual way to obtain posterior samples for logistic regression is to assume a probit
regression (Chib, 1995) and develop a gibbs sampler. But we wanted to avoid that
to assess our proposed method as accurate as possible. So we seek for an automatic
Bayesian updation software or package after keeping the form of logistic link function
as it is. OpenBUGS (Thomas et al., 2006) is widely accepted software in this purpose.
However, when it requires to do a simulation it is very inconvenient to do the same
job in OpenBUGS again and again. Therefore this leads us to use the R2OpenBUGS
package (Sturtz et al., 2005) in R (R Core Team, 2015). R2OpenBUGS (Sturtz et
al., 2005) package provides an interface between R and OpenBUGS (Thomas et al.,
2006) by virtue of which we can set any experiment in R (R Core Team, 2015), do the
Bayesian updation in OpenBUGS Thomas et al. (2006), have OpenBUGS (Thomas et
al., 2006) send the posterior samples to R (R Core Team, 2015), and finally get done
the required calculations using those posterior samples in R (R Core Team, 2015).
This nice integration of R (R Core Team, 2015) and OpenBUGS (Thomas et al.,
2006) allows us to compute SSVS and MPM efficiently. However, computation of
power posterior requires sampling from power posterior distribution of the parameters
which are not readily available in OpenBUGS (Thomas et al., 2006). One would
use the zero’s trick for this purpose. When we did this in practice, somehow the
81
result was not satisfactory. This made us search for a ready sampler. We tried to do
this sampling using a Slice sampler Metropolis Hastings (Hastings, 1970) sampler, for
which a ready routine is available in MHadaptive R package (Chivers, 2012). However,
Metropolis Hastings sampler requires a tuning parameter for the which has to be
change adaptively and a variance covariance matrix of the proposal distribution. Lack
of proper information often provides biased estimate of marginal likelihood. On the
other hand diversitree (FitzJohn, 2012) package in R implements the Slice sampler
(Neal, 2003) which requires only one tuning parameter, the width of the proposal
step. In general, the Slice algorithm is insensitive to the vale of this tuning parameter.
However, in our experience, setting this to 0.1 leads to well mixing of the Markov
Chain.
Essentially, power posterior method is simple and easy to implement even for non linear
complex model. Our contribution is to use the simplicity of this method to estimate the
marginal likelihood at each evaluation of the objective function for simulated annealing.
Recall that in the simulated annealing algorithm the proposal distributions are based on the
posterior probabilities of the neighborhood models. We replace these posterior probabilities
with the marginal likelihoods obtained via power posterior method.
8.3 Simulation Study
In this study we generate the response 100 times according to (2.1) and (2.2). We set
n = 100 and p = 5. The data generating model is (2, 4). β = (0, 2, 0, 2, 0). The results are
shown in Table (8.1). Also the predictors are generated from standard normal distribution
such that the first 2 covariates have correlation around 0.95 and others are independent.
The SSVS and MPM are estimated using R2OpenBUGS package (Sturtz et al., 2005) in R
82
(R Core Team, 2015). We also investigate the performance of LPML here. The posterior
sampling for LPML method is done using slice sampler (Neal, 2003) available in the R
package diversitree (FitzJohn, 2012).
Table 8.1: Number of times the correct model is obtained based on 100 repetitions: Logisticregression in presence of collinearity.
Method Correct FDR FNR
SSVS 60% 0.149 0.015MPM 53% 0.158 0.007LPML 35% 0.3 0.123HPM 71% 0.100 0.039Lasso 36% 0.246 0.014Elastic Net 0% 0.424 0SIS-SCAD 48% 0.241 0.063SIS-MCP 36% 0.300 0.095ISIS-SCAD 46% 0.243 0.056ISIS-MCP 35% 0.306 0.114SA-HPM 65% 0.13 0.008
The results conclude that HPM is the most successful method to recover the data gen-
erating model and thereby followed by the proposed SA-HPM method. The MPM, and
LPML performs poorly. When there exist large number of explanatory variables then HPM
calculation is not feasible and hence the proposed SA-HPM appears to be the method of
choice.
Example 8.1. We set n = 200 and p = 20. Data generating model is (2, 4, 7, 9) having β2 =
β4 = β7 = β9 = 2 and all other parameter coefficients are zero. The data are generated such
that corr(x1, x2 ) = corr(x5, x6) = corr(x11, x12) = corr(x15, x16) = 0.95. All other regressors
are generated independently from standard normal distribution. Finally the response is
generated 100 times according to (2.1) and (2.2). The results are reported in Table 8.2.
Clearly, the SA-HPM method outperforms other Bayesian methods, and continue to
outperform frequetist methods.
83
Table 8.2: Number of times the correct model is obtained based on 100 repetitions: Logisticregression for large p. p = 20.
Method Result FDR FNR
SSVS 47% 0.146 0.015MPM 39% 0.17 0.007LPML - - -HPM - - -Lasso 24% 0.277 0Elastic Net 0% 0.605 -SIS-SCAD 23% 0.255 0.001SIS-MCP 58% 0.126 0ISIS-SCAD 25% 0.248 0.001ISIS-MCP 61% 0.117 0SA-HPM 65% 0.087 0.04
Example 8.2. Nodal data.
The nodal data, considered by Collett (1991) and Chib (1995), is the cancer information
of fifty three patients. Researchers tried to explain whether any patient has the cancer or
not with help of age of the patient in years at diagnosis, level of serum acid phosphate, the
result of an X-ray examination, the size of the tumor, and the pathological grade of the
tumor. Collett (1991) concluded that the second, third and fourth variables are significant
in explaining the cancers of a patient using frequentist deviance approach while Chib (1995)
obtained same significant variables for the cancer patients using Bayesian marginal likeli-
hood computation. Chib (1995) fitted a probit regression model to obtain highest marginal
likelihood for this model. Hence, this model can be concluded as the HPM.
We applied SA-HPM with a probit regression model. We placed independent normal
prior on β with mean 0.75 and variance 25 as suggested by Chib (1995). The SA-HPM
appears to be 100% successful in recovering the HPM based on 100 repetitions.
CHAPTER 9
SURVIVAL MODELS
9.1 Introduction
Here we present exact comparison of different Bayesian methods for survival data. LPML
is a widely used criterion for model selection in Bayesian survival analysis. In our study we
show that the HPM and hence the proposed SA-HPM method often outperforms LPML.
SSVS and MPM are also kept in the comparison examples as before.
9.2 Weibull Distribution
In our simulation study, the number of independent variables p is set to be 5, as with
increasing p, the calculation of HPM becomes growingly time consuming. The covariates are
generated from normal distribution such that x1, x2, and x3 are highly correlated with corre-
lation around 0.9. x4 and x5 are independently distributed according to the standard normal
distribution. We set n = 100. The data generating model is (2, 4) and β = (0, 1, 0,−1, 0).
Rest of the examples assume this set up, unless otherwise mentioned.
The SSVS and MPM are estimated using R2OpenBUGS package (Sturtz et al., 2005)
in R (R Core Team, 2015). The posterior sampling for LPML method is done using slice
sampler (Neal, 2003) available in the R package diversitree (FitzJohn, 2012). The posterior
probabilities required for computation of HPM and SA-HPM are estimated using power
posterior discussed in section 8.2.
85
Example 9.1. Weibull Regression
We assume that the response y follows a Weibull distribution given by,
f(y) = ηνyη−1e−νyη
, y > 0, s > 0, k > 0 (9.1)
where
ν = exp (Xγβγ) (9.2)
The response y is generated according to (9.1) with η = 1.75 and ν is given by (9.2).
We assume no censoring. We fit a Weibull regression when estimating SSVS, MPM, LPMl,
HPM and SA-HPM. The results are given in Table (9.1). During the analysis an Uniform
prior with parameters (0.5, 3) is placed on the shape parameter η.
Table 9.1: Number of times the correct model is obtained based on 100 repetitions: Weibullregression.
Method Correct FDR FNR
SSVS 53% 0.180 0.097MPM 36% 0.247 0.077LPML 35% 0.300 0.123HPM 81% 0.102 0.028SA-HPM 80% 0.130 0.071
From Table 9.1, we can conclude that SA-HPM outperforms SSVS, MPM, and LPML
in selecting the data generating model. Note that LPML, which is widely used in survival
settings, has the worst performance.
9.3 Mixture of Weibull Distribution
Weibull distribution is popular model in parametric survival models and it satisfies the
assumption of proportional hazards model and hence often fit well when the data comes
86
from a proportional-hazards. We propose a Bayesian model where the baseline is modeled
by a mixture of Weibull distribution. In simulation examples we show that this model fits
reasonably well even when data are generated from distributions other than Weibull.
Thus we propose the following hierarchical representation for a regression model with a
mixture of Weibull distribution.
f(yi|λi, ν, βI , β) = ηλi exp(x′iβ)yη−1i exp−λi exp(x′iβ)yηi
π(λi) ∼ G(αI , βI), i = 1, . . . , n
π(βI) ∼ G(α0, β0)
π(η) ∼ U(αs, βs)
π(β) ∼ N(β,Σβ)
If V ∼ G(φ1, φ2) then the pdf of V is defined as,
f(v) =φφ12
Γ(φ1)vφ1−1 exp−φ2v, v > 0, φ1, φ2 > 0. (9.3)
If W ∼ U(ν1, ν2) then the pdf of W is defined as,
f(w) =1
ν2 − ν1
, w > 0, ν2 > ν1 > 0.
We have imposed a prior for the rate parameter βI of the distribution of intercept such
that the informations can be shared between the observations. These lead to the following
posteriors.
87
f(yi|λi, η, βI , β)π(λi|βI)π(βI)π(η)π(β)
=λiη exp(x′iβ)yη−1i exp−λi exp(x′iβ)yηi
βαIIΓ(αI)
λαI−1i exp(−λiβI)
βα00
Γ(α0)βαI−1I exp(−βIβ0)π(η)π(β)
We analytically obtain the integrated likelihood over λi as,
∴π(βI , η, β|yi)π(βI)π(η)π(β)
=η exp(x′iβ)yη−1i
βαIIΓ(αI)
βα00
Γ(α0)βα0−1I exp(−βIβ0)π(η)π(β)∫ ∞
0
λi exp−λi exp(x′iβ)yηi
λaI−1i exp(−λiβI)dλi
=η exp(x′iβ)yη−1i
βαIIΓ(αI)
βα00
Γ(α0)βα0−1I exp(−β0βI)π(η)π(β)∫ ∞
0
λαI+1−1i exp
−(βI + exp(x′iβ)yηi
)=η exp(x′iβ)yη−1
i
βαIIΓ(αI)
βα00
Γ(α0)βα0−1I exp(−β0βI)
Γ(αI + 1)
(βI + exp(x′iβ)yηi )αI+1π(η)π(β)
i = 1, . . . , n
Hence this likelihood may be be used to calculate the power posterior estimate. Note
that, once the full conditional likelihood is integrated with respect to λi, we are left with at
least p many less parameters and thereby achieve fast mixing in the MCMC.
Example 9.2. Mixture of Weibull Regression
88
Table 9.2: Number of times the correct model is obtained based on 100 repetitions: Datagenerating model – Weibull regression. Analysis model – Mixture of Weibull regression.
Method Correct FDR FNR
SSVS 51% 0.238 0.122MPM 32% 0.263 0.032LPML 30% 0.321 0.135HPM 74% 0.125 0.042SA-HPM 72% 0.138 0.063
Table 9.3: Number of times the correct model is obtained based on 100 repetitions: Datagenerating model – Gamma regression. Analysis model – Mixture of Weibull regression.
Method Correct FDR FNR
SSVS 43% 0.237 0.053MPM 32% 0.263 0.028LPML 41% 0.263 0.055HPM 70% 0.102 0.029SA-HPM 66% 0.190 0.047
The response y is generated according to (9.1) with s = 1.75 and ν is given by (9.2). No
censoring is assumed. We fit a mixture of Weibull regression. Table 9.2 reports the results.
Notice that, HPM and SA-HPM are the best performers whereas both LPML and MPM
perform poorly.
Example 9.3. Response Follows Gamma Distribution
For this example, the response y is generated according to a Gamma distribution (9.3)
with φ1 = 1.75 and φ2 = φ1exp(Xβ)
. The purpose is twofold here. First we wish to investigate
the performance of fitting a mixture of Weibull distribution when the data are not from a
Weibull distribution. Second, we wish to check the performance of the proposed SA-HPM
method. Table 9.3 provides the results.
First, we compare these results with that of Table 9.2 and notice that the results are
similar. Second, as before, HPM and hence SA-HPM tend to find the data generating
89
Table 9.4: Number of times the correct model is obtained based on 100 repetitions: Datagenerating model – Log Normal regression. Analysis model – Mixture of Weibull regression.
Method Correct FDR FNR
SSVS 58% 0.215 0.072MPM 45% 0.190 0.113LPML 22% 0.392 0.202HPM 60% 0.176 0.088SA-HPM 54% 0.265 0.157
model more often than other Bayesian methods. MPM followed by LPML are the two worst
performing methods.
Example 9.4. Response Follows Log Normal Distribution.
Following the similar spirit of the previous example here we draw the response from a
Log Normal distribution. Recall that, if U ∼ LN(µ0, σ20) then log(U) ∼ N(µ0, σ
20). We set
µ0 = exp(Xβ) and σ0 =√
11.75
. We report the results in table 9.4, and notice that, the
conclusions are similar as in the example 9.3
9.4 Censoring
In presence of the censoring, the likelihood for the i-th observation is given by,
l(θ|yi) = f(yi|θ)δiS(yi|θ)(1−δi)
where
δi =
1 if yi is event time
0 if yi is right censored i = 1, . . . , n
90
, and f(.|θ) is the pdf, and S(.|θ) is the survival function. When f(.|.) is the pdf of a Weibull
distribution (9.1), then the survival function is given by,
S(y|η, ν) = exp−νyη, y > 0, ν > 0, η > 0
The contribution of the censored data can be added to the likelihood using the following.
π(βI , η, β|yi)π(βI)π(η)π(β)
=βαII
Γ(αI)
βα00
Γ(α0)βα0−1I exp(−βIβ0)π(η)π(β)∫ ∞
0
exp−λi exp(x′iβ)yηi
λaI−1i exp(−λiβI)dλi
=βαII
Γ(αI)
βα00
Γ(α0)βα0−1I exp(−β0βI)π(η)π(β)∫ ∞
0
λαI−1i exp
−(βI + exp(x′iβ)yηi
)=
βαIIΓ(αI)
βα00
Γ(α0)βα0−1I exp(−β0βI)
Γ(αI)
(βI + exp(x′iβ)yηi )αIπ(η)π(β)
i = 1, . . . , n
Example 9.5. Responses are Right Censored.
We keep the similar set up for generating the covariates. The response is generated using
(9.1) with η = 1.75 and ν is given by (9.2). The censoring distribution is taken to be (9.1)
with η = 1.75 and ν = 13. This results in 25 % censoring in the simulated data. Recall that,
informations are typically lost via censoring. Keeping this fact in the mind the number of
observations is increased to n = 200.
91
Here we also decide to compare with frequentist regularization regressions lasso (Tibshi-
rani, 1996) and elastic net (Zou & Hastie, 2005). These are fitted using R package glmnet
(Friedman et al., 2010) (Simon et al., 2011). We report the simulated results in Table 9.5.
Table 9.5: Number of times the correct model is obtained based on 100 repetitions: Presenceof Censoring
Method Result FDR FNR
SSVS 47% 0.258 0.148MPM 42% 0.263 0.116LPML 11% 0.439 0.115HPM 76% 0.094 0.050Lasso 51% 0.179 0.025Elastic Net 0% 0.463 0.458SIS-lasso 12% 0.379 0.010ISIS-lasso 11% 0.375 0.010SA-HPM 72% 0.123 0.084
The results suggest that SA-HPM clearly ouperforms other Byaesian and frequentist
methods in selecting the data generating model. In particular, LPML is able to find the
data generating model only 11 times out of 100 repetitions.
Example 9.6. Proteomics data.
We apply the methodologies discussed above to Proteomics data. This is a right censored
data with 110 subjects. Among them 8 subjects are censored. The data has two types
of treatments which can be represented by a binary variable. In addition, there are 60
biomarkers. Moreover, we created interaction terms between treatment and the biomarkers.
This leads us to deal with 121 regressors. We report the results in Table 9.6.
Note that, since the model space is extremely large, the runtime for MPM and SSVS was
around 5 hours. We set 25000 iterations and 5000 burnin. Notice that, the model obtained
using SA-HPM has highest log(marginal likelihood). In this sense, SA-HPM outperforms
other methods. We also include the dimensions of models of the SA-HPM path and their
92
Table 9.6: Models obtained using different methods for Prot data
Method Model log(marginal likelihood)
MPM 14, 16, 20, 23, 24, 30, 50, 61,65, 71, 72, 75, 77, 84, 85 -359.6159
SSVS 1, 2, 4, 6, 14, 15, 16, 17, 19,20, 22, 23, 24, 30, 31, 32, 35,47, 49, 51, 54, 57, 61, 62, 63,65, 70, 71, 77, 79, 84, 85, 88,92, 93, 96, 100, 101, 102, 107,108, 112, 119 -390.3568
Lasso 10, 23 -359.5315Elastic Net 10, 23 -359.5315SIS-lasso 6, 7, 10, 17, 20, 23, 27, 44,
68, 70, 96 -366.4602ISIS-lasso 4, 7, 10, 11, 14, 20, 23, 24,
28, 30, 32, 41, 42, 49, 50, 55,70, 74, 78, 81, 82, 96, 99,102, 105, 117 -371.4623
SA-HPM 10, 11, 20, 24, 30, 50, 75 -352.8674
log(marginal likelihoods) for each steps of SA-HPM, in Figure 9.1. The solution paths are
for different models as start points (start models) for SA-HPM.
93
Figure 9.1: (a)The path for dimensions of models obtained in each step of SA-HPM. (b) Thepath for log(marginal likelihood) of models obtained in each step of SA-HPM.
(a) (b)
CHAPTER 10
DISCUSSION
10.1 Introduction
Posterior probability approach is one of the oldest Bayesian model selection criterion.
It is widely accepted in the Bayesian community. It became popular because of its solid
theoretical and logical foundation. However, like some of the other Bayesian criteria this
approach requires enumerating the whole model space. Our work presented in this article
tries to find the highest posterior model by avoiding the enumeration of whole model space.
10.2 Comparison of Methods for Linear Models
When comparing various Bayesian methods for linear models the main issue with SSVS
and MPM is that both of the methods take significant amount of time. We have followed
the Gibbs sampling of Garcıa-Donato & Martınez-Beneito (2013). Therefore the continuous
parameters are integrated out and we are left with a finite parameter space. So to capture
this, we wrote the code completely in R (R Core Team, 2015) neither taking advantage of
MCMC calculation of OpenBUGS (Thomas et al., 2006) nor using efficient coding of C or
C++ or Fortran. This argument may be used for huge time taken by computation of these
methods. But then, we wrote the code for proposed SA-HPM using R again. So, in our
opinion, this makes the comparison fair. The only advantage we have taken is that the closed
form expression for posterior probabilities are available when we use g prior. Unfortunately,
95
the SSVS and MPM are not blessed by this fact. The MPM could have been calculated
using the posterior probabilities as well. However such practice is not tenable for large p.
On the other hand, the performance of MPM is not good. Thus, in either way, proposed
SA-HPM is preferable.
10.3 Non-Linear Models
In general, the non conjugacy of any prior for non-linear models demands a significant
amount of computation. We have used power posterior (Friel & Pettitt, 2008) method in
this purpose. The advantage of power posterior is that it is easy to understand and simple
to employ. Until today, according to our knowledge, there have been no effort, to use
computational simplicity of power posterior for variable selection purpose. Our work is, thus
the first attempt to this direction.
10.4 Survival Settings
Again, according to our knowledge, Bayesian variable selection literature is very limited
in the survival statistics. One of main issue is the lack of good prior or conjugate prior.
On the top of that the form of a survival function is often very complicated to deal with
(example: Cox proportional hazards model, Weibull regression model). Presence of censoring
adds another level of difficulty. Bayesian methods normally depend on the given data and
try to recover the true situation by using the data. In the presence of censoring a significant
amount of information is lost and thereby it becomes challenging for the statisticians to
recover the relevant variables. In our method we employed mixture of Weibull distribution
and have shown its good performance for different scenarios. Furthermore LPML, perhaps
96
the most famous goodness of fit statistics in survival models, has been shown to work poorly.
In addition, simulations show that DIC suffers the same problem as LPML does.
10.5 Future Extensions
10.5.1 Large p
Although we have done the experiments under small p for non-linear models, the work
can easily be extended for large p. And from the theory of convergence of SA the proposed
SA-HPM has the potential to do well.
10.5.2 Prior
We have also used the double exponential prior to β and got similar result for small
p. The results are not presented here for brevity. For linear models, when using double-
exponential prior, the closed form expression of posterior probabilities is not available and
hence power posterior arrives as the rescuer like we have discussed before. Thus one will be
tempted use any shrinkage prior and assess the performance.
10.5.3 Different Models
The immediate future extension of this work would be to use SA-HPM for other mod-
els like proportional odds regression, Poisson regression, and spatial regression. Currently,
performance of SA-HPM is examined for proportional odds model.
97
10.5.4 Posterior Sampling
When we do not have Gibbs sampler available (example: non-linear models) we have
relied on other R (R Core Team, 2015) packages (diversitree, MHadaptive) due to in-efficiency
of OpenBUGS in calculating the power posterior using zero’s trick for posterior sampling.
Developing efficient posterior sampler, would certainly make the computation of SA-HPM
easier and accelerate the computation time.
10.5.5 Computation Time
In addition, we believe that the computation time can also be reduced by using efficient
programing of lower level languages like C, C++, or Fortran. In that case, our effort will
be to deliver an R package which will implement our idea having functions written in one of
the lower level languages.
CHAPTER 11
CONCLUSION
In conclusion, first note that, we have not tried to find the data generating model using
any direct approach like using a shrinkage prior or by sampling from the model space. Rather
our effort focuses on finding the highest posterior model which is often perceived to have
good properties. If highest posterior model does not possess sufficient information on data,
our proposed SA-HPM mtehod might fail in those cases. Fortunately, there is no study until
now which shows the failure of posterior probability approach. Exception include MPM by
Barbieri & Berger (2004). But the optimality of MPM depends on the orthogonal assumption
of design matrix which is a rare event in practice. And in the presence of collinearity, the
proposed SA-HPM has shown to outperform MPM.
Furthermore, as Garcıa-Donato & Martınez-Beneito (2013) observed that SSVS estima-
tors have potential to find a good model, but when a large number of predictors is available,
it might become infeasible to check its performance.
One important feature of our method is that the computation time is comparable to that
taken by various frequentist regularized regressions. Comparison of Bayesian and frequentist
methods is always critical. Even keeping this fact in mind, the proposed SA-HPM have shown
to outperform its frequentist counterparts.
Most of the regularized regression requires one to choose the tuning parameter and this
is typically done by cross validation. The entire path of the tuning parameter is believed
to contain the data generating solution. The examination of that entire path is beyond the
scope of this article.
99
Most of the stochastic search processes available in the literature depend on the posterior
probabilities of the models by virtue of the design of the algorithms. We use this idea of
using posterior probability directly with the help of simulated annealing algorithm and power
posterior method. As a summary, our research strengthens the classical idea of assessing a
model by its posterior probability.
We agree with Garcıa-Donato & Martınez-Beneito (2013) that a large volume of near
future research in Bayesian literature of variable selection will involve sampling and stochastic
search. We also agree with Hahn & Carvalho (2015) that good models can be obtained by
exploring the posterior summary of the models. The highest posterior model, a posterior
summary, is widely known to have excellent properties. Our research, thus, provides a simple,
efficient, quick, and feasible way toward this direction of variable selection.
REFERENCES
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions
on Automatic Control , 19 (6), 716 - 723.
Andrews, D. F., & Mallows, C. L. (1974). Scale mixtures of normal distributions. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 36 (1), 99 – 102.
Barbieri, M. M., & Berger, J. O. (2004). Optimal predictive model selection. The Annals
of Statistics , 32 (3), 870 – 897.
Basu, S., & Chib, S. (2003, March). Marginal likelihood and bayes factor for dirichlet process
mixture models. Journal of the American Statistical Association, 98 (461), 224 – 235.
Basu, S., & Ebrahimi, N. (2003, May). Bayesian software reliability models based on
martingale processes. Technometrics , 45 (2), 150 – 158.
Basu, S., Sen, A., & Banerjee, M. (2003). Bayesian analysis of competing risks with par-
tially masked cause-of-failure. Journal of the Royal Statistical Society: Series C (Applied
Statistics), 52 (1), 77 – 93.
Basu, S., & Tiwari, R. C. (2010, April). Breast cancer survival, competing risks and
mixture cure model: a bayesian analysis. Journal of the Royal Statistical Society: Series
A (Statistics in Society), 173 (2), 307 – 329.
Berger, J. O., & Molina, G. (2005). Posterior model probabilities via path-based pairwise
priors. Statistica Neerlandica, 59 , 3 – 15.
101
Bertsimas, D., & Tsitsiklis, J. (1993). Simulated annealing. Statistical Science, 8 (1), 10 –
15.
Bottolo, L., & Richardson, S. (2010). Evolutionary stochastic search for bayesian model
exploration. Bayesian Analysis , 5 (3), 583 – 618.
Breheny, P., & Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized
regression, with applications to biological feature selection. Annals of Applied Statistics ,
5 (1), 232–253.
Breiman, L. (1995, November). Better subset selection using the nonnegative garrote.
Technometrics , 37 (4), 373 – 384.
Candes, E., & Tao, T. (2007). The dantzig selector: Statistical estimation when p is much
larger than n. The Annals of Statistics , 35 (6), 2313 – 2351.
Carvalho, C., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for sparse
signals. Biometrika, 97 , 465 – 480.
Casella, G., & George, E. I. (1992, August). Explaining the gibbs sampler. The American
Statistician, 46 (3), 167 – 174.
Casella, G., Giron, F. J., Martınez, M. L., & Moreno, E. (2009). Consistency of bayesian
procedures. The Annals of Statistics , 37 (3), 1207 – 1228.
Casella, G., & Moreno, E. (2006, March). Objective bayesian variable selection. Journal of
the American Statistical Association, 101 (473).
Celeux, G., Forbes, F., Robert, C. P., & Titterington, D. M. (2006). Deviance information
criteria for missing data models. Bayesian Analysis , 1 , 651 – 674.
102
Chen, M. H., Harrington, D. P., & Ibrahim, J. G. (2002). Bayesian cure rate models for
malignant melanoma: A case-study of eastern cooperative oncology group trial e1690.
Journal of the Royal Statistical Society. Series C (Applied Statistics), 51 (2), 135 – 150.
Chen, M. H., Ibrahim, J. G., & Sinha, D. (1999). A new bayesian model for survival data
with a surviving fraction. Journal of the American Statistical Association, 94 (447), 909 –
919.
Chen, M. H., Shao, Q. M., & Ibrahim, J. G. (2001). Monte carlo methods in bayesian
computation. New York, New York: Springer-Verlag.
Chib, S. (1995, December). Marginal likelihood from the gibbs output. Journal of the
American Statistical Association, 90 (432), 1313 – 1321.
Chib, S., & Jeliazkov, I. (2001, March). Marginal likelihood from the metropolis-hastings
output. Journal of the American Statistical Association, 96 (453), 270 – 281.
Chivers, C. (2012). Mhadaptive: General markov chain monte carlo for bayesian inference
using adaptive metropolis-hastings sampling [Computer software manual]. Retrieved from
http://CRAN.R-project.org/package=MHadaptive (R package version 1.1-8)
Clyde, M. A., Ghosh, J., & Littman, M. (2011). Bayesian adaptive sampling for variable
selection and model averaging. Journal of Computational and Graphical Statistics , 20 , 80
– 101.
Collett, D. (1991). Modelling binary data. London: Chapman and Hill.
Cooner, F., Banerjee, S., Carlin, B. P., & Sinha, D. (2007, June). Flexible cure rate
modeling under latent activation schemes. Journal of the American Statistical Association,
102 (478), 560 – 572.
103
Cruz, J. R., & Dorea, C. C. Y. (1998, December). Simple conditions for the convergence of
simulated annealing type algorithms. Journal of Applied Probability , 35 (4), 885 – 889.
Dey, T., Ishwaran, H., & Rao, J. S. (2008). An in-depth look at highest posterior model
selection. Econometric Theory , 24 , 377 – 403.
DiCiccio, T. J., Kass, R. E., Raftery, A., & Wasserman, L. (1997, September). Computing
bayes factors by combining simulation and asymptotic approximations. Journal of the
American Statistical Association, 92 (439), 903 – 915.
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals
of Statistics , 32 (2), 407 – 499.
Fan, J., Feng, Y., Saldana, D. F., Samworth, R., & Wu, Y. (2015).
Sis: Sure independence screening [Computer software manual]. Retrieved from
http://CRAN.R-project.org/package=SIS (R package version 0.7-5)
Fan, J., & Li, R. (2001, December). Variable selection via nonconcave penalized likelihood
and its oracle properties. Journal of the American Statistical Association, 96 (456), 1348
– 1360.
Fan, J., & Lv, J. (2008). Sure independent screening for ultrahigh dimensional feature space.
Journal of the Royal Statistical Society: Series B (Statistical Methodlogy), 70 (5), 849 –
911.
Fan, J., & Lv, J. (2010a). A selective overview of variable selection in high dimensional
feature space. Statistica Sinica, 20 , 101 – 148.
Fan, J., & Lv, J. (2010b). A selective overview of variable selection in high dimensional
feature space. Statistica Sinica, 20 , 101 – 148.
104
Fernandez, C., Ley, E., & Steel, M. F. J. (2001). Benchmark priors for bayesian model
averaging. Journal of Econometrics , 100 , 381 – 427.
FitzJohn, R. G. (2012). Diversitree: Comparative phylogenetic analyses of diversification in
r. Methods in Ecology and Evolution, 3 , 1084 – 1092.
Flom, P. L., & Cassell, D. L. (2007). Stopping stepwise: Why stepwise and similar selection
methods are bad, and what you should use. NESUG .
Frank, I. E., & Friedman, J. H. (1993, May). A statistical view of some chemometrics
regression tools. Technometrics , 35 (2), 109 – 135.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear
models via coordinate descent. Journal of Statistical Software, 33 (1), 1 – 22.
Friel, N., & Pettitt, A. N. (2008, July). Marginal likelihood estimation via power posteriors.
Journal of Royal Statistical Society: Series B (Statistical Methodology), 70 (3), 589 – 607.
Garcia-Donato, G., & Forte, A. (2015). Bayesvarsel: Bayes factors, model choice
and variable selection in linear models [Computer software manual]. Retrieved from
http://CRAN.R-project.org/package=BayesVarSel (R package version 1.6.1)
Garcıa-Donato, G., & Martınez-Beneito, M. A. (2013, March 15). On sampling strategies
in bayesian variable selection problems with large model spaces. Journal of the American
Statistical Association, 108 (501), 340 – 352.
Geisser, S., & Eddy, W. F. (1979). A predictive approach to model selection. Journal of the
American Statistical Association, 74 (365), 153–160.
Gelfand, A. E., Dey, D. K., & Chang, H. (1992). Model determination using predictive
distributions with implementation via sampling-based methods. Technical Report for the
Office of Naval Research, 462 (1992).
105
Gelman, A., & Meng, X.-L. (1998). Simulating normalizing constants: From importance
sampling to bridge sampling to path sampling. Statistical Science, 13 (2), 163 – 185.
George, E. I., & McCulloch, R. E. (1993). Variable selection via gibbs sampling. Journal of
the Americal Statistical Association, 88 (423), 881 – 889.
George, E. I., & McCulloch, R. E. (1997). Approaches for bayesian variable selection.
Statistica Sinica, 7 , 339 – 373.
Ghosh, P., Basu, S., & Tiwari, R. C. (2009). Bayesian analysis of cancer rates from seer
program using parametric and semiparametric joinpoint regression models. Journal of the
Royal Statistical Society: Series A (Statistics in Society), 104 (486), 439 – 452.
Hahn, P. R., & Carvalho, C. M. (2015). Decoupling shrinkage and selection in bayesian
linear models: A posterior summary perspective. Journal of the American Statistical
Association, 110 (509), 435 – 448.
Hall, P., Lee, E. R., & Park, B. U. (2009). Bootstrap based penalty choice for the lasso,
achieving oracle performance. Statistica Sinica, 19 , 449 – 471.
Hans, C. (2009). Bayesian lasso regression. Biometrika, 96 , 835 – 845.
Hans, C., Dobra, A., & West, M. (2007). Shotgun stochastic search for large p regression.
Journal of the American Statistical Association, 102 (478), 507 – 516.
Hastings, W. K. (1970, April). Monte carlo sampling methods using markov chains and
their applications. Biometrika, 57 (1), 97 – 109.
Huang, J., Horowitz, J. L., & Ma, S. (2008). Asymptotic properties of bridge estimators in
sparse high-dimensional regression models. The Annals of Statistics , 36 (2), 587 – 613.
106
Ibrahim, J. G., Chen, M.-H., & Sinha, D. (2004). Bayesian survival analysis. New York,
New York: Springer Series in Statistics.
Ishwaran, H., & Rao, J. S. (2005). Spike and slab variable selection: Frequentist and bayesian
strategies. The Annals of Statistics , 33 (2), 730 – 773.
Johnson, V. E. (2013). On numerical aspects of bayesian model selection in high and
ultrahigh-dimensional settings. Bayesian Analysis , 8 (4), 741 – 758.
Johnson, V. E., & Rossell, D. (2012, July 24). Bayesian model selection in high-dimensional
settings. Journal of the American Statistical Association, 107 (498), 649 – 660.
Kass, R. E., & Raftery, A. E. (1995, June). Bayes factors. Journal of the American Statistical
Association, 90 (430), 773 – 795.
Kraemer, N., Schaefer, J., & Boulesteix, A.-L. (2009). Regularized estimation of large-scale
gene regulatory networks using gaussian graphical models. BMC Bioinformatics , 10 (384).
Kuo, L., & Mallick, B. (1998). Variable selection for regression models. Sankhya: The Indian
Journal of Statistics. Special Issue on Bayesian Analysis , 60 (1), 65 – 81.
Kyung, M., Gill, J., Ghosh, M., & Casella, G. (2010). Penalized regression, standard errors,
and bayesian lassos. Bayesian Analysis , 5 (2), 369 – 412.
Laud, P. W., & Ibhrahim, J. G. (1995). Predictive model selection. Journal of Royal
Statistical Society: Series B (Statistical Methodology), 57 (1), 247 – 262.
Leng, C., Tran, M.-N., & Nott, D. (2014, April). Bayesian adaptive lasso. Annals of the
Institute of Statistical Mathematics , 66 (2), 221 – 244.
Li, Q., & Lin, N. (2010). The bayesian elastic net. Bayesian Analysis , 5 (1), 151 – 170.
107
Liang, F., Paulo, R., Polina, G., Clyde, M. A., & Berger, J. O. (2008, March). Mixtures
of g priors for bayesian variable selection. Journal of the American Statistical Analysis ,
103 (481), 410 – 423.
Martin, A. D., Quinn, K. M., & Park, J. H. (2011). MCMCpack: Markov chain
monte carlo in R. Journal of Statistical Software, 42 (9), 22. Retrieved from
http://www.jstatsoft.org/v42/i09/
Mitchell, T. J., & Beauchamp, J. J. (1988, December). Bayesian variable selection in linear
regression. Journal of the American Statistical Association, 83 (404), 1023 – 1032.
Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing , 11 , 125 –
139.
Neal, R. M. (2003). Slice sampling. The Annals of Statistics , 31 (3), 705 – 767.
O’Hara, R. B., & Sillanpaa, M. J. (2009). A review of bayesian variable selection methods:
What, how and which. Bayesian Analysis , 4 (1), 85 – 118.
Park, T., & Casella, G. (2008, June). The bayesian lasso. Journal of the American Statistical
Association, 103 (482).
Picard, R. R., & Cook, R. D. (1984). Cross-validation of regression models. Journal of the
American Statistical Association, 79 (387), 575 – 583.
Polson, N. G., & Scott, J. G. (2012). Local shrinkage rules, levy processes, and regularized
regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
74 (2), 287 – 311.
Polson, N. G., Scott, J. G., & Windle, J. (2014, September). The bayesian bridge. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 76 (4), 713 – 733.
108
R Core Team. (2015). R: A language and environment for statistical computing [Computer
software manual]. Vienna, Austria. Retrieved from http://www.R-project.org/
Raftery, A. E. (1999). Bayes factors and bic. comment on ”a critique of the bayesian
information criterion for model selection”. Sociol. Methods Res., 27 , 411 – 427.
Raftery, A. E., Newton, M. A., Satagopan, J. M., & Krivitsky, P. N. (2007). Estimating
the integrated likelihood via posterior simulation using the harmonic mean identity. In
Bayesian statistics (Vol. 8, pp. 1 – 45).
Rossell, D., Cook, J. D., Telesca, D., & Roebuck, P. (2014). mombf: Moment
and inverse moment bayes factors [Computer software manual]. Retrieved from
http://CRAN.R-project.org/package=mombf (R package version 1.5.9)
Schwarz, G. E. (1978). Estimating the dimension of a model. Annals of Statistics , 6 (2), 461
- 464.
Seber, G. A. F., & Lee, A. J. (2003). Linear regression analysis. Hoboken, New Jersey: A
John Wiley & Sons Publication.
Shao, J. (1993, June). Linear model selection by cross validation. Journal of the American
Statistical Association, 88 (422), 486 – 494.
Shi, M., & Dunson, D. B. (2011, February 1). Bayesian variable selection via particle
stochastic search. Statistics and Probability Letters , 81 (2), 283 – 291.
Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2011). Regularization paths for cox’s
proportional hazards model via coordinate descent. Journal of Statistical Software, 39 (5),
1 – 13. Retrieved from http://www.jstatsoft.org/v39/i05/
Song, Q., & Liang, F. (2015). High dimensional variable selection with reciprocal l1 regu-
larization. Journal of the American Statistical Association.
109
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian
measures of model complexity and fit. Journal of Royal Statistical Society: Series B
(Statistical Methodology), 64 (4), 583 – 639.
Sturtz, S., Ligges, U., & Gelman, A. (2005). R2winbugs: A package for running
winbugs from r. Journal of Statistical Software, 12 (3), 1 – 16. Retrieved from
http://www.jstatsoft.org
Sun, W., Ibrahim, J. G., & Zou, F. (2009). Variable selection by bayesian adaptive lasso
and iterative adaptive lasso, with application for genome-wide multiple loci mapping.
Biostatistics Technical Report Series , 10 .
Thomas, A., Hara, B. O., Ligges, U., & Sturtz, S. (2006, March). Making bugs open. R
News , 6 (1), 12 – 17.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 58 (1), 267 – 288.
Tibshirani, R. (2011). Regression shrinkage and selection via the lasso:a retrospective.
Journal of Royal Statistical Society: Seris B (Statistical Methodlogy), 73 (3), 273 – 282.
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and
smoothness via the fused lasso. Journal of Royal Statistical Society: Series B (Statistical
Methodology), 67 (1), 91 – 108.
Tsodikov, A. D., Ibrahim, J. G., & Yakovlev, A. Y. (2003). Estimating cure rates from
survival data: An alternative to two-component mixture models. Journal of the American
Statistical Association, 98 (464), 1063 – 1078.
110
Watanabe, S. (2010). Asymptotic equivalence of bayes cross validation and widely applicable
information criterion in singular learning theory. Journal of Machine Learning Research,
11 (2010), 3571 – 3594.
Watanabe, S. (2013). A widely applicable bayesian information criterion. Journal of Machine
Learning Research, 14 (2013), 867 – 897.
Yan, X., & Su, X. G. (2009). Linear regression analysis: theory and computing. World
Scientific.
Yin, G., & Ibrahim, J. G. (2005, December). Cure rate models: A unified approach.
Canadian Journal of Statistics , 33 (4), 559 – 570.
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped
variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
68 (1), 49 – 67.
Zellner, A. (1986).
Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti , 233
– 243.
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty.
The Annals of Statistics , 38 (2), 894 – 942.
Zou, H. (2006, December). The adaptive lasoo and its oracle properties. Journal of the
American Statistical Association, 101 (476), 1418 – 1429.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statiscal Society: Series B (Statistical Methodology), 67 (2), 301 –
320.
111
Zou, H., & Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of
parameters. The Annals of Statistics , 37 (4), 1733 – 1751.