ABSTRACT BAYESIAN VARIABLE SELECTION IN LINEAR AND …

ABSTRACT

BAYESIAN VARIABLE SELECTION IN LINEAR AND NON-LINEARMODELS

Arnab Kumar Maity, Ph.D.Division of Statistics

Northern Illinois University, 2016Sanjib Basu, Director

Appropriate feature selection is a fundamental problem in the field of statistics. Models

with large number of features or variables require special attention due to the computational

complexity of the huge model space. This is generally known as the variable or model se-

lection problem in the field of statistics whereas in machine learning and other literature,

this is also known as feature selection, attribute selection or variable subset selection. The

method of variable selection is the process of efficiently selecting an optimal subset of rele-

vant variables for use in model construction. The central assumption in this methodology is

that the data contain many redundant variable; those which do not provide any significant

additional information than the optimally selected subset of variable. Variable selection is

widely used in all application areas of data analytics, ranging from optimal selection of genes

in large scale micro-array studies, to optimal selection of biomarkers for targeted therapy in

cancer genomics to selection of optimal predictors in business analytics. Under the Bayesian

approach, the formal way to perform this optimal selection is to select the model with highest

posterior probability. Using this fact the problem may be thought as an optimization prob-

lem over the model space where the objective function is the posterior probability of model

and the maximization is taken place with respect to the models. We propose an efficient

method for implementing this optimization and we illustrate its feasibility in high dimen-

sional problems. By means of various simulation studies, this new approach has been shown

to be efficient and to outperform other statistical feature selection methods methods namely

median probability model and sampling method with frequency based estimators. Theoreti-

cal justifications are provided. Applications to logistic regression and survival regression are

discussed.

NORTHERN ILLINOIS UNIVERSITYDE KALB, ILLINOIS

AUGUST 2016

BAYESIAN VARIABLE SELECTION IN LINEAR AND NON-LINEAR

MODELS

BY

ARNAB KUMAR MAITYc© 2016 Arnab Kumar Maity

A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE

DOCTOR OF PHILOSOPHY

DIVISION OF STATISTICS

Dissertation Director:Sanjib Basu

ACKNOWLEDGEMENTS

I express my sincerely profound gratitude to my advisor Dr. Sanjib Basu, for his constant

support, mentoring, advise, and personal care during my entire NIU life.

I am greatly indebted to the guidance from Dr. Rama Lingam, Dr. Balakrishna Hosmane,

and Dr. Alan Polansky. I acknowledge the help and enthusiasm from Dr. Shuva Gupta, Dr.

Zhuan Ye, Dr. Nader Ebrahimi, Dr. Duchwan Ryu, and Dr. Michelle Xia.

At this moment of achievement, I am deeply grateful to Dr. Isha Dewan and other

faculty members in ISI; I am immensely grateful to Sri Palas Pal, Sri Subhadeep Banerjee,

Sri Tulsidas Mukhopadhyay, Dr. Dilip Kumar Sahoo, and Sri Parthsarathi Chakrabarti, and

other stuffs at Narendrapur.

As the dissertation is in its present form, I would like to convey my gratitude to my uncle

Sri Subrata Maiti, my uncle Dr. Tapabrata Maiti and Dr. Tathagata Bandyopadhyay for

their unselfish support.

In addition, I continue to acknowledge the guidance of Mondal-Sir, Sri Nandadulal Jana,

Sri Dhaniram Tudu, Badal-da, Sri Rajendra Nath Giri, Santi-Babu, Sri Swapan Bhaumik, Sri

Bhabesh Barman, Samya-Babu, Bhaskar-Babu, Abhijit-Babu, Hari-babu, Surjendyu-Babu,

Sukumar-Babu, Pintu-da, Kaushik-da, Panda-Miss, Jana-Miss, Gayen-Miss, Gobindo-Sir

and other school teachers.

I would like to thank John Winans and Department of Computer Science for letting me

use their super computer to execute my time consuming computations.

Most importantly, I wish to express my heartfelt thanks to beloved Dr. Bilal Khan, Dr.

Amartya Chakrabarti, Smt. Sreya Chakraborty, Dr. Santu Ghosh, Dr. Ujjwal das, Dr.

Arpita Chatterjee, Mr. Alan Hurt Jr., Sri Rajendranath Maiti, Sri Joydeep Das, Sri Suman

iii

Sarkar, Sri Biswajit Pal, Sri Soumen Achar, Sri Suman Bhunia, Sourav-da, Sri Anirban

Roy Chowdhuri, Sri Nirmal Kandar, Sri Subhasis Samanta, Chandan, Dr. Raju Maiti, Sri

Dines Pal, Sri Shibshankar Banerjee, Sri Kshaunis Misra, Sri Anupam Mondal, Sri Kaushik

Jana, Sri Bappa Mondal, Sri Avishek Guha, Sri Maitreya Samanta, Dr. Himel Mallick, Dr.

Abhra Sarkar, Sri Sayantan Jana, Dr. Bipul Saurabh, Sri Sayan Dasgupta, Smt. Susmita

Bose, Smt. Tuhina Biswas, Smt. Upama Roy, Sri Sesha Sai Ram, Sri Koustuv Lahiri,

Sri Suresh Venkat, Sri Kushdesh Prasad, Sri Bappaditya Mondal, Sri Pradeep Sadhu, Sri

Pratyush Chandra Sinha, Smt. Pragya Patwari, Sri Vijay Anand, Sri Sougata Dhar, Sri

Saptarshi Chatterjee, Smt. Erina Paul, Smt. Gunisha Arora, Dr. Tsehaye Eyassu, Md.

Rafi Hossain, Dr. Priyanka Grover, Sri Narendra Chopra, Sri Paramahansa Pramanik, Mr.

Jacob Holzman and many other friends and philosophers for their various inspirational roles

in different stages of my life.

I must acknowledge the significant amount of resource I fetched from Northern Illi-

nois University (NIU) libraries, https://www.google.co.in/, https://scholar.google.com/, and

https://www.wikipedia.org/.

Last but not the least, I would like to thank my other-half Smt. Puja Saha for always

keeping faith in me.

DEDICATION

To my father Sri Arjun Kumar Maiti, my mother Smt. Rina Maiti, my sister Smt.

Sudeshna Maity, and to my soul-partner Smt. Puja Saha.

TABLE OF CONTENTS

Page

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

LIST OF FIGURES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Chapter

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 VARIABLE SELECTION PROBLEM: PAST AND PRESENT . . . . . . . . . . . . . 8

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Classical Measures of Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 A Selective Review of Penalized Regression Methods . . . . . . . . . . . . . . . . . 12

2.3.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.2 Bridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.3 Non-negative Garrote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.4 Least Absolute Shrinkage and Selection Operator (LASSO) . . . . . . . 14

2.3.5 Smoothly Clipped Absolute Deviation Penalty . . . . . . . . . . . . . . . . 14

2.3.6 Fused Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.7 Elastic Net. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.8 Adaptive Lasso. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.9 Sure Independent Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.10 Minimax Concave Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.11 Reciprocal Lasso. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

vi

Chapter Page

2.4 A Selective Review of Bayesian Model Selection Criteria. . . . . . . . . . . . . . . 19

2.4.1 Highest Posterior Model (HPM). . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.2 Marginal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.3 Log Pseudo Marginal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4.4 L-Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4.5 Deviance Information Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4.6 Median Probability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5 Development of Prior Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5.1 Bayesian Connection to Penalized Regressions. . . . . . . . . . . . . . . . . 26

2.5.2 Shrinkage Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 Sampling over Model Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 COMPARISON AMONG LPML, DIC, AND HPM . . . . . . . . . . . . . . . . . . . . . . 30

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Inconsistency of LPML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.1 Linear Model in Presence of Multicollinearity . . . . . . . . . . . . . . . . . 43

3.3.2 A Non-conjugate Setting: Logistic Regression . . . . . . . . . . . . . . . . . 44

3.3.3 Nodal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.4 Melanoma data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 MEDIAN PROBABILITY MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 Comparison of MPM and HPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

vii

Chapter Page

5 SIGNIFICANCE OF HIGHEST POSTERIOR MODEL . . . . . . . . . . . . . . . . . . 52

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 VARIABLE SELECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.2 Requirement of Sampling or Stochastic Search . . . . . . . . . . . . . . . . . . . . . . 54

6.3 Limitation of Bayesian Lasso and its Extensions . . . . . . . . . . . . . . . . . . . . 55

6.4 Maximization in the Model Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.5 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.6 Our Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.7 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7 EMPIRICAL STUDY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

8 LOGISTIC REGRESSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

8.2 Power Posterior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

9 SURVIVAL MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

9.2 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

9.3 Mixture of Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

9.4 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

10 DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

10.2 Comparison of Methods for Linear Models. . . . . . . . . . . . . . . . . . . . . . . . . 94

viii

Chapter Page

10.3 Non-Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

10.4 Survival Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

10.5 Future Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

10.5.1 Large p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

10.5.2 Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

10.5.3 Different Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

10.5.4 Posterior Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

10.5.5 Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

11 CONCLUSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

LIST OF TABLES

Table Page

3.1 Probabilities of selecting data generating models for Gunst and Mason data. 33

3.2 Probabilities of selecting the data-generating Model . . . . . . . . . . . . . . . . . . 43

3.3 Probabilities of selecting the data-generating models for logistic regressionsimulation example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4 Probabilities of selecting the data-generating models for nodal data. . . . . . . 48

3.5 Probabilities of selecting the data-generating models for melanoma data . . . 49

4.1 Comparison between MPM and HPM for linear models in presence of collinear-ity: Result presents the number of times the data generating model is recov-ered using 1000 simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.1 Number of times the correct model is obtained based on 100 repetitions:Model with Uncorrelated Predictors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.2 Number of times the correct model is obtained based on 100 repetitions:Linear regression in presence of collinearity . . . . . . . . . . . . . . . . . . . . . . . . 68

7.3 Number of times the correct model is obtained based on 100 repetitions:Lasso example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.4 Number of times the correct model is obtained based on 100 repetitions:Adaptive lasso example (lasso fails) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.5 Number of times the correct model is obtained based on 100 repetitions:p = 30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.6 Number of times the correct model is obtained based on 100 repetitions:Elastic net example. p = 40.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.7 Number of times the correct model is obtained based on 100 repetitions:p > n, p = 200, n = 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

x

Table Page

7.8 Number of times the correct model is obtained based on 100 repetitions:p > n, p = 200, n = 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.9 Number of times the correct model is obtained based on 100 repetitions:p > n, p = 1000, n = 200. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.10 Number of times the correct model is obtained based on 100 repetitions:p > n, p = 1000, n = 200 and Presence of Correlation.. . . . . . . . . . . . . . . . . 76

7.11 Description of the Ozone dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.12 The HPM and MPM for Ozone-35 data. The last 2 columns provide Bayesfactor against the null model and log of that respectively. . . . . . . . . . . . . . . 76

8.1 Number of times the correct model is obtained based on 100 repetitions:Logistic regression in presence of collinearity. . . . . . . . . . . . . . . . . . . . . . . . 82

8.2 Number of times the correct model is obtained based on 100 repetitions:Logistic regression for large p. p = 20.. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

9.1 Number of times the correct model is obtained based on 100 repetitions:Weibull regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

9.2 Number of times the correct model is obtained based on 100 repetitions:Data generating model – Weibull regression. Analysis model – Mixture ofWeibull regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

9.3 Number of times the correct model is obtained based on 100 repetitions:Data generating model – Gamma regression. Analysis model – Mixture ofWeibull regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

9.4 Number of times the correct model is obtained based on 100 repetitions:Data generating model – Log Normal regression. Analysis model – Mixtureof Weibull regression.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

9.5 Number of times the correct model is obtained based on 100 repetitions:Presence of Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

9.6 Models obtained using different methods for Prot data . . . . . . . . . . . . . . . . 92

LIST OF FIGURES

Figure Page

7.1 Solution path for different starting model for SA-HPM model. The pointsare the log of Bayes factor of models against the null model. The log(BayesFactor) of the data generating model (1, 2, 5) is 20.909. . . . . . . . . . . . . . . . 77

9.1 (a)The path for dimensions of models obtained in each step of SA-HPM.(b) The path for log(marginal likelihood) of models obtained in each step ofSA-HPM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

CHAPTER 1

INTRODUCTION

A regression model typically specifies relationship between a response y = (y1, . . . , yn)′,

n × 1 response variable, p many regressors (predictors, or covariates) x1, . . . xp using X =

[x1, . . . , xp], a n × p design matrix, and β = (β1, . . . , βp)′, a p × 1 vector of coefficients

(parameters). The simplest and most commonly used regression is linear regression which is

given by,

E(y) = Xβ

Assuming that there exists a true model or a data-generating model, for analysis, we fit

a collection of several nested models which are the sub-models or super-models of the true

model, in order to identify the most useful model. Thus, in the analyses models some of βj

are zero or not according as the corresponding xj, j = 1, . . . , p are absent or present in the

model. So we follow a more convenient way to write the analyses models.

E(y) = Xγβγ (1.1)

where γ = (γ1, . . . , γp)′ is p× 1 indicator vector with

γj =

1 if xj is present

0 if xj is absent

2

Therefore the model space M consists of 2p models where each model is variant of Mγ,

indexed by the binary vector γ, according to (1.1). The first column of X is 1, unless

otherwise stated, and the corresponding intercept term is assumed to be always present in

the model.

The objectives of the variable selection can be broadly described as,

1. Providing efficient and effective predictors.

2. Providing a better understanding of the underlying process.

3. Improving the prediction performance of the predictors.

Variable selection attempts to select an “optimal subset of predictors” as our goal is

to explain the data in the simplest possible way with the removal of redundant predictors.

The principle of Occam’s Razor states that among several plausible explanations for a phe-

nomenon, the simplest is best. Furthermore, redundant predictors may add noise to the

statistical inferential process.

As an example of the statistical variable selection problem, an essential biomedical ques-

tion in assessing the effect of the biomarkers in diseases such as cancer is which of the factors

under investigation have significant association with the outcomes of the disease such as

cancer recurrence and mortality. This is often handled informally via some sort of stepwise

approach (such as by sequentially considering the p-values for the relevant predictors), but

from a formal viewpoint, this is a variable selection problem. Variable or feature selection is

a crucial part of any statistical analysis.

The vast collection of criteria for model selection include multiple R2, Akaike Informa-

tion Criterion (AIC) (Akaike, 1974), Bayesian Information Criterion (BIC), Mallow’s Cp,

Widely Applicable Akaike Information Criterion (WAIC) (Watanabe, 2010), Widely Appli-

cable Bayesian Information Criterion (WBIC) (Watanabe, 2013), and many others. Dif-

ferent methods of selecting a model include forward selection, backward selection, stepwise

3

selection, stagewise selection, ridge regression, non-negative garrote (Breiman, 1995), bridge

regression (Frank & Friedman, 1993) and other methods. However, in large scale variable

selection problems, the performances of these methods can be far from satisfactory, see, for

example, Breiman (1995).

Bayesian variable selection now has an extensive literature; see Kass & Raftery (1995),

O’Hara & Sillanpaa (2009) for an overall review and Ibrahim et al. (2004) (chapter 6) for

model comparison in time-to-event data. Criteria for Bayesian model selection include Bayes

factors, measures of model complexity and information, and goodness of fit measures.

The Bayes factor ((Kass & Raftery, 1995), (Basu & Chib, 2003)) for model M1 against

model M2 is the ratio of their marginal likelihoods.

BF12 =Pr(y|M1)

Pr(y|M2)(1.2)

In the log-scale

log BF12 = log Pr(y|M1)− log Pr(y|M2)

where Pr(y|M) =∫

Pr(y|M, θ)π(θ|M)dθ and y denotes the observed data and θ denotes

all unobservables, Pr(y|M, θ) is the likelihood for modelM, and π(θ|M) is the prior density

of θ.

Kass & Raftery (1995) stated that the Bayes factor is a summary of evidence for model

M1 against modelM2 and provided a table of cutoffs for interpreting logBF12. In general,

the model with higher log-marginal likelihood is preferable in this model selection criterion.

In modern era, Bayesian inference is typically done by Markov Chain sampling. The

computation of Bayes factor from Markov chain sampling, however, is typically difficult

since the Markov Chain methods avoid the computation of the normalizing constant of the

posterior and it is precisely this constant(s) that is (are) needed for the marginal likelihood.

4

Several methods for estimating the marginal likelihood (or integrated likelihood or nor-

malizing constant) have been suggested. See Chen et al. (2001). Kass & Raftery (1995) used

Laplace approximation. Chib (1995) proposed a method of estimating marginal likelihood

from Gibbs sampling (Casella & George, 1992) output while Chib & Jeliazkov (2001) cal-

culated the same from Metropolis Hastings sampling (Hastings, 1970) output. Basu et al.

(2003) compared different parametric models for masked competing risks using Bayes factors

whereas Basu & Ebrahimi (2003) used Bayes factors to compare their martingale process

model with other competing models. Basu & Chib (2003) presented a general method for

comparing semiparametric Bayesian models, constructed under the Dirichlet Process Mixture

(DPM) framework, with alternative semiparametric or parametric Bayesian models. A dis-

tinctive feature of their method is that it can be applied to semiparametric models containing

covariates and hierarchical prior structures. This innovative method proposes two separate

computations schemes for estimating the likelihood and posterior ordinates of the DPM

model at a single high density point which are then combined via the Basic marginal iden-

tity to obtain an estimate of the marginal likelihood. For computing the marginal likelihood

in the mixture cure rate competing risks model, Basu & Tiwari (2010) used a combination

of the volume-corrected harmonic mean estimator proposed by DiCiccio et al. (1997) which

restricts the Monte Carlo averaging in a high posterior density region with the estimate

proposed by Chib (1995). This combination method was used as the conditional posterior of

the cure fraction c, available in closed form (due to conditional conjugacy) whereas the other

conditionals were not available in closed form. The estimate of the marginal likelihood was

finally obtained by the Basic marginal identity ((Chib, 1995), (Basu & Chib, 2003)). Neal

(2001) used importance sampling to calculate marginal likelihood while Gelman & Meng

(1998) used path sampling to calculate the integrated likelihood. Friel & Pettitt (2008)

extended the theory of path sampling to develop power posterior method in calculating the

5

normalizing constant. Raftery et al. (2007) used harmonic mean type estimator to estimate

integrated likelihood.

Marginal likelihood can be a measure of model fit as well. The log-pseudo marginal

likelihood (LPML) (Ibrahim et al., 2004), (Yin & Ibrahim, 2005), (Ghosh et al., 2009)

is a ”leave-one-out” cross-validated criterion based on the conditional Predictive Ordinate

(CPO) (Gelfand et al., 1992). CPOs can be estimated from Markov chain samples. The

posterior predictive L-measure (Ibrahim et al., 2004), (Laud & Ibhrahim, 1995) is another

model selection criterion which is a combination of goodness-of-fit and a penalty (similar to

variance and bias2).

The Deviance Information Criterion (DIC), proposed by Spiegelhalter et al. (2002) is

another Bayesian measure of model fit penalized for increased model complexity but the

DIC, in its orginal form, may not be appropriate in models with missing or latent variables

(mixture models or models involving random effects), and Celeux et al. (2006) developed

various modifications of the DIC. Ghosh et al. (2009) used a modified DIC and LPML to

compare joinpoint models for modeling cancer rates over the years.

In frequentist model selection the lasso (Tibshirani, 1996) has become a widely popular

procedure for regularized variable selection in least square type regression problems. The

lasso can be fitted by the LARS algorithm (Efron et al., 2004), or the coordinate descent

algorithm (Friedman et al., 2010). Penalties other than L1 have also been explored. Recent

work on regularized variable selection methods include Smoothly Clipped Absolute Devi-

ation Penalty (SCAD) (Fan & Li, 2001), Adaptive Lasso (Zou, 2006), Elastic Net (Zou

& Hastie, 2005), Fused Lasso (Tibshirani et al., 2005), Grouped Lasso (Yuan & Lin, 2006),

Bootstrapped Lasso (Hall et al., 2009), Weighted Grouped Lasso. Oracle property (discussed

later) has been established for SCAD, Adaptive Lasso and Bootstrapped Lasso.

Bayesian regularization literature is rich as well. Bayesian lasso via the double-exponential

prior has been explored in Park & Casella (2008) and Hans (2009). Bayesian Adaptive Lasso

6

has been developed and used by Leng et al. (2014) and Sun et al. (2009). Bayesian Elastic

Net have been developed by Li & Lin (2010) and Kyung et al. (2010). In addition, Kyung et

al. (2010) also developed Bayesian Group Lasso and Bayesian Fused Lasso. Recently, Polson

et al. (2014) considered Bayesian Bridge.

Note that, the Bayesian regularization methods implement the regularization by choice

of an appropriate prior on the regression coefficients β. These can be referred as shrinkage

prior. In addition to these penalty priors other shrinkage priors have also been proposed.

Carvalho et al. (2010) considered the horseshoe prior. Polson & Scott (2012) proposed joint

priors generated by a Levy Process.

Spike and Slab prior is another type of prior in Bayesian shrinkage area, which was

originally proposed by Mitchell & Beauchamp (1988), was greatly improved by George &

McCulloch (1993), George & McCulloch (1997), and Kuo & Mallick (1998), and further

developed by Ishwaran & Rao (2005) and Dey et al. (2008). Garcıa-Donato & Martınez-

Beneito (2013) used posterior summary based on Stochastic Search Variable Selection (SSVS)

(George & McCulloch, 1997) for variable selection in high dimensional data. They illustrated

advantage of Bayesian model selection method based on visit frequency compared to other

methods when the parameter space is finite, both by theoretical justification and simulation.

A competing Bayesian variable selection proposal is based on the median probability

model (MPM). Barbieri & Berger (2004) theoretically established that in normal linear

model under a predictive loss the optimal predictive model is the median probability model.

However, their theoretical optimality result is based on the assumption of linear models and

orthogonal design matrix. Performance of median probability model in a correlated structure

remains an active area of research.

The marginal likelihood based highest posterior model (HPM) approach, on the other

hand, chooses the model which has highest posterior probability or the highest marginal like-

lihood. This is a formal approach which is grounded in the basic probabilistic developments

7

of the Bayesian statistical paradigm. However, this approach is often computationally diffi-

cult to implement as one needs to calculate the posterior probabilities for all of the models

in the model space. For example, even in the presence of just 30 variables or features, the

model space contains more than one billion models (230), and any method which explore one

model at a time may take years to explore this model space!!

This article proposes an efficient method for variable selection using the highest posterior

probability approach.

CHAPTER 2

VARIABLE SELECTION PROBLEM: PAST AND PRESENT

2.1 Introduction

A statistical model specifies a relationship between a n×1 response variable or dependent

variable y = (y1, . . . , yn)′ and p many n× 1 regressors or covariates or predictors or explana-

tory variables or independent variables, x1, x2, . . . , xp. This is typically done by specifying

the stochastic law of the response y which involves X = [x1, . . . , xp], the n× p design matrix

whose columns are the covariates, and β = (β1, . . . , βp)′, the parameter vector.

1. In linear model,

E(y) = Xβ

2. For logistic regression model the response has only two outcomes, denoted by 0 and 1.

The following then provides the relation between y and X.

y ∼ Bernoulli(π) (2.1)

where

π = P (y = 1|X, β) =exp (Xβ)

1 + exp (Xβ)(2.2)

9

3. In survival settings the response is time to event. Let S(y) be the survival function of

y. Then the stochastic relation of between y and X is given by,

h(y) =

[h0(y)

]exp(Xβ)

where, S0(y) is the baseline survival function.

When there are large number of predictors, some of them are more significant than others

in explaining the response. One of the main aim of variable selection is to select important

subset of explanatory variables. Define a p× 1 indicator γ = (γ1, . . . , γp)′ as

γj =

1 if xj is present

0 if xj is absent

The stochastic law of representation of y then depends on (x1, . . . , xp) via Xγβγ where γ

is working as a subscript of X (β) such that xj(βj) is present in the model whenever γj = 1,

j = 1, 2, . . . , p. For example, suppose x1 and x2 are present in the model. Then

γ = (1, 1, 0, . . . , 0︸︷︷︸p - 2

)

Xγ = [x1, x2]

βγ = (β1, β2)

Note that, γj can take only two different possible values namely 0 or 1. There are p many

elements in γ. Thus, all together there are 2p possible combinations. This implies that there

are 2p models in the model space Ω (say). We denote, individual model in the model space

as Mγ, indexed by the binary vector γ.

10

When x1 and x2 are present in the model space the corresponding model can alternatively

denoted by (1, 2). We shall use these notations in this dissertation frequently. The null model

which has no independent variable in the model is denoted by M0.

Notice that one interesting feature of the model space is that it can get extremely large

even for moderate p. For instance, when there are 30 independent variables the model space

has 1,073,741,824 many models!!

We can broadly divide the collection of variable selection methods into two broad groups.

The first group of methods enumerate the whole model space to choose a good model whereas

the second group of methods search the model space without visiting all models to arrive

a good model. Naturally, when p is large methods from the first group can be difficult or

impossible to implement. We review methods from both groups below, primarily in the

setting of linear models unless otherwise stated, though, these methods are applicable to

more complex models as well.

2.2 Classical Measures of Goodness of Fit

Classical criteria for goodness of fit of a model includes Error Sum of Square (SSE), R2,

Mallow’s Cp, Akaike Information Criterion (AIC) (Akaike, 1974), and Bayesian Information

Criterion (BIC) (Schwarz, 1978).

For nested models, SSE always decreases as number of predictors increases. Similarly,

multiple R2 always increases as number of predictors increases. Therefore they cannot serve

alone as good measures of goodness of fit. This is precisely the reason, the adjusted R2 have

been suggested. However, adjusted R2 is often criticized for its lack of easy interpretation.

AIC is derived from Kullback-Leibler distance. However, AIC suffers overfitting as sample

size grows (Yan & Su, 2009). On the other hand, the Bayesian version of this type of

11

information criterion is BIC, which puts large penalty for overfitting and in doing so, BIC

faces issue of underfitting (Yan & Su, 2009).

Forward selection is a process of selecting a model rather than deciding a goodness of fit of

a model. Forward selection adds one explanatory variable at a time and the extra variable in

the model is kept by comparing some goodness of fit criterion (all of the independent variables

are significant, say) of the current model and the model without the current covariate.

In the backward selection process of selection a model we fit the full model first and then

keep removing one covariate at a time according to a goodness of fit measure until we reach

desired goodness of fit.

Stepwise selection is a combination of forward selection and backward selection where

each step considers both addition and removal of variables, one at a time. This method

has been subjected to several criticisms in the literature. For a nice summary of limitations

of stepwise method see Flom & Cassell (2007). Essentially, stepwise regression produces

unstable results. Slight change in data may produce different set of regressors (Breiman,

1995).

A model is often built on a subset of data and validation of the fitted model is done using

remaining dataset. The data which are used to build the model is called the training set and

remaining data is referred to as the test set or the validation set. A k-fold or leave-k-out

cross validation involves having k observations in the test set and repeating the process

of building the model by selecting different training set and hence test set several times.

However, Shao (1993) proved that the probability of obtaining the data generating model

using leave-one-out cross validation based prediction error never tends to one as sample size

increases.

12

2.3 A Selective Review of Penalized Regression Methods

2.3.1 Ridge Regression

Ridge regression can provide efficient parameter estimates in presence of multicollinear-

ity among predictors (Seber & Lee, 2003). Furthermore, when there are large number of

covariates present ridge regression tends to shrink some of the coefficients toward zero. The

ridge estimate is defined as,

β = argminβ

n∑i=1

(yi −

p∑j=1

βjxij

)2

subject to

p∑j=1

β2j ≤ t

or equivalently,

β = argminβ

n∑i=1

(yi −

p∑j=1

βjxij

)2

+ λ

p∑j=1

β2j

A drawback of ridge regression is that the ridge estimates are not scale invariant.

2.3.2 Bridge Regression

Frank & Friedman (1993) introduced bridge regression. The Bridge estimate is defined

as

β = argminβ

n∑i=1

(yi −

p∑j=1

βjxij

)2

subject to

p∑j=1

βqj ≤ t

or equivalent to

β = argminβ

n∑i=1

(yi −

p∑j=1

βjxij

)2

+ λ

p∑j=1

βqj

where q > 0.

13

Oracle properties for an estimator has been discussed by Fan & Li (2001), Zou (2006).

Let β(n) be an estimator of β based on a sample size n. Define, B = γ : βγ 6= 0, the

non-zero components of β, and Bn = γ : β(n)γ 6= 0. The estimate βn is said to have oracle

property if

1. limn→∞ P (Bn = B) = 1.

2.√n(β

(n)B − βB)

D−→ N(0,Σ∗) where Σ∗ is covariance matrix of β(n)B .

For 0 < q ≤ 1 bridge regression is able to produce sparse estimates which is useful when

there are large number of covariates. Recently Huang et al. (2008) proved oracle properties

in the sense of Fan & Li (2001) for this type of Bridge estimators. But when q > 1 this

method tends to keep all of the covariates. Moreover, it is not clear how to set values of λ

and q.

2.3.3 Non-negative Garrote

The nonnegative garrote estimate (Breiman, 1995) is defined as,

β = argminβ

n∑i=1

(yi −

p∑j=1

cjβjxij

)2

subject to cj ≥ 0 and

p∑j=1

cj ≤ t

or equivalent to

β = argminβ

n∑i=1

(yi −

p∑j=1

cjβjcij

)2

+ λ

p∑j=1

|cj|

Breiman (1995) illustrated its better performance than stepwise selection and ridge re-

gression.

14

2.3.4 Least Absolute Shrinkage and Selection Operator (LASSO)

Tibshirani (1996) introduced the lasso estimate which is defined as,

β = argminβ

n∑i=1

(yi −

p∑j=1

βjxij

)2

subject to

p∑j=1

|βj| ≤ t

or equivalent to

β = argminβ

n∑i=1

(yi −

p∑j=1

βjxij

)2

+ λ

p∑j=1

|βj|

Lasso can be viewed as a special case of bridge regression with q = 1, and has the

ability to shrink the coefficients exactly to zero. Tibshirani (1996) compared performance of

lasso in selecting the data generating model with ridge regression and non-negative garrote.

Lasso is very popular and lasso estimates are easily interpretable. However, there are some

limitations of lasso.

• The oracle property (Fan & Li, 2001) does not hold for lasso estimators.

• When p > n, lasso cannot select more than n covariates.

• In presence of multicollinearity, lasso may perform poorly.

Many modifications to lasso have been developed to address these limitations.

2.3.5 Smoothly Clipped Absolute Deviation Penalty

Fan & Li (2001) noted that any penalty should satisfy following desired properties:

1. Unbiasedness: The resulting estimator is nearly unbiased when the true unknown

parameter is large to avoid unnecessary modeling bias.

15

2. Sparsity: The resulting estimator is a thresholding rule, which automatically sets

small estimated coefficients to zero to reduce model complexity.

3. Continuity: The resulting estimator is continuous in data in appropriate metrics to

avoid instability in model prediction.

According to Fan & Lv (2010a), “The convex Lq penalty with q > 1 does not satisfy the

sparsity condition, whereas the convex L1 penalty does not satisfy the unbiasedness condi-

tion, and the concave Lq penalty with 0 < q < 1 does not satisfy the continuity condition.

In other words, none of the Lq penalties satisfies all three conditions simultaneously.” In

order to achieve all three properties, the following penalty function was proposed.

The Smoothly Clipped Absolute Deviation (SCAD) penalty (Fan & Li, 2001) estimate

is defined as,

β = argminβ

n∑i=1

(yi −

p∑j=1

βjxij

)2

+ pλ

where the derivative of pλ is given by,

p′λ = λ

p∑j=1

I(βj ≤ λ) +

(aλ− βj)+

(a− 1)λI(βj > λ)

for a > 2 and θ > 0.

The penalty function satisfies the mathematical conditions for unbiasedness, sparsity and

continuity. Furthermore, Fan & Li (2001) showed that the estimate has oracle property.

16

2.3.6 Fused Lasso

The fused lasso (Tibshirani et al., 2005) estimate is defined as,

β = argminβ

n∑i=1

(yi −

p∑j=1

βjxij

)2

+ λ1j

p∑j=1

|βj|+ λ2j

p∑j=2

|βj − βj−1|

This is particularly useful when there exists a natural ordering in the covariates.

2.3.7 Elastic Net

The elastic net estimate (Zou & Hastie, 2005) is defined as,

β = argminβ

n∑i=1

(yi −

p∑j=1

βjxij

)2

+ λ1j

p∑j=1

|βj|+ λ2j

p∑j=1

β2j

The elastic net estimator can select more than n predictors, and the additional square

penalty term helps to address the issue of collinear predictors.

2.3.8 Adaptive Lasso

The adaptive lasso estimate (Zou, 2006) is defined as,

β = argminβ

n∑i=1

(yi −

p∑j=1

βjxij

)2

+

p∑j=1

λj|βj|

Zou (2006) established that this estimate has oracle property, and can select more than

n predictors.

17

2.3.9 Sure Independent Screening

Consider the componentwise regression,

ω = XTy

Sure Independent Screening (SIS) (Fan & Lv, 2008) proposes to select the model Mγ

where, for a given γ ∈ (0, 1),

Mγ =

1 ≤ i ≤ p : |ωi| is the first [γn] largest of all

with [γn] denoting the integer part of γn.

Fan & Lv (2008) showed that, under some regularity assumptions,

Pr(M∗ ⊂Mγ)→ 1

where, M∗ is the data-generating model.

By definition, SIS selects less than n covariates, but we note that, the variable selection

property holds good even if we select more than n covariates. Iterative Sure Independent

Screening (ISIS) (Fan & Lv, 2008) procedure is an iterative procedure of SIS. Because of its

iterative nature, ISIS often can recover the important predictors if it is left out by the SIS.

After we select the modelMγ using SIS or ISIS, the final model is obtained by applying

usual penalized procedures such as lasso, SCAD, and MCP (defined in the following section).

18

2.3.10 Minimax Concave Penalty

Zhang (2010) proposed the Minimax Concave Penalty (MCP) estimate as

β = argminβ

n∑i=1

(yi −

p∑j=1

βjxij

)2

+ pλ

where pλ is given by,

pλ = λ

p∑j=1

∫ βj

0

(1− x

γλ)+dx

Zhang (2010) noted that, MCP estimate obeys the oracle property, and satisfies the

mathematical conditions for unbiasedness, sparsity and continuity in the sense of Fan & Li

(2001)

2.3.11 Reciprocal Lasso

The reciprocal lasso estimate (Song & Liang, 2015) is defined as,

β = argminβ

n∑i=1

(yi −

p∑j=1

βjxij

)2

+

p∑j=1

λ

|βj|I(βi 6= 0)

The estimate has oracle property.

Other important references of penalized regression include grouped lasso (Yuan & Lin,

2006), Dantzig selector (Candes & Tao, 2007), Bootstrapped lasso (Hall et al., 2009), adap-

tive elastic net (Zou & Zhang, 2009).

19

2.4 A Selective Review of Bayesian Model Selection Criteria

2.4.1 Highest Posterior Model (HPM)

In the Bayesian paradigm the uncertainty about parameters is expressed by a prior

probability distribution. Uncertainty about models can be further expressed by putting

a prior distribution on the model space. In model selection, Bayesian model can thus be

defined as,

y ∼ f(y|θ,M)

θ ∼ π(θ|M)

Mγ ∼ Pr(M)

where π(θ|M) is the prior distribution of parameter θ under model M and Pr(M) is the

prior on the model.

Then posterior distribution of θ under model M is given by,

π(θ|y,M) =f(y|θ,M)π(θ|M)∫f(y|θ,M)π(θ|M)dθ

and the posterior probability of model M is given by,

Pr(M|y) =Pr(y|M) Pr(M)∑

Mγ∈Ω Pr(y|Mγ) Pr(Mγ)(2.3)

where

Pr(y|M) =

∫Pr(y|θ,M)π(θ|M)dθ (2.4)

20

is called the marginal likelihood or integrated likelihood of data y under model M.

Highest posterior model (HPM) is defined as the model having highest posterior proba-

bility among all the models in the model space, that is,

HPM = argmaxγ Pr(Mγ|y) = argmaxγ Pr(y|M) Pr(M)

or equivalently, the model M is said to be the highest posterior model if

Pr(M|y) = maxγ

Pr(Mγ|y)

Kass & Raftery (1995) pointed out that the theory of highest posterior model has solid

theoretical foundation. In addition, Raftery (1999) described, “The hypothesis testing proce-

dure defined by choosing the model with the higher posterior probability minimizes the total

error rate, that is, the sum of type I and type II error rates. Note that, frequentist statisti-

cians sometimes recommend reducing the significance level in tests when the sample size is

large, the Bayes factor (and hence highest posterior model) does this automatically!!” Most

importantly, model uncertainty can be captured when using posterior probability approach.

2.4.2 Marginal Likelihood

Finding HPM requires the following:

1. the enumeration of the whole model space.

2. the computation of marginal likelihood in (2.4).

When large number of covariates are present the enumeration of the whole model space

can be challenging even with the use of modern computing capabilities. As far as the second

21

point is concerned, an extensive amount of research has been done for estimating the marginal

likelihood in complex non-conjugate settings. In the contrary, the calculation of marginal

likelihood is straightforward when we have conjugate prior.

Consider the conjugate setting of normal linear model with normal priors, where we have,

y|β ∼ N(Xβ, τ−1y ) (2.5)

and β ∼ N(β0, τ−10 ) (2.6)

It follows that the posterior distribution of β is given by

β|y ∼ N((XT τyX + τ0)−1(XT τyy + τ0β0), (XT τyX + τ0)−1)

and the marginal distribution y is given by,

y ∼ N(Xβ0, τ−1y +Xτ−1

0 XT ) (2.7)

which provides the value of marginal likelihood directly.

The g-prior (Zellner, 1986) for β in the setting of linear model is given by,

β|σ,M∼ N(0, gσ2(XTX)−1

)(2.8)

Consider g prior which is given by (2.8). Also assume a constant improper prior on the

intercept α and non-informative prior on standard deviation σ where σ2 = τ−1y , i.e. π(σ) ∝ 1

σ.

Then the posterior is given by (Garcıa-Donato & Martınez-Beneito, 2013),

π(α, β, σ|g) = σ−1N

(β|0, gσ2

(XT (I − 1

n11T )X

)−1)

for γ 6= 0 and π0(α, σ) = 1σ

for the null model M0.

22

Then the Bayes factor of any modelMγ against the null modelM0 is given by (Garcıa-

Donato & Martınez-Beneito, 2013),

BFγ0 =

(1 + g

SSEγSSE0

)n−12

(1 + g)n−kγ−1

2 (2.9)

where SSEγ is the sum of square of Mγ, SSE0 is the sum of square of M0, and kγ is

the number of explanatory variables present in Mγ.

Note that, under constant prior probability on the models, posterior probability of Mγ

can be written as

Pr(Mγ|y) =BFγ0 Pr(Mγ)∑i BFi0 Pr(Mi)

∝ BFγ0P (Mγ)

Therefore, under constant prior probability on the models, the posterior probability of

model Mγ is proportional to Bayes factor of model Mγ against the null model M0, after

using g prior for β.

When we have non-conjugate priors and we rely on Markov Chain Monte Carlo (MCMC)

for Bayesian analysis, estimation of integrated likelihood is challenging as that MCMC by-

passes computing normalizing constants whereas the marginal likelihood is the normaliz-

ing constant of the posterior distribution. Chib (1995) developed a method for estimating

marginal likelihood from Gibbs sampling output which was extended by Chib & Jeliazkov

(2001) for Metropolis Hastings sampling output. Gelman & Meng (1998) used path sampling

for calculating integrated likelihood which is a motivation for power posterior method of Friel

& Pettitt (2008). Raftery et al. (2007) used harmonic mean type estimator to estimate the

marginal likelihood.

23

2.4.3 Log Pseudo Marginal Likelihood

The conditional predictive ordinate Ibrahim et al. (2004) for the i − th observation is

defined as

CPOi = f(yi|y−i) =

∫f(yi|θ)π(θ|y−i)dθ, i = 1, . . . , n (2.10)

where y−i = (y1, . . . , yi−1, yi+1, . . . , yn)′ denotes the set of all observed data points excluding

yi and π(θ|y−i) is the posterior distribution of θ given y−i.

The CPO is thus the conditional density of yi given the remaining observations which

can alternatively be described as

CPOi =

∫f(yi|θ)π(θ|yi)dθ

=

∫f(yi|θ)

∏j 6=i f(yi|θ)π(θ)∫ ∏j 6=i f(yi|θ)π(θ)dθ

dθ

=f(y)

f(yi)

Geisser & Eddy (1979) defined the log pseudo marginal likelihood (LPML) is defined as

LPML =1

n

n∑i=1

log CPOi

and proposed it as a model comparison criterion.

Notice that, LPML is based on the notion of leave-one-out cross validation. Models with

higher values of LPML are preferred in this criterion.

24

2.4.4 L-Measure

Laud & Ibhrahim (1995) introduced L measure as a model comparison criterion which

attempts to balance predictive loss and variability of the predictions. L measure is defined

as,

L2m =

n∑i=1

[ν(E(zi)− yi)2 + V ar(zi)]

For ν = 1 we get

L2m =

n∑i=1

[E(zi)− yi)2 + V ar(zi)]

where z = (z1, . . . , zn) are new observation to be predicted.

In the setting of normal linear model (2.5, 2.6),

z|β ∼ N(Xβ, τ−1y )

p(z|y) =

∫p(z|β)p(β|y)dβ

It follows that,

z|y ∼ N(X(XT τyX + τ0)−1(XT τyy + τ0β0), X(XT τyX + τ0)−1XT + τ−1y )

Models with smaller L measure is preferred in this criterion.

25

2.4.5 Deviance Information Criterion

Spiegelhalter et al. (2002) introduced the Deviance Information Criterion (DIC) as,

DIC = E(D(θ|y)) + pd

= E(D(θ|y)) + E(D(θ|y))−D(E(θ|y))

where D(θ|y) = −2 log f(y|θ) and pd is the effective number of parameters in the model.

The DIC is available as a goodness of fit measure incorporated in the popular Gibbs

sampling software OpenBUGS (Thomas et al., 2006). In the linear model setting, under

(2.5) and (2.6) we get,

E(D(β|y)) =− log |τy|+ n log 2π + tr(XT τyX(XT τyX + τ0)−1)+

(X(XT τyX + τ0)−1(XT τyy + τ0β0)− y)′τy

(X(XT τyX + τ0)−1(XT τyy + τ0β0)− y)

and

D(E(β|y)) =− log |τy|+ n log 2π+

(y −X(XT τyX + τ0)−1(XT τyy + τ0β0))′τy

(y −X(XT τyX + τ0)−1(XT τyy + τ0β0))

26

2.4.6 Median Probability Model

Barbieri & Berger (2004) introduced the concept of median probability model (MPM)

which is one model in the model space. This is thus not a criterion based method. The

MPM utilizes the inclusion probability of a given explanatory variable xl defined as

ql =∑γ:γ=1

P (Mγ|y)

The median probability model is then defined as the model which includes variables with

xl : ql > 0.5 i.e. median probability model contains those variables which have at least 50%

posterior probability over all models.

When the goal of model selection is to choose a model for future prediction, Barbieri &

Berger (2004) showed that, for normal linear models, if we have orthogonal design matrix,

under suitably chosen loss function, the optimal predictive model is the MPM. Barbieri &

Berger (2004) established that in the setting of normal linear models with an orthogonal

design matrix, the MPM is the optimal model under a predictive loss function.

2.5 Development of Prior Distributions

2.5.1 Bayesian Connection to Penalized Regressions

A penalized regression estimate is typically obtained by minimizing the residual sum of

squares after adding a penalty term pλ(β). For the normal linear model, it then follows that,

under the prior π(β) ∝ exp(−pλ(β)), the posterior mode is the corresponding penalized

estimate after suitably adjusting the location and scale parameters of the prior. Indeed,

27

placing the independent normal prior π(β) ∝ exp

(−λ

2|β|2)

on the coefficients recreates the

objective function of ridge regression as the log likelihood of a ridge regression.

As noted in Tibshirani (1996), lasso estimates can be derived as the Bayes posterior mode

under independent double exponential priors for the β.

f(β) =λ

2exp

(−λ|β|

)

Park & Casella (2008) expressed the Laplace or double exponential distribution as a scale

mixture of normals (with an exponential mixing density) (Andrews & Mallows, 1974)

a

2exp(−a|z|) =

∫ ∞0

1√2πs

exp(−z2

2s)a2

2exp(−a

2s

2)ds

and described the model in the following hierarchical structure

y|β, σ2 ∼ N(Xβ, σ2In)

β|σ2, τ 21 , . . . , τ

2p ∼ N(0, σ2Dτ )

Dτ = diag(τ 21 , . . . , τ

2p )

σ2, τ 21 , . . . , τ

2p ∼ π(σ2)dσ2

p∏j=1

λ2

2exp−1

2λ2τ 2

j dτ 2j

π(σ2) ∼ 1

σ2

σ2, τ 21 , . . . , τ

2p > 0.

Other important references in this theme include Bayesian elastic net (Li & Lin, 2010),

Bayesian adaptive lasso (Leng et al., 2014), (Sun et al., 2009), Bayesian bridge regression

(Polson et al., 2014) and others.

28

2.5.2 Shrinkage Prior

Besides developing Bayesian lasso type priors researchers have also been considering

other priors for efficient model selection. George & McCulloch (1993) introduced Stochastic

Search Variable Selection (SSVS). They placed independent Bernoulli type priors on each

γj, j = 1, . . . , p and collected the sequence γ1, . . . , γm after running Gibbs (Casella & George,

1992) sampler with m iterations with the hope of exploring the model space more easily and

efficiently such that important models would have higher posterior probability and would

appear in the above sequence more frequently. In this way they proposed to perform variable

selection avoiding the enumeration of the whole model space. George & McCulloch (1997)

noted that when p is large then it is difficult to set m = 2p and hence is impossible to visit

all the models at least once.

On the other hand, Dey et al. (2008) proved variable selection consistency of HPM when

spike and slab prior (Mitchell & Beauchamp, 1988) is placed on the coefficients. This type

of consistency is similar to oracle property discussed by Fan & Li (2001).

Zellner’s g-prior (Zellner, 1986) is another popular prior mainly because of its computa-

tional ease of posterior probability of the models. One can show that this prior is conjugate

to (2.6), the conjugacy has been discussed in Section 2.4.2, which makes the computation

of the marginal likelihood straightforward. Fernandez et al. (2001) proved model posterior

consistency for g-prior when g = n and Liang et al. (2008) proved model selection consistency

for mixture of g priors and hyper g priors.

For shrinkage purpose, Carvalho et al. (2010) considered placing Beta(12, 1

2) prior density

of which looks like a horseshoe and hence referred to it as the horseshoe prior. They also

justified performance of this prior theoretically. Recently, Polson & Scott (2012) suggested

to use joint priors generated by a Levy process.

29

2.6 Sampling over Model Space

Since it is impossible to enumerate the whole model space for large p, many researchers

proposed to sample from the model space or to stochastically search the model space. An

early work in this area has been done by Berger & Molina (2005) who employed stochastic

search algorithm with path-based pairwise prior. They proposed on sampling without re-

placement with probabilities proportional to posterior probabilities of the models visited in

the MCMC run. Casella & Moreno (2006) proposed to search the model space stochastically

using Intrinsic Bayes factors as an estimate of posterior model probability. The posterior

probabilities were also used by Hans et al. (2007) where they took advantage of parallel

computing to address the issue. They named thier stochastic search process as “Shotgun

Stochastic Search (SSS)”. Bottolo & Richardson (2010) devised another stochastic search

process which they called evolutionary stochastic search (ESS). On the other hand, Clyde et

al. (2011) discussed the feasibility and advantage of Bayesian adaptive sampling (BAS) which

is a variant of without replacement sampling of the variables according to their marginal in-

clusion probabilities (the theory of MPM is based on this marginal inclusion probability)

which are estimated adaptively during the search process.

CHAPTER 3

COMPARISON AMONG LPML, DIC, AND HPM

3.1 Introduction

Variable selection remains an active area of research in both Bayesian and frequentist

statistics. There are many discussions available in this area. For example, see O’Hara

& Sillanpaa (2009), Hahn & Carvalho (2015), and references therein for existing Bayesian

variable selection methods, and for frequentist variable selection methods, see Shao (1993),

Fan & Lv (2010b), Tibshirani (2011), and the references therein.

Classical Bayesian model selection criteria depend on marginal likelihoods and the highest

posterior models thereafter. However, researchers face difficulty when the data and the priors

on the parameters do not follow a conjugacy. For instance, in time-to-event data, computa-

tion of marginal likelihood of the models is itself a non trivial task. Placing noninformative

priors on the parameters adds one more level of difficulty, because in such scenarios, often

the Bayes factor becomes undefined and hence the highest posterior model can not be com-

puted. A preferred Bayesian criterion is then the Log Pseudo Marginal Likelihood (LPML)

of Gelfand et al. (1992), which relies upon predictive likelihood approach. It is derived from

predictive considerations and leads to pseudo Bayes factors for choosing among models. This

approach has seen increased popularity due in part to the relative ease with which LPML is

steadily estimated from MCMC output.

The theory of LPML solely stands upon the idea of the famous cross validation tech-

nique. Cross validation is widely accepted predictive model selection criterion. As the name

31

suggests, cross validation relies on splitting the data into training set and test set, building

the model using the training set and examining the accuracy of the fit using the test set.

In this process, the model is validated across many possible combinations of the training

set and test set. The number of training-test set combination depends on the size of the

data n and the size of the test set. The most popular cross validation method is to keep

only one observation in the test set or the validation set, which is referred to as the one

deleted cross validation. Suppose there are n observations, and we fit the model to the n− 1

observations. After that, the remaining observations are predicted using the fitted model

and the assessment of the fitted model is done by considering all possible observations in

the test set and by taking average squared error of the fitted values and the actual observed

values.

However, despite of the simplicity of the one deleted cross validation, this technique has

been criticized by the researchers. For instance, the fitting of model on a part of the avail-

able observations violates the sufficiency principle of statistics (see Picard & Cook (1984)).

Another significant worrisome fact, perhaps, is the inconsistency of model selection for one

deleted cross validation (see Shao (1993) and references therein).

This chapter revisits this issue and establish that LPML, which utilizes one deleted cross

validation in a Bayesian setting, suffers the same problem of inconsistency of model selection.

We extend our study beyond linear models to logistic regression and time to event regression

models.

Spiegelhalter et al. (2002) introduced the Deviance Information Criterion (DIC), a Bayesian

information criterion. Since its introduction, DIC has received widespread attention due to

its straightforward computation from a Markov Chain Monte Carlo (MCMC) sample. How-

ever, if a model has random effects or latent variables, then the calculation of DIC is not

well defined, and Celeux et al. (2006) suggested various modifications. Ghosh et al. (2009)

used a modified DIC and LPML to compare joinpoint models for modeling cancer rates over

32

the years. We extend our study of model selection accuracy to the DIC and illustrate that

the DIC suffers similar inaccuracy problem as the LPML, for model selection. On the other

hand, one should establish that the HPM approach to model selection performs strongly.

Example 3.1. Gunst and Mason data.

We consider Gunst and Mason data (Table 1, Shao (1993)) to assess the performance of

LPML, DIC, and HPM using simulation. There are n = 40 observations p = 4 explanatory

variables in Gunst and Mason data. Without loss of generality, we assume the intercept as

a covariate which is always included in the analysis models. We keep the covariates fixed.

Data y is simulated according to equation

y ∼ N(Xβ, σ2I) (3.1)

with some of the βj j = 2, . . . , 5 set to 0. σ2 is set to 1. This is done 1000 times i.e. the

number of simulations is set to 1000.

In the Bayesian analysis, we place a conjugate flat (kind of noninformative) prior on β.

As in Shao (1993), if we know whether each component of β is 0 or not, then the analysis

modelsMγ can be classified into two categories (β is the parameter vector of data generating

model M and βγ is the parameter vector of the analysis models Mγ):

• Category I: At least one nonzero component of β is not in βγ.

• Category II: βγ contains all nonzero components of β.

For each simulation we fit all possible 24 − 1 = 31 models. For DIC criterion the model

with lowest DIC is chosen, while for LPML and HPM criteria, the models with the highest

LPML and marginal likelihood are selected respectively. Table 3.1 reports the sample prob-

abilities of selecting each models based on 1000 simulations. For one deleted cross validation

33

the model with minimum average prediction square error, as in Shao (1993), is chosen. We

report the sample probabilities of selecting each models due to one deleted cross validation

mainly for the comparison purpose. We denote the models by a subset of 1, . . . , 5 that

contains the corresponding covariates in the model.

Table 3.1: Probabilities of selecting data generating models for Gunst and Mason data

Model Category CV(1) LPML DIC HPM

1, 4 Data-generating 0.501 0.589 0.626 0.999β = 1, 2, 4 II 0.141 0.109 0.098 0.000(2, 0, 1, 3, 4 II 0.110 0.111 0.110 0.001

0, 4, 0) 1, 4, 5 II 0.138 0.126 0.107 0.0001, 2, 3, 4 II 0.036 0.024 0.021 0.0001, 2, 4, 5 II 0.027 0.017 0.013 0.0001, 3, 4, 5 II 0.034 0.018 0.018 0.0001, 2, 3, 4, 5 II 0.013 0.006 0.007 0.0001, 3, 5 I 0.002 0.000 0.000 0.000

β = 1, 4, 5 Data-generating 0.650 0.721 0.719 0.996(2, 0, 1, 2, 4, 5 II 0.158 0.119 0.131 0.002

0, 4, 8)’ 1, 3, 4, 5 II 0.119 0.110 0.120 0.0021, 2, 3, 4, 5 II 0.073 0.050 0.030 0.0001, 2, 5 I 0.000 0.000 0.000 0.000

β = 1, 4, 5 I 0.000 0.000 0.000 0.000(2, 9, 1, 2, 3, 5 I 0.000 0.000 0.000 0.000

0, 4, 8)’ 1, 2, 4, 5 Data-generating 0.796 0.838 0.841 0.9991, 3, 4, 5 I 0.003 0.000 0.000 0.0011, 2, 3, 4, 5 II 0.201 0.162 0.159 0.0001, 2, 3, 5 I 0.000 0.000 0.000 0.000

β = 1, 2, 4, 5 I 0.000 0.000 0.000 0.000(2, 9, 1, 3, 4, 5 I 0.001 0.000 0.000 0.002

6, 4, 8)’ 1, 2, 3, 4, 5 Data-generating 0.999 1.000 1.000 0.998

There are following conclusions which are evident from Table 3.1:

1. The performances of LPML and DIC are poor and similar to CV(1) method.

2. HPM performs strongly and consistently in selecting the data generating model.

34

3. In particular, when the data generating model is a sparse model like in the top panel

(β = (2, 0, 0, 4, 0)) in Table 3.1, the LPML and DIC tend to select unnecessary covari-

ates and as a result can not distinguish between the category II models.

4. The probabilities of selecting the category I models remain very small irrespective of

the model selection criteria.

3.2 Inconsistency of LPML

In this section, we present the following theorem which can be used to justify the poor

performance of LPML for linear models.

We note that, as before,M is the data-generating model andMγ are the analyses models.

Furthermore, we assume, limn→∞max1≤i≤n hiγ = 0 where hiγ is the i -th diagonal element

of the hat matrix for the model Mγ.

Theorem 3.2. We consider the noninformative prior, π(β) ∝ 1, and assume that σ2 is

known and fixed. If Mγ belongs to category II then,

plimn→∞LPML(Mγ) = −σ2

In particular, all models in category II have the same asymptotic value. That is, LPML

cannot asymptotically distinguish among the category II models.

Proof. For simplicity, we denote X ≡ Xγ and β ≡ βγ. Let β and β(i) denote the least square

estimate of β with and without the i-th observation included in the data. Furthermore, let

y−i and X(i) denote the response vector and the regression matrix respectively, with the i-th

row deleted.

35

Then,

Pr(y|β) =1

(2πσ2)n2

exp

− 1

2σ2(y −Xβ)T (y −Xβ)

Then, the marginal likelihood is,

Pr(y) =

∫f(y|β)dβ

=1

(2πσ2)n2

∫exp

− 1

2σ2(y −Xβ)T(y −Xβ)

dβ

=1

(2πσ2)n2

∫exp

− 1

2σ2(y −Xβ +Xβ −Xβ)T(y −Xβ +Xβ −Xβ)

dβ

=

exp

− 1

2σ2 (y −Xβ)T(y −Xβ)

(2πσ2)

n2

∫exp

− 1

2σ2(β − β)T(XTX)(β − β)

dβ

=|(XTX)−1| 12√

(2πσ2)n−pexp

− 1

2σ2(y −Xβ)T(y −Xβ)

(3.2)

Then the predictive density of yi given the other observations,

Pr(yi|y1, y2, . . . , yi−1, yi+1, . . . , yn)

= Pr(yi|y−i)

=f(y1, y2, . . . , yn)

Pr(y−i)

=

|(XTX)−1| 12 exp

− 1

2σ2

(y −Xβ

)T(y −Xβ

)|(X(i)TX(i))−1| 12 exp

− 1

2σ2

(y−i −X(i)β(i)

)T(y−i −X(i)β(i)

)=|X(i)TX(i)| 12|XTX| 12

√2πσ2

exp

− 1

2σ2(Q1 −Q2)

36

where

Q1 −Q2 =

(y −Xβ

)T(y −Xβ

)−(y−i −X(i)β(i)

)T(y−i −X(i)β(i)

)=yTy − 2yTXβ + βTXTXβ − yT

−iy−i + 2yT−iX(i)β(i)− β′(i)XT(i)X(i)β(i)

We know that (Seber & Lee, 2003),

X(i)TX(i) = XTX − xixTi (3.3)

yT−iX(i) = yTX − yixT

i (3.4)

β − βi =XTX − xiei

1− hi(3.5)

ei = yi − xTi β (3.6)

where xi is the i-th row of X and hi = xTi (XTX)−1xi is the i-th diagonal element of the hat

matrix.

So,

Q1 −Q2 = y2i − 2

(yTXβ − yT

−iX(i)β(i)

)+

(βTXTXβ − β(i)TX(i)TX(i)β(i)

)

37

Now,

yTXβ − yT−iX(i)β(i) =yTXβ − yT

−iX(i)

[β − (XTX)−1xiei

1− hi

][using 3.5]

=

(yTXβ − yT

−iX(i)β

)+yT−iX(i)(XTX)−1xiei

1− hi[using 3.4]

=

(yTX − yT

−iX(i)

)β +

yT−iX(i)(XTX)−1xiei

1− hi

=yixTi β +

yT−iX(i)(XTX)−1xiei

1− hi

=yixTi β +

(yTX − yixTi )(XTX)−1xiei

1− hi[using 3.5]

=yixTi β +

yTX(XTX)−1xiei − yixTi (XTX)−1xiei

1− hi

=yixTi β +

βTxiei − yihiei1− hi

=yixTi β +

βTxi(yi − xTi β)− yihi(yi − xT

i β)

1− hi

=yixTi β +

βTxiyi − βTxixTi β − hiy2

i + hiyixTi β

1− hi

=yixTi β

(1− hi

1− hi

)+yiβ

Txi − βTxixTi β − hiy2

i

1− hi

=1

1− hi

[yix

Ti β − β + yiβ

Txi − βTxixTi β − hiy2

i

]=

1

1− hi

[2yix

Ti β − βTxix

Ti β − hiy2

i

]

38

Now,

βTXTXβ − β(i)TX(i)TX(i)β(i)

=βTXTXβ −(β − (XTX)−1xiei

1− hi

)T

X(i)TX(i)

(β − (XTX)−1xiei

1− hi

)[using 3.5]

=βTXTXβ − βTX(i)TX(i)β +βTX(i)TX(i)(XTX)−1xiei

1− hi+

eixTi (XTX)−1X(i)TX(i)β

1− hi− e2

ixTi (XTX)−1X(i)TX(i)(XTX)−1xi

(1− hi)2

=βTxixTi β +

2

1− hiβTX(i)TX(i)(XTX)−1xiei−

e2i

(1− hi)2xTi (XTX)−1X(i)TX(i)(XTX)−1xi

=βTxixTi β +

2

1− hiβT(XTX − xixT

i )XTX−1xiei−

e2i

(1− hi)2xTi (XTX)−1(XTX − xixT

i )(XTX)−1xi

=βTxixTi β +

2

1− hi

βTxiei − βTxihiei

− e2

i

(1− hi)2

xTi (XTX)−1xi − h2

i

=βTxix

Ti β +

2

1− hi(1− hi)βTxiei −

e2i

(1− hi)2(hi − h2

i )

=βTxixTi β + 2βTxi(yi − xT

i β)− hie2i

(1− hi)2(1− hi)

=βTxixTi β + 2βTxiyi − 2βTxIx

Ti β −

hie2i

1− hi

39

It follows that,

Q1 −Q2

=y2i −

2

1− hi

2yix

Ti β − βTxix

Ti β − hiy2

i

+ 2yix

Ti β − βTxix

Ti β −

hie2i

1− hi

=y2i −

4

1− hiyiβ

Txi +2

1− hiβTxix

Ti β +

2hiy2i

1− hi+ 2yix

Ti β − βTxix

Ti β −

hie2i

1− hi

=y2i

(1 +

2hi1− hi

)− yixT

i β

(2− 4

1− hi

)+ βTxix

Ti β

(2

1− hi− 1

)− hie

2i

1− hi

=y2i

1 + hi1− hi

− yixTi β

(−2)(1 + hi)

1− hi+ βTxix

Ti β

1 + hi1− hi

− hie2i

1− hi

=1 + hi1− hi

(y2i − 2yix

Ti β + βTxix

Ti β

)− hie

2i

1− hi

=1 + hi1− hi

(yi − xTi β)T(yi − xT

i β)− hie2i

1− hi

=1 + hi1− hi

eTi ei −

hi1− hi

e2i

=1 + hi1− hi

e2i −

hi1− hi

e2i

=e2i

1− hi

=(yi − xT

i β)2

1− hi

∴ Pr(yi|y1, y2, . . . , yi−1, yi+1, . . . , yn) =|X(i)TX(i)| 12|XTX| 12

√2πσ2

exp

−(yi − xT

i β)2

2σ2(1− hi)

40

Thus, for LPML for model Mγ becomes,

LPML(Mγ) =1

n

n∑i=1

log Pr(yi|y−i,Mγ)

=− 1

n

n∑i=1

[1

2log(2πσ2)− 1

2log|Xγ(i)

TXγ(i)||XT

γ Xγ|+

1

2σ2

(yi − xTiγβγ)2

1− hiγ

]=− 1

2n

n∑i=1

[log(2πσ2)− log

|Xγ(i)TXγ(i)|

|XTγ Xγ|

+1

σ2(yi − xT

iγβγ)2(1− hiγ)−1

]

Now,

1

n

n∑i=1

(yi − xTiγβ)2(1− hiγ)−1

=1

n

n∑i=1

(yi − xTiγβγ)

2(1 + hiγ +O(h2iγ))

=1

n

n∑i=1

(yi − xTiγβγ)

2 +1

n

n∑i=1

(yi − xTiγβγ)

2(hiγ +O(h2iγ))

Now, recall the model Mγ is, y = Xγβγ + ε, and the projection matrix is, Pγ =

Xγ(XTγ Xγ)

−1XTγ . Hence,

1

n

n∑i=1

(yi − xTiγβγ)

2

=1

nyT(I − Pγ)y

=1

n[εT(I − Pγ)ε+ βT

γ XTγ (I − Pγ)Xγβγ + 2εT(I − Pγ)Xγβγ]

If model Mγ belongs to category II, then,

1

n

n∑i=1

(yi − xTiγβγ)

2 =εTε

n− εTPγε

n

41

Now,

Pr

[|εTPγε|n

> δ

]≤ 1

δnE(|εTPγε|)

=1

δnE(εTPγε), since εTPγε > 0

=kγσ

2

δn→ 0 as n→∞

where kγ is the number of covariates for model Mγ

This implies that, εTPγε

n

P→ 0, and we know that, εTεn

P→ σ2.

Assumption: limn→∞maxi≤n hiγ = 0.

Now,

1

n

n∑i=1

(hiγ +O(h2

iγ)

)(yi − xT

iγβγ)2 =

1

n

n∑i=1

O(hiγγ )(yi − xTiγβγ)

2

Again,

0 ≤ 1

n

n∑i=1

O(hiγ)(yi − xTiγβγ)

2 ≤O(maxi≤n

hiγ)1

n

n∑i=1

(yi − xTiγβγ)

2

=O(maxi≤n

hiγ)

[εTε

n− εTPγε

n

]P→ 0, by the assumption.

42

log|Xγ(i)

TXγ(i)||XT

γ Xγ|= log

|XTX − xixTi |

|XTX|, [using 3.3]

= log|XTX|(1− xT

i (XTX)−1xi)

|XTX|

= log

(1− xT

i XTX)−1xi

)= log(1− hi)→ 0 as n→∞

log|XT

i Xi||XTX|

= log|XTX − xixT

i ||XTX|

)

= log|XTX|(1− xT

i (XTX)−1xi)

|XTX|

= log(1− xTi (XTX)−1xi)

= log(1− hi)→ 0 as n→∞

This completes the proof.

3.3 Simulation Study

Theorem 3.2 establishes that the LPML criterion cannot distinguish among category II

models when σ2 is known. In the following we present a series of simulation studies to

investigate the performance of LPML in other settings.

43

3.3.1 Linear Model in Presence of Multicollinearity

Our goal is to assess the performance of LPML when there is multicollinearity is present

in data. We consider the linear model again. We set n = 50 and p = 10, i.e. there are 50

observations and 10 covariates x1, . . . , x10. The intercept term is included in the analysis

model as in the previous example. The covariates are generated from multivariate normal

distribution such that correlation(xi, xj) = 0.95, 1 ≤ i < j ≤ 3. x4, . . . , x10 are generated

from standard normal distribution independently. The covariates are kept fixed throughout

simulations. The response vector is sampled according to equation (3.1). σ2 is set to 1. We

place standard normal prior on β for Bayesian analysis.

Table 3.2: Probabilities of selecting the data-generating Model

Data-generating model LPML based selection

All uncorrelated 4, 5, 6, 7, 8, 9, 10 0.943All correlated 1, 2, 3 0.573All 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 0.720One correlated 3, 4, 5, 6, 7, 8, 9, 10 0.575and all uncorrelatedOne correlated 3, 4, 5, 6, 7 0.522and 4 uncorrelated

We report the sample probabilities of selecting the data generating models in table 3.2.

We note that, when all the uncorrelated variables present in the data generating model, the

LPML appears to be successful in recovering the data generating models. However, LPML

performs poorly in all the other cases.

44

3.3.2 A Non-conjugate Setting: Logistic Regression

Logistic regression is a widely used binary regression model which is an example of non

linear model. There exists no conjugate prior for logistic model. We are left with no choice

but MCMC for sampling from posterior densities. We consider the following regression

model,

y ∼ Bernoulli(πi) (3.7)

where

πi =Xγβγ

1 + exp(Xγβγ)

We set n = 100 and p = 5. We generate x1, . . . , x5 from standard normal distribution

and keep them fixed. The response vector is generated according to equation (3.7). This is

a non-conjugate Bayesian model and we use Markov Chain sampling for Bayesian analysis

using the mcmclogit function of R package MCMCpack (Martin et al., 2011). We have used

Laplace approximations to compute the marginal likelihoods of the models. This can be

easily done by setting marginal.likelihood = “Laplace”.

We have run 100 simulations. Table 3.3 shows that, the poor performance of LPML and

DIC continues beyond the linear models such as logistic models. However, HPM remains

consistent in selecting the data generating models. In particular, all of the methods show

similar performance as for the linear models.

3.3.3 Nodal Data

Chib (1995) and many others considered Nodal data for Bayesian variable selection pur-

pose. There were 53 patients. Following is an overview of the variables:

45

• y: 1 or 0 according as the cancer has spread to the surrounding lymph nodes or not.

• x1: age of the patient.

• x2: level of serum acid phosphate.

• x3: 1 or 0 according as the result of an X-ray examination is positive or negative.

• x4: 1 or 0 according as the size of the tumor is large or small.

• x5: 1 or 0 according as the pathological grade of the tumor is more serious or less

serious.

Chib (1995) reported the modelMγ = (x2, x3, x4) as the highest posterior model (HPM)

for the observed data. We note that, Chib (1995) employed a probit regression model for

nodal data. A probit model is given by,

y ∼ Bernoulli(πi) (3.8)

where

πi = Φ(Xγβγ)

where Φ is the cumulative density function of standard normal distribution. We consider

a simulation study where the response y is generated according to (3.8) after keeping the

covariates fixed, and setting different β as given in Table 3.4. We adopt a normal distribution

with mean 1 and variance 1 for the priors on β for the Bayesian analysis. Intercept term is

included in all the analyses models as usual. We run a total of 100 simulations. We have

used the mcmcprogit function for MCMC sampling.

The important feature of this experiment is that LPML and DIC fail to produce con-

vincing result even if the data generating model is the full model.

46

3.3.4 Melanoma data

Survival models do not typically provide a conditional conjugate setup in Markov Chain

sampling. Due to this complexity, LPML based model selection has been used extensively

for model comparison in the setting of complex survival models. We consider the Bounded

Cumulative Hazard (BCH) cure rate model for survival data which models the survival

function S(t) as,

S(t) = exp

(−θG(t)

)where G(t) is a proper cdf (with pdf g(t)) with G(0) = 0 and limt→∞G(t) = 1 (see Tsodikov

et al. (2003), Chen et al. (1999)). The cure fraction is given by,

c = limt→∞

S(t) = exp(−θ limt→∞

G(t)) = exp(−θ)

Suppose an individual has N latent factors (carcinogenic cells) with activation times

T1, . . . , TN which are assumed to be i.i.d. If N = 0 (no carcinogenic cells) then the subject is

cured. The first activation scheme assumes that time to failure of the subject is determined

by the first activation T(1) = min(T1, . . . , Tn) and we have,

S(t) = Pr(N = 0) +∞∑n=1

Pr(N = n)S0(t)n

When N ∼ Poisson(θ), this gives the BCH model S(t) = exp(−θS0(t)). We consider

a regression model where the cure rate parameter θ depends on the covariates through the

relationship θ = exp(Xβ). We assume a Weibull density for g(t)

g(t) = ηtη−1 exp(λ− tη exp(λ))

47

We consider data from a phase III melanoma clinical trial conducted by the Eastern

Cooperative Oncology Group (ECOG) (Chen et al. (1999)). The study, denoted as E1684,

was a two-arm clinical trial involving patients randomized to one of two treatment arms:

high-dose interferon (IFN) or observation. There are four covariates – treatment, age, gender

and performance status (binary), and 255 observations.

Several authors (Chen et al. (1999), Chen et al. (2002), Cooner et al. (2007)) have

proposed complex model for these data and used LPML for model validation purpose. Our

goal is to examine the performance of LPML as a model validation criterion in this setting.

For the data generating model, as usual, the covariates are fixed and β is set to dif-

ferent take values as shown in Table 3.5. The censored time is generated according to

set up described above. In order to carry out the analysis, we adopt following priors:

η ∼ Gamma(shape = 1, rate = 0.1), and λ, βj ∼ N(0, 10000). This is a complex non-

conjugate model. The number of simulations is 100.

We present the results in Table 3.5 which clearly supports the previous conclusions. We

estimated the marginal likelihood using Laplace approximations. We find that the HPM has

been 100% successful in detecting the data generating model in this simulation study.

3.4 Conclusion

LPML is extremely popular, in particular, in censored model. DIC is readily available

in OpenBUGS. The performance of the one-deleted LPML criterion based model selection

and DIC based model selection are questioned in this work. Highest posterior model (HPM)

criterion or marginal likelihood based model selection is found to be superior in our and

many other recent studies. There are many recent and ongoing advances on computation of

marginal likelihood and intelligent searches over the model space.

48

Table 3.3: Probabilities of selecting the data-generating models for logistic regression simu-lation example

Model Category LPML DIC HPM

1 I 0.00 0.00 0.01β = 1, 4 Data-generating 0.67 0.64 0.88

(1, 0, 0, 1, 0) 1, 2, 4 II 0.05 0.08 0.031, 3, 4 II 0.13 0.13 0.031, 4, 5 II 0.09 0.08 0.023, 4, 5 II 0.00 0.00 0.011, 2, 3, 4 II 0.01 0.01 0.001, 2, 4, 5 II 0.03 0.03 0.011, 3, 4, 5 II 0.02 0.03 0.011, 4 I 0.02 0.02 0.02

β = 1, 5 I 0.01 0.00 0.01(1, 0, 0, 1, 1) 4, 5 I 0.01 0.01 0.02

1, 2, 4 I 0.01 0.01 0.001, 3, 5 I 0.00 0.01 0.001, 4, 5 Data-generating 0.62 0.60 0.863, 4, 5 I 0.01 0.01 0.001, 2, 4, 5 II 0.17 0.18 0.041, 3, 4, 5 II 0.12 0.12 0.041, 2, 3, 4, 5 II 0.03 0.04 0.011, 2 I 0.00 0.00 0.01

β = 1, 4 I 0.00 0.00 0.01(1, 1, 0, 1, 1) 1, 2, 4 I 0.02 0.01 0.04

1, 4, 5 I 0.00 0.00 0.022, 4, 5 I 0.00 0.00 0.021, 2, 4, 5 Data-generating 0.82 0.81 0.821, 2, 3, 4, 5 II 0.16 0.18 0.041, 2, 3, 4 I 0.02 0.02 0.03

β = 2, 3, 4, 5 I 0.03 0.02 0.04(1, 1, -1, 1, 1) 1, 2, 3, 4, 5 Data-generating 0.95 0.96 0.93

Table 3.4: Probabilities of selecting the data-generating models for nodal data

Data-generating model LPML DIC HPM

β = (-2, 0, 2, 0, 2, 0) 2, 4 0.44 0.34 0.71β = (-2, 0, 2, 0, 2, 2) 2, 4, 5 0.47 0.53 0.85β = (-2, 0, 2, 2, 2, 2) 2, 3, 4, 5 0.58 0.46 0.91β = (-2, -2, 2, 2, 2, 2) 1, 2, 3, 4, 5 0.57 0.48 0.86

49

Table 3.5: Probabilities of selecting the data-generating models for melanoma data

Model Category LPML DIC HPM

β = 2, 4 Data-generating 0.72 0.74 1.00(0, -1, 0, -1) 1, 2, 4 II 0.13 0.12 0.00

2, 3, 4 II 0.13 0.12 0.001, 2, 3, 4 II 0.02 0.02 0.00

β = 1, 2, 4 Data-generating 0.76 0.77 1.00(1, -1, 0, 1) 1, 2, 3, 4 II 0.24 0.23 0.00

β = (1, -1, 1, -1) 1, 2, 3, 4 Data-generating 1.00 1.00 1.00

CHAPTER 4

MEDIAN PROBABILITY MODEL

4.1 Introduction

In this section we revisit the properties of MPM.

Theorem 4.1. (Barbieri & Berger, 2004) In the setting of normal linear model with an

orthogonal design matrix, the MPM is the optimal model under a predictive loss function.

Moreover, Barbieri & Berger (2004) provided a condition on prior probabilities of the

models under which MPM and HPM coincides. One such scenario has been considered in

Dey et al. (2008). The MPM depends only on the marginal inclusion probability of each

predictor xj, which are often easier to estimate, for example, from Markov Chain sampling

as opposed to other criteria which depends on joint inclusion probabilities.

4.2 Comparison of MPM and HPM

Hahn & Carvalho (2015), however, noted that, theoretical results for the performance of

the median probability model under non-orthogonal design matrix have not been established.

In the following, we present a simulation study comparing MPM and HPM in presence of

multicollinearity.

We consider a linear model. We generate ten covariates from normal distribution such

that first three of them are pairwise correlated with correlation around 0.95 and rest are

51

Table 4.1: Comparison between MPM and HPM for linear models in presence of collinearity:Result presents the number of times the data generating model is recovered using 1000simulations

Prior Model Result

independent MPM 27normal HPM 603

g MPM 457prior HPM 513

uncorrelated. We set n = 50. The response is generated according to (3.1) with β =

(0, 0, 1, 1, 1, 1, 1, 0, 0, 0)′. Thus the data generating model is Mγ = (3, 4, 5, 6, 7) where γ =

(0, 0, 1, 1, 1, 1, 1, 0, 0, 0)′.

We consider independent normal prior and g prior with g = n for carrying out the

Bayesian analyses. The results reported in 4.1 are based on 1000 reproduced data simu-

lations. Computation of posterior probability of the models is done using (2.7) and (2.9)

respectively. We summarize the findings below.

• When independent normal prior is placed on the parameters then MPM selects the

data generating model 2.7% times only whereas the HPM recovers the data generating

model 60.3% times.

• When we use g prior for the parameters, MPM procedure is able to select the data

generating model 45.7% times while HPM recovers the same 51.3% times out of 1000

simulations.

This study clearly indicates that HPM outperforms MPM when the collinearity is present

in the data.

CHAPTER 5

SIGNIFICANCE OF HIGHEST POSTERIOR MODEL

5.1 Introduction

Empirically, as in the previous examples in Chapter 3 and in Chapter 4, highest posterior

model tends to select the data generating model more frequently.

The purpose of hypothesis testing is to evaluate the evidence in favor of a scientific theory

and Bayes factors offer a way of including other information when evaluating the evidence in

favor of a null hypothesis. As Raftery (1999) described, ”The hypothesis testing procedure

defined by choosing the model with the higher posterior probability minimizes the total error

rate, that is, the sum of Type I and Type II error rates. Note that, frequentist statisticians

sometimes recommend reducing the significance level in tests when the sample size is large,

the Bayes factor does this automatically”!! Therefore, as Kass & Raftery (1995) pointed

out, the essential strength of Bayes factors and hence highest posterior models, is their solid

logical foundation. In particular, when competing models are non nested, inference can be

drawn. Most importantly, model uncertainty can be captured.

Model uncertainty can be examined by exploring the model space. George & McCulloch

(1993) observed that even with a moderate number of Markov chain iterations, the non

visited models constitute merely a fraction of total posterior the probabilities of the model

space. Garcıa-Donato & Martınez-Beneito (2013) explored this in detail. They conducted an

extensive and time-consuming study and showed that stochastic search on the model space

53

is often useful in recovering a large portion of posterior probabilities of the model space in

the presence of finite number of parameters.

Recently, Hahn & Carvalho (2015) developed a method for variable selection in high

dimensional models which utilizes posterior summary of the model space. However, their

method includes a tuning parameter similar to the tuning parameter in lasso, and hence

suffers from the same problem of selecting that tuning parameter.

5.2 Consistency

There is an extensive literature on Bayesian model selection consistency for linear models

(Casella et al., 2009). The notion of model consistency is developed with respect to posterior

probabilities of the models. If Mγ∗ is the true model then a model is consistent if, the

posterior probability of Mγ∗ tends to 1 as sample size increases and that of other models

goes to zero as sample size increases that is

p limn→∞

P (Mγ∗|y) = 1 (5.1)

and p limn→∞

P (Mγ|y) = 0 for any γ 6= γ∗ (5.2)

where the probability limit is taken with respect to the true sampling distribution (3.1).

When we have g prior then Fernandez et al. (2001) proved (5.1) and (5.2), for g = n and

other values of g.

CHAPTER 6

VARIABLE SELECTION

6.1 Introduction

Despite of its straightforwardness, the problem of variable selections demands a detail

and careful treatment because of many issues which includes the fact that the number of

models 2p gets easily extremely large as p increases (Garcıa-Donato & Martınez-Beneito,

2013). The most difficult barrier of variable selection lies its infeasibility of visiting all the

models. With increasing p, enumerating the whole model space tends to an impossible

and infeasible assignment, even if we use ultra modern machineries. To give an essence of

difficulty of tackling the problem for large p we refer to Garcıa-Donato & Martınez-Beneito

(2013) where the authors noted that merely a binary representation of a model space with

p = 40 would occupy 5 terabytes of memory.

6.2 Requirement of Sampling or Stochastic Search

Early work on exploring model space includes Stochastic Search Variable Selection (SSVS)

(George & McCulloch, 1993). George & McCulloch (1993) placed Spike and Slab (Ishwaran

& Rao, 2005) type prior on the parameters and run a Gibbs sampler on the model space

with the hope that the MCMC sampler will visit the models having higher posterior prob-

ability more frequently and those models which have lower posterior probability will not be

visited by the MCMC run or will be visited with negligible frequency. After a complete run

55

of MCMC one can obtain good models with the help of frequencies of the models visited

by the sampler. This is called as SSVS method (George & McCulloch, 1993). Note that a

sampler with iteration less than the total number of models will not visit every model in the

model space and the non visited models are hoped to accumulate a negligible amount of the

posterior model probabilities of the model space. Therefore the goal is to find at least some

of the more probable models (Hahn & Carvalho, 2015).

Other search processes developed in the literature include stochastic search by Berger &

Molina (2005) (SSBM), stochastic search by Casella & Moreno (2006), Shotgun Stochastic

Search (Hans et al., 2007), Evolutionary Stochastic Search (Bottolo & Richardson, 2010),

Particle Stochastic Search (Shi & Dunson, 2011). On the other hand, recent work of Clyde

et al. (2011) develops Bayesian adaptive sampling (BAS) which is a variant of without

replacement sampling according to adaptively updated marginal inclusion probabilities of

the variables.

6.3 Limitation of Bayesian Lasso and its Extensions

On the other hand applying shrinkage priors on the coefficients overcomes these type of

computational burdens automatically. For instance, Bayesian lasso by Park & Casella (2008)

use scale mixture of normal distribution (Andrews & Mallows, 1974) to obtain posterior

estimates from the MCMC sampler. This is aimed to achieve variable selection as done by its

frequentist counterparts. However, the procedures by themselves only provides the estimate

and does not do variable selection. Post processing of the results of the Bayesian procedures

are needed to make decision of inclusion or exclusion of variables. Given the mechanisms,

the best way, probably, would be to look at the posterior summary (for example, the highest

posterior model) after placing the shrinkage priors.

56

6.4 Maximization in the Model Space

We revisit the definition of HPM here.

Definition 6.1. Highest Posterior Model.

Highest posterior model (HPM) is defined as the model having highest posterior model

among all the model in the model space that is

HPM = argmaxγ∈M Pr(Mγ|y)

According to (6.1), HPM is merely the model which has the highest posterior probability

among all the models in the model space. The model space consists of 2p models which are

the elements of the model space. Therefore, one can find the HPM by optimization over the

model space with the posterior probability of a model as the desired objective function. We

can thus take resource from the huge optimization literature.

However, in variable selection setting, a model is represented by Mγ where γ ∈ 0, 1p.

Clearly, the representation of each model is binary. So if we want to maximize any objec-

tive function over the model the solution must belong to the binary representation of the

models. In this sense this problem is more similar to integer problem of simplex method

literature. The model space thus can be represented as all the 2p possible p -dimensional bi-

nary vectors γ ∈ 0, 1p. This unique structure of the model space severely limits the choice

of optimization methods.

57

6.5 Simulated Annealing

Because of the infeasibility of enumerating the whole model space and because of the

features of the maximization problem, we propose to conduct the maximization process

stochastically. Simulated Annealing (SA) is a widely known stochastic maximization routine.

Therefore, we decide to explore feasibility and tenacity of this routine for variable selection.

In this section we provide an introduction of an SA algorithm. For details see Bertsimas

& Tsitsiklis (1993). Suppose there exists a finite set S. We define a real valued function J

on S. J is our objective function which we want to minimize over S. Let S∗ ⊂ S be the

set of global minima of the function J , assumed to be a proper subset of S. For each i ∈ S,

there exists a set S(i) ⊂ S − i, called the set of neighbors of i. In addition, for every i, there

exist a collection of positive coefficients qij, j ∈ S(i), such that,∑

j∈S(i) qij = 1. qij form a

transition matrix elements of which provide the transition probabilities of moving from i to

j. It is assumed that j ∈ S(i) if and only if i ∈ S(j). We define a nonincreasing function

T : N → (0,∞) which is called the cooling schedule. Here N is the set of positive integers,

and T (t) is called the temperature at time t.

Let x(t) be a discrete time inhomogeneous Markov Chain (MC). The search process starts

at an initial state x(0) ∈ S. Then the steps of the algorithm are as follows.

1. Fix i = current state of x(t).

2. Choose a neighbor j of i at random according to probability qij.

3. Once j is chosen, the next state x(t+ 1) is determined as follows:

58

If J(j) ≤ J(i), then x(t+ 1) = j

If J(j) > J(i), then x(t+ 1) = j with probability exp

[−J(j)− J(i)

T (t)

]x(t+ 1) = i otherwise

If j 6= i and j /∈ S(i), then Pr

[x(t+ 1) = j|x(t) = i

]= 0.

4. Repeat above steps until convergence.

According to this algorithm, x(t) converges to the optimal set S∗. Mathematically, for

k = 0, 1, 2, . . . and all j ∈ S,

limn→∞

Pr

(x(n+ k) ∈ S∗|x(k) = j

)= lim

n→∞Pr

(J(x(n+ k)

)= J∗|x(k) = j

)= 1 (6.1)

where, J∗ = minj∈S J(j) and S∗ = i : i ∈ S, J(i) = J∗.

Note that, the acceptance probability function defined in the third step is similar to one

in Metropolis-Hastings (Hastings, 1970) sampler. In this spirit this is a stochastic search.

According to SA algorithm, x(t) converges to the optimal set S∗.

6.6 Our Setup

Now let us frame the variable selection problem in the setting of SA elements. The main

SA routine is developed for minimization but we can easily convert it to a maximization

purpose. Assume x(t) = γ(t), where j − th element of γ is 1 or 0 according as j − th

covariate being present or absent in the model. Set J = posterior probability of the model

59

Mγ. Basically, we want to maximize the posterior probabilities over the model space applying

simulated annealing algorithm. At the end of a run of this algorithm we expect to have

maximum posterior probability and hence the corresponding highest posterior model i.e.

γ = argmaxγMγ

There are couple of issues to deal with before we actually perform the maximization. The

issues are the following:

• It is very crucial to set a cooling schedule. An appropriately chosen cooling schedule

accelerates the convergence. When T is very small, the time it takes for the Markov

chain x(t) to reach equilibrium can be excessive. The main significance of cooling

schedule is that, during the beginning of the search process it helps the algorithm to

escape from the local modes and then when the search is actually in the neighbor of

the global optimum the algorithm tries to focus in that region by reducing the value

of cooling schedule and thereby finding the actual optimum. There are a number of

suggestions available in the literature to choose a functional form for cooling schedule.

However, we set cooling temperature at time t,

T (t) =

p

(J(j)− J(i)

)log(t+ 1)

• Let the transition matrix be denoted by Q. Then the (i, j)−th element of the transition

matrix Q is taken as

qij =posterior probability of j − th model

sum of posterior probabilities of neighbors of i− th model

where j − th model ∈ neighborhood of i− th model.

60

• Then we attempt to define a neighborhood of a model. Note that we could specify

the whole model space as neighborhood of a model. But then that would be similar

to enumerate the all models which we want to avoid. On the other hand if we specify

very few models in the neighborhood of a model then a large number of steps are likely

to be required to reach to the higher posterior probability region. This forces us to

keep balance in selecting number of models in the neighborhood region.

For any given modelMγ we define the collection, Mγ,Mγ0 ,Mγ00, as the neighbor-

hood where

1. A deletion or an addition:

γ0 is such that if |γ0 − γ| = 1

that is the model Mγ0 can be obtained from model Mγ by either adding or

deleting one predictor.

2. A swap:

γ00′1 = γ′1 and |γ00 − γ| = 2

that is model Mγ00 can be obtained from model Mγ by swapping one predictor

with another.

61

For example, when p = 6 and suppose at time t the modelMγ = 2, 3, 4 gets selected.

Then the neighborhood is given by,

Mγ = 2, 3, 4,

Mγ0 =1, 2, 3, 4, 3, 4, 2, 4, 2, 3, 2, 3, 4, 5, 2, 3, 4, 6

,

Mγ00 =1, 2, 3, 2, 3, 5, 2, 3, 6, 1, 2, 4, 2, 4, 5, 2, 4, 6, 1, 3, 4,

3, 4, 5, 3, 4, 6

Note that, our selection provides the advantage for getting different region of neigh-

borhood at every step and thus eliminates the possibility of keeping old models in the

search region which is the case in Berger & Molina (2005) and Hans et al. (2007). In

this way our approach is different in the sense that the search procedure requires nei-

ther more than one processor nor a complicated and long Markov Chain to converge.

According to our knowledge, no such search process have been developed till now. Also

notice that, in the example above, the number of models in the neighborhood is 16

which is significantly lower than the total number of models 26 = 64.

• The widely accepted acceptance probability function for SA algorithm is given by

aT (t)(j|i) = min

[1, exp

−J(j)− J(i)

T (t)

](6.2)

This is known as Gibbs acceptance function. As we show later use of this function

meets the criteria of convergence.

Definition 6.2. We name the above set up of SA as SA-HPM.

62

6.7 Convergence

Cruz & Dorea (1998) provided simple conditions for convergence of simulated annealing

algorithm which are easy to verify. We discuss their conditions and compare our set up with

that of Cruz & Dorea (1998) below.

Condition 6.3. The transition matrix Q is irreducible with qii > 0 for all i ∈ S.

This follows that, when S is finite there exists a n0 ≥ 0, such that

minq(n0)ij : i, j ∈ S, > 0

Condition 6.4. For T (t) ↓ 0, Let, at(j|i) = aT (t)(j|i) > 0. Then limt→∞ at(j|i) exists, and

at(j|i) ↓ 0 if J(j) > J(i),

and

at(j|i) ↑ 1 if J(j) < J(i)

Moreover, if J(j′) > J(j) > J(i) then,

at(j′|i)

a(j|i)→ 0,

and

at(j′|i)

at(j′|j)→ 0.

Theorem 6.5. (Cruz & Dorea, 1998) Under the acceptance probability function aT (t) given

by (6.2) and under the conditions (6.3), and (6.4), the simulated annealing algorithm (6.5)

converges in the sense of (6.1).

63

Lemma 6.6. The model space M is finite.

Proof. Suppose there exists p many predictors. Thus the model space has 2p many elements

which is always finite.

Lemma 6.7.

qii > 0

Proof. Suppose at time t the i -th model gets selected. Since the i-th model itself belongs

to the neighborhood of i th model according to our neighborhood region selection, we have,

qii =posterior probability of i− th model

sum of posterior probabilities of neighbors of i− th model(6.3)

=P (Mi)P (y|Mi)∑

γ∈nbd(i) P (Mγ)P (y|Mγ), using (2.3) (6.4)

(6.5)

By definition prior probability of any model is strictly greater than 0 and posterior proba-

bilities are never 0 in practice for any given proper prior. These complete the proof.

Lemma 6.8. Q is irreducible.

Proof. Note that, each model has its own binary representation using γ. Given two binary

sequences, they are neighbors within a finite number of transformations of the coordinates

which is given by either 0 → 1 or 1 → 0. Hence, P (Mj|Mi) > 0 for all i, j ∈ M, trivially

completing the proof.

Lemma 6.9. The probability of moving to j -th model from i -th model in p steps is positive

i.e.

q(p)ij > 0

64

Proof. Consider two null model M0 and the full model M1 (say). These are two ex-

treme models. One can visit from M0 to M1 using following steps. 0, 0, . . . , 0, 0 →

1, 0, . . . , 0, 0 . . .→ 1, 1, . . . , 1, 0 → 1, 1, . . . , 1, 1.

Clearly, there are p many steps. Since we have considered two extereme models, any

other models can be visited in p0 (p0 ≤ p) many steps. This completes the proof.

Theorem 6.10. The convergence of simulated annealing (SA) algorithm (6.2) holds good

in the sense of (6.1).

Proof. The conditions of Theorem 1 (Cruz & Dorea, 1998) can be shown to satisfy using

(6.6), (6.7), (6.8), (6.9), (6.6) and the Gibbs acceptance probability function as in Example

1 at Cruz & Dorea (1998). Therefore SA algorithm (6.5) converges according to theorem 1

of Cruz & Dorea (1998) completing the proof.

In practice, to make the computation stable, we suggest to calculate the log of posterior

probabilities in stead of actual posterior probabilities and use that as estimates of proposal

distribution. Of course, one has to transform these quantities in the exponential scale to

obtain an move from the current model to its neighborhood. In this process we also need

to take care the possibility of the exponential term to blow out by adding or subtracting a

constant term to the power terms of these exponential quantities. For instance, we add the

maximum of log posterior probabilities of all the neighbor models.

CHAPTER 7

EMPIRICAL STUDY

In the following we describe a series of examples to assess the performance of the pro-

posed SA-HPM method. We compare the proposed method with HPM, SSVS (George &

McCulloch, 1993), MPM (Barbieri & Berger, 2004), lasso (Tibshirani, 1996), SCAD (Fan

& Li, 2001), adaptive lasso (Zou, 2006) and elastic net (Zou & Hastie, 2005) (whenever

feasible). We compare the various methods with respect to how many times each method is

recovering the data generating model based on complete enumeration.

Example 7.1. Linear Model.

The simulation set up is similar to what is considered in George & McCulloch (1993).

We set n = 60, p = 5, β = (0, 0, 0, 1, 1.2). So the data generating model is (4, 5). Covariates

are generated independently from standard normal distribution. The response is generated

according to (3.1) with σ2 = 1. The value of σ2 = 1 is kept same for all other examples

considered below. In this way, 100 data are generated keeping the predictors same. The

results are provided in Table 7.1. We observe how many times the data generating model

is selected by the several methods. We also record the times taken by each method. In

particular, since the Bayesian procedures are known to more time-consuming due to analysis

via Markov Chain sampling, we show that how our method solves that issue.

To carry out the SSVS and MPM, all the continuous parameters are analytically inte-

grated out utilizing the conjugate structure in the model. For the derivation of the posterior

distribution of γ see Garcıa-Donato & Martınez-Beneito (2013). The calculation of SSVS

and MPM were done using 4 processors parallely, in a Intel(R) Core(TM) i5-4210U CPU @

1.70 GHz machine. On the other hand, since we have used g prior with g = n and hence an-

66

alytical expression of posterior probability is available, thus the computation times of HPM

and the proposed method SA-HPM are comparable to different frequentist methods. The

time reported here is the total time taken for 100 simulations. All codes are written in R (R

Core Team, 2015). In addition, pmomMarginalK function of R package mombf (Rossell et

al., 2014) has been used to obtain Laplace approximation to marginal likelihood for SA-HPM

method when pMOM prior was used. We refer this method SA-HPM-pMOM. When g prior

has been used, the proposed SA-HPM is called SA-HPM.

The number of iterations for gibbs sampler in SSVS method and MPM method is set to

10000. First 2000 samples are discarded as burn-in samples.

Lasso and elastic net have been fitted using R package glmnet (Friedman et al., 2010)

which implements the coordinate descent algorithm. SCAD and MCP are fitted using ncvreg

(Breheny & Huang, 2011) Package. adalasso function of package parcor (Kraemer et al.,

2009) is used for fitting adaptive lasso. SIS and related methods have been computed using

SIS function of SIS (Fan et al., 2015) package. The model size is bounded to the number of

regressors by setting the argument nsis appropriately.

The results clearly indicates that all four Bayesian methods perform similarly. On the

other hand, all frequentist methods performs similar in terms of selecting the data generating

models. As far as the time is concerned since SSVS and MPM procedures are implemented

using Markov Chain sampling, they are more time consuming.

Example 7.2. Role of Correlation.

This data generating scheme is based on an example in George & McCulloch (1993). The

set up is exactly similar to (7.1) expect here we set β = (0, 0, 0, 1, 1.2) and x1, x2 and x4 are

generated independently from standard normal distribution and correlation (x3, x5) = 0.95.

So x3 can be thought as substantial proxy for x5. Therefore we expect (3, 4) instead of (4,

5) to show up as our final model often times. The results are reported in Table 7.2.

67

Table 7.1: Number of times the correct model is obtained based on 100 repetitions: Modelwith Uncorrelated Predictors.

Method Correct FDR FNR Time

SSVS 91% 0.032 0 1.7hMPM 83% 0.062 0 1.6hHPM 91% 0.033 0 8.5sLasso 77% 0.083 0 10.9sSCAD 78% 0.100 0 5.6sElastic Net 55% 0.174 0 12.1sAdaptive Lasso 80% 0.087 0 1.2mMCP 76% 0.108 0 5.6sSIS-SCAD 79% 0.091 0 31.1sSIS-MCP 83% 0.072 0 9.1sISIS-SCAD 78% 0.092 0 10.6sISIS-MCP 82% 0.073 0 10.5sSA-HPM 90% 0.037 0 8.3s

All Bayesian methods perform poorly in this example with MPM being the worst per-

formance. Among the frequentist methods SIS-SCAD has the best performance.

Example 7.3. Lasso example.

This example was considered by Tibshirani (1996). Set n = 20 and p = 8. Covariates

are generated independently from standard normal distribution with pairwise correlation

between xi and xj to be 0.5|i−j|. β = (3, 1.5, 0, 0, 2, 0, 0, 0)′ so that the data generating model

is (1, 2, 5). The results are reported in Table 7.3.

Again the performance of the Bayesian methods is superior. However, SSVS and MPM

took around 2 hours using 4 processors while SA-HPM takes only 12.6 seconds. On the other

hand ISIS-MCP remains the best method among the frquentist methods considered.

Example 7.4. Adaptive Lasso Example (lasso fails).

This set up was considered by Zou (2006). Zou (2006) illustrated this example to show

that Lasso is not consistent for variable selection. Zou (2006) extended lasso to estimate

68

Table 7.2: Number of times the correct model is obtained based on 100 repetitions: Linearregression in presence of collinearity


SSVS 58% 0.182 0.078 2.0hMPM 66% 0.148 0.067 1.9hHPM 66% 0.158 0.085 15.7sLasso 47% 0.221 0.067 10.5sSCAD 49% 0.267 0.136 5.0sElastic Net 0% 0.393 0 19.3sAdaptive Lasso 53% 0.214 0.065 1.3mMCP 30% 0.365 0.178 5.1sSIS-SCAD 63% 0.183 0.082 10.0sSIS-MCP 35% 0.338 0.198 10.1sISIS-SCAD 62% 0.192 0.099 11.6sISIS-MCP 34% 0.335 0.196 11.6sSA-HPM 65% 0.171 0.105 12.1s

parameters adaptively and established that adaptive lasso is a consistent variable selection

procedure and also holds oracle property.

For this example n = 60 and p = 4. Predictors are generated independently from

standard normal distribution with correlation (xj, xk) = −0.39, j < k < 4 and correlation

(xj, x4) = 0.23, j < 4. And we set β = (5.6, 5.6, 5.6, 0)′ the data generating model is (1, 2,

3). The results are reported in Table 7.4.

As expected all Bayesian methods remain consistent to recover the true model. The time

taken by SSVS and MPM is around 1.2 hours using 4 processors. The proposed SA-HPM

selects the data generating model 97% times successfully and takes only 6.1 seconds. On

the other hand Lasso performs poorly as illustrated by Zou (2006) and Adaptive lasso, is

the best performing frequentist method. Surprisingly, Elastic Net was never able to find the

data generating model out of 100 simulations.

Example 7.5. p = 30 and Presence of Correlation.

69

Table 7.3: Number of times the correct model is obtained based on 100 repetitions: Lassoexample


SSVS 95% 0.01 0.002 2.3hMPM 94% 0.016 0.002 2.0hHPM 87% 0.029 0.007 41.5sLasso 36% 0.209 0 10.9sSCAD 53% 0.163 0.002 7.8sElastic Net 7% 0.384 0 11.2sAdaptive Lasso 58% 0.148 0.002 58.0MCP 57% 0.162 0.009 4.6sSIS-SCAD 43% 0.206 0.002 10.1sSIS-MCP 58% 0.152 0.010 10.5sISIS-SCAD 48% 0.185 0.005 11.0sISIS-MCP 61& 0.144 0.006 11.9sSA-HPM 81% 0.044 0.019 13.0s

We consider this example from Kuo & Mallick (1998) to asses the performance of SA-

HPM along with other methods when we have large p. We set n = 100 and p = 30.

Covariates are generated from standard normal distribution with pairwise correlation 0.5.

β = (0, . . . , 0︸︷︷︸10

, 1, . . . , 1︸︷︷︸10

, 0, . . . , 0︸︷︷︸10

)′. So the data generating model is (11, . . . , 20). The results

are presented in Table 7.5.

Notice that calculation of HPM is omitted due to the infeasibility of enumerating 230

models. Otherwise, the performances of the Bayesian methods are similar. However, the

SSVS and MPM take a significant amount of time compare to that taken by proposed SA-

HPM. The reported time is required by the methods for 100 repetitions using 4 processors

parallely. Since each repetition takes around 40 to 50 minutes the calculation is done using

100 processors parallely.

Interestingly, almost all the frequentist regularized methods except adaptive lasso and

ISIS-MCP often fail to select the data generating model.

Example 7.6. p = 40 and Presence of Correlation.

70

Table 7.4: Number of times the correct model is obtained based on 100 repetitions: Adaptivelasso example (lasso fails)


SSVS 97% 0.008 0 1.2hMPM 95% 0.012 0 1.2hHPM 97% 0.01 0 4.7sLasso 1% 0.01 0.248 12.0sSCAD 90% 0.025 0 4.7sElastic Net 0% 0.25 0 13.4sAdaptive Lasso 94% 0.343 0 1.4mMCP 82% 0.045 0 4.6sSIS-SCAD 78% 0.055 0 9.8sSIS-MCP 78% 0.055 0 11.7sISIS-SCAD 79% 0.052 0 11.9sISIS-MCP 79& 0.052 0 11.9sSA-HPM 97% 0.008 0 6.1s

This example is taken from Zou & Hastie (2005) and illustrates the performance of

proposed SA-HPM method. Set n = 100 and p = 40. Covariates are generated independently

from standard normal distribution with pairwise correlation 0.5.

β = (0, . . . , 0︸︷︷︸10

, 2, . . . , 2︸︷︷︸10

, 0, . . . , 0︸︷︷︸10

, 2, . . . , 2︸︷︷︸10

)′ so that the data generating model (true model)

is (11, . . . , 20, 31, . . . , 40). σ2 is set to 1 as in the previous examples. The results are presented

in Table 7.6.

Here we omit presenting results for SSVS, MPM, and HPM methods because they are

computationally expensive. However, SA-HPM outperforms other frequentist methods.

Example 7.7. p > n, p = 200, n = 100.

Motivated by Song & Liang (2015), we consider this experiment where p > n. We

set n = 100, p = 200 and β = 1, . . . , 1︸︷︷︸8

, 0, . . . , 0︸︷︷︸192

. Each row of the design matrix X was

independently drawn from a multivariate normal distribution having mean 0 and identity

covariance matrix. Table 7.7 reports the results.

71

Table 7.5: Number of times the correct model is obtained based on 100 repetitions: p = 30.


SSVS 89% 0.01 0 20.8hMPM 96% 0.004 0 21.1hHPM - - - -Lasso 0% 0.319 0 24.4sSCAD 77% 0.052 0 11.0sElastic Net 0% 0.414 0 12.5sAdaptive Lasso 85% 0.018 0 1.3mMCP 77% 0.037 0 11.3sSIS-SCAD 66% 0.067 0 16.5sSIS-MCP 75% 0.044 0 19.0sISIS-SCAD 76% 0.048 0 21.2sISIS-MCP 82% 0.033 0 24.4sSA-HPM 97% 0.003 0 2.7m

Note that, g prior defined by 2.8 is not defined for p > n settings. So we apply non local

priors.

The function pmomLM of the package mombf was used for MCMC method of Johnson &

Rossell (2012) using pMOM prior. We refer this method as JR12. The Beta Binomial prior

with B(1, 20) was used on the models. A non informative inverse gamma with IG(0.001,

0.001) has been placed on σ2. The dispersion parameter is set to, τ = 2.85 for pMOM priors

on β as suggested by Johnson (2013).

Here we omit presenting results for SSVS, MPM, and HPM methods because they are

computationally expensive. However, SA-HPM-pMOM is the champion in recovering the

data generating model and outperforms all other methods.

Example 7.8. p > n, p = 200, n = 100 and Presence of Correlation.

Motivated by Song & Liang (2015), we consider this experiment where p > n and multi-

collinearity is present. We set n = 100, p = 200 and β = (1, . . . , 1︸︷︷︸8

, 0, . . . , 0︸︷︷︸192

). Each row of the

design matrix X was independently drawn from a multivariate normal distribution having

72

Table 7.6: Number of times the correct model is obtained based on 100 repetitions: Elasticnet example. p = 40.


SSVS - - - -MPM - - - -HPM - - - -Lasso 1% 0.180 0 13.0sSCAD 84% 0.022 0 24.7Elastic Net 0% 0.963 0.155 12.8sAdaptive Lasso 100% 0 0 2.1mMCP 83% 0.012 0 13.1sSIS-SCAD 76% 0.025 0 24.7sSIS-MCP 81% 0.017 0 23.1sISIS-SCAD 78% 0.022 0 34.1sISIS-MCP 85% 0.015 0 30.8sSA-HPM 100% 0 0 2.8m

mean 0 and covariance matrix Σ with diagonal entries equal to 1 and off-diagonal entries

equal to 0.5. Table 7.8 reports the results.

Obviously, SA-HPM-pMOM appears to be the most successful method in picking up

the data generating model. These examples demonstrate that, the proposed SA-HPM with

pMOM priors is potential better alternative for SIS, ISIS, and JR12 methods when p > n.

Example 7.9. p > n, p = 1000, n = 200.

Motivated by Song & Liang (2015), we consider this experiment where p > n. We

set n = 200, p = 1000 and β = 1, . . . , 1︸︷︷︸8

, 0, . . . , 0︸︷︷︸992

. Each row of the design matrix X was

independently drawn from a multivariate normal distribution having mean 0 and identity

covariance matrix. Table 7.9 reports the results.

Example 7.10. p > n, p = 1000, n = 200 and Presence of Correlation.

Motivated by Song & Liang (2015), we consider this experiment where p > n and mul-

ticollinearity is present. We set n = 200, p = 1000 and β = (1, . . . , 1︸︷︷︸8

, 0, . . . , 0︸︷︷︸992

). Each row

73

Table 7.7: Number of times the correct model is obtained based on 100 repetitions: p >n, p = 200, n = 100


SSVS - - - -MPM - - - -HPM - - - -JR12 91% 0 0.003 55.2sLasso 0% 0.526 0 14.8sSCAD 4% 0.440 0 22.3sElastic Net 0% 0.664 0 15.0sAdaptive Lasso 24% 0.216 0 1.7mMCP 29% 0.202 0 20.6sSIS-SCAD 15% 0.351 0.002 18.2sSIS-MCP 29% 0.239 0.002 17.7sISIS-SCAD 0% 0.765 0 48.7sISIS-MCP 1% 0.757 0 51.1sSA-HPM 100% 0 0 5.3m

of the design matrix X was independently drawn from a multivariate normal distribution

having mean 0 and covariance matrix Σ with diagonal entries equal to 1 and off-diagonal

entries equal to 0.5. Table 7.10 reports the results.

Example 7.11. Real Data Example: Ozone-35.

The Ozone-35 data has been considered by Garcıa-Donato & Martınez-Beneito (2013).

It has n = 178 observations and p = 35 covariates. For a detail description of the original

Ozone dataset, we refer to Table 7.11 which is also in Casella & Moreno (2006). We have

considered the covariates x3 − x7, along with the square terms and interaction terms, as

in Garcıa-Donato & Martınez-Beneito (2013). The data set is readily available in the R

package BayesVarSel (Garcia-Donato & Forte, 2015). Garcıa-Donato & Martınez-Beneito

(2013) illustrated that the posterior probability of the median probability model is 23 times

lower than that of the highest posterior model. The model space is huge which makes

complete enumeration infeasible. g-prior was used to compute the Bayes factors easily and

continuous parameters were integrated out as before.

74

Table 7.8: Number of times the correct model is obtained based on 100 repetitions: p >n, p = 200, n = 100


SSVS - - - -MPM - - - -HPM - - - -JR12 72% 0 0.002 1.2mLasso 0% 0.665 0 14.2sSCAD 38% 0.125 0 14.8sElastic Net 0% 0.743 0 15.3sAdaptive Lasso 33% 0.116 0 1.5mMCP 48% 0.080 0.0001 11.0sSIS-SCAD 0% 0.318 0.010 19.7sSIS-MCP 0% 0.193 0.010 28.1sISIS-SCAD 1% 0.757 0 1.3mISIS-MCP 4% 0.721 0 1.2mSA-HPM 83% 0.005 0.001 6.1m

.

Since this is a real dataset, the data generating model is not known. As discussed before,

HPM is perceived to be good model. Using a huge time consuming study, Garcıa-Donato &

Martınez-Beneito (2013) calculated the HPM. The Bayes factor of HPM and MPM, against

M0 are reported in Table 7.12. The main contribution is that, the proposed SA-HPM

method able to find the HPM 100 times out of 100 repetitions. In particular, the SA-HPM

method is extremely useful even for large model spaces.

Example 7.12. Independent of Starting Model

The theory of simulated annealing method holds good for any start point in the state

space. In the examples, we have considered above, the starting model was the null model

M0. In this example we show that in our set-up, indeed, convergence does not depend on

the start model. We revisit the Tibshirani (1996) example where we produce the response

only once using the model (1, 2, 5) having β = (3, 1.5, 0, 0, 2, 0, 0, 0). We plot the log(Bayes

factor) of the models against the null modelM0 with number of steps to reach to the HPM

75

Table 7.9: Number of times the correct model is obtained based on 100 repetitions: p >n, p = 1000, n = 200.


SSVS - - - -MPM - - - -HPM - - - -JR12 100% 0 0 8.0mLasso 2% 0.484 0 41.6sSCAD 28% 0.278 0 1.2mElastic Net 0% 0.672 0 43.9sAdaptive Lasso 60% 0.085 0 13.0mMCP 58% 0.142 0 1.2mSIS-SCAD 28% 0.276 0 24.5sISIS-SCAD 0% 0.789 0 3.4mSIS-MCP 56% 0.168 0 22.9sISIS-MCP 0% 0.789 0 3.6mSA-HPM 100% 0 0 5.5h

model in the horizontal axis for different start model in figure 7.1. The HPM, here, is the

model (1, 2, 5) having the log(Bayes factor) = 20.909 against M0.

76

Table 7.10: Number of times the correct model is obtained based on 100 repetitions: p >n, p = 1000, n = 200 and Presence of Correlation.


SSVS - - - -MPM - - - -HPM - - - -JR12 100% 0 0 8.2mLasso 0% 0.765 0 49.5sSCAD 73% 0.041 0 59.0sElastic Net 0% 0.832 0 51.9sAdaptive Lasso 74% 0.031 0 9.3mMCP 87% 0.015 0 46.8sSIS-SCAD 0% 0.726 0.003 1.0mISIS-SCAD 1% 0.782 0 6.2mSIS-MCP 0% 0.652 0.003 39.6sISIS-MCP 0% 0.789 0 7.5mSA-HPM 100% 0 0 5.5h

Table 7.11: Description of the Ozone dataset

Variable Description

y Response = Daily maximum 1-hour-averageozone reading (ppm) at Upland, CA

x1 Month: 1 = January, . . . , 12 = Decemberx2 Day of monthx3 Day of week: 1 = Monday, . . . , 7 = Sundayx4 500-millibar pressure height (m) measured at Vandenberg AFBx5 Wind speed (mph) at Los Angeles International Airport (LAX)x6 Humidity (%) at LAXx7 Temperature (F) measured at Sandburg, CAx8 Inversion base height (feet) at LAXx9 Pressure gradient (mm Hg) from LAX to Daggett, CAx10 Visibility (miles) measured at LAX

Table 7.12: The HPM and MPM for Ozone-35 data. The last 2 columns provide Bayes factoragainst the null model and log of that respectively.

Serial No Model Bayes Factor log(Bayes Factor)

HPM 7 10 23 26 29 1.02E+47 108.2364944MPM 21 22 23 29 4.34E+45 105.0834851

77

Figure 7.1: Solution path for different starting model for SA-HPM model. The points arethe log of Bayes factor of models against the null model. The log(Bayes Factor) of the datagenerating model (1, 2, 5) is 20.909.

CHAPTER 8

LOGISTIC REGRESSION

8.1 Introduction

Computation of marginal likelihood of models in the model space is one of the main

key in computing HPM. In linear model with conjugate priors (examples include normal

prior or g prior) we are able obtain an analytical expression for marginal likelihoods which

constitutes the objective function for the optimization needed to find the HPM. However,

if we place a non- conjugate prior on the coefficients, then we fail to derive the marginal

likelihood analytically and the evaluation of the objective function becomes a non-trivial

task.

In logistic regression, primarily due to the nonlinearity of the link function, there exists

no prior distribution for which analytical expression of the marginal likelihood available.

Therefore non-trivial rigorous approach needs to be developed for this purpose. Early work

on estimation of marginal likelihood includes how to compute it from a Gibbs sampler output

(Chib, 1995). Chib (1995) elaborated this procedure using an example in logistic regression.

This idea was extended for output obtained from a Metropolis Hastings sampler (Chib &

Jeliazkov, 2001). Raftery et al. (2007) used harmonic mean type estimator to estimate the

marginal likelihood. However, in our experience, we have seen that it is very difficult task

to estimate marginal likelihood using this type of estimator.

79

8.2 Power Posterior

Motivated by the work of (Gelman & Meng, 1998) Friel & Pettitt (2008) derive a simple

method to estimate the marginal likelihood which they name the power posterior method.

They illustrated the computational ease of this technique using several examples. In this

section we provide an introduction to the power posterior method.

Define,

z(y|t) =

∫θγ

L(y|θγ)

tp(θγ)dθγ

where θk stands for the parameters for model Mγ.

Then, marginal likelihood is given by,

log

(p(y)

)= log

z(y|t = 1)

z(y|t = 0)

=

∫ 1

0

Eθk|y,t logL(y|θk)dt

Applying trapezoidal rule,

log

(p(y)

)=

n−1∑i=1

1

2(ti+1 − ti)

[Eθk|y,ti+1

logL(y|θk) + Eθk|y,ti logL(y|θk)]

(8.1)

According to the suggestions in Friel & Pettitt (2008) we set n = 10. ti = a5i , i =

1, . . . , n where ai’s are equally divided points in the [0, 1] interval. Friel & Pettitt (2008)

commented that if collecting samples from posterior distribution of parameters is possible

then theoretically it is also possible to collect power posterior samples. Once we have power

posterior samples we calculate the right hand side of (8.1) and hence we obtain an estimate

of the marginal likelihood.

In order to apply power posterior method to logistic regression, however, some issues

need to be taken care of.

80

1. First, note that since we apply trapezoidal rule and like to get power posterior samples

for each interval the number iterations should be kept as small as possible for compu-

tational efficiency. A single marginal likelihood calculation actually requires parallel

Markov Chain sampling from 10 power posteriors.

2. Second, prior specification plays an important role as usual. We place a normal prior

for all of our computation. We did not experience a significant amount of difference

between normal prior and double exponential prior when p is small enough.

3. The usual way to obtain posterior samples for logistic regression is to assume a probit

regression (Chib, 1995) and develop a gibbs sampler. But we wanted to avoid that

to assess our proposed method as accurate as possible. So we seek for an automatic

Bayesian updation software or package after keeping the form of logistic link function

as it is. OpenBUGS (Thomas et al., 2006) is widely accepted software in this purpose.

However, when it requires to do a simulation it is very inconvenient to do the same

job in OpenBUGS again and again. Therefore this leads us to use the R2OpenBUGS

package (Sturtz et al., 2005) in R (R Core Team, 2015). R2OpenBUGS (Sturtz et

al., 2005) package provides an interface between R and OpenBUGS (Thomas et al.,

2006) by virtue of which we can set any experiment in R (R Core Team, 2015), do the

Bayesian updation in OpenBUGS Thomas et al. (2006), have OpenBUGS (Thomas et

al., 2006) send the posterior samples to R (R Core Team, 2015), and finally get done

the required calculations using those posterior samples in R (R Core Team, 2015).

This nice integration of R (R Core Team, 2015) and OpenBUGS (Thomas et al.,

2006) allows us to compute SSVS and MPM efficiently. However, computation of

power posterior requires sampling from power posterior distribution of the parameters

which are not readily available in OpenBUGS (Thomas et al., 2006). One would

use the zero’s trick for this purpose. When we did this in practice, somehow the

81

result was not satisfactory. This made us search for a ready sampler. We tried to do

this sampling using a Slice sampler Metropolis Hastings (Hastings, 1970) sampler, for

which a ready routine is available in MHadaptive R package (Chivers, 2012). However,

Metropolis Hastings sampler requires a tuning parameter for the which has to be

change adaptively and a variance covariance matrix of the proposal distribution. Lack

of proper information often provides biased estimate of marginal likelihood. On the

other hand diversitree (FitzJohn, 2012) package in R implements the Slice sampler

(Neal, 2003) which requires only one tuning parameter, the width of the proposal

step. In general, the Slice algorithm is insensitive to the vale of this tuning parameter.

However, in our experience, setting this to 0.1 leads to well mixing of the Markov

Chain.

Essentially, power posterior method is simple and easy to implement even for non linear

complex model. Our contribution is to use the simplicity of this method to estimate the

marginal likelihood at each evaluation of the objective function for simulated annealing.

Recall that in the simulated annealing algorithm the proposal distributions are based on the

posterior probabilities of the neighborhood models. We replace these posterior probabilities

with the marginal likelihoods obtained via power posterior method.

8.3 Simulation Study

In this study we generate the response 100 times according to (2.1) and (2.2). We set

n = 100 and p = 5. The data generating model is (2, 4). β = (0, 2, 0, 2, 0). The results are

shown in Table (8.1). Also the predictors are generated from standard normal distribution

such that the first 2 covariates have correlation around 0.95 and others are independent.

The SSVS and MPM are estimated using R2OpenBUGS package (Sturtz et al., 2005) in R

82

(R Core Team, 2015). We also investigate the performance of LPML here. The posterior

sampling for LPML method is done using slice sampler (Neal, 2003) available in the R

package diversitree (FitzJohn, 2012).

Table 8.1: Number of times the correct model is obtained based on 100 repetitions: Logisticregression in presence of collinearity.

Method Correct FDR FNR

SSVS 60% 0.149 0.015MPM 53% 0.158 0.007LPML 35% 0.3 0.123HPM 71% 0.100 0.039Lasso 36% 0.246 0.014Elastic Net 0% 0.424 0SIS-SCAD 48% 0.241 0.063SIS-MCP 36% 0.300 0.095ISIS-SCAD 46% 0.243 0.056ISIS-MCP 35% 0.306 0.114SA-HPM 65% 0.13 0.008

The results conclude that HPM is the most successful method to recover the data gen-

erating model and thereby followed by the proposed SA-HPM method. The MPM, and

LPML performs poorly. When there exist large number of explanatory variables then HPM

calculation is not feasible and hence the proposed SA-HPM appears to be the method of

choice.

Example 8.1. We set n = 200 and p = 20. Data generating model is (2, 4, 7, 9) having β2 =

β4 = β7 = β9 = 2 and all other parameter coefficients are zero. The data are generated such

that corr(x1, x2 ) = corr(x5, x6) = corr(x11, x12) = corr(x15, x16) = 0.95. All other regressors

are generated independently from standard normal distribution. Finally the response is

generated 100 times according to (2.1) and (2.2). The results are reported in Table 8.2.

Clearly, the SA-HPM method outperforms other Bayesian methods, and continue to

outperform frequetist methods.

83

Table 8.2: Number of times the correct model is obtained based on 100 repetitions: Logisticregression for large p. p = 20.

Method Result FDR FNR

SSVS 47% 0.146 0.015MPM 39% 0.17 0.007LPML - - -HPM - - -Lasso 24% 0.277 0Elastic Net 0% 0.605 -SIS-SCAD 23% 0.255 0.001SIS-MCP 58% 0.126 0ISIS-SCAD 25% 0.248 0.001ISIS-MCP 61% 0.117 0SA-HPM 65% 0.087 0.04

Example 8.2. Nodal data.

The nodal data, considered by Collett (1991) and Chib (1995), is the cancer information

of fifty three patients. Researchers tried to explain whether any patient has the cancer or

not with help of age of the patient in years at diagnosis, level of serum acid phosphate, the

result of an X-ray examination, the size of the tumor, and the pathological grade of the

tumor. Collett (1991) concluded that the second, third and fourth variables are significant

in explaining the cancers of a patient using frequentist deviance approach while Chib (1995)

obtained same significant variables for the cancer patients using Bayesian marginal likeli-

hood computation. Chib (1995) fitted a probit regression model to obtain highest marginal

likelihood for this model. Hence, this model can be concluded as the HPM.

We applied SA-HPM with a probit regression model. We placed independent normal

prior on β with mean 0.75 and variance 25 as suggested by Chib (1995). The SA-HPM

appears to be 100% successful in recovering the HPM based on 100 repetitions.

CHAPTER 9

SURVIVAL MODELS

9.1 Introduction

Here we present exact comparison of different Bayesian methods for survival data. LPML

is a widely used criterion for model selection in Bayesian survival analysis. In our study we

show that the HPM and hence the proposed SA-HPM method often outperforms LPML.

SSVS and MPM are also kept in the comparison examples as before.

9.2 Weibull Distribution

In our simulation study, the number of independent variables p is set to be 5, as with

increasing p, the calculation of HPM becomes growingly time consuming. The covariates are

generated from normal distribution such that x1, x2, and x3 are highly correlated with corre-

lation around 0.9. x4 and x5 are independently distributed according to the standard normal

distribution. We set n = 100. The data generating model is (2, 4) and β = (0, 1, 0,−1, 0).

Rest of the examples assume this set up, unless otherwise mentioned.

The SSVS and MPM are estimated using R2OpenBUGS package (Sturtz et al., 2005)

in R (R Core Team, 2015). The posterior sampling for LPML method is done using slice

sampler (Neal, 2003) available in the R package diversitree (FitzJohn, 2012). The posterior

probabilities required for computation of HPM and SA-HPM are estimated using power

posterior discussed in section 8.2.

85

Example 9.1. Weibull Regression

We assume that the response y follows a Weibull distribution given by,

f(y) = ηνyη−1e−νyη

, y > 0, s > 0, k > 0 (9.1)

where

ν = exp (Xγβγ) (9.2)

The response y is generated according to (9.1) with η = 1.75 and ν is given by (9.2).

We assume no censoring. We fit a Weibull regression when estimating SSVS, MPM, LPMl,

HPM and SA-HPM. The results are given in Table (9.1). During the analysis an Uniform

prior with parameters (0.5, 3) is placed on the shape parameter η.

Table 9.1: Number of times the correct model is obtained based on 100 repetitions: Weibullregression.


SSVS 53% 0.180 0.097MPM 36% 0.247 0.077LPML 35% 0.300 0.123HPM 81% 0.102 0.028SA-HPM 80% 0.130 0.071

From Table 9.1, we can conclude that SA-HPM outperforms SSVS, MPM, and LPML

in selecting the data generating model. Note that LPML, which is widely used in survival

settings, has the worst performance.

9.3 Mixture of Weibull Distribution

Weibull distribution is popular model in parametric survival models and it satisfies the

assumption of proportional hazards model and hence often fit well when the data comes

86

from a proportional-hazards. We propose a Bayesian model where the baseline is modeled

by a mixture of Weibull distribution. In simulation examples we show that this model fits

reasonably well even when data are generated from distributions other than Weibull.

Thus we propose the following hierarchical representation for a regression model with a

mixture of Weibull distribution.

f(yi|λi, ν, βI , β) = ηλi exp(x′iβ)yη−1i exp−λi exp(x′iβ)yηi

π(λi) ∼ G(αI , βI), i = 1, . . . , n

π(βI) ∼ G(α0, β0)

π(η) ∼ U(αs, βs)

π(β) ∼ N(β,Σβ)

If V ∼ G(φ1, φ2) then the pdf of V is defined as,

f(v) =φφ12

Γ(φ1)vφ1−1 exp−φ2v, v > 0, φ1, φ2 > 0. (9.3)

If W ∼ U(ν1, ν2) then the pdf of W is defined as,

f(w) =1

ν2 − ν1

, w > 0, ν2 > ν1 > 0.

We have imposed a prior for the rate parameter βI of the distribution of intercept such

that the informations can be shared between the observations. These lead to the following

posteriors.

87

f(yi|λi, η, βI , β)π(λi|βI)π(βI)π(η)π(β)

=λiη exp(x′iβ)yη−1i exp−λi exp(x′iβ)yηi

βαIIΓ(αI)

λαI−1i exp(−λiβI)

βα00

Γ(α0)βαI−1I exp(−βIβ0)π(η)π(β)

We analytically obtain the integrated likelihood over λi as,

∴π(βI , η, β|yi)π(βI)π(η)π(β)

=η exp(x′iβ)yη−1i

βαIIΓ(αI)

βα00

Γ(α0)βα0−1I exp(−βIβ0)π(η)π(β)∫ ∞

0

λi exp−λi exp(x′iβ)yηi

λaI−1i exp(−λiβI)dλi

=η exp(x′iβ)yη−1i

βαIIΓ(αI)

βα00

Γ(α0)βα0−1I exp(−β0βI)π(η)π(β)∫ ∞

0

λαI+1−1i exp

−(βI + exp(x′iβ)yηi

)=η exp(x′iβ)yη−1

i

βαIIΓ(αI)

βα00

Γ(α0)βα0−1I exp(−β0βI)

Γ(αI + 1)

(βI + exp(x′iβ)yηi )αI+1π(η)π(β)

i = 1, . . . , n

Hence this likelihood may be be used to calculate the power posterior estimate. Note

that, once the full conditional likelihood is integrated with respect to λi, we are left with at

least p many less parameters and thereby achieve fast mixing in the MCMC.

Example 9.2. Mixture of Weibull Regression

88

Table 9.2: Number of times the correct model is obtained based on 100 repetitions: Datagenerating model – Weibull regression. Analysis model – Mixture of Weibull regression.



Table 9.3: Number of times the correct model is obtained based on 100 repetitions: Datagenerating model – Gamma regression. Analysis model – Mixture of Weibull regression.



The response y is generated according to (9.1) with s = 1.75 and ν is given by (9.2). No

censoring is assumed. We fit a mixture of Weibull regression. Table 9.2 reports the results.

Notice that, HPM and SA-HPM are the best performers whereas both LPML and MPM

perform poorly.

Example 9.3. Response Follows Gamma Distribution

For this example, the response y is generated according to a Gamma distribution (9.3)

with φ1 = 1.75 and φ2 = φ1exp(Xβ)

. The purpose is twofold here. First we wish to investigate

the performance of fitting a mixture of Weibull distribution when the data are not from a

Weibull distribution. Second, we wish to check the performance of the proposed SA-HPM

method. Table 9.3 provides the results.

First, we compare these results with that of Table 9.2 and notice that the results are

similar. Second, as before, HPM and hence SA-HPM tend to find the data generating

89

Table 9.4: Number of times the correct model is obtained based on 100 repetitions: Datagenerating model – Log Normal regression. Analysis model – Mixture of Weibull regression.



model more often than other Bayesian methods. MPM followed by LPML are the two worst

performing methods.

Example 9.4. Response Follows Log Normal Distribution.

Following the similar spirit of the previous example here we draw the response from a

Log Normal distribution. Recall that, if U ∼ LN(µ0, σ20) then log(U) ∼ N(µ0, σ

20). We set

µ0 = exp(Xβ) and σ0 =√

11.75

. We report the results in table 9.4, and notice that, the

conclusions are similar as in the example 9.3

9.4 Censoring

In presence of the censoring, the likelihood for the i-th observation is given by,

l(θ|yi) = f(yi|θ)δiS(yi|θ)(1−δi)

where

δi =

1 if yi is event time

0 if yi is right censored i = 1, . . . , n

90

, and f(.|θ) is the pdf, and S(.|θ) is the survival function. When f(.|.) is the pdf of a Weibull

distribution (9.1), then the survival function is given by,

S(y|η, ν) = exp−νyη, y > 0, ν > 0, η > 0

The contribution of the censored data can be added to the likelihood using the following.

π(βI , η, β|yi)π(βI)π(η)π(β)

=βαII

Γ(αI)

βα00

Γ(α0)βα0−1I exp(−βIβ0)π(η)π(β)∫ ∞

0

exp−λi exp(x′iβ)yηi

λaI−1i exp(−λiβI)dλi

=βαII

Γ(αI)

βα00

Γ(α0)βα0−1I exp(−β0βI)π(η)π(β)∫ ∞

0

λαI−1i exp

−(βI + exp(x′iβ)yηi

)=

βαIIΓ(αI)

βα00

Γ(α0)βα0−1I exp(−β0βI)

Γ(αI)

(βI + exp(x′iβ)yηi )αIπ(η)π(β)

i = 1, . . . , n

Example 9.5. Responses are Right Censored.

We keep the similar set up for generating the covariates. The response is generated using

(9.1) with η = 1.75 and ν is given by (9.2). The censoring distribution is taken to be (9.1)

with η = 1.75 and ν = 13. This results in 25 % censoring in the simulated data. Recall that,

informations are typically lost via censoring. Keeping this fact in the mind the number of

observations is increased to n = 200.

91

Here we also decide to compare with frequentist regularization regressions lasso (Tibshi-

rani, 1996) and elastic net (Zou & Hastie, 2005). These are fitted using R package glmnet

(Friedman et al., 2010) (Simon et al., 2011). We report the simulated results in Table 9.5.

Table 9.5: Number of times the correct model is obtained based on 100 repetitions: Presenceof Censoring

Method Result FDR FNR

SSVS 47% 0.258 0.148MPM 42% 0.263 0.116LPML 11% 0.439 0.115HPM 76% 0.094 0.050Lasso 51% 0.179 0.025Elastic Net 0% 0.463 0.458SIS-lasso 12% 0.379 0.010ISIS-lasso 11% 0.375 0.010SA-HPM 72% 0.123 0.084

The results suggest that SA-HPM clearly ouperforms other Byaesian and frequentist

methods in selecting the data generating model. In particular, LPML is able to find the

data generating model only 11 times out of 100 repetitions.

Example 9.6. Proteomics data.

We apply the methodologies discussed above to Proteomics data. This is a right censored

data with 110 subjects. Among them 8 subjects are censored. The data has two types

of treatments which can be represented by a binary variable. In addition, there are 60

biomarkers. Moreover, we created interaction terms between treatment and the biomarkers.

This leads us to deal with 121 regressors. We report the results in Table 9.6.

Note that, since the model space is extremely large, the runtime for MPM and SSVS was

around 5 hours. We set 25000 iterations and 5000 burnin. Notice that, the model obtained

using SA-HPM has highest log(marginal likelihood). In this sense, SA-HPM outperforms

other methods. We also include the dimensions of models of the SA-HPM path and their

92

Table 9.6: Models obtained using different methods for Prot data

Method Model log(marginal likelihood)

MPM 14, 16, 20, 23, 24, 30, 50, 61,65, 71, 72, 75, 77, 84, 85 -359.6159

SSVS 1, 2, 4, 6, 14, 15, 16, 17, 19,20, 22, 23, 24, 30, 31, 32, 35,47, 49, 51, 54, 57, 61, 62, 63,65, 70, 71, 77, 79, 84, 85, 88,92, 93, 96, 100, 101, 102, 107,108, 112, 119 -390.3568

Lasso 10, 23 -359.5315Elastic Net 10, 23 -359.5315SIS-lasso 6, 7, 10, 17, 20, 23, 27, 44,

68, 70, 96 -366.4602ISIS-lasso 4, 7, 10, 11, 14, 20, 23, 24,

28, 30, 32, 41, 42, 49, 50, 55,70, 74, 78, 81, 82, 96, 99,102, 105, 117 -371.4623

SA-HPM 10, 11, 20, 24, 30, 50, 75 -352.8674

log(marginal likelihoods) for each steps of SA-HPM, in Figure 9.1. The solution paths are

for different models as start points (start models) for SA-HPM.

93

Figure 9.1: (a)The path for dimensions of models obtained in each step of SA-HPM. (b) Thepath for log(marginal likelihood) of models obtained in each step of SA-HPM.

(a) (b)

CHAPTER 10

DISCUSSION

10.1 Introduction

Posterior probability approach is one of the oldest Bayesian model selection criterion.

It is widely accepted in the Bayesian community. It became popular because of its solid

theoretical and logical foundation. However, like some of the other Bayesian criteria this

approach requires enumerating the whole model space. Our work presented in this article

tries to find the highest posterior model by avoiding the enumeration of whole model space.

10.2 Comparison of Methods for Linear Models

When comparing various Bayesian methods for linear models the main issue with SSVS

and MPM is that both of the methods take significant amount of time. We have followed

the Gibbs sampling of Garcıa-Donato & Martınez-Beneito (2013). Therefore the continuous

parameters are integrated out and we are left with a finite parameter space. So to capture

this, we wrote the code completely in R (R Core Team, 2015) neither taking advantage of

MCMC calculation of OpenBUGS (Thomas et al., 2006) nor using efficient coding of C or

C++ or Fortran. This argument may be used for huge time taken by computation of these

methods. But then, we wrote the code for proposed SA-HPM using R again. So, in our

opinion, this makes the comparison fair. The only advantage we have taken is that the closed

form expression for posterior probabilities are available when we use g prior. Unfortunately,

95

the SSVS and MPM are not blessed by this fact. The MPM could have been calculated

using the posterior probabilities as well. However such practice is not tenable for large p.

On the other hand, the performance of MPM is not good. Thus, in either way, proposed

SA-HPM is preferable.

10.3 Non-Linear Models

In general, the non conjugacy of any prior for non-linear models demands a significant

amount of computation. We have used power posterior (Friel & Pettitt, 2008) method in

this purpose. The advantage of power posterior is that it is easy to understand and simple

to employ. Until today, according to our knowledge, there have been no effort, to use

computational simplicity of power posterior for variable selection purpose. Our work is, thus

the first attempt to this direction.

10.4 Survival Settings

Again, according to our knowledge, Bayesian variable selection literature is very limited

in the survival statistics. One of main issue is the lack of good prior or conjugate prior.

On the top of that the form of a survival function is often very complicated to deal with

(example: Cox proportional hazards model, Weibull regression model). Presence of censoring

adds another level of difficulty. Bayesian methods normally depend on the given data and

try to recover the true situation by using the data. In the presence of censoring a significant

amount of information is lost and thereby it becomes challenging for the statisticians to

recover the relevant variables. In our method we employed mixture of Weibull distribution

and have shown its good performance for different scenarios. Furthermore LPML, perhaps

96

the most famous goodness of fit statistics in survival models, has been shown to work poorly.

In addition, simulations show that DIC suffers the same problem as LPML does.

10.5 Future Extensions

10.5.1 Large p

Although we have done the experiments under small p for non-linear models, the work

can easily be extended for large p. And from the theory of convergence of SA the proposed

SA-HPM has the potential to do well.

10.5.2 Prior

We have also used the double exponential prior to β and got similar result for small

p. The results are not presented here for brevity. For linear models, when using double-

exponential prior, the closed form expression of posterior probabilities is not available and

hence power posterior arrives as the rescuer like we have discussed before. Thus one will be

tempted use any shrinkage prior and assess the performance.

10.5.3 Different Models

The immediate future extension of this work would be to use SA-HPM for other mod-

els like proportional odds regression, Poisson regression, and spatial regression. Currently,

performance of SA-HPM is examined for proportional odds model.

97

10.5.4 Posterior Sampling

When we do not have Gibbs sampler available (example: non-linear models) we have

relied on other R (R Core Team, 2015) packages (diversitree, MHadaptive) due to in-efficiency

of OpenBUGS in calculating the power posterior using zero’s trick for posterior sampling.

Developing efficient posterior sampler, would certainly make the computation of SA-HPM

easier and accelerate the computation time.

10.5.5 Computation Time

In addition, we believe that the computation time can also be reduced by using efficient

programing of lower level languages like C, C++, or Fortran. In that case, our effort will

be to deliver an R package which will implement our idea having functions written in one of

the lower level languages.

CHAPTER 11

CONCLUSION

In conclusion, first note that, we have not tried to find the data generating model using

any direct approach like using a shrinkage prior or by sampling from the model space. Rather

our effort focuses on finding the highest posterior model which is often perceived to have

good properties. If highest posterior model does not possess sufficient information on data,

our proposed SA-HPM mtehod might fail in those cases. Fortunately, there is no study until

now which shows the failure of posterior probability approach. Exception include MPM by

Barbieri & Berger (2004). But the optimality of MPM depends on the orthogonal assumption

of design matrix which is a rare event in practice. And in the presence of collinearity, the

proposed SA-HPM has shown to outperform MPM.

Furthermore, as Garcıa-Donato & Martınez-Beneito (2013) observed that SSVS estima-

tors have potential to find a good model, but when a large number of predictors is available,

it might become infeasible to check its performance.

One important feature of our method is that the computation time is comparable to that

taken by various frequentist regularized regressions. Comparison of Bayesian and frequentist

methods is always critical. Even keeping this fact in mind, the proposed SA-HPM have shown

to outperform its frequentist counterparts.

Most of the regularized regression requires one to choose the tuning parameter and this

is typically done by cross validation. The entire path of the tuning parameter is believed

to contain the data generating solution. The examination of that entire path is beyond the

scope of this article.

99

Most of the stochastic search processes available in the literature depend on the posterior

probabilities of the models by virtue of the design of the algorithms. We use this idea of

using posterior probability directly with the help of simulated annealing algorithm and power

posterior method. As a summary, our research strengthens the classical idea of assessing a

model by its posterior probability.

We agree with Garcıa-Donato & Martınez-Beneito (2013) that a large volume of near

future research in Bayesian literature of variable selection will involve sampling and stochastic

search. We also agree with Hahn & Carvalho (2015) that good models can be obtained by

exploring the posterior summary of the models. The highest posterior model, a posterior

summary, is widely known to have excellent properties. Our research, thus, provides a simple,

efficient, quick, and feasible way toward this direction of variable selection.

REFERENCES

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions

on Automatic Control , 19 (6), 716 - 723.

Andrews, D. F., & Mallows, C. L. (1974). Scale mixtures of normal distributions. Journal

of the Royal Statistical Society: Series B (Statistical Methodology), 36 (1), 99 – 102.

Barbieri, M. M., & Berger, J. O. (2004). Optimal predictive model selection. The Annals

of Statistics , 32 (3), 870 – 897.

Basu, S., & Chib, S. (2003, March). Marginal likelihood and bayes factor for dirichlet process

mixture models. Journal of the American Statistical Association, 98 (461), 224 – 235.

Basu, S., & Ebrahimi, N. (2003, May). Bayesian software reliability models based on

martingale processes. Technometrics , 45 (2), 150 – 158.

Basu, S., Sen, A., & Banerjee, M. (2003). Bayesian analysis of competing risks with par-

tially masked cause-of-failure. Journal of the Royal Statistical Society: Series C (Applied

Statistics), 52 (1), 77 – 93.

Basu, S., & Tiwari, R. C. (2010, April). Breast cancer survival, competing risks and

mixture cure model: a bayesian analysis. Journal of the Royal Statistical Society: Series

A (Statistics in Society), 173 (2), 307 – 329.

Berger, J. O., & Molina, G. (2005). Posterior model probabilities via path-based pairwise

priors. Statistica Neerlandica, 59 , 3 – 15.

101

Bertsimas, D., & Tsitsiklis, J. (1993). Simulated annealing. Statistical Science, 8 (1), 10 –

15.

Bottolo, L., & Richardson, S. (2010). Evolutionary stochastic search for bayesian model

exploration. Bayesian Analysis , 5 (3), 583 – 618.

Breheny, P., & Huang, J. (2011). Coordinate descent algorithms for nonconvex penalized

regression, with applications to biological feature selection. Annals of Applied Statistics ,

5 (1), 232–253.

Breiman, L. (1995, November). Better subset selection using the nonnegative garrote.

Technometrics , 37 (4), 373 – 384.

Candes, E., & Tao, T. (2007). The dantzig selector: Statistical estimation when p is much

larger than n. The Annals of Statistics , 35 (6), 2313 – 2351.

Carvalho, C., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for sparse

signals. Biometrika, 97 , 465 – 480.

Casella, G., & George, E. I. (1992, August). Explaining the gibbs sampler. The American

Statistician, 46 (3), 167 – 174.

Casella, G., Giron, F. J., Martınez, M. L., & Moreno, E. (2009). Consistency of bayesian

procedures. The Annals of Statistics , 37 (3), 1207 – 1228.

Casella, G., & Moreno, E. (2006, March). Objective bayesian variable selection. Journal of

the American Statistical Association, 101 (473).

Celeux, G., Forbes, F., Robert, C. P., & Titterington, D. M. (2006). Deviance information

criteria for missing data models. Bayesian Analysis , 1 , 651 – 674.

102

Chen, M. H., Harrington, D. P., & Ibrahim, J. G. (2002). Bayesian cure rate models for

malignant melanoma: A case-study of eastern cooperative oncology group trial e1690.

Journal of the Royal Statistical Society. Series C (Applied Statistics), 51 (2), 135 – 150.

Chen, M. H., Ibrahim, J. G., & Sinha, D. (1999). A new bayesian model for survival data

with a surviving fraction. Journal of the American Statistical Association, 94 (447), 909 –

919.

Chen, M. H., Shao, Q. M., & Ibrahim, J. G. (2001). Monte carlo methods in bayesian

computation. New York, New York: Springer-Verlag.

Chib, S. (1995, December). Marginal likelihood from the gibbs output. Journal of the

American Statistical Association, 90 (432), 1313 – 1321.

Chib, S., & Jeliazkov, I. (2001, March). Marginal likelihood from the metropolis-hastings

output. Journal of the American Statistical Association, 96 (453), 270 – 281.

Chivers, C. (2012). Mhadaptive: General markov chain monte carlo for bayesian inference

using adaptive metropolis-hastings sampling [Computer software manual]. Retrieved from

http://CRAN.R-project.org/package=MHadaptive (R package version 1.1-8)

Clyde, M. A., Ghosh, J., & Littman, M. (2011). Bayesian adaptive sampling for variable

selection and model averaging. Journal of Computational and Graphical Statistics , 20 , 80

– 101.

Collett, D. (1991). Modelling binary data. London: Chapman and Hill.

Cooner, F., Banerjee, S., Carlin, B. P., & Sinha, D. (2007, June). Flexible cure rate

modeling under latent activation schemes. Journal of the American Statistical Association,

102 (478), 560 – 572.

103

Cruz, J. R., & Dorea, C. C. Y. (1998, December). Simple conditions for the convergence of

simulated annealing type algorithms. Journal of Applied Probability , 35 (4), 885 – 889.

Dey, T., Ishwaran, H., & Rao, J. S. (2008). An in-depth look at highest posterior model

selection. Econometric Theory , 24 , 377 – 403.

DiCiccio, T. J., Kass, R. E., Raftery, A., & Wasserman, L. (1997, September). Computing

bayes factors by combining simulation and asymptotic approximations. Journal of the


Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals

of Statistics , 32 (2), 407 – 499.

Fan, J., Feng, Y., Saldana, D. F., Samworth, R., & Wu, Y. (2015).

Sis: Sure independence screening [Computer software manual]. Retrieved from

http://CRAN.R-project.org/package=SIS (R package version 0.7-5)

Fan, J., & Li, R. (2001, December). Variable selection via nonconcave penalized likelihood

and its oracle properties. Journal of the American Statistical Association, 96 (456), 1348

– 1360.

Fan, J., & Lv, J. (2008). Sure independent screening for ultrahigh dimensional feature space.

Journal of the Royal Statistical Society: Series B (Statistical Methodlogy), 70 (5), 849 –

911.

Fan, J., & Lv, J. (2010a). A selective overview of variable selection in high dimensional

feature space. Statistica Sinica, 20 , 101 – 148.

Fan, J., & Lv, J. (2010b). A selective overview of variable selection in high dimensional

feature space. Statistica Sinica, 20 , 101 – 148.

104

Fernandez, C., Ley, E., & Steel, M. F. J. (2001). Benchmark priors for bayesian model

averaging. Journal of Econometrics , 100 , 381 – 427.

FitzJohn, R. G. (2012). Diversitree: Comparative phylogenetic analyses of diversification in

r. Methods in Ecology and Evolution, 3 , 1084 – 1092.

Flom, P. L., & Cassell, D. L. (2007). Stopping stepwise: Why stepwise and similar selection

methods are bad, and what you should use. NESUG .

Frank, I. E., & Friedman, J. H. (1993, May). A statistical view of some chemometrics

regression tools. Technometrics , 35 (2), 109 – 135.

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear

models via coordinate descent. Journal of Statistical Software, 33 (1), 1 – 22.

Friel, N., & Pettitt, A. N. (2008, July). Marginal likelihood estimation via power posteriors.

Journal of Royal Statistical Society: Series B (Statistical Methodology), 70 (3), 589 – 607.

Garcia-Donato, G., & Forte, A. (2015). Bayesvarsel: Bayes factors, model choice

and variable selection in linear models [Computer software manual]. Retrieved from

http://CRAN.R-project.org/package=BayesVarSel (R package version 1.6.1)

Garcıa-Donato, G., & Martınez-Beneito, M. A. (2013, March 15). On sampling strategies

in bayesian variable selection problems with large model spaces. Journal of the American

Statistical Association, 108 (501), 340 – 352.

Geisser, S., & Eddy, W. F. (1979). A predictive approach to model selection. Journal of the

American Statistical Association, 74 (365), 153–160.

Gelfand, A. E., Dey, D. K., & Chang, H. (1992). Model determination using predictive

distributions with implementation via sampling-based methods. Technical Report for the

Office of Naval Research, 462 (1992).

105

Gelman, A., & Meng, X.-L. (1998). Simulating normalizing constants: From importance

sampling to bridge sampling to path sampling. Statistical Science, 13 (2), 163 – 185.

George, E. I., & McCulloch, R. E. (1993). Variable selection via gibbs sampling. Journal of

the Americal Statistical Association, 88 (423), 881 – 889.

George, E. I., & McCulloch, R. E. (1997). Approaches for bayesian variable selection.

Statistica Sinica, 7 , 339 – 373.

Ghosh, P., Basu, S., & Tiwari, R. C. (2009). Bayesian analysis of cancer rates from seer

program using parametric and semiparametric joinpoint regression models. Journal of the

Royal Statistical Society: Series A (Statistics in Society), 104 (486), 439 – 452.

Hahn, P. R., & Carvalho, C. M. (2015). Decoupling shrinkage and selection in bayesian

linear models: A posterior summary perspective. Journal of the American Statistical

Association, 110 (509), 435 – 448.

Hall, P., Lee, E. R., & Park, B. U. (2009). Bootstrap based penalty choice for the lasso,

achieving oracle performance. Statistica Sinica, 19 , 449 – 471.

Hans, C. (2009). Bayesian lasso regression. Biometrika, 96 , 835 – 845.

Hans, C., Dobra, A., & West, M. (2007). Shotgun stochastic search for large p regression.

Journal of the American Statistical Association, 102 (478), 507 – 516.

Hastings, W. K. (1970, April). Monte carlo sampling methods using markov chains and

their applications. Biometrika, 57 (1), 97 – 109.

Huang, J., Horowitz, J. L., & Ma, S. (2008). Asymptotic properties of bridge estimators in

sparse high-dimensional regression models. The Annals of Statistics , 36 (2), 587 – 613.

106

Ibrahim, J. G., Chen, M.-H., & Sinha, D. (2004). Bayesian survival analysis. New York,

New York: Springer Series in Statistics.

Ishwaran, H., & Rao, J. S. (2005). Spike and slab variable selection: Frequentist and bayesian

strategies. The Annals of Statistics , 33 (2), 730 – 773.

Johnson, V. E. (2013). On numerical aspects of bayesian model selection in high and

ultrahigh-dimensional settings. Bayesian Analysis , 8 (4), 741 – 758.

Johnson, V. E., & Rossell, D. (2012, July 24). Bayesian model selection in high-dimensional

settings. Journal of the American Statistical Association, 107 (498), 649 – 660.

Kass, R. E., & Raftery, A. E. (1995, June). Bayes factors. Journal of the American Statistical

Association, 90 (430), 773 – 795.

Kraemer, N., Schaefer, J., & Boulesteix, A.-L. (2009). Regularized estimation of large-scale

gene regulatory networks using gaussian graphical models. BMC Bioinformatics , 10 (384).

Kuo, L., & Mallick, B. (1998). Variable selection for regression models. Sankhya: The Indian

Journal of Statistics. Special Issue on Bayesian Analysis , 60 (1), 65 – 81.

Kyung, M., Gill, J., Ghosh, M., & Casella, G. (2010). Penalized regression, standard errors,

and bayesian lassos. Bayesian Analysis , 5 (2), 369 – 412.

Laud, P. W., & Ibhrahim, J. G. (1995). Predictive model selection. Journal of Royal

Statistical Society: Series B (Statistical Methodology), 57 (1), 247 – 262.

Leng, C., Tran, M.-N., & Nott, D. (2014, April). Bayesian adaptive lasso. Annals of the

Institute of Statistical Mathematics , 66 (2), 221 – 244.

Li, Q., & Lin, N. (2010). The bayesian elastic net. Bayesian Analysis , 5 (1), 151 – 170.

107

Liang, F., Paulo, R., Polina, G., Clyde, M. A., & Berger, J. O. (2008, March). Mixtures

of g priors for bayesian variable selection. Journal of the American Statistical Analysis ,

103 (481), 410 – 423.

Martin, A. D., Quinn, K. M., & Park, J. H. (2011). MCMCpack: Markov chain

monte carlo in R. Journal of Statistical Software, 42 (9), 22. Retrieved from

http://www.jstatsoft.org/v42/i09/

Mitchell, T. J., & Beauchamp, J. J. (1988, December). Bayesian variable selection in linear

regression. Journal of the American Statistical Association, 83 (404), 1023 – 1032.

Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing , 11 , 125 –

139.

Neal, R. M. (2003). Slice sampling. The Annals of Statistics , 31 (3), 705 – 767.

O’Hara, R. B., & Sillanpaa, M. J. (2009). A review of bayesian variable selection methods:

What, how and which. Bayesian Analysis , 4 (1), 85 – 118.

Park, T., & Casella, G. (2008, June). The bayesian lasso. Journal of the American Statistical

Association, 103 (482).

Picard, R. R., & Cook, R. D. (1984). Cross-validation of regression models. Journal of the


Polson, N. G., & Scott, J. G. (2012). Local shrinkage rules, levy processes, and regularized

regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology),

74 (2), 287 – 311.

Polson, N. G., Scott, J. G., & Windle, J. (2014, September). The bayesian bridge. Journal

of the Royal Statistical Society: Series B (Statistical Methodology), 76 (4), 713 – 733.

108

R Core Team. (2015). R: A language and environment for statistical computing [Computer

software manual]. Vienna, Austria. Retrieved from http://www.R-project.org/

Raftery, A. E. (1999). Bayes factors and bic. comment on ”a critique of the bayesian

information criterion for model selection”. Sociol. Methods Res., 27 , 411 – 427.

Raftery, A. E., Newton, M. A., Satagopan, J. M., & Krivitsky, P. N. (2007). Estimating

the integrated likelihood via posterior simulation using the harmonic mean identity. In

Bayesian statistics (Vol. 8, pp. 1 – 45).

Rossell, D., Cook, J. D., Telesca, D., & Roebuck, P. (2014). mombf: Moment

and inverse moment bayes factors [Computer software manual]. Retrieved from

http://CRAN.R-project.org/package=mombf (R package version 1.5.9)

Schwarz, G. E. (1978). Estimating the dimension of a model. Annals of Statistics , 6 (2), 461

- 464.

Seber, G. A. F., & Lee, A. J. (2003). Linear regression analysis. Hoboken, New Jersey: A

John Wiley & Sons Publication.

Shao, J. (1993, June). Linear model selection by cross validation. Journal of the American


Shi, M., & Dunson, D. B. (2011, February 1). Bayesian variable selection via particle

stochastic search. Statistics and Probability Letters , 81 (2), 283 – 291.

Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2011). Regularization paths for cox’s

proportional hazards model via coordinate descent. Journal of Statistical Software, 39 (5),

1 – 13. Retrieved from http://www.jstatsoft.org/v39/i05/

Song, Q., & Liang, F. (2015). High dimensional variable selection with reciprocal l1 regu-

larization. Journal of the American Statistical Association.

109

Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian

measures of model complexity and fit. Journal of Royal Statistical Society: Series B

(Statistical Methodology), 64 (4), 583 – 639.

Sturtz, S., Ligges, U., & Gelman, A. (2005). R2winbugs: A package for running

winbugs from r. Journal of Statistical Software, 12 (3), 1 – 16. Retrieved from

http://www.jstatsoft.org

Sun, W., Ibrahim, J. G., & Zou, F. (2009). Variable selection by bayesian adaptive lasso

and iterative adaptive lasso, with application for genome-wide multiple loci mapping.

Biostatistics Technical Report Series , 10 .

Thomas, A., Hara, B. O., Ligges, U., & Sturtz, S. (2006, March). Making bugs open. R

News , 6 (1), 12 – 17.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society: Series B (Statistical Methodology), 58 (1), 267 – 288.

Tibshirani, R. (2011). Regression shrinkage and selection via the lasso:a retrospective.

Journal of Royal Statistical Society: Seris B (Statistical Methodlogy), 73 (3), 273 – 282.

Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and

smoothness via the fused lasso. Journal of Royal Statistical Society: Series B (Statistical

Methodology), 67 (1), 91 – 108.

Tsodikov, A. D., Ibrahim, J. G., & Yakovlev, A. Y. (2003). Estimating cure rates from

survival data: An alternative to two-component mixture models. Journal of the American


110

Watanabe, S. (2010). Asymptotic equivalence of bayes cross validation and widely applicable

information criterion in singular learning theory. Journal of Machine Learning Research,

11 (2010), 3571 – 3594.

Watanabe, S. (2013). A widely applicable bayesian information criterion. Journal of Machine

Learning Research, 14 (2013), 867 – 897.

Yan, X., & Su, X. G. (2009). Linear regression analysis: theory and computing. World

Scientific.

Yin, G., & Ibrahim, J. G. (2005, December). Cure rate models: A unified approach.

Canadian Journal of Statistics , 33 (4), 559 – 570.

Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped

variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology),

68 (1), 49 – 67.

Zellner, A. (1986).

Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti , 233

– 243.

Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty.

The Annals of Statistics , 38 (2), 894 – 942.

Zou, H. (2006, December). The adaptive lasoo and its oracle properties. Journal of the


Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.

Journal of the Royal Statiscal Society: Series B (Statistical Methodology), 67 (2), 301 –

320.

111

Zou, H., & Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of

parameters. The Annals of Statistics , 37 (4), 1733 – 1751.

ABSTRACT BAYESIAN VARIABLE SELECTION IN LINEAR AND …

Documents

Transcript of ABSTRACT BAYESIAN VARIABLE SELECTION IN LINEAR AND …