Variable Selection and Model Averaging in Semiparametric Overdispersed Generalized Linear Models

Variable Selection and Model Averaging in Semiparametric Overdispersed Generalized LinearModelsAuthor(s): Remy Cottet, Robert J. Kohn and David J. NottSource: Journal of the American Statistical Association, Vol. 103, No. 482 (Jun., 2008), pp.661-671Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/27640088 .

Accessed: 16/06/2014 10:34

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.

http://www.jstor.org

This content downloaded from 194.29.185.230 on Mon, 16 Jun 2014 10:34:56 AMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=astata

http://www.jstor.org/stable/27640088?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


Variable Selection and Model Averaging in

Semiparametric Overdispersed Generalized Linear Models

Remy Cottet, Robert J. Kohn, and David J. Nott

We express the mean and variance terms in a double-exponential regression model as additive functions of the predictors and use Bayesian variable selection to determine which predictors enter the model and whether they enter linearly or flexibly. When the variance term is null, we obtain a generalized additive model, which becomes a generalized linear model if the predictors enter the mean linearly. The model is

estimated using Markov chain Monte Carlo simulation, and the methodology is illustrated using real and simulated data sets.

KEY WORDS: Bayesian analysis; Double-exponential family; Hierarchical prior; Markov chain Monte Carlo.

1. INTRODUCTION

Correctly modeling the response variance in regression is im

portant for efficient estimation of mean parameters, correct in

ference, and understanding the sources of variability in the re

sponse. Generalized linear models (GLMs) have traditionally been used to model non-Gaussian regression data (e.g., McCul

lagh and Neider 1989), where the response has a distribution from the exponential family and a transformation of the mean

response is a linear function of predictors. This framework was

extended to generalized additive models (GAMs) by Hastie and Tibshirani (1990), where a transformation of the mean is mod eled as a flexible additive function of the predictors. But some times the restriction to the exponential family in GLMs and GAMs is insufficiently general, because the variance of these distributions is a function of the mean and the data often ex

hibit greater variability than is implied by such mean-variance

relationships, a phenomenon known as overdispersion. Under

dispersion, where the data exhibit less variability than expected, also can occur, although less frequently.

We model the mean and variance terms using the class

of double-exponential regression models introduced by Efron

(1986). This class of models has the parsimony and inter

pretability of GLMs and is able to account for both overdis

persion and underdispersion. The most general model that we

consider describes the mean and dispersion terms, after trans

formation by link functions, as flexible additive functions of the

predictors. One drawback of such a model is that it may con

tain many additive terms and parameters even when there is a

moderate number of predictors. The main contribution of our

article is to present a Bayesian variable selection approach that

allows the data to determine whether or not to include predic

tors, whether or not to model the effect of predictors flexibly or linearly, and whether overdispersion or underdispersion is

present. If the dispersion term is null, then we obtain a general ized additive model, which becomes a generalized linear model

Remy Cottet is Lecturer, University of Sydney, Faculty of Business and Eco

nomics, Sydney, New South Wales 2006, Australia (E-mail: [email protected]. edu.au). Robert J. Kohn is Professor, School, of Economics, Australian School of Business, University of New South Wales, Sydney, New South Wales 2052, Australia (E-mail: [email protected]). David I. Nott is Associate Professor,

Department of Statistics and Applied Probability, Faculty of Science, National

University of Singapore, Singapore 117546 (E-mail: [email protected]). This work was supported by an Australian Research Council Grant DP0667069. The authors thank Dr. Mikis Stasinopoulos for his quick and helpful response to some questions about the GAMLSS package. They also thank a referee, an

associate editor, and the joint editors for extensive comments that improved the content and exposition of the article.

if the flexible terms drop out of the mean. An important benefit of our work on variable selection and model averaging is that it leads to more interpretable models and produces more efficient

model estimates when there are redundant covariates or para meters. To estimate the model, we develop an efficient Markov

chain Monte Carlo (MCMC) sampling scheme. To the best of our knowledge, alternative approaches to flexi

bly model the mean and variance functions do not address sim

ilar issues of model selection in a systematic way that is practi

cally feasible when there are many predictors. We now review some of the literature related to this article.

The extended quasi-likelihood approach of Neider and Pregi bon (1987) allows modeling of the overdispersion as a function of the covariates, but in general extended quasi-likelihood es

timators may not be consistent (Davidian and Carroll 1988). Wild and Yee (1996) estimated additive models using gener alized estimating equations. Inference about mean and vari

ance functions using estimating equations has a drawback that

there is no fully specified model, making it difficult to deal with characteristics of the predictive distribution for a future

response, other than its mean and variance. Model-based ap

proaches to overdispersion include exponential dispersion mod

els and related approaches (Jorgensen 1997; Smyth 1989), the extended Poisson process models of Faddy (1997), and mix ture models, such as the beta-binomial, negative binomial, and

generalized linear mixed models (Breslow and Clayton 1993; Lee and Neider 1996). One drawback of mixture models is that

they cannot model underdispersion. Generalized additive mixed models incorporating random effects in GAMs were considered

by Lin and Zhang (1999). Both Yee and Wild (1996) and Rigby and Stasinopoulos (2005) considered very general frameworks for additive modeling and algorithms for estimating the addi tive terms. However, there is clearly scope for further research

on inference, and Rigby and Stasinopoulos (2005) suggested that one use for their methods is as an exploratory tool for a

subsequent fully Bayesian analysis of the kind that we consider here. Other recent work on Bayesian GAMs has been done by Brezger and Lang (2005) and Smith and Kohn (1996).

Nott (2006) considered Bayesian nonparametric estimation of a double-exponential family model but does not consider variable selection, and his priors for the unknown functions and

smoothing parameters are very different than those used in our

? 2008 American Statistical Association Journal of the American Statistical Association

June 2008, Vol. 103, No. 482, Theory and Methods DOI 10.1198/016214508000000346

661



662 Journal of the American Statistical Association, June 2008

article. Our article refines and generalizes the work of Shively, Kohn, and Wood (1999) and Yau, Kohn, and Wood (2003) on

nonparametric regression in binary and multinomial probit re

gression models where a data-based prior was used to carry out

variable selection. A serious drawback of this prior is that it is

necessary to first estimate the model with all flexible terms in

cluded, even though the actual fitted model may require only a much smaller number of such terms. This makes the approach

impractical when there is a moderate to large number of terms

in the full model, because the Markov chain simulation breaks down. Yau et al. (2003) discussed this problem and gave some

strategies to overcome it. Our proposed hierarchical prior over

comes this problem and also is computationally more efficient

than the data-based prior approach of Yau et al. (2003), because it requires one simulation run through the data, whereas the

data-based approach requires two runs, the first to obtain the

data-based prior and the second to estimate the parameters of

the model.

2. MODEL AND PRIOR DISTRIBUTIONS

2.1 The Double-Exponential Family

Write the density of a random variable y from a one

parameter exponential family as

/yif-b(^f) \ p(y; ?, cp/A) = expi

' + c(y, 4>/A)\, (1)

where Va is a location parameter, qb/A is a known scale pa

rameter, and b(-) and c(-, ) are known functions. The mean

of y is ? = b'(^f), and its variance is (0/A)b"'(t/0. This means that \?f

= ^?/(?) is a function of ?, as is the variance.

Examples of densities that can be written in this form are the Gaussian, binomial, Poisson, and gamma (McCullagh and Neider 1989). In (1) we write the scale parameter as 0/A in

anticipation of later discussion of regression models, where 0 is common to all responses but A may vary between responses.

A double-exponential family is defined from a corresponding one-parameter exponential family by

p(y,?,0,q>/A)

= Z(?, 0, (p/A)0^2p(y; ?, (?)/A)6p(y- y, 0/A)1"0, (2)

where 6 is an additional parameter and Z(?,6, (p/A) is a nor

malizing constant. To get some intuition for this definition, con

sider a Gaussian density with variance 1 and apply the double

exponential family construction. The resulting double-Gaussian

distribution is in fact an ordinary Gaussian density with mean ? and variance 1/0, so we can think of the parameter 9 as a scale

parameter modeling overdispersion (6 < 1) or underdispersion (0 > 1) with respect to the original one-parameter exponential family density. Whereas the double-Gaussian density is simply the ordinary Gaussian density, for distributions like the bino

mial and Poisson, where the variance is a function of the mean,

the corresponding double-binomial and double-Poisson densi

ties are genuine extensions that allow modeling of the variance.

Efron (1986) showed that

E(y) % ?, Var(y) % -?b"(y?r), and M (3)

Z(?,0,(p/A)^ 1,

whereas these expressions are exact for 6 = 1. Equation (3)

helps interpret the parameters in the double-exponential model and shows how the GLM mean variance relationship is embed

ded within the double-exponential family, which is important for parsimonious modeling of the variance in regression.

2.2 Semiparametric Double-Exponential Regression Models

Efron (1986) considered regression models with a response distribution from a double-exponential family, such that both the mean parameter ?i and the dispersion parameter 6 are func

tions of the predictors. Let y\,..., yn denote n observed re

sponses, and suppose that /z; and 6? are the location and disper sion parameters in the distribution of >v. For appropriate link functions g(-) and h(-), we consider the model

p p

g(?l) = ?% -r-J2x>j?j + E #<*;). w 7 = 1 7 = 1

P P

hm=?e+E^'^y + E fj^jy ^ 7 = 1 7 = 1

We first discuss (4) for the mean. This equation has an over all intercept ??, with the effect of the yth covariate given by the linear term Xij?11 and the nonlinear term

f^(x-xj), which

is modeled flexibly using a cubic smoothing spline prior. Let x j

= (x?j ,i = \,...,n),forj

= l,...,p. We standardize each

of the covariate vectors x.j to have mean 0 and variance 1,

which makes the x.j, j = 1,..., p, orthogonal to the intercept

term and comparable in size.

2.3 Prior Specification

We now describe the priors on the parameters for the model

given by (2), (4), and (5). The hierarchical prior is specified in terms of indicator variables that allow selection of linear or flex

ible terms. The prior for ?g is flat. Let ?11 = (??, ..., ?lxp)T. To allow the elements of ?? to be in or out of the model, we de fine a vector of indicator variables J? =

(J1^,..., Jp) such that

jj1 = 0 means that ?j1 is identically 0 and jj1

= 1 otherwise. For given J11, let

?j be the subvector of nonzero components

of ??, that is, those components ?j1 with jj1 = 1. We use the

notation N(a, b) for the normal distribution with mean a and variance b, IG(a,b) for the inverse-gamma distribution with

shape parameter a and scale parameter b, and U(a, b) for the

uniform distribution on the interval [a,b]. With this notation, the prior on ?'1 for a given value of J? is ?^\J?

- N(0, b^I),

where b? - IG(s, t), s = 101, and t = 10,100. This choice of

parameters produces an inverse-gamma prior with mean 101

and standard deviation 10.15, which works well across a range

of examples, both parametric and nonparametric. But in gen

eral, the choice of s and t may depend on the scale and lo

cation of the dependent variable and is left to the user. For a continuous response, standardizing the dependent variable

may be useful here. The issue of sensitivity to prior hyper parameters is addressed in Section 3.5. We also assume that

Pr(7/" = \\n?l1) = 71^ for / = 1,..., p, and that the 7/ are

independent given n??. The prior for Tt?? is U(0, 1). We now specify the priors for the nonlinear terms

f^, j =

1,..., p. The discussion that follows assumes that each jc.,y is



Cottet, Kohn, and Nott: Overdispersed Generalized Linear Models 663

rescaled to the interval [0, 1] so that the priors in the general case are obtained by transforming back to the original scale.

We assume that the functions f? are a priori independent of one another and that for any m abcissae z i,..., zm, the vector

(fj1 (z\),. , f1(zm))T

is normal with mean 0 and

cov(/jx(z),/ji(z,))=exp(cpn(z,z,)? where

? (z, z) = l-z2 (z -l-z\ for 0 < z < z < 1 (6)

and Q(zf,z) = Q(z,z'). This prior on f leads to a cubic

smoothing spline for the posterior mean of f'1 (Wahba 1990,

p. 16) with exp(c^)

as the smoothing parameter. This prior is used extensively to flexibly model univariate functions and

gives the joint prior distribution of the unknown function / over all abscissa values. In particular, it allows the introduction

of additional abcissae points without changing the form of the

prior. Other Gaussian process priors can be used instead of the

cubic smoothing spline prior, and the results likely will be very similar. These include radial basis function-type priors, includ

ing thin-plate spline priors that readily generalize to multiple dimensions (see Ruppert, Wand, and Carroll 2003).

For j = 1,..., p, let ff(x.j)

= (ff(xUj),..., f'/(xnj))T

and define the p x p matrix VM as having (i,k)th element

Q(xij,Xkj), so that cov(fljL(x,j))

= Qxp(c^)V^.

The ma

trix VM is positive definite and can be factored as V^

=

qV; D^ qV 9 where Q^ is an orthogonal matrix of eigen vectors and D? is a diagonal matrix of eigenvalues. Let

Wf = Q?,(Dy2; then ff{x ?) = W^aIJ?, where a?

- j ^J JJ J J ' / j j j

N(0,exp(cp/). To allow the term f11 to be in or out of the model, we in J j

troduce the indicator variable K'\ so that K^

= 0 means that

a11. = 0, which is equivalent to f? = 0. Otherwise, Klx = 1.

We also force f? to be null if the corresponding linear term

?^ = 0; that is, if the linear term is 0, then we force the flexible

term also to be 0. If 7^

= 1, then we assume that K^

is 1 with a

/UL p Li II

^v^^^x.j ,?, , with the prior on nJ uniform. When K ? 1,

the prior for c11 is N(ac'x, bc?1), where ac?1 ~ N(0, 100) and frcii ^ lG(s, t), with s and t as defined earlier.

As a practical matter, we order the eigenvalues D^ of V?

in decreasing order and set all but the largest m eigenvalues to 0, where m is chosen to be the smallest number such that

?7=i DV ?!/=i D?1j - M-In our work m usuallyis ^uite

small, around 3 or 4. By setting DIX to 0 for j > m, we set the

corresponding elements of a^

to 0, and thus it is necessary to

work only with an a1^

that is low-dimensional. This achieves a

parsimonious parameterization of f? while retaining its flex

ibility as a prior. Our approach is similar to the pseudospline approach of Hastie (1996) and has the advantage over other re

duced spline basis approaches, such as those used by Eilers and Marx (1996), Yau et al. (2003), and Ruppert et al. (2003) of not

requiring the choice of the number or location of knots.

The interpretation of (5) for the variance is similar to that of the mean equation. Let ?e

= (?fl,..., ?e)T and define the

indicator variable J9 =0 if ?e is identically 0, with J6 = 1

otherwise; that is, in the variance equation (unlike the mean

equation), all of the linear terms are either in or out of the model simultaneously, so we assume that there is linear overdis

persion or underdispersion in all of the variables or none of

the variables. It would not be difficult to perform selection on the linear terms for individual predictors in the variance

model, but in many applications there may be no overdisper

sion, so that the null model in which all predictors are excluded from the variance model is inherently interesting, with inclu

sion of all linear terms with a shrinkage prior on coefficients as

a reasonable alternative. Our prior parameterizes this compar

ison directly. When J9 = 1, we take the prior ?e ?

N(0, b91) with b9

? IG(s,t), where s and t are as defined earlier, and

Pr(7? = 1) = .5. We usually use a log link in the disper sion model, h(6) = log#; in this case J9 = 0 implies that all 0 values are fixed at 1, corresponding to no overdisper sion. In some of the examples that follow, we sometimes fix

Je = 0, which means that our prior gives a strategy for gener

alized additive modeling with variable selection and the abil

ity to choose between linear and flexible effects for additive terms.

The hierarchical prior for the nonlinear terms f9 is sim

ilar to that for f11. We write f9(x ,) = W9a9 with a9 ~ 7 7 '-' 77 7

N(0,exp(c?)/). The prior for c9 is N(ac9,bc9) with ac9 -

N(0, 100) and bc9 ? IG(s,t), where s and t are as defined ear

lier and K9 is 1 with probability nf , with the prior on n? uniform. We allow the nonlinear terms to be identically zero

by introducing the indicator variables K9, j = 1,..., p, where

K9 = 1 means that f9 is in the model and K9 = 0 means that 7 J J 7

it is not. Similar to the linear case, we impose that K9 = 0 for

all j if J9 = 0, that is, if J9 ? 0, then all of the nonlinear terms in the variance are 0.

Our framework provides an approach to variable selection

and model averaging in GLMs and overdispersed GLMs by fix

ing K? = K9 = 0, j = 1,... ,p, so that all of the terms enter

the model parametrically. The example given in Section 3.1 il lustrates our framework's ability to handle situations in which a simple parametric model is appropriate.

We conclude this section by noting that there is a large sam

ple motivation for using Bayesian model selection and model

averaging. We believe that consistency results will hold in our

framework under appropriate regularity conditions, but it is be

yond the scope of this article to prove this. Our variable selection prior defines a number of models

based on the values of the indicator variables Jt?, K^, and Jg. When all of the indicator variables take the value 1, we ob

tain the largest model. If the true model is one of the candi date models, then, under general conditions, we can usually show that Bayesian model selection is consistent in the sense that the posterior probability of the true model tends to 1 as

ymptotically. Furthermore, it also is possible to show that, as

ymptotically, the predictive density of a new observation un

der model averaging will tend to the predictive density under the true model. This means that both Bayesian model selection and model averaging will produce parsimonious models in large samples. (See O'Hagan and Forster 2004, chap. 7, particularly




secs. 7.25-7.27, and Berger and Pericchi 2001 for a discussion of Bayesian model selection and Cottet, Kohn, and Nott 2008 for a more complete discussion of the foregoing issues.)

In finite samples we can control for model parsimony by the choice of the priors for the indicator variables, which in turn determine the probabilities of the various implied models. Here the model size for the mean is determined by the data, because we take n^? and n^ as uniformly distributed. We also take .5 as the prior probability that the variance indicator variable Je takes value 1. By choosing nonuniform distributions for n^? and nf? as well as assuming the probability that J? = 1 to be other than .5, we can make informative prior assumptions on the model size (i.e., number of parameters in the model) in finite samples.

3. EMPIRICAL RESULTS

The sampling scheme that we use to estimate the models was

described by Cottet et al. (2008).

3.1 Fully Parametric Regression

This section illustrates the use of our variable selection

methodology in a parametric setting by fitting an overdis

persed Poisson model to the pox lesions chick data (available at http://www.statsci.org/dato/general/pocklesi.html). The ex

ample demonstrates that our methods can be applied with the flexible terms excluded for small data sets, for which it may be feasible to fit only a simple parametric model. There are 58 ob servations in this data set, such small data sets are common in

applications of overdispersed GLMs. The dependent variable is the lesion count produced on mem

branes of chick embryos by viruses of the pox group. The in

dependent variable is the dilution level of the viral medium. These data were analyzed by Breslow (1990) and Podlich,

Faddy, and Smyth (2004). In our model, g(/??) = log(/?;) and

h(0i) = log (ft), where g and h are linear functions of the di lution level. Figure 1 plots the fit for the parameter ?x and for

264|

254' 200 400 600 800 1000 1200 1400 1600 1800 2000

Iteration Number

Figure 2. Lesions on chick embryos: Plot of the log-likelihood ver

sus iteration and estimated autocorrelation based on 2,000 iterations

with 2,000 burn-in.

\og0 as a function of viral dilution. The posterior probability of overdispersion (i.e., the posterior probability of J$ = 1) is

approximately 1, strongly suggesting overdispersion, which is

consistent with previous analyses reported by Breslow (1990) and Podlich et al. (2004). The results show that overdisper sion increases with increasing viral dilution, whereas the lesion count decreases with dilution.

Figure 2 plots the log-likelihood versus iteration number in our MCMC sampling scheme, as well as the autocorrelation

function of the log-likelihood values based on 2,000 iterations after 2,000 burn-in iterations. The plots show that our sampling scheme converges rapidly and mixes well. Corresponding plots for our other examples (not shown) confirm the excellent prop erties of our sampling scheme. The 4,000 iterations of our sam

pler took 280 seconds on a machine with a 2.8-GHz processor. For all of the examples considered in this article, programs im

plementing our sampler were written in Fortran 90.

250T

200$

150&

(a) i ""i"

" i

100

so

5 10 15 "

*& %W<: 30 5 5 10 15 ._... Viral dfeitkjn

10 15 20 25 30 Viral dilution

Figure 1. Lesions on chick embryos: Plot of the estimated posterior means of mean and variance parameters as a function of viral dilution

together with pointwise 95% credible intervals, (a) Estimated values of the parameter ?x along with the data, (b) Estimated values of log 6.




3.2 Binary Logistic Regression

In this section we fit a main-effects binary logistic regres sion to the Pima Indian diabetes data set obtained from the UCI

repository of machine learning databases (http://archive.ics.uci. edu/mU). A population of women at least 21 years old, of Pima Indian heritage, and living near Phoenix, Arizona was tested for diabetes according to World Health Organization (WHO) crite ria. The data were collected by the U.S. National Institute of Diabetes and Digestive and Kidney Diseases. We follow Yau et al. (2003) and use the 724 complete records after dropping the aberrant cases. The dependent variable is diabetic or not ac

cording to the WHO criteria, with a positive test coded as "1." There are eight covariates: number of pregnancies, plasma glu cose concentration in an oral glucose tolerance test, diastolic

blood pressure (in mm Hg), triceps skinfold thickness (in mm), 2-hour serum insulin (in mu U/mL), body mass index (weight in kg/(height in m)2), diabetes pedigree function, and age in

years. In the notation of Section 2, we fix all ft's at 1 and so fit

a generalized linear additive model with g(?i) = log(/x;/(l ?

/x,-)), which allows for variable selection and choice between flexible and linear effects for the additive terms. The results are

given in Figure 3, with the barplot showing that the posterior probabilities of effects for each predictor are null, linear, and flexible. The barplot suggests that the number of pregnancies, diastolic blood pressure, triceps skinfold thickness, and 2-hour serum insulin do not seem to help predict the occurrence of diabetes when the other covariates are included in the model.

Figure 3 also shows that plasma glucose concentration has a

O 5 10 15 # times pregnant

50 100 150 Glucose concentration

0 20 40 60 80 Triceps skin thickness

0 200 400 600 800 2-Hour serum insulin

40; 60 80 100 120 Diastolic Wood pressure

10

20 40 60 Body mass index

0.5 1 1.5 2 Diabetes function

40 60 Age

12 3 4 5 6 7 8 Covariate

Figure 3. Logistic diabetes data: Plots of the posterior means of the covariate effects at the design points ( ) and 95% credible intervals

( ) The barplot gives the posterior probability of each covariate function being null (white), linear (gray), and flexible (black).




Table 1. Simulated diabetes data with interaction: Posterior

probabilities of null, linear, and flexible main effects

and flexible interaction effects

Covariate

Null .88 0 .86 .81 .76 .03 0 0 Linear .08 .58 .08 .05 .11 .53 0 0

1 0 0 0 0 0 .01 .03 .02 2 .42 .01 .03 .05 .13 .24 .21

3 .06 .02 .02 0 .02 .03 4 .14 .04 .03 .07 .07

5 .13 .03 .06 .06 6 .44 .023 .23

7 1.00 1.00 8 1.00

NOTE: The table is interpreted as follows for covariate 4. The posterior probabilities of a

null, linear, and flexible main effect are .81, .05, and .14. The posterior probability of a flexible interaction between covariates 3 and 4 is .02. Other entries in the table are inter

preted similarly.

where tt1 is distributed uniformly. The generation of the inter action effects parameters (o?!jk, cL, KL) is similar to the gen

eration of the other parameters in the model. First, the indi

cator variable is generated from the prior p^t^K1}4K??f). If

KL = 1, then a!-k, c!-k

are generated as described in the ap

pendix of Cottet et al. (2008) for the generation of the other

parameters; otherwise, a!-k is set to 0.

No interactions were detected when the interaction model was fitted to the data. To test the effectiveness of the method

ology at detecting interactions, we also generated observations

from the estimated main-effects model, but added an interac

tion between diabetes pedigree function and age. Writing x and z for these two predictors, the interaction term added to our fit

ted additive model for log{/x?/(l ~

?i)} m the simulation takes the simple multiplicative form xz. Table 1 reports the results of estimation when the interaction model is fitted to the artificial

data, showing that the interaction effect between variables 7 and 8 is detected.

3.4 Double-Binomial Model

This example considers a data set given by Moore and Tsiatis

(1991) and analyzed by Aerts and Claskens (1997) using a local beta binomial model. An iron supplement was given to 58 fe male rats at various dose levels. The rats were impregnated and

then sacrificed after 3 weeks. The litter size and the number of dead fetuses, as well as the hemoglobin levels of the mothers, were recorded. We fitted a double-binomial model to the data to

try to explain the proportion of dead fetuses with the mother's

hemoglobin level and litter size as covariates.

Figure 4 summarizes the estimation results and shows

the presence of overdispersion. As usual when dealing with binomial-like data, the count response is rescaled to be a pro

portion, so the parameter ?i lies in [0, 1]. We use a logistic link for ?i and a log link for 6. The results suggest no effect for sam

ple size in the mean model, with some support for either linear or flexible effects for hemoglobin in the mean and variance models and for sample size in the variance model.

As in the diabetes examples, we quantify the usefulness of

approximation 2 of Cottet et al. (2008) for constructing the

strong positive linear effect and that body mass index, diabetes

pedigree function, and age have nonlinear effects.

Our method extends the approach of Yau et al. (2003) to any GAM, whereas Yau et al. (2003) relied on the probit link to turn a binary regression into a regression with Gaussian errors. Our

approach has several other advantages over that of Yau et al.

(2003), as explained in Sections 1 and 3.3. We also estimated the model using data-based priors similar to those of Yau et al.

(2003) and obtained similar results. To quantify the usefulness of approximation 2 of Cottet et al.

(2008, the appendix) for constructing the Metropolis-Hastings proposal, we report the acceptance rates at step 4 of their sam

pling scheme for each predictor for which there is posterior probability > . 1 of inclusion of a flexible term. The acceptance rates are 19%, 11%, and 38%. The results for this example were based on 4,000 iterations of the sampling scheme with 4,000 burn-in iterations. Running this sampling scheme took 1,300 seconds for 4,000 iterations.

3.3 High-Dimensional Binary Logistic Regression

This section extends the model for the Pima Indian data set to allow for flexible second order interactions. This means that the model potentially has 36 flexible terms, with 8 main effects and 28 interactions. The purpose of this section is to show how our class of models can handle interactions and to demonstrate

that the hierarchical priors allow variable selection with a large number of terms. This is infeasible with the data-based prior approach of Yau et al. (2003), as explained in Section 1. We generalize the mean model (4) as

p p p

Here the superscript ? is dropped from ?j and fj, because we are dealing with the mean equation only. We write the flexible main effects and interactions as /M and

fjk, where M repre

sents main effect and / represents an interaction. The prior for

the fM is the same as that for the flexible main effects in Sec J j tion 2. For the interaction effects, we assume that any collection

of {f-k(xi, Zi),i

= 1,..., m} is Gaussian with mean 0 and

cov(fjk(x, z), fjk(x , z)) = Qxp(cIjk)Q(x, xf)Q(z, z),

where Q(z, z!) is defined by (6). This gives a covariance ker nel for the fjk that is the tensor product of univariate covari ance kernels (Gu 2002, sec. 2.4). Once the covariance matrix

for (fL(xij,Xik),

i = 1, . , n) is constructed, we factor it to

get a parsimonious representation as in Section 2. The smooth

ing parameters cl-k have a similar prior to the c^

in Section 2.

To allow for variable selection of the flexible main effects, let

Kf be indicator variables such that Kf = 0 if ff4 is null and 1 j j J j

otherwise. The prior for K^1 is the same as that for the K?J. in

Section 2. To allow variable selection on the flexible interac tion terms, let KL be an indicator variable, which is 0 if fl is null and 1 otherwise. To make the bivariate interactions in

terpretable, we allow a flexible interaction between the jth and ?th variables only if both the flexible main effects are in; that

is, if Kf = 0; K*f

= 0; or both, then K!jk

= 0. If both Kf

and

Kf arel, then

p(K]k =

\\Kf,KJf,nI)=n1,




Covariate Covariate

Figure 4. Double-binomial rat data. Left column: Plot of the posterior means of the effects in the mean model ( ) together with the 95%

credible intervals ( ) Right column: Effects for the dispersion component are plotted in the right column. The barplot plots the posterior

probability of the effects being null (white), linear (gray), and flexible (black).

Metropolis-Hastings proposal by reporting the acceptance rates at steps 4 and 8 of their sampling scheme (the appendix of Cot tet et al. 2008) for each predictor where there is posterior prob ability > . 1 of inclusion of a flexible term. The acceptance rates for the mean model are 2.5% for hemoglobin and 2.76% for lit ter size. No flexible effect is selected in the variance. Although the acceptance rates are quite low, our proposals are still good

enough to obtain reasonable mixing. The results for this exam

ple are based on 5,000 iterations of our sampling scheme with

5,000 burn-in iterations. The 5,000 iterations of our sampling scheme took 4,039 seconds. We also compared an implementation of our methodol

ogy using a beta-binomial response distribution to a flexi ble beta-binomial regression implemented in the GAMLSS

library in R (Rigby and Stasinopoulos 2005). Implementa tion of our method for the beta-binomial family rather than the double-exponential family is straightforward, because our

computational scheme makes no particular use of the double

exponential family assumption, but considers only the concept of mean and variance parameters being modeled flexibly as a function of covariates. For beta-binomial regression, Rigby and

Stasinopoulos (2005) parameterized the model in terms of mean

parameter pi and dispersion parameter a, which is p/(l ?

p), where p is the intracluster correlation. (If we consider each count observation as an observation of a sequence of exchange able binary random variables, then the intracluster correlation is

just the correlation between a pair of these binary random vari

ables.) Large values of a correspond to overdispersion, whereas a=0 corresponds to no overdispersion. Our model is similar to the foregoing, except that the model for h(0i) in (5) is re

placed with a model of the same form for h (at), where a? is the dispersion parameter for observation / and h(-) is a link

function, which we choose as the log function. Figures 5 and 6

present the results of our fit and the GAMLSS fit (with all terms flexible) for the rat data, showing that the fits are simi lar.




10

Hemoglobin level

Figure 5. Plots of effects for hemoglobin in the mean and dispersion models [(a) and (b)] and for sample size [(b) and (e)] together with

pointwise 95% credible intervals ( ) for the rat data. The barplots in (c) and (f) show the probabilities for null (white), linear (gray), and flexible (black) effects. A Bayesian beta binomial model is used.

One advantage of our approach is greater computational sta

bility, a feature that we believe is related to our shrinkage pri ors. We simulated several data sets from our fitted model for the mean but assuming no overdispersion (a ? 0) and then at

tempted to fit to these simulated data using GAMLSS and our

Bayesian approach with a beta-binomial model. The Bayesian approach produced satisfactory results, but attempting to fit the model in GAMLSS even with only an intercept and no covari ates in the variance model resulted in convergence problems

that cannot be easily resolved (D. M. Stasinopoulos, personal communication). But the GAMLSS fit is faster, and we have

found the GAMLSS package very useful in the exploratory ex

amination of many potential models.

3.5 Simulation Studies

Here we discuss three simulation studies that illustrate the

effectiveness of our methodology for detecting overdispersion

(a) (b)

10 12

Hemoglobin level

10 12

Hemoglobin level

Figure 6. Plots of effects for hemoglobin in the mean model (a) and dispersion model (b) together with pointwise 95% credible intervals ( )

for the rats data, with the fit obtained from the GAMLSS package.




Table 2. Rats data simulated from the fitted model: The 25th and

75th percentiles of the probabilities of flexible, linear, and null effects

for the mean and variance components, for the two covariates

hemoglobin (Hem) and sample size (SS)

Covariate

Mean Var

Effect Percentile Hem S S Hem SS

Flexible 25th .42 .15 .38 .38 75th .47 .27 .45 .45

Linear 25th .52 .19 .55 .55 75th .58 .30 .61 .61

Null 25th 0 .44 0 0 75th 0 .64 0 0

when it exists and for distinguishing among null, linear, and flexible effects. We also report the gain in performance that re

sults when using hierarchical variable selection priors instead of a similar hierarchical prior in which variable selection is not done.

We generated 50 replicate data sets from the overdispersed model that we fitted to the rats data. Table 2 gives the 25th and 75th percentiles of the probabilities of null, linear, and flexible effects for the two predictors in the mean and dispersion models over the 50 replications. The results are consistent with our fit to the original data, with appreciable probabilities of linear and flexible effects for sample size in the mean model and sample size and hemoglobin level in the variance model, and an ap

preciable probability for no effect for sample size in the mean model.

Tables 3 and 4 give the results of the simulation study for our method with the hyperparameters s and t in the inverse

gamma priors of Section 2.2 of (s,t) = (6, 500) and (s,t) =

(27, 1,300) (giving prior means of 100 and 50 and standard de viations of 50 and 10). The tables show that the results of our

approach are not particularly sensitive to the choices of s and t.

Table 5 is similar to Table 2, but for 50 replicate data sets simulated from a fitted binomial model, that is, with no overdis

persion. The probability of a null effect for both covariates in the variance model is near 1, and again there is a high prob ability of a null effect for sample size in the mean model

Table 3. Rats data simulated from the fitted model: The 25th and

75th percentiles of the probabilities of flexible, linear, and null effects

for the mean and variance components for the two covariates


Covariate

Mean Var

Effect Percentile Hem SS Hem SS

Flexible 25th .10 .04 .11 .10 75th .13 .06 .13 .13

Linear 25th .86 .27 .87 .88 75th .89 .33 .89 .90

Null 25th 0 .60 0 0 75th .01 .69 0 0

NOTE: The hyperparameter settings are (s, t) = (27, 1,300).

Table 4. Rats data simulated from the fitted model. The 25th and 75th

percentiles of the probabilities of flexible, linear, and null effects for

the mean and variance components for the two covariates


Covariate

Mean Var

Effect Percentile Hem S S Hem SS

Flexible 25th .09 .03 .11 .10 75th .12 .05 .14 .13

Linear 25th .86 .23 .86 .87 75th .90 .28 .89 .90

Null 25th 0 .66 0 0 75th .02 .73 0 0

NOTE: The hyperparameter settings are (s, t) = (6, 500).

and appreciable probabilities for linear and flexible effects for

hemoglobin in the mean model. We also examined the perfor mance of our approach for this case with the hyperparameters set to (s, t) = (6, 500) and (s, t) = (27, 1,300) and found that the results were relatively insensitive to these settings.

Table 6 shows the probabilities of null, linear, and flexible effects for the 8 covariates in the diabetes example for 50 sim ulated replicate data sets from an additive model fitted to the real data. The results are again consistent with our fit to the full

model, with high probability of a null effect for covariates 1,3, 4, and 5 (i.e., number of pregnancies, diastolic blood pressure,

triceps skinfold thickness, and 2-hour serum insulin), an appre

ciable probability of a linear effect for covariate 2 (i.e., plasma glucose concentration), and high probability of nonlinear ef fects for covariates 6, 7, and 8 (i.e., body mass index, diabetes

pedigree function, and age). We also studied the performance of our approach for the sim

ulated diabetes data hyperparameter settings (s,t) = (6, 500) and (s,t) = (27, 1,300) and found that the results were not par ticularly sensitive to these settings.

We now compare the performance of our hierarchical vari

able selection priors with the same prior but with all terms flex ible (i.e., no variable selection is carried out). Our measure of

performance is the Kullback-Leibler divergence, averaged over

Table 5. Rats simulated data from a fitted binomial model with no

overdispersion. The 25th and 75th percentiles of the probabilities of

flexible, linear, and null effects are given for the mean and variance

components for the two covariates hemoglobin (Hem) and sample size (SS)

Covariate

Mean Var

Effect Percentile Hem SS Hem SS

Flexible 25th .40 .08 0 0 75th .43 .22 0 0

Linear 25th .57 .12 0 0 75th .60 .36 0 0

Null 25th 0 .43 .99 .99 75th 0 .79 .99 .99




Table 6. Simulated data from the fitted diabetes model: The 25th and

75th percentiles of the posterior probabilities of flexible,

linear, and null effects

Covariate

Effect Percentile 1 2 3

Flexible 25th .03 .55 .06 .05 .04 .62 .35 .99

75th .06 .61 .11 .11 .12 .97 .54 1.00

Linear 25th .03 .39 .04 .03 .04 .03 .19 0

75th .05 .45 .08 .09 .08 .38 .40 .01

Null 25th .88 0 .80 .79 .82 0 .04 0 75th .94 0 .90 .91 .92 0 .46 0

the observed covariates. In estimating the true response distri

bution po(y\x) using an estimate p(y\x), where x denotes the

covariates, the Kullback-Leibler divergence is defined as

KL(p(-\x),po(-\x))= / po(y\x)lo\ -/

p(y\x) dy.

_po(y\x)

We define the average Kullback-Leibler divergence as

1 AKLD(p, po) =

- YKL(p(.\xt), po(-k)), n *-^

i=[

where x?, i = 1, ..., n, denotes the observed predictors. Writ

ing pv(y\x) for the estimated predictive density at x for the

variable selection prior and pNV (y\x) for the estimated pre dictive density at x for the prior without variable selection, we

define the average percentage increase in Kullback-Leibler loss

for variable selection compared with no variable selection as

APKL = AKLD(r\p,)-AKLD(p\p,) x m AKLD(pv,po)

When APKL is positive, the prior that allows for variable se

lection outperforms the prior that does not allow for variable

selection. Table 7 gives the 10th, 25th, 50th, 75th, and 90th per centiles of the APKL for the 50 replicate data sets generated in our simulation study for the diabetes data, the rats data when no overdispersion is present, and the rats data when overdisper

sion is present. The table shows a positive median percentage

increase in APKL for all three cases, indicating an improvement from using our hierarchical variable selection prior compared

with no variable selection. Furthermore, for the rats data with

no overdispersion, even the 10th percentile exceeds 28%.

Table 7. The 10th, 25th, 50th, 75th, and 90th percentiles in the

percentage increase in Kullback-Leibler divergence with no

variable selection and with variable selection

Percentile

Data set 10th 25th 50th 75th 90th

Diabetes -14.49 -1.17 16.74 40.85 64.14

Rats with overdispersion -13.92 -5.18 9.08 30.48 62.20

Rats with no overdispersion 28.67 67.55 155.52 293.43 734.54

4. CONCLUSION

In this article we have developed a general Bayesian frame

work for variable selection and model averaging in GLMs that allows for overdispersion and underdispersion. The priors and

sampling are innovative, and the flexibility of the approach has

been demonstrated using examples ranging from fully paramet

ric to fully nonparametric.

There are a number of natural extensions to the work de

scribed here. Although we have implemented our approach to

flexible regression for the mean and variance using the double

exponential family of distributions, it would be easy to im

plement a similar approach using other distributional families for overdispersed count data, such as the beta-binomial and

negative binomial. We have demonstrated the use of the beta

binomial in one of our real data examples. Flexible modeling

of multivariate data also can be easily accommodated in our

framework by incorporating other kinds of random effects apart from those involved in the nonparametric functional forms.

[Received August 2005. Revised February 2008.]

REFERENCES

Aerts, M., and Ciaskens, G. (1997), "Local Polynomial Estimation in Multipa rameter Models," Journal of the American Statistical Association, 92, 1536? 1545.

Berger, J. O., and Pericchi, L. R. (2001), "Objective Bayesian Methods for Model Selection: Introduction and Comparison," in Model Selection, ed. P. Lahiri, Beachwood, OH: IMS, pp. 135-207.

Breslow, N. (1990), "Further Studies in the Variability of Pock Counts," Statis tics in Medicine, 9, 615-626.

Breslow, N., and Clayton, D. (1993), "Approximate Inference in Generalized Linear Mixed Models," Journal of the American Statistical Association, 88, 9-25.

Brezger, A., and Lang, S. (2005), "Generalized Additive Structured Regression Based on Bayesian P-Splines," Computational Statistics and Data Analysis, 50,967-991.

Cottet, R., Kohn, R., and Nott, D. (2008), "Variable Selection and Model

Averaging in Semiparametric Overdispersed Generalized Linear Models: The Extended Version," available at http://www.amstat.org/publications/ jasa/supplementaljnaterials.

Davidian, M., and Carroll, R. (1988), "A Note on Extended Quasi-Likelihood," Journal of the Royal Statistical Society, Ser. B, 50, 74-82.

Efron, B. (1986), "Double-Exponential Families and Their Use in Generalised Linear Regression," Journal of the American Statistical Association, 81, 709 721.

Eilers, P. H. C, and Marx, B. D. (1996), "Flexible Smoothing With B-Splines and Penalties With Rejoinder," Statistical Science, 11, 89-121.

Faddy, M. (1997), "Extended Poisson Process Modelling and Analysis of Count

Data," Biometrical Journal, 39, 431-440.

Gu, C (2002), Smoothing Spline ANOVA Models, New York: Springer-Verlag. Hastie, T. (1996), "Pseudosplines," Journal of the Royal Statistical Society,

Ser. B, 58, 379-396.

Hastie, T., and Tibshirani, R. (1990), Generalized Additive Models, New York:

Chapman & Hall.

Jorgensen, B. (1997), The Theory of Dispersion Models, London: Chapman & Hall.

Lee, Y, and Neider, J. (1996), "Hierarchical Generalized Linear Models" (with discussion), Journal of the Royal Statistical Society, Ser. B, 58, 619-678.

Lin, X., and Zhang, D. (1999), "Inference in Generalized Additive Mixed Mod els by Using Smoothing Splines," Journal of the Royal Statistical Society, Ser. B, 61, 381-400.

McCullagh, P., and Neider, J. (1989), Generalized Linear Models (2nd ed.), London: Chapman & Hall.

Moore, D., and Tsiatis, A. (1991), "Robust Estimation of the Variance in Mo ment Methods for Extra-Binomial and Extra-Poisson Variation," Biometrics, 47,383-401.

Neider, J., and Pregibon, D. (1987), "An Extended Quasi-Likelihood Function," Biometrika, 74, 221-232.

Nott, D. (2006), "Semiparametric Estimation of Mean and Variance Functions for Non-Gaussian Data," Computational Statistics, 21, 603-620.




O'Hagan, A., and Forster, J. (2004), Kendall's Advanced Theory of Statistics, Volume 2B: Bayesian Inference (2nd ed.), Oxford, U.K.: Oxford University Press.

Podlich, H., Faddy, M., and Smyth, G. (2004), "Semi-Parametric Extended Poisson Process Models," Statistics and Computing, 14, 311-321.

Rigby, R., and Stasinopoulos, D. (2005), "Generalized Additive Models for Lo

cation, Scale and Shape," Applied Statistics, 54, 1-38.

Ruppert, D., Wand, M., and Carroll, R. (2003), Semiparametric Regression, Cambridge: Cambridge University Press.

Shively, S., Kohn, R., and Wood, S. (1999), "Variable Selection and Func tion Estimation in Additive Nonparametric Regression Using a Data Based Prior" (with discussion), Journal of the American Statistical Association, 94, 777-807.

Smith, M., and Kohn, R. (1996), "Nonparametric Regression Using Bayesian Variable Selection," Journal of Econometrics, 75, 317-344.

Smyth, G. (1989), "Generalized Linear Models With Varying Dispersion," Journal of the Royal Statistical Society, Ser. B, 51, 47-60.

Wahba, G. (1990), Spline Models for Observational Data, Philadelphia: SIAM.

Wild, C., and Yee, T. (1996), "Additive Extensions to Generalized Estimating Equation Methods," Journal of the Royal Statistical Society, Ser. B, 58, 711 725.

Yau, P., Kohn, R., and Wood, S. (2003), "Bayesian Variable Selection and Model Averaging in High-Dimensional Multinomial Nonparametric Regres sion," Journal of Computational and Graphical Statistics, 12, 23-54.

Yee, T., and Wild, C. (1996), "Vector-Generalized Additive Models," Journal

of the Royal Statistical Society, Ser. B, 58, 481^193.



Variable Selection and Model Averaging in Semiparametric Overdispersed Generalized Linear Models

Documents

Transcript of Variable Selection and Model Averaging in Semiparametric Overdispersed Generalized Linear Models