Variable Selection and Model Averaging in Semiparametric Overdispersed Generalized Linear Models
-
Upload
robert-j-kohn-and-david-j-nott -
Category
Documents
-
view
214 -
download
0
Transcript of Variable Selection and Model Averaging in Semiparametric Overdispersed Generalized Linear Models
Variable Selection and Model Averaging in Semiparametric Overdispersed Generalized LinearModelsAuthor(s): Remy Cottet, Robert J. Kohn and David J. NottSource: Journal of the American Statistical Association, Vol. 103, No. 482 (Jun., 2008), pp.661-671Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/27640088 .
Accessed: 16/06/2014 10:34
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp
.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].
.
American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.
http://www.jstor.org
This content downloaded from 194.29.185.230 on Mon, 16 Jun 2014 10:34:56 AMAll use subject to JSTOR Terms and Conditions
Variable Selection and Model Averaging in
Semiparametric Overdispersed Generalized Linear Models
Remy Cottet, Robert J. Kohn, and David J. Nott
We express the mean and variance terms in a double-exponential regression model as additive functions of the predictors and use Bayesian variable selection to determine which predictors enter the model and whether they enter linearly or flexibly. When the variance term is null, we obtain a generalized additive model, which becomes a generalized linear model if the predictors enter the mean linearly. The model is
estimated using Markov chain Monte Carlo simulation, and the methodology is illustrated using real and simulated data sets.
KEY WORDS: Bayesian analysis; Double-exponential family; Hierarchical prior; Markov chain Monte Carlo.
1. INTRODUCTION
Correctly modeling the response variance in regression is im
portant for efficient estimation of mean parameters, correct in
ference, and understanding the sources of variability in the re
sponse. Generalized linear models (GLMs) have traditionally been used to model non-Gaussian regression data (e.g., McCul
lagh and Neider 1989), where the response has a distribution from the exponential family and a transformation of the mean
response is a linear function of predictors. This framework was
extended to generalized additive models (GAMs) by Hastie and Tibshirani (1990), where a transformation of the mean is mod eled as a flexible additive function of the predictors. But some times the restriction to the exponential family in GLMs and GAMs is insufficiently general, because the variance of these distributions is a function of the mean and the data often ex
hibit greater variability than is implied by such mean-variance
relationships, a phenomenon known as overdispersion. Under
dispersion, where the data exhibit less variability than expected, also can occur, although less frequently.
We model the mean and variance terms using the class
of double-exponential regression models introduced by Efron
(1986). This class of models has the parsimony and inter
pretability of GLMs and is able to account for both overdis
persion and underdispersion. The most general model that we
consider describes the mean and dispersion terms, after trans
formation by link functions, as flexible additive functions of the
predictors. One drawback of such a model is that it may con
tain many additive terms and parameters even when there is a
moderate number of predictors. The main contribution of our
article is to present a Bayesian variable selection approach that
allows the data to determine whether or not to include predic
tors, whether or not to model the effect of predictors flexibly or linearly, and whether overdispersion or underdispersion is
present. If the dispersion term is null, then we obtain a general ized additive model, which becomes a generalized linear model
Remy Cottet is Lecturer, University of Sydney, Faculty of Business and Eco
nomics, Sydney, New South Wales 2006, Australia (E-mail: [email protected]. edu.au). Robert J. Kohn is Professor, School, of Economics, Australian School of Business, University of New South Wales, Sydney, New South Wales 2052, Australia (E-mail: [email protected]). David I. Nott is Associate Professor,
Department of Statistics and Applied Probability, Faculty of Science, National
University of Singapore, Singapore 117546 (E-mail: [email protected]). This work was supported by an Australian Research Council Grant DP0667069. The authors thank Dr. Mikis Stasinopoulos for his quick and helpful response to some questions about the GAMLSS package. They also thank a referee, an
associate editor, and the joint editors for extensive comments that improved the content and exposition of the article.
if the flexible terms drop out of the mean. An important benefit of our work on variable selection and model averaging is that it leads to more interpretable models and produces more efficient
model estimates when there are redundant covariates or para meters. To estimate the model, we develop an efficient Markov
chain Monte Carlo (MCMC) sampling scheme. To the best of our knowledge, alternative approaches to flexi
bly model the mean and variance functions do not address sim
ilar issues of model selection in a systematic way that is practi
cally feasible when there are many predictors. We now review some of the literature related to this article.
The extended quasi-likelihood approach of Neider and Pregi bon (1987) allows modeling of the overdispersion as a function of the covariates, but in general extended quasi-likelihood es
timators may not be consistent (Davidian and Carroll 1988). Wild and Yee (1996) estimated additive models using gener alized estimating equations. Inference about mean and vari
ance functions using estimating equations has a drawback that
there is no fully specified model, making it difficult to deal with characteristics of the predictive distribution for a future
response, other than its mean and variance. Model-based ap
proaches to overdispersion include exponential dispersion mod
els and related approaches (Jorgensen 1997; Smyth 1989), the extended Poisson process models of Faddy (1997), and mix ture models, such as the beta-binomial, negative binomial, and
generalized linear mixed models (Breslow and Clayton 1993; Lee and Neider 1996). One drawback of mixture models is that
they cannot model underdispersion. Generalized additive mixed models incorporating random effects in GAMs were considered
by Lin and Zhang (1999). Both Yee and Wild (1996) and Rigby and Stasinopoulos (2005) considered very general frameworks for additive modeling and algorithms for estimating the addi tive terms. However, there is clearly scope for further research
on inference, and Rigby and Stasinopoulos (2005) suggested that one use for their methods is as an exploratory tool for a
subsequent fully Bayesian analysis of the kind that we consider here. Other recent work on Bayesian GAMs has been done by Brezger and Lang (2005) and Smith and Kohn (1996).
Nott (2006) considered Bayesian nonparametric estimation of a double-exponential family model but does not consider variable selection, and his priors for the unknown functions and
smoothing parameters are very different than those used in our
? 2008 American Statistical Association Journal of the American Statistical Association
June 2008, Vol. 103, No. 482, Theory and Methods DOI 10.1198/016214508000000346
661
This content downloaded from 194.29.185.230 on Mon, 16 Jun 2014 10:34:56 AMAll use subject to JSTOR Terms and Conditions
662 Journal of the American Statistical Association, June 2008
article. Our article refines and generalizes the work of Shively, Kohn, and Wood (1999) and Yau, Kohn, and Wood (2003) on
nonparametric regression in binary and multinomial probit re
gression models where a data-based prior was used to carry out
variable selection. A serious drawback of this prior is that it is
necessary to first estimate the model with all flexible terms in
cluded, even though the actual fitted model may require only a much smaller number of such terms. This makes the approach
impractical when there is a moderate to large number of terms
in the full model, because the Markov chain simulation breaks down. Yau et al. (2003) discussed this problem and gave some
strategies to overcome it. Our proposed hierarchical prior over
comes this problem and also is computationally more efficient
than the data-based prior approach of Yau et al. (2003), because it requires one simulation run through the data, whereas the
data-based approach requires two runs, the first to obtain the
data-based prior and the second to estimate the parameters of
the model.
2. MODEL AND PRIOR DISTRIBUTIONS
2.1 The Double-Exponential Family
Write the density of a random variable y from a one
parameter exponential family as
/yif-b(^f) \ p(y; ?, cp/A) = expi
' + c(y, 4>/A)\, (1)
where Va is a location parameter, qb/A is a known scale pa
rameter, and b(-) and c(-, ) are known functions. The mean
of y is ? = b'(^f), and its variance is (0/A)b"'(t/0. This means that \?f
= ^?/(?) is a function of ?, as is the variance.
Examples of densities that can be written in this form are the Gaussian, binomial, Poisson, and gamma (McCullagh and Neider 1989). In (1) we write the scale parameter as 0/A in
anticipation of later discussion of regression models, where 0 is common to all responses but A may vary between responses.
A double-exponential family is defined from a corresponding one-parameter exponential family by
p(y,?,0,q>/A)
= Z(?, 0, (p/A)0^2p(y; ?, (?)/A)6p(y- y, 0/A)1"0, (2)
where 6 is an additional parameter and Z(?,6, (p/A) is a nor
malizing constant. To get some intuition for this definition, con
sider a Gaussian density with variance 1 and apply the double
exponential family construction. The resulting double-Gaussian
distribution is in fact an ordinary Gaussian density with mean ? and variance 1/0, so we can think of the parameter 9 as a scale
parameter modeling overdispersion (6 < 1) or underdispersion (0 > 1) with respect to the original one-parameter exponential family density. Whereas the double-Gaussian density is simply the ordinary Gaussian density, for distributions like the bino
mial and Poisson, where the variance is a function of the mean,
the corresponding double-binomial and double-Poisson densi
ties are genuine extensions that allow modeling of the variance.
Efron (1986) showed that
E(y) % ?, Var(y) % -?b"(y?r), and M (3)
Z(?,0,(p/A)^ 1,
whereas these expressions are exact for 6 = 1. Equation (3)
helps interpret the parameters in the double-exponential model and shows how the GLM mean variance relationship is embed
ded within the double-exponential family, which is important for parsimonious modeling of the variance in regression.
2.2 Semiparametric Double-Exponential Regression Models
Efron (1986) considered regression models with a response distribution from a double-exponential family, such that both the mean parameter ?i and the dispersion parameter 6 are func
tions of the predictors. Let y\,..., yn denote n observed re
sponses, and suppose that /z; and 6? are the location and disper sion parameters in the distribution of >v. For appropriate link functions g(-) and h(-), we consider the model
p p
g(?l) = ?% -r-J2x>j?j + E #<*;). w 7 = 1 7 = 1
P P
hm=?e+E^'^y + E fj^jy ^ 7 = 1 7 = 1
We first discuss (4) for the mean. This equation has an over all intercept ??, with the effect of the yth covariate given by the linear term Xij?11 and the nonlinear term
f^(x-xj), which
is modeled flexibly using a cubic smoothing spline prior. Let x j
= (x?j ,i = \,...,n),forj
= l,...,p. We standardize each
of the covariate vectors x.j to have mean 0 and variance 1,
which makes the x.j, j = 1,..., p, orthogonal to the intercept
term and comparable in size.
2.3 Prior Specification
We now describe the priors on the parameters for the model
given by (2), (4), and (5). The hierarchical prior is specified in terms of indicator variables that allow selection of linear or flex
ible terms. The prior for ?g is flat. Let ?11 = (??, ..., ?lxp)T. To allow the elements of ?? to be in or out of the model, we de fine a vector of indicator variables J? =
(J1^,..., Jp) such that
jj1 = 0 means that ?j1 is identically 0 and jj1
= 1 otherwise. For given J11, let
?j be the subvector of nonzero components
of ??, that is, those components ?j1 with jj1 = 1. We use the
notation N(a, b) for the normal distribution with mean a and variance b, IG(a,b) for the inverse-gamma distribution with
shape parameter a and scale parameter b, and U(a, b) for the
uniform distribution on the interval [a,b]. With this notation, the prior on ?'1 for a given value of J? is ?^\J?
- N(0, b^I),
where b? - IG(s, t), s = 101, and t = 10,100. This choice of
parameters produces an inverse-gamma prior with mean 101
and standard deviation 10.15, which works well across a range
of examples, both parametric and nonparametric. But in gen
eral, the choice of s and t may depend on the scale and lo
cation of the dependent variable and is left to the user. For a continuous response, standardizing the dependent variable
may be useful here. The issue of sensitivity to prior hyper parameters is addressed in Section 3.5. We also assume that
Pr(7/" = \\n?l1) = 71^ for / = 1,..., p, and that the 7/ are
independent given n??. The prior for Tt?? is U(0, 1). We now specify the priors for the nonlinear terms
f^, j =
1,..., p. The discussion that follows assumes that each jc.,y is
This content downloaded from 194.29.185.230 on Mon, 16 Jun 2014 10:34:56 AMAll use subject to JSTOR Terms and Conditions
Cottet, Kohn, and Nott: Overdispersed Generalized Linear Models 663
rescaled to the interval [0, 1] so that the priors in the general case are obtained by transforming back to the original scale.
We assume that the functions f? are a priori independent of one another and that for any m abcissae z i,..., zm, the vector
(fj1 (z\),. , f1(zm))T
is normal with mean 0 and
cov(/jx(z),/ji(z,))=exp(cpn(z,z,)? where
? (z, z) = l-z2 (z -l-z\ for 0 < z < z < 1 (6)
and Q(zf,z) = Q(z,z'). This prior on f leads to a cubic
smoothing spline for the posterior mean of f'1 (Wahba 1990,
p. 16) with exp(c^)
as the smoothing parameter. This prior is used extensively to flexibly model univariate functions and
gives the joint prior distribution of the unknown function / over all abscissa values. In particular, it allows the introduction
of additional abcissae points without changing the form of the
prior. Other Gaussian process priors can be used instead of the
cubic smoothing spline prior, and the results likely will be very similar. These include radial basis function-type priors, includ
ing thin-plate spline priors that readily generalize to multiple dimensions (see Ruppert, Wand, and Carroll 2003).
For j = 1,..., p, let ff(x.j)
= (ff(xUj),..., f'/(xnj))T
and define the p x p matrix VM as having (i,k)th element
Q(xij,Xkj), so that cov(fljL(x,j))
= Qxp(c^)V^.
The ma
trix VM is positive definite and can be factored as V^
=
qV; D^ qV 9 where Q^ is an orthogonal matrix of eigen vectors and D? is a diagonal matrix of eigenvalues. Let
Wf = Q?,(Dy2; then ff{x ?) = W^aIJ?, where a?
- j ^J JJ J J ' / j j j
N(0,exp(cp/). To allow the term f11 to be in or out of the model, we in J j
troduce the indicator variable K'\ so that K^
= 0 means that
a11. = 0, which is equivalent to f? = 0. Otherwise, Klx = 1.
We also force f? to be null if the corresponding linear term
?^ = 0; that is, if the linear term is 0, then we force the flexible
term also to be 0. If 7^
= 1, then we assume that K^
is 1 with a
/UL p Li II
^v^^^x.j ,?, , with the prior on nJ uniform. When K ? 1,
the prior for c11 is N(ac'x, bc?1), where ac?1 ~ N(0, 100) and frcii ^ lG(s, t), with s and t as defined earlier.
As a practical matter, we order the eigenvalues D^ of V?
in decreasing order and set all but the largest m eigenvalues to 0, where m is chosen to be the smallest number such that
?7=i DV ?!/=i D?1j - M-In our work m usuallyis ^uite
small, around 3 or 4. By setting DIX to 0 for j > m, we set the
corresponding elements of a^
to 0, and thus it is necessary to
work only with an a1^
that is low-dimensional. This achieves a
parsimonious parameterization of f? while retaining its flex
ibility as a prior. Our approach is similar to the pseudospline approach of Hastie (1996) and has the advantage over other re
duced spline basis approaches, such as those used by Eilers and Marx (1996), Yau et al. (2003), and Ruppert et al. (2003) of not
requiring the choice of the number or location of knots.
The interpretation of (5) for the variance is similar to that of the mean equation. Let ?e
= (?fl,..., ?e)T and define the
indicator variable J9 =0 if ?e is identically 0, with J6 = 1
otherwise; that is, in the variance equation (unlike the mean
equation), all of the linear terms are either in or out of the model simultaneously, so we assume that there is linear overdis
persion or underdispersion in all of the variables or none of
the variables. It would not be difficult to perform selection on the linear terms for individual predictors in the variance
model, but in many applications there may be no overdisper
sion, so that the null model in which all predictors are excluded from the variance model is inherently interesting, with inclu
sion of all linear terms with a shrinkage prior on coefficients as
a reasonable alternative. Our prior parameterizes this compar
ison directly. When J9 = 1, we take the prior ?e ?
N(0, b91) with b9
? IG(s,t), where s and t are as defined earlier, and
Pr(7? = 1) = .5. We usually use a log link in the disper sion model, h(6) = log#; in this case J9 = 0 implies that all 0 values are fixed at 1, corresponding to no overdisper sion. In some of the examples that follow, we sometimes fix
Je = 0, which means that our prior gives a strategy for gener
alized additive modeling with variable selection and the abil
ity to choose between linear and flexible effects for additive terms.
The hierarchical prior for the nonlinear terms f9 is sim
ilar to that for f11. We write f9(x ,) = W9a9 with a9 ~ 7 7 '-' 77 7
N(0,exp(c?)/). The prior for c9 is N(ac9,bc9) with ac9 -
N(0, 100) and bc9 ? IG(s,t), where s and t are as defined ear
lier and K9 is 1 with probability nf , with the prior on n? uniform. We allow the nonlinear terms to be identically zero
by introducing the indicator variables K9, j = 1,..., p, where
K9 = 1 means that f9 is in the model and K9 = 0 means that 7 J J 7
it is not. Similar to the linear case, we impose that K9 = 0 for
all j if J9 = 0, that is, if J9 ? 0, then all of the nonlinear terms in the variance are 0.
Our framework provides an approach to variable selection
and model averaging in GLMs and overdispersed GLMs by fix
ing K? = K9 = 0, j = 1,... ,p, so that all of the terms enter
the model parametrically. The example given in Section 3.1 il lustrates our framework's ability to handle situations in which a simple parametric model is appropriate.
We conclude this section by noting that there is a large sam
ple motivation for using Bayesian model selection and model
averaging. We believe that consistency results will hold in our
framework under appropriate regularity conditions, but it is be
yond the scope of this article to prove this. Our variable selection prior defines a number of models
based on the values of the indicator variables Jt?, K^, and Jg. When all of the indicator variables take the value 1, we ob
tain the largest model. If the true model is one of the candi date models, then, under general conditions, we can usually show that Bayesian model selection is consistent in the sense that the posterior probability of the true model tends to 1 as
ymptotically. Furthermore, it also is possible to show that, as
ymptotically, the predictive density of a new observation un
der model averaging will tend to the predictive density under the true model. This means that both Bayesian model selection and model averaging will produce parsimonious models in large samples. (See O'Hagan and Forster 2004, chap. 7, particularly
This content downloaded from 194.29.185.230 on Mon, 16 Jun 2014 10:34:56 AMAll use subject to JSTOR Terms and Conditions
664 Journal of the American Statistical Association, June 2008
secs. 7.25-7.27, and Berger and Pericchi 2001 for a discussion of Bayesian model selection and Cottet, Kohn, and Nott 2008 for a more complete discussion of the foregoing issues.)
In finite samples we can control for model parsimony by the choice of the priors for the indicator variables, which in turn determine the probabilities of the various implied models. Here the model size for the mean is determined by the data, because we take n^? and n^ as uniformly distributed. We also take .5 as the prior probability that the variance indicator variable Je takes value 1. By choosing nonuniform distributions for n^? and nf? as well as assuming the probability that J? = 1 to be other than .5, we can make informative prior assumptions on the model size (i.e., number of parameters in the model) in finite samples.
3. EMPIRICAL RESULTS
The sampling scheme that we use to estimate the models was
described by Cottet et al. (2008).
3.1 Fully Parametric Regression
This section illustrates the use of our variable selection
methodology in a parametric setting by fitting an overdis
persed Poisson model to the pox lesions chick data (available at http://www.statsci.org/dato/general/pocklesi.html). The ex
ample demonstrates that our methods can be applied with the flexible terms excluded for small data sets, for which it may be feasible to fit only a simple parametric model. There are 58 ob servations in this data set, such small data sets are common in
applications of overdispersed GLMs. The dependent variable is the lesion count produced on mem
branes of chick embryos by viruses of the pox group. The in
dependent variable is the dilution level of the viral medium. These data were analyzed by Breslow (1990) and Podlich,
Faddy, and Smyth (2004). In our model, g(/??) = log(/?;) and
h(0i) = log (ft), where g and h are linear functions of the di lution level. Figure 1 plots the fit for the parameter ?x and for
264|
254' 200 400 600 800 1000 1200 1400 1600 1800 2000
Iteration Number
Figure 2. Lesions on chick embryos: Plot of the log-likelihood ver
sus iteration and estimated autocorrelation based on 2,000 iterations
with 2,000 burn-in.
\og0 as a function of viral dilution. The posterior probability of overdispersion (i.e., the posterior probability of J$ = 1) is
approximately 1, strongly suggesting overdispersion, which is
consistent with previous analyses reported by Breslow (1990) and Podlich et al. (2004). The results show that overdisper sion increases with increasing viral dilution, whereas the lesion count decreases with dilution.
Figure 2 plots the log-likelihood versus iteration number in our MCMC sampling scheme, as well as the autocorrelation
function of the log-likelihood values based on 2,000 iterations after 2,000 burn-in iterations. The plots show that our sampling scheme converges rapidly and mixes well. Corresponding plots for our other examples (not shown) confirm the excellent prop erties of our sampling scheme. The 4,000 iterations of our sam
pler took 280 seconds on a machine with a 2.8-GHz processor. For all of the examples considered in this article, programs im
plementing our sampler were written in Fortran 90.
250T
200$
150&
(a) i ""i"
" i
100
so
5 10 15 "
*& %W<: 30 5 5 10 15 ._... Viral dfeitkjn
10 15 20 25 30 Viral dilution
Figure 1. Lesions on chick embryos: Plot of the estimated posterior means of mean and variance parameters as a function of viral dilution
together with pointwise 95% credible intervals, (a) Estimated values of the parameter ?x along with the data, (b) Estimated values of log 6.
This content downloaded from 194.29.185.230 on Mon, 16 Jun 2014 10:34:56 AMAll use subject to JSTOR Terms and Conditions
Cottet, Kohn, and Nott: Overdispersed Generalized Linear Models 665
3.2 Binary Logistic Regression
In this section we fit a main-effects binary logistic regres sion to the Pima Indian diabetes data set obtained from the UCI
repository of machine learning databases (http://archive.ics.uci. edu/mU). A population of women at least 21 years old, of Pima Indian heritage, and living near Phoenix, Arizona was tested for diabetes according to World Health Organization (WHO) crite ria. The data were collected by the U.S. National Institute of Diabetes and Digestive and Kidney Diseases. We follow Yau et al. (2003) and use the 724 complete records after dropping the aberrant cases. The dependent variable is diabetic or not ac
cording to the WHO criteria, with a positive test coded as "1." There are eight covariates: number of pregnancies, plasma glu cose concentration in an oral glucose tolerance test, diastolic
blood pressure (in mm Hg), triceps skinfold thickness (in mm), 2-hour serum insulin (in mu U/mL), body mass index (weight in kg/(height in m)2), diabetes pedigree function, and age in
years. In the notation of Section 2, we fix all ft's at 1 and so fit
a generalized linear additive model with g(?i) = log(/x;/(l ?
/x,-)), which allows for variable selection and choice between flexible and linear effects for the additive terms. The results are
given in Figure 3, with the barplot showing that the posterior probabilities of effects for each predictor are null, linear, and flexible. The barplot suggests that the number of pregnancies, diastolic blood pressure, triceps skinfold thickness, and 2-hour serum insulin do not seem to help predict the occurrence of diabetes when the other covariates are included in the model.
Figure 3 also shows that plasma glucose concentration has a
O 5 10 15 # times pregnant
50 100 150 Glucose concentration
0 20 40 60 80 Triceps skin thickness
0 200 400 600 800 2-Hour serum insulin
40; 60 80 100 120 Diastolic Wood pressure
10
20 40 60 Body mass index
0.5 1 1.5 2 Diabetes function
40 60 Age
12 3 4 5 6 7 8 Covariate
Figure 3. Logistic diabetes data: Plots of the posterior means of the covariate effects at the design points ( ) and 95% credible intervals
( ) The barplot gives the posterior probability of each covariate function being null (white), linear (gray), and flexible (black).
This content downloaded from 194.29.185.230 on Mon, 16 Jun 2014 10:34:56 AMAll use subject to JSTOR Terms and Conditions
666 Journal of the American Statistical Association, June 2008
Table 1. Simulated diabetes data with interaction: Posterior
probabilities of null, linear, and flexible main effects
and flexible interaction effects
Covariate
Null .88 0 .86 .81 .76 .03 0 0 Linear .08 .58 .08 .05 .11 .53 0 0
1 0 0 0 0 0 .01 .03 .02 2 .42 .01 .03 .05 .13 .24 .21
3 .06 .02 .02 0 .02 .03 4 .14 .04 .03 .07 .07
5 .13 .03 .06 .06 6 .44 .023 .23
7 1.00 1.00 8 1.00
NOTE: The table is interpreted as follows for covariate 4. The posterior probabilities of a
null, linear, and flexible main effect are .81, .05, and .14. The posterior probability of a flexible interaction between covariates 3 and 4 is .02. Other entries in the table are inter
preted similarly.
where tt1 is distributed uniformly. The generation of the inter action effects parameters (o?!jk, cL, KL) is similar to the gen
eration of the other parameters in the model. First, the indi
cator variable is generated from the prior p^t^K1}4K??f). If
KL = 1, then a!-k, c!-k
are generated as described in the ap
pendix of Cottet et al. (2008) for the generation of the other
parameters; otherwise, a!-k is set to 0.
No interactions were detected when the interaction model was fitted to the data. To test the effectiveness of the method
ology at detecting interactions, we also generated observations
from the estimated main-effects model, but added an interac
tion between diabetes pedigree function and age. Writing x and z for these two predictors, the interaction term added to our fit
ted additive model for log{/x?/(l ~
?i)} m the simulation takes the simple multiplicative form xz. Table 1 reports the results of estimation when the interaction model is fitted to the artificial
data, showing that the interaction effect between variables 7 and 8 is detected.
3.4 Double-Binomial Model
This example considers a data set given by Moore and Tsiatis
(1991) and analyzed by Aerts and Claskens (1997) using a local beta binomial model. An iron supplement was given to 58 fe male rats at various dose levels. The rats were impregnated and
then sacrificed after 3 weeks. The litter size and the number of dead fetuses, as well as the hemoglobin levels of the mothers, were recorded. We fitted a double-binomial model to the data to
try to explain the proportion of dead fetuses with the mother's
hemoglobin level and litter size as covariates.
Figure 4 summarizes the estimation results and shows
the presence of overdispersion. As usual when dealing with binomial-like data, the count response is rescaled to be a pro
portion, so the parameter ?i lies in [0, 1]. We use a logistic link for ?i and a log link for 6. The results suggest no effect for sam
ple size in the mean model, with some support for either linear or flexible effects for hemoglobin in the mean and variance models and for sample size in the variance model.
As in the diabetes examples, we quantify the usefulness of
approximation 2 of Cottet et al. (2008) for constructing the
strong positive linear effect and that body mass index, diabetes
pedigree function, and age have nonlinear effects.
Our method extends the approach of Yau et al. (2003) to any GAM, whereas Yau et al. (2003) relied on the probit link to turn a binary regression into a regression with Gaussian errors. Our
approach has several other advantages over that of Yau et al.
(2003), as explained in Sections 1 and 3.3. We also estimated the model using data-based priors similar to those of Yau et al.
(2003) and obtained similar results. To quantify the usefulness of approximation 2 of Cottet et al.
(2008, the appendix) for constructing the Metropolis-Hastings proposal, we report the acceptance rates at step 4 of their sam
pling scheme for each predictor for which there is posterior probability > . 1 of inclusion of a flexible term. The acceptance rates are 19%, 11%, and 38%. The results for this example were based on 4,000 iterations of the sampling scheme with 4,000 burn-in iterations. Running this sampling scheme took 1,300 seconds for 4,000 iterations.
3.3 High-Dimensional Binary Logistic Regression
This section extends the model for the Pima Indian data set to allow for flexible second order interactions. This means that the model potentially has 36 flexible terms, with 8 main effects and 28 interactions. The purpose of this section is to show how our class of models can handle interactions and to demonstrate
that the hierarchical priors allow variable selection with a large number of terms. This is infeasible with the data-based prior approach of Yau et al. (2003), as explained in Section 1. We generalize the mean model (4) as
p p p
Here the superscript ? is dropped from ?j and fj, because we are dealing with the mean equation only. We write the flexible main effects and interactions as /M and
fjk, where M repre
sents main effect and / represents an interaction. The prior for
the fM is the same as that for the flexible main effects in Sec J j tion 2. For the interaction effects, we assume that any collection
of {f-k(xi, Zi),i
= 1,..., m} is Gaussian with mean 0 and
cov(fjk(x, z), fjk(x , z)) = Qxp(cIjk)Q(x, xf)Q(z, z),
where Q(z, z!) is defined by (6). This gives a covariance ker nel for the fjk that is the tensor product of univariate covari ance kernels (Gu 2002, sec. 2.4). Once the covariance matrix
for (fL(xij,Xik),
i = 1, . , n) is constructed, we factor it to
get a parsimonious representation as in Section 2. The smooth
ing parameters cl-k have a similar prior to the c^
in Section 2.
To allow for variable selection of the flexible main effects, let
Kf be indicator variables such that Kf = 0 if ff4 is null and 1 j j J j
otherwise. The prior for K^1 is the same as that for the K?J. in
Section 2. To allow variable selection on the flexible interac tion terms, let KL be an indicator variable, which is 0 if fl is null and 1 otherwise. To make the bivariate interactions in
terpretable, we allow a flexible interaction between the jth and ?th variables only if both the flexible main effects are in; that
is, if Kf = 0; K*f
= 0; or both, then K!jk
= 0. If both Kf
and
Kf arel, then
p(K]k =
\\Kf,KJf,nI)=n1,
This content downloaded from 194.29.185.230 on Mon, 16 Jun 2014 10:34:56 AMAll use subject to JSTOR Terms and Conditions
Cottet, Kohn, and Nott: Overdispersed Generalized Linear Models 667
Covariate Covariate
Figure 4. Double-binomial rat data. Left column: Plot of the posterior means of the effects in the mean model ( ) together with the 95%
credible intervals ( ) Right column: Effects for the dispersion component are plotted in the right column. The barplot plots the posterior
probability of the effects being null (white), linear (gray), and flexible (black).
Metropolis-Hastings proposal by reporting the acceptance rates at steps 4 and 8 of their sampling scheme (the appendix of Cot tet et al. 2008) for each predictor where there is posterior prob ability > . 1 of inclusion of a flexible term. The acceptance rates for the mean model are 2.5% for hemoglobin and 2.76% for lit ter size. No flexible effect is selected in the variance. Although the acceptance rates are quite low, our proposals are still good
enough to obtain reasonable mixing. The results for this exam
ple are based on 5,000 iterations of our sampling scheme with
5,000 burn-in iterations. The 5,000 iterations of our sampling scheme took 4,039 seconds. We also compared an implementation of our methodol
ogy using a beta-binomial response distribution to a flexi ble beta-binomial regression implemented in the GAMLSS
library in R (Rigby and Stasinopoulos 2005). Implementa tion of our method for the beta-binomial family rather than the double-exponential family is straightforward, because our
computational scheme makes no particular use of the double
exponential family assumption, but considers only the concept of mean and variance parameters being modeled flexibly as a function of covariates. For beta-binomial regression, Rigby and
Stasinopoulos (2005) parameterized the model in terms of mean
parameter pi and dispersion parameter a, which is p/(l ?
p), where p is the intracluster correlation. (If we consider each count observation as an observation of a sequence of exchange able binary random variables, then the intracluster correlation is
just the correlation between a pair of these binary random vari
ables.) Large values of a correspond to overdispersion, whereas a=0 corresponds to no overdispersion. Our model is similar to the foregoing, except that the model for h(0i) in (5) is re
placed with a model of the same form for h (at), where a? is the dispersion parameter for observation / and h(-) is a link
function, which we choose as the log function. Figures 5 and 6
present the results of our fit and the GAMLSS fit (with all terms flexible) for the rat data, showing that the fits are simi lar.
This content downloaded from 194.29.185.230 on Mon, 16 Jun 2014 10:34:56 AMAll use subject to JSTOR Terms and Conditions
668 Journal of the American Statistical Association, June 2008
10
Hemoglobin level
Figure 5. Plots of effects for hemoglobin in the mean and dispersion models [(a) and (b)] and for sample size [(b) and (e)] together with
pointwise 95% credible intervals ( ) for the rat data. The barplots in (c) and (f) show the probabilities for null (white), linear (gray), and flexible (black) effects. A Bayesian beta binomial model is used.
One advantage of our approach is greater computational sta
bility, a feature that we believe is related to our shrinkage pri ors. We simulated several data sets from our fitted model for the mean but assuming no overdispersion (a ? 0) and then at
tempted to fit to these simulated data using GAMLSS and our
Bayesian approach with a beta-binomial model. The Bayesian approach produced satisfactory results, but attempting to fit the model in GAMLSS even with only an intercept and no covari ates in the variance model resulted in convergence problems
that cannot be easily resolved (D. M. Stasinopoulos, personal communication). But the GAMLSS fit is faster, and we have
found the GAMLSS package very useful in the exploratory ex
amination of many potential models.
3.5 Simulation Studies
Here we discuss three simulation studies that illustrate the
effectiveness of our methodology for detecting overdispersion
(a) (b)
10 12
Hemoglobin level
10 12
Hemoglobin level
Figure 6. Plots of effects for hemoglobin in the mean model (a) and dispersion model (b) together with pointwise 95% credible intervals ( )
for the rats data, with the fit obtained from the GAMLSS package.
This content downloaded from 194.29.185.230 on Mon, 16 Jun 2014 10:34:56 AMAll use subject to JSTOR Terms and Conditions
Cottet, Kohn, and Nott: Overdispersed Generalized Linear Models 669
Table 2. Rats data simulated from the fitted model: The 25th and
75th percentiles of the probabilities of flexible, linear, and null effects
for the mean and variance components, for the two covariates
hemoglobin (Hem) and sample size (SS)
Covariate
Mean Var
Effect Percentile Hem S S Hem SS
Flexible 25th .42 .15 .38 .38 75th .47 .27 .45 .45
Linear 25th .52 .19 .55 .55 75th .58 .30 .61 .61
Null 25th 0 .44 0 0 75th 0 .64 0 0
when it exists and for distinguishing among null, linear, and flexible effects. We also report the gain in performance that re
sults when using hierarchical variable selection priors instead of a similar hierarchical prior in which variable selection is not done.
We generated 50 replicate data sets from the overdispersed model that we fitted to the rats data. Table 2 gives the 25th and 75th percentiles of the probabilities of null, linear, and flexible effects for the two predictors in the mean and dispersion models over the 50 replications. The results are consistent with our fit to the original data, with appreciable probabilities of linear and flexible effects for sample size in the mean model and sample size and hemoglobin level in the variance model, and an ap
preciable probability for no effect for sample size in the mean model.
Tables 3 and 4 give the results of the simulation study for our method with the hyperparameters s and t in the inverse
gamma priors of Section 2.2 of (s,t) = (6, 500) and (s,t) =
(27, 1,300) (giving prior means of 100 and 50 and standard de viations of 50 and 10). The tables show that the results of our
approach are not particularly sensitive to the choices of s and t.
Table 5 is similar to Table 2, but for 50 replicate data sets simulated from a fitted binomial model, that is, with no overdis
persion. The probability of a null effect for both covariates in the variance model is near 1, and again there is a high prob ability of a null effect for sample size in the mean model
Table 3. Rats data simulated from the fitted model: The 25th and
75th percentiles of the probabilities of flexible, linear, and null effects
for the mean and variance components for the two covariates
hemoglobin (Hem) and sample size (SS)
Covariate
Mean Var
Effect Percentile Hem SS Hem SS
Flexible 25th .10 .04 .11 .10 75th .13 .06 .13 .13
Linear 25th .86 .27 .87 .88 75th .89 .33 .89 .90
Null 25th 0 .60 0 0 75th .01 .69 0 0
NOTE: The hyperparameter settings are (s, t) = (27, 1,300).
Table 4. Rats data simulated from the fitted model. The 25th and 75th
percentiles of the probabilities of flexible, linear, and null effects for
the mean and variance components for the two covariates
hemoglobin (Hem) and sample size (SS)
Covariate
Mean Var
Effect Percentile Hem S S Hem SS
Flexible 25th .09 .03 .11 .10 75th .12 .05 .14 .13
Linear 25th .86 .23 .86 .87 75th .90 .28 .89 .90
Null 25th 0 .66 0 0 75th .02 .73 0 0
NOTE: The hyperparameter settings are (s, t) = (6, 500).
and appreciable probabilities for linear and flexible effects for
hemoglobin in the mean model. We also examined the perfor mance of our approach for this case with the hyperparameters set to (s, t) = (6, 500) and (s, t) = (27, 1,300) and found that the results were relatively insensitive to these settings.
Table 6 shows the probabilities of null, linear, and flexible effects for the 8 covariates in the diabetes example for 50 sim ulated replicate data sets from an additive model fitted to the real data. The results are again consistent with our fit to the full
model, with high probability of a null effect for covariates 1,3, 4, and 5 (i.e., number of pregnancies, diastolic blood pressure,
triceps skinfold thickness, and 2-hour serum insulin), an appre
ciable probability of a linear effect for covariate 2 (i.e., plasma glucose concentration), and high probability of nonlinear ef fects for covariates 6, 7, and 8 (i.e., body mass index, diabetes
pedigree function, and age). We also studied the performance of our approach for the sim
ulated diabetes data hyperparameter settings (s,t) = (6, 500) and (s,t) = (27, 1,300) and found that the results were not par ticularly sensitive to these settings.
We now compare the performance of our hierarchical vari
able selection priors with the same prior but with all terms flex ible (i.e., no variable selection is carried out). Our measure of
performance is the Kullback-Leibler divergence, averaged over
Table 5. Rats simulated data from a fitted binomial model with no
overdispersion. The 25th and 75th percentiles of the probabilities of
flexible, linear, and null effects are given for the mean and variance
components for the two covariates hemoglobin (Hem) and sample size (SS)
Covariate
Mean Var
Effect Percentile Hem SS Hem SS
Flexible 25th .40 .08 0 0 75th .43 .22 0 0
Linear 25th .57 .12 0 0 75th .60 .36 0 0
Null 25th 0 .43 .99 .99 75th 0 .79 .99 .99
This content downloaded from 194.29.185.230 on Mon, 16 Jun 2014 10:34:56 AMAll use subject to JSTOR Terms and Conditions
670 Journal of the American Statistical Association, June 2008
Table 6. Simulated data from the fitted diabetes model: The 25th and
75th percentiles of the posterior probabilities of flexible,
linear, and null effects
Covariate
Effect Percentile 1 2 3
Flexible 25th .03 .55 .06 .05 .04 .62 .35 .99
75th .06 .61 .11 .11 .12 .97 .54 1.00
Linear 25th .03 .39 .04 .03 .04 .03 .19 0
75th .05 .45 .08 .09 .08 .38 .40 .01
Null 25th .88 0 .80 .79 .82 0 .04 0 75th .94 0 .90 .91 .92 0 .46 0
the observed covariates. In estimating the true response distri
bution po(y\x) using an estimate p(y\x), where x denotes the
covariates, the Kullback-Leibler divergence is defined as
KL(p(-\x),po(-\x))= / po(y\x)lo\ -/
p(y\x) dy.
_po(y\x)
We define the average Kullback-Leibler divergence as
1 AKLD(p, po) =
- YKL(p(.\xt), po(-k)), n *-^
i=[
where x?, i = 1, ..., n, denotes the observed predictors. Writ
ing pv(y\x) for the estimated predictive density at x for the
variable selection prior and pNV (y\x) for the estimated pre dictive density at x for the prior without variable selection, we
define the average percentage increase in Kullback-Leibler loss
for variable selection compared with no variable selection as
APKL = AKLD(r\p,)-AKLD(p\p,) x m AKLD(pv,po)
When APKL is positive, the prior that allows for variable se
lection outperforms the prior that does not allow for variable
selection. Table 7 gives the 10th, 25th, 50th, 75th, and 90th per centiles of the APKL for the 50 replicate data sets generated in our simulation study for the diabetes data, the rats data when no overdispersion is present, and the rats data when overdisper
sion is present. The table shows a positive median percentage
increase in APKL for all three cases, indicating an improvement from using our hierarchical variable selection prior compared
with no variable selection. Furthermore, for the rats data with
no overdispersion, even the 10th percentile exceeds 28%.
Table 7. The 10th, 25th, 50th, 75th, and 90th percentiles in the
percentage increase in Kullback-Leibler divergence with no
variable selection and with variable selection
Percentile
Data set 10th 25th 50th 75th 90th
Diabetes -14.49 -1.17 16.74 40.85 64.14
Rats with overdispersion -13.92 -5.18 9.08 30.48 62.20
Rats with no overdispersion 28.67 67.55 155.52 293.43 734.54
4. CONCLUSION
In this article we have developed a general Bayesian frame
work for variable selection and model averaging in GLMs that allows for overdispersion and underdispersion. The priors and
sampling are innovative, and the flexibility of the approach has
been demonstrated using examples ranging from fully paramet
ric to fully nonparametric.
There are a number of natural extensions to the work de
scribed here. Although we have implemented our approach to
flexible regression for the mean and variance using the double
exponential family of distributions, it would be easy to im
plement a similar approach using other distributional families for overdispersed count data, such as the beta-binomial and
negative binomial. We have demonstrated the use of the beta
binomial in one of our real data examples. Flexible modeling
of multivariate data also can be easily accommodated in our
framework by incorporating other kinds of random effects apart from those involved in the nonparametric functional forms.
[Received August 2005. Revised February 2008.]
REFERENCES
Aerts, M., and Ciaskens, G. (1997), "Local Polynomial Estimation in Multipa rameter Models," Journal of the American Statistical Association, 92, 1536? 1545.
Berger, J. O., and Pericchi, L. R. (2001), "Objective Bayesian Methods for Model Selection: Introduction and Comparison," in Model Selection, ed. P. Lahiri, Beachwood, OH: IMS, pp. 135-207.
Breslow, N. (1990), "Further Studies in the Variability of Pock Counts," Statis tics in Medicine, 9, 615-626.
Breslow, N., and Clayton, D. (1993), "Approximate Inference in Generalized Linear Mixed Models," Journal of the American Statistical Association, 88, 9-25.
Brezger, A., and Lang, S. (2005), "Generalized Additive Structured Regression Based on Bayesian P-Splines," Computational Statistics and Data Analysis, 50,967-991.
Cottet, R., Kohn, R., and Nott, D. (2008), "Variable Selection and Model
Averaging in Semiparametric Overdispersed Generalized Linear Models: The Extended Version," available at http://www.amstat.org/publications/ jasa/supplementaljnaterials.
Davidian, M., and Carroll, R. (1988), "A Note on Extended Quasi-Likelihood," Journal of the Royal Statistical Society, Ser. B, 50, 74-82.
Efron, B. (1986), "Double-Exponential Families and Their Use in Generalised Linear Regression," Journal of the American Statistical Association, 81, 709 721.
Eilers, P. H. C, and Marx, B. D. (1996), "Flexible Smoothing With B-Splines and Penalties With Rejoinder," Statistical Science, 11, 89-121.
Faddy, M. (1997), "Extended Poisson Process Modelling and Analysis of Count
Data," Biometrical Journal, 39, 431-440.
Gu, C (2002), Smoothing Spline ANOVA Models, New York: Springer-Verlag. Hastie, T. (1996), "Pseudosplines," Journal of the Royal Statistical Society,
Ser. B, 58, 379-396.
Hastie, T., and Tibshirani, R. (1990), Generalized Additive Models, New York:
Chapman & Hall.
Jorgensen, B. (1997), The Theory of Dispersion Models, London: Chapman & Hall.
Lee, Y, and Neider, J. (1996), "Hierarchical Generalized Linear Models" (with discussion), Journal of the Royal Statistical Society, Ser. B, 58, 619-678.
Lin, X., and Zhang, D. (1999), "Inference in Generalized Additive Mixed Mod els by Using Smoothing Splines," Journal of the Royal Statistical Society, Ser. B, 61, 381-400.
McCullagh, P., and Neider, J. (1989), Generalized Linear Models (2nd ed.), London: Chapman & Hall.
Moore, D., and Tsiatis, A. (1991), "Robust Estimation of the Variance in Mo ment Methods for Extra-Binomial and Extra-Poisson Variation," Biometrics, 47,383-401.
Neider, J., and Pregibon, D. (1987), "An Extended Quasi-Likelihood Function," Biometrika, 74, 221-232.
Nott, D. (2006), "Semiparametric Estimation of Mean and Variance Functions for Non-Gaussian Data," Computational Statistics, 21, 603-620.
This content downloaded from 194.29.185.230 on Mon, 16 Jun 2014 10:34:56 AMAll use subject to JSTOR Terms and Conditions
Cottet, Kohn, and Nott: Overdispersed Generalized Linear Models 671
O'Hagan, A., and Forster, J. (2004), Kendall's Advanced Theory of Statistics, Volume 2B: Bayesian Inference (2nd ed.), Oxford, U.K.: Oxford University Press.
Podlich, H., Faddy, M., and Smyth, G. (2004), "Semi-Parametric Extended Poisson Process Models," Statistics and Computing, 14, 311-321.
Rigby, R., and Stasinopoulos, D. (2005), "Generalized Additive Models for Lo
cation, Scale and Shape," Applied Statistics, 54, 1-38.
Ruppert, D., Wand, M., and Carroll, R. (2003), Semiparametric Regression, Cambridge: Cambridge University Press.
Shively, S., Kohn, R., and Wood, S. (1999), "Variable Selection and Func tion Estimation in Additive Nonparametric Regression Using a Data Based Prior" (with discussion), Journal of the American Statistical Association, 94, 777-807.
Smith, M., and Kohn, R. (1996), "Nonparametric Regression Using Bayesian Variable Selection," Journal of Econometrics, 75, 317-344.
Smyth, G. (1989), "Generalized Linear Models With Varying Dispersion," Journal of the Royal Statistical Society, Ser. B, 51, 47-60.
Wahba, G. (1990), Spline Models for Observational Data, Philadelphia: SIAM.
Wild, C., and Yee, T. (1996), "Additive Extensions to Generalized Estimating Equation Methods," Journal of the Royal Statistical Society, Ser. B, 58, 711 725.
Yau, P., Kohn, R., and Wood, S. (2003), "Bayesian Variable Selection and Model Averaging in High-Dimensional Multinomial Nonparametric Regres sion," Journal of Computational and Graphical Statistics, 12, 23-54.
Yee, T., and Wild, C. (1996), "Vector-Generalized Additive Models," Journal
of the Royal Statistical Society, Ser. B, 58, 481^193.
This content downloaded from 194.29.185.230 on Mon, 16 Jun 2014 10:34:56 AMAll use subject to JSTOR Terms and Conditions