Methodology Manual

download Methodology Manual

of 32

Transcript of Methodology Manual

  • 8/14/2019 Methodology Manual

    1/32

    BayesX

    Software for Bayesian Inference in Structured Additive Regression Models

    Version 1.51

    Methodology Manual

    Developed by

    Andreas BrezgerThomas Kneib (Ludwig-Maximilians-University Munich)Stefan Lang (Leopold-Franzens-University Innsbruck)

    With contributions by

    Christiane BelitzEva-Maria FronkAndrea Hennerfeind

    Manuela HummelAlexander JerakPetra KraglerLeyre Osuna Echavarra

    Supported by

    Ludwig Fahrmeir (mentally)Leo Held (mentally)German Science Foundation

  • 8/14/2019 Methodology Manual

    2/32

    2

    Acknowledgements

    The development of BayesX has been supported by grants from the German National ScienceFoundation (DFG), Collaborative Research Center 386 Statistical Analysis of Discrete Structures.

    Special thanks go to (in alphabetical order of first names):

    Achim Zeileis for advertising HCL colors;Dieter Gollnow for computing and providing the map of Munich (a really hard job);Leo Held for advertising the program;Ludwig Fahrmeirfor his patience with finishing the program and for carefully reading and correctingthe manual;Ngianga-Bakwin Kandala for being the first user of the program (a really hard job);Samson Babatunde Adebayo for carefully reading and correcting the manual;Ursula Becker for carefully reading and correcting the manual;

    Licensing agreement

    The authors of this software grant to any individual or non-commercial organization the right touse and to make an unlimited number of copies of this software. Usage by commercial entities

    requires a license from the authors. You may not decompile, disassemble, reverse engineer, ormodify the software. This includes, but is not limited to modifying/changing any icons, menus, ordisplays associated with the software. This software cannot be sold without written authorizationfrom the authors. This restriction is not intended to apply for connect time charges, or flat rateconnection/download fees for electronic bulletin board services. The authors of this program acceptno responsibility for damages resulting from the use of this software and make no warranty onrepresentation, either express or implied, including but not limited to, any implied warranty ofmerchantability or fitness for a particular purpose. This software is provided as is, and you, itsuser, assume all risks when using it.

    BayesX is available at http://www.stat.uni-muenchen.de/bayesx

    http://www.stat.uni-muenchen.de/~bayesxhttp://www.stat.uni-muenchen.de/~bayesx
  • 8/14/2019 Methodology Manual

    3/32

    3

    1 Introduction

    In this manual we provide a brief overview of the methodological background for the two regressiontools currently implemented in BayesX. The first regression tool (bayesreg objects) relies on Markovchain Monte Carlo simulation techniques and yields fully Bayesian posterior mean estimates. Thesecond regression tool (remlreg objects) is based on the mixed model representation of penalisedregression models with inference being based on penalised maximum likelihood and marginal like-

    lihood (a generalisation of restricted maximum likelihood) estimation. Both regression tools allowto estimate structured additive regression (STAR) models (Fahrmeir, Kneib and Lang, 2004) withcomplex semiparametric predictors. STAR models cover a number of well known model classes asspecial cases, including generalized additive models (Hastie and Tibshirani, 1990), generalized addi-tive mixed models (Lin and Zhang, 1999), geoadditive models (Kammann and Wand, 2003), varyingcoefficient models (Hastie and Tibshirani, 1993), and geographically weighted regression (Fothering-ham, Brunsdon, and Charlton, 2002). Besides models for responses from univariate exponentialfamilies, BayesX also supports non-standard regression situations such as models for categorical re-sponses with either ordered and unordered categories, continuous time survival data, or continuoustime multi-state models. To provide a first impression of structured additive regression, Sections 2to 4 describe STAR models for exponential family regression. Section 5 extends structured additive

    regression to the analysis of survival times and multi-state data. Full details on STAR methodologycan be found in the following references:

    Structured additive regression based on MCMC simulation:

    Brezger, A., Lang, S. (2006) Generalized structured additive regression based on BayesianP-Splines. Computational Statistics and Data Analysis, 50, 967-991.

    Fahrmeir, L., Lang, S. (2001) Bayesian Inference for Generalized Additive Mixed ModelsBased on Markov Random Field Priors. Journal of the Royal Statistical Society C (AppliedStatistics), 50, 201-220.

    Fahrmeir, L., Lang, S. (2001) Bayesian Semiparametric Regression Analysis of Multicategor-ical Time-Space Data. Annals of the Institute of Statistical Mathematics, 53, 10-30.

    Fahrmeir, L., Osuna, L.. (2006) Structured additive regression for overdispersed and zero-inflated count data. Applied Stochastic Models in Business and Industry, 22, 351-369.

    Hennerfeind, A., Brezger, A., Fahrmeir, L. (2006) Geoadditive survival models. Journal ofthe American Statistical Association, 101, 1065-1075.

    Kneib, T., Hennerfeind, A. (2006) Bayesian Semiparametric Multi-State Models. SFB 386Discussion Paper 502.

    Lang, S., Brezger, A. (2004) Bayesian P-Splines Journal of Computational and GraphicalStatistics, 13, 183-212.

    Structured additive regression based on mixed model methodology:

    Fahrmeir, L., Kneib, T., Lang, S. (2004) Penalized structured additive regression for space-time data: a Bayesian perspective. Statistica Sinica, 14, 715-745.

    Kneib, T. (2006): Mixed model based inference in structured additive regression. Dr.Hut Verlag, Munchen. Available online from http://edoc.ub.uni-muenchen.de/archive/00005011/

    Kneib, T. (2006): Geoadditive hazard regression for interval censored survival times. Com-putational Statistics and Data Analysis, 51, 777-792.

  • 8/14/2019 Methodology Manual

    4/32

    4 2 Observation model

    Kneib, T., Fahrmeir, L. (2007): A mixed model approach for geoadditive hazard regression.Scandinavian Journal of Statistics, 34, 207-228.

    Kneib, T., Fahrmeir, L. (2006): Structured additive regression for multicategorical space-timedata: A mixed model approach. Biometrics, 62, 109-118.

    Kneib, T., Hennerfeind, A. (2006) Bayesian Semiparametric Multi-State Models. SFB 386Discussion Paper 502.

    2 Observation model

    Bayesian generalized linear models assume that, given covariates u and unknown parameters , thedistribution of the response variable y belongs to an exponential family, i.e.

    p(y | u) = exp

    y b()

    c(y, ) (1)

    where b(), c(), and determine the specific response distribution. A list of the most commondistributions and their parameters can be found for example in Fahrmeir and Tutz (2001), page

    21. The mean = E(y|u, ) is linked to a linear predictor by

    = h() = u, (2)

    where h is a known response function and are unknown regression parameters.

    In most practical regression situations, however, we are facing at least one of the following problems:

    For the continuous covariates in the data set, the assumption of a strictly linear effect on thepredictor may be not appropriate.

    Observations may be spatially correlated.

    Observations may be temporally correlated.

    Complex interactions may be required to model the joint effect of some of the covariatesadequately.

    Heterogeneity among individuals or units may be not sufficiently described by covariates.Hence, unobserved unit or cluster specific heterogeneity must be considered appropriately.

    To overcome these difficulties, we replace the strictly linear predictor in (2) by a structured additivepredictor

    r = f1(xr1) + + fp(xrp) + ur, (3)

    where r is a generic observation index, the xj denote generic covariates of different type anddimension, and f

    jare (not necessarily smooth) functions of the covariates. The functions f

    jcomprise usual nonlinear effects of continuous covariates, time trends and seasonal effects, two-dimensional surfaces, varying coefficient models, i.i.d. random intercepts and slopes as well asspatial effects. In order to demonstrate the generality of the approach we point out some specialcases of (3) well known from the literature:

    Generalized additive model (GAM) for cross-sectional dataA GAM is obtained if the xj , j = 1, . . . , p, are univariate and continuous and fj are smoothfunctions. In BayesXthe functions fj are modelled either by random walk priors or P-splines,see subsection 3.1.

  • 8/14/2019 Methodology Manual

    5/32

    5

    Generalized additive mixed model (GAMM) for longitudinal dataConsider longitudinal data for individuals i = 1, . . . , n , observed at time points t {t1, t2, . . . }. For notational simplicity we assume the same time points for every individ-ual, but generalizations to individual specific time points are obvious. A GAMM extends aGAM by introducing individual specific random effects, i.e.

    it = f1(xit1) + + fk(xitk) + b1iwit1 + + bqiwitq + uit

    where it, xit1, . . . , xitk, wit1, . . . , witq, uit are predictor and covariate values for individual iat time t and bi = (b1i, . . . , bqi) is a vector of q i.i.d. random intercepts (if witj = 1) orrandom slopes. The random effects components are modelled by i.i.d. Gaussian priors, seesubsection 3.3. GAMMs can be subsumed into (3) by defining r = (i, t), xrj = xitj, j =1, . . . , k, xr,k+h = with, and fk+h(xr,k+h) = bhiwith, h = 1, . . . , q. Similarly, GAMMs forcluster data can be written in the general form (3).

    Geoadditive ModelsIn many situations additional geographic information is available for the observations in thedata set. As an example, compare the demonstrating example on determinants of childhoodundernutrition in Zambia in the tutorial manual. The district where the mother of a child

    lives is given as a covariate and may be used as an indicator for regional differences in thehealth status of children. A reasonable predictor for such data is given by

    r = f1(xr1) + + fk(xrk) + fspat(sr) + ur (4)

    where fspat is an additional spatially correlated effect of the location sr an observation pertainsto. Models with a predictor that contains a spatial effect are also called geoadditive models,see Kammann and Wand (2003). In BayesX, spatial effects can be modelled by Markovrandom fields (Besag, York and Mollie (2003)), stationary Gaussian random fields (Kriging,Kamman and Wand, 2003), or two-dimensional P-splines (Lang and Brezger, 2006), comparesubsection 3.2.

    Varying coefficient model (VCM) - Geographically weighted regression

    A VCM as proposed by Hastie and Tibshirani (1993) is defined by

    r = g1(wr1)zr1 + + gp(wrp)zrp,

    where the effect modifiers wrj are continuous covariates or time scales and the interactingvariables zrj are either continuous or categorical. A VCM can be cast into the general form ( 3)with xrj = (wrj , zrj ) and by defining the special function fj(xrj) = fj(wrj , zrj ) = gj(wrj )zrj .Note that in BayesX the effect modifiers are not necessarily restricted to be continuousvariables as in Hastie and Tibshirani (1993). For example, the geographical location may beused as effect modifier as well. VCMs with spatially varying regression coefficients are wellknown in the geography literature as geographically weighted regression, see e.g. Fotheringham,Brunsdon, and Charlton (2002).

    ANOVA type interaction modelSuppose wr and zr are two continuous covariates. Then, the effect of wr and zr may bemodelled by a predictor of the form

    r = f1(wr) + f2(zr) + f1|2(wr, zr) + . . . ,

    see e.g. Chen (1993). The functions f1 and f2 are the main effects of the two covariates andf1|2 is a two-dimensional interaction surface which can be modelled e.g. by two-dimensional P-splines, see subsection 3.4. The interaction can be cast into the form (3) by defining xr1 = wr,xr2 = zr and xr3 = (wr, zr).

  • 8/14/2019 Methodology Manual

    6/32

    6 3 Prior assumptions

    At first sight it may look strange to use one general notation for nonlinear functions of continuouscovariates, i.i.d. random intercepts and slopes, and spatially correlated effects as in (3). However,the unified treatment of the different components in the model has several advantages:

    Since we adopt a Bayesian perspective it is generally not necessary to distinguish betweenfixed and random effects because in a Bayesian approach all unknown parameters are assumedto be random.

    As we will see below in section 3 the priors for smooth functions, two-dimensional surfaces,i.i.d., serially and spatially correlated effects can be cast into a general form as well.

    The general form of both predictors and priors allows for rather general and unified estima-tion procedures, see section 4. As a side effect the implementation and description of theseprocedures is considerably facilitated.

    3 Prior assumptions

    For Bayesian inference, the unknown functions f1, . . . , f p in predictor (3), more exactly correspond-

    ing vectors of function evaluations, and the fixed effects parameters are considered as randomvariables and must be supplemented by appropriate prior assumptions.

    In the absence of any prior knowledge, diffuse priors are the appropriate choice for fixed effectsparameters, i.e.

    p(j) const

    Another common choice, not yet supported by BayesX, are informative multivariate Gaussian priorswith mean 0 and covariance matrix 0.

    Priors for the unknown functions f1, . . . , f p depend on the type of the covariates and on priorbeliefs about the smoothness of fj. In the following we express the vector of function evaluationsfj = (fj(x1j), . . . , f j(xnj))

    of a function fj as the matrix product of a design matrix Xj and a

    vector of unknown parameters j , i.e.fj = Xjj . (5)

    Then, we obtain the predictor (3) in matrix notation as

    = X11 + + Xpp + U , (6)

    where U corresponds to the usual design matrix for fixed effects.

    A prior for a function fj is defined by specifying a suitable design matrix Xj and a prior distributionfor the vector j of unknown parameters. The general form of the prior for j is given by

    p(j |2j )

    1(2j )

    rank(Kj)/2exp

    1

    22jjKjj

    , (7)

    where Kj is a penalty matrix that shrinks parameters towards zero or penalizes too abrupt jumpsbetween neighboring parameters. In most cases Kj will be rank deficient and therefore the priorfor j is partially improper.

    The variance parameter 2j is equivalent to the inverse smoothing parameter in a frequentist ap-proach and controls the trade off between flexibility and smoothness.

    In the following we will describe specific priors for different types of covariates and functions fj.

  • 8/14/2019 Methodology Manual

    7/32

    3.1 Priors for continuous covariates and time scales 7

    3.1 Priors for continuous covariates and time scales

    Several alternatives have been proposed for specifying smoothness priors for continuous covariatesor time scales. These are random walk priors or more generally autoregressive priors (see Fahrmeirand Lang, 2001a, and Fahrmeir and Lang, 2001b), Bayesian P-splines (Lang and Brezger, 2006)and Bayesian smoothing splines (Hastie and Tibshirani, 2000). BayesX supports random walkpriors and P-splines.

    3.1.1 Random walks

    Suppose first that xj is a time scale or continuous covariate with equally spaced ordered observations

    x(1)j < x

    (2)j < < x

    (Mj)j .

    Here Mj n denotes the number of different observed values for xj in the data set. A common

    approach in dynamic or state space models is to estimate one parameter jm for each distinct x(m)j ,

    i.e. fj(x(m)j ) = jm , and penalize too abrupt jumps between successive parameters using random

    walk priors. For example, first and second order random walk models are given by

    jm = j,m1 + ujm and jm = 2j,m1 j,m2 + ujm (8)

    with Gaussian errors ujm N(0, 2j ) and diffuse priors p(j1) const, and p(j1) and p(j2)

    const, for initial values, respectively. Both specifications act as smoothness priors that p enalizetoo rough functions fj. A first order random walk penalizes abrupt jumps jm j,m1 betweensuccessive states while a second order random walk penalizes deviations from the linear trend2j,m1 j,m2. The joint distribution of the regression parameters j is easily computed as theproduct of conditional densities defined by (8) and can be brought into the general form (7). Thepenalty matrix is of the form Kj = D

    D where D is a first or second order difference matrix. For

    example, for a random walk of first order the penalty matrix is given by:

    Kj =

    1 1

    1 2 1

    . . .. . .

    . . .

    1 2 1

    1 1

    .

    The design matrix Xj is a simple 0/1 matrix where the number of columns equals the number of

    parameters, i.e. the number of distinct covariate values. If for the r-th observation xrj = x(l)j the

    element in the r-th row and l-th column of Xj is one and zero otherwise.

    In case of non-equally spaced observations slight modifications of the priors defined in (8) are

    necessary, see Fahrmeir and Lang (2001a) for details.

    If xj is a time scale we may introduce an additional seasonal effect of xj . A common smoothness

    prior for a seasonal component fj(x(m)j ) = jm is given by

    jm = j,m1 j,mper1 + ujm (9)

    where ujm N(0, 2j ) and per is the period of the seasonal effect (e.g. per = 12 for monthly data).

    Compared to a dummy variable approach this specification has the advantage that it allows for atime varying rather than a time constant seasonal effect.

  • 8/14/2019 Methodology Manual

    8/32

    8 3 Prior assumptions

    3.1.2 P-splines

    A second approach for effects of continuous covariates, that is closely related to random walkmodels, is based on P-splines introduced by Eilers and Marx (1996). The approach assumes thatan unknown smooth function fj of a covariate xj can be approximated by a polynomial spline ofdegree l defined by a set of equally spaced knots xminj = 0 < 1 < < d1 < d = x

    maxj within

    the domain of xj . Such a spline can be written in terms of a linear combination of Mj = d + l

    B-spline basis functions Bm, i.e.

    fj(xj) =

    Mjm=1

    jmBm(xj).

    In this case, j = (j1, . . . , jMj ) corresponds to the vector of unknown regression coefficients and

    the n Mj design matrix Xj consists of the basis functions evaluated at the observations xrj , i.e.Xj[r, m] = Bm(xrj ). The crucial point is the choice of the number of knots. For a small numberof knots, the resulting spline may not be flexible enough to capture the variability of the data.For a large number of knots, estimated curves tend to overfit the data and, as a result, too roughfunctions are obtained. As a remedy, Eilers and Marx (1996) suggest a moderately large number

    of equally spaced knots (usually between 20 and 40) to ensure enough flexibility, and to definea roughness penalty based on first or second order differences of adjacent B-Spline coefficients toguarantee sufficient smoothness of the fitted curves. This leads to penalized likelihood estimationwith penalty terms

    P(j) = j

    Mjm=k+1

    (kjm)2, k = 1, 2 (10)

    where j is the smoothing parameter. First order differences penalize abrupt jumps jm j,m1between successive parameters while second order differences penalize deviations from the lineartrend 2j,m1 j,m2. In a Bayesian approach we use the stochastic analogue of differencepenalties, i.e. first or second order random walks, as priors for the regression coefficients. Note

    that simple first or second order random walks can be regarded as P-splines of degree l = 0 and aretherefore included as a special case. More details about Bayesian P-splines can be found in Langand Brezger (2004) and Brezger and Lang (2006).

    3.2 Priors for spatial effects

    Suppose that the index s {1, . . . , S } represents the location or site in connected geographicalregions. For simplicity we assume that the regions are labelled consecutively. A common wayto introduce a spatially correlated effect is to assume that neighboring sites are more alike thanarbitrary sites. Thus, for a valid prior definition a set of neighbors for each site s must be defined.For geographical data one usually assumes that two sites s and s are neighbors if they share a

    common boundary.The simplest (but most frequently used) spatial smoothness prior for the function evaluationsfj(s) = js is given by

    js |js , s = s, 2j N

    1

    Ns

    ss

    js ,2jNs

    , (11)

    where Ns is the number of adjacent sites and s s denotes that site s

    is a neighbor of site s.Hence, the (conditional) mean ofjs is an unweighted average of function evaluations of neighboring

  • 8/14/2019 Methodology Manual

    9/32

    3.2 Priors for spatial effects 9

    sites. The prior is a direct generalization of a first order random walk to two-dimensions and iscalled a Markov random field (MRF).

    A more general prior including (11) as a special case is given by

    js |js s = s, 2j N

    ss

    wss

    ws+js ,

    2jws+

    , (12)

    where wss are known weights and + denotes summation over the missing subscript. Such a prioris called a Gaussian intrinsic autoregression, see Besag, York and Mollie (1991) and Besag andKooperberg (1995). Other weights than wss = 1 as in (11) are based on the common boundarylength of neighboring sites, or on the distance of the centroids of two sites. All these spatial priorsare supported by BayesX, see chapter 5 of the reference manual for more details.

    The n S design matrix X is a 0/1 incidence matrix. Its value in the i-th row and the s-th columnis 1 if the i-th observation is located in site or region s, and zero otherwise. The S S penaltymatrix K has the form of an adjacency matrix.

    If exact locations s = (sx, sy) are available, we can use two-dimensional surface estimators to modelspatial effects. One option are two-dimensional P-splines, see subsection 3.4. Another option are

    Gaussian random field (GRF) priors, originating from geostatistics. These can also be interpretedas two-dimensional surface smoothers based on radial basis functions and have been employed byKammann and Wand (2003) to model the spatial component in Gaussian regression models. Thespatial component fj(s) = s is assumed to follow a zero mean stationary Gaussian random field{s : s R

    2} with variance 2j and isotropic correlation function cov(s, s+h) = C(||h||). Thismeans that correlations between sites that are ||h|| units apart are the same, regardless of directionand the sites location. For a finite array s {1, . . . , S } of sites as in image analysis, the prior forj = (1, . . . , S)

    is of the general form (7) with K = C1 and

    C(i, j) = C(||si sj ||), 1 i, j S.

    The design matrix X is again a 0/1 incidence matrix.

    Several proposals for the choice of the correlation function C(r) have been made. In the krigingliterature, the Matern family C(r; , ) is highly recommended. For prechosen values = m + 1/2,m = 0, 1, 2, . . . of the smoothness parameter simple correlation functions C(r; ) are obtained,e.g.

    C(r; ) = exp(|r/|)(1 + |r/|)

    with = 1.5. The parameter controls how fast correlations die out with increasing r = ||h||. It canbe determined in a preprocessing step or may be estimated jointly with the variance components byrestricted maximum likelihood. A simple rule, that also ensures scale invariance of the estimates,is to choose as

    = maxi,j

    ||si sj||/c.

    The constant c > 0 is chosen in such a way, that C(c) is small, e.g. 0.001. Therefore the differentvalues of ||si sj||/ are spread out over the r-axis of the correlation function. This choice of has proved to work well in our experience.

    Although we described them separately, approaches for exact locations can also be used in the caseof connected geographical regions, e.g. based on the centroids of the regions. Conversely, we canalso apply MRFs to exact locations if neighborhoods are defined based on a distance measure. Ingeneral, it is not clear which of the different approaches leads to the best fits. For data observedon a discrete lattice MRFs seem to be most appropriate. If the exact locations are available,surface estimators may be more natural, particularly because predictions for unobserved locations

  • 8/14/2019 Methodology Manual

    10/32

    10 3 Prior assumptions

    are available. However, in some situations surface estimators lead to an improved fit compared toMRFs even for discrete lattices and vice versa. A general approach that can handle both situationsis given by Muller et al. (1997).

    From a computational point of view MRFs and P-splines are preferable to GRFs because theirposterior precision matrices are band matrices or can be transformed into a band matrix likestructure. This special structure considerably speeds up computations, at least for inference basedon MCMC techniques. For inference based on mixed models, the main difference between GRFs andMRFs, considering their numerical properties, is the dimension of the penalty matrix. For MRFsthe dimension ofK equals the number of different regions S and is therefore independent from thesample size. On the other side, for GRFs, the dimension of K is given by the number of distinctlocations, which is usually close to the sample size. Therefore, the number of regression coefficientsused to describe a MRF is usually much smaller than for a GRF and therefore the estimation ofGRFs is computationally much more expensive. To overcome this difficulty Kammann and Wand(2003) propose low-rank kriging to approximate stationary Gaussian random fields. Note first, thatwe can define GRFs equivalently based on a design matrix X with entries X[i, j] = C(||si sj||)and penalty matrix K = C. To reduce the dimensionality of the estimation problem we define asubset of knots D = {1, . . . , M} of the set of distinct locations C. These knots can be chosen tobe representative for the set of distinct locations C based on a space filling algorithm. Therefore

    consider the distance measure

    d(s, D) =

    D

    ||s ||p

    1p

    ,

    with p < 0, between any location s D and a possible set of knots C. Obviously this distancemeasure is zero for all knots. Using a simple swapping algorithm to minimize the overall coveragecriterion

    sC

    d(s, D)q

    1q

    with q > 0 (compare Johnson et al. (1990) and Nychka and Saltzman (1998) for details) yields

    an optimal set of knots D. Based on these knots we define the approximation fj = Xj with then M design matrix X[i, j] = C(||si j ||), penalty matrix K = C, and C[i, j] = C(||i j ||).The number of knots M allows to control the trade-off between accuracy of the approximation (Mclose to the sample size) and numerical simplification (M small).

    3.3 Unordered group indicators and unstructured spatial effects

    In many situations we observe the problem of heterogeneity among clusters of observations causedby unobserved covariates. Neglecting unobserved heterogeneity may lead to considerably biasedestimates for the remaining effects as well as false standard error estimates. Suppose c {1, . . . , C }is a cluster variable indicating the cluster a particular observation belongs to. A common approach

    to overcome the difficulties of unobserved heterogeneity is to introduce additional Gaussian i.i.d.effects fj(c) = jc with

    jc N(0, 2j ), c = 1, . . . , C . (13)

    The design matrix Xj is again a nC-dimensional 0/1 incidence matrix that represents the groupingstructure of the data, while the penalty matrix is simply the identity matrix, i.e. Kj = I. From aclassical perspective, (13) defines i.i.d. random effects. However, from a Bayesian point of view allunknown parameters are assumed to be random and hence the notation random effects in thiscontext is misleading. Hence, one may also think of (13) as an approach for modelling an unsmoothfunction.

  • 8/14/2019 Methodology Manual

    11/32

    3.4 Modelling interactions 11

    Prior (13) may also be used for a more sophisticated modelling of spatial effects. In some situationit may be useful to split up the spatial effect fspat into a spatially correlated (structured) part fstrand a spatially uncorrelated (unstructured) part funstr, i.e.

    fspat = fstr + funstr.

    A rationale is that a spatial effect is usually a surrogate of many unobserved influential factors, someof which obeying a strong spatial structure while others are present only locally. By estimating

    a structured and an unstructured component we aim at distinguishing between the two kinds ofinfluential factors, see also Besag, York and Mollie (1991). For the smooth spatial part we canassume any of the spatial priors discussed in subsection 3.2. For the uncorrelated part we mayassume prior (13).

    3.4 Modelling interactions

    The models considered so far are not appropriate for modelling interactions between covariates.A common approach is based on varying coefficient models introduced by Hastie and Tibshirani(1993) in the context of smoothing splines. Here, the effect of covariate zrj , j = 1, . . . , p is assumedto vary smoothly over the range of a second covariate wrj , i.e.

    fj(wrj , zrj ) = gj(wrj )zrj . (14)

    In most cases the interacting covariate zrj is categorical whereas the effect modifier may be eithercontinuous, spatial or an unordered group indicator. For the nonlinear function gj we can assumeany of the priors already defined in subsection 3.1 for continuous effect modifiers, subsection 3.2 forspatial effect modifiers, and subsection 3.3 for unordered group indicators as effect modifiers. InHastie and Tibshirani (1993) continuous effect modifiers have been considered exclusively. Modelswith spatial effect modifiers are used in Fahrmeir, Lang, Wolff and Bender (2003) and Gamerman,Moreira and Rue (2003). From a frequentist point of view, models with unordered group indicatorsas effect modifiers are called models with random slopes.

    In matrix notation we obtain

    fj = diag(z1j, . . . , znj)X

    j jfor the vector of function evaluations, where Xj is the design matrix corresponding to the prior forgj. Hence the overall design matrix is given by Xj = diag(z1j , . . . , znj)Xj .

    Suppose now that both interacting covariates are continuous. In this case, a flexible approach formodelling interactions can be based on (nonparametric) two-dimensional surface fitting. In BayesXsurface fitting is based on two-dimensional P-splines described in more detail in Lang and Brezger(2004) and Brezger and Lang (2006). The unknown surface fj(wrj , zrj ) is approximated by thetensor product of two one-dimensional B-splines, i.e.

    fj(wrj , zrj ) =

    Mj

    m1=1Mj

    m2=1j,m1m2Bj,m1(wrj )Bj,m2(zrj ).

    In analogy to one-dimensional P-splines, the n M2j design matrix Xj is composed of productsof basis functions. Priors for j = (j,11, . . . , j,MjMj )

    can be based on spatial smoothness priorscommon in spatial statistics, e.g. two-dimensional first order random walks. The most commonlyused prior specification based on the four nearest neighbors is defined by

    j,m1m2 | N

    1

    4(j,m11,m2 + j,m1+1,m2 + j,m1,m21 + j,m1,m2+1),

    2j4

    (15)

    for m1, m2 = 2, . . . , M j 1 and appropriate edge corrections. This prior as well as higher orderbivariate random walks can be easily brought into the general form (7).

  • 8/14/2019 Methodology Manual

    12/32

    12 3 Prior assumptions

    3.5 Mixed Model representation

    You may skip this section if you are not interested in using the regression tool based on mixedmodel methodology (remlreg objects).

    In this section we show how STAR models can be represented as generalized linear mixed models(GLMM) after appropriate reparametrization, see also Lin and Zhang (1999) or Green (1987)in the context of smoothing splines. In fact, model (2) with the structured additive predictor

    (6) can always be expressed as a GLMM. This provides the key for simultaneous estimation of thefunctions fj, j = 1, . . . , p and the variance parameters

    2j in the empirical Bayes approach described

    in subsection 4.2 and used for estimation by remlreg objects. To rewrite the model as a GLMM,the general model formulation again proves to be useful. We proceed as follows:

    The vectors of regression coefficients j , j = 1, . . . , p, are decomposed into an unpenalized and apenalized part. Suppose that the j-th coefficient vector has dimension Mj 1 and the correspondingpenalty matrix Kj has rank kj . Then we define the decomposition

    j = Xunpj

    unpj + X

    penj

    penj , (16)

    where the columns of the Mj (Mj kj) matrix Xunp

    j

    contain a basis of the nullspace of Kj. The

    Mj kj matrix Xpenj is given by X

    penj = Lj(L

    jLj)

    1 where the Mj kj matrix Lj is determined bythe decomposition of the penalty matrix Kj into Kj = LjL

    j. A requirement for the decomposition

    is that LjXunpj = 0 and X

    unpj L

    j = 0 hold. Hence the parameter vector

    unpj represents the part of

    j which is not penalized by Kj whereas the vector penj represents the deviation of j from the

    nullspace of Kj .

    In general, the decomposition Kj = LjLj is obtained from the spectral decomposition Kj = jj

    j.

    The (kj kj) diagonal matrix j contains the positive eigenvalues jm, m = 1, . . . , kj , of Kj indescending order, i.e. j = diag(j1, . . . , j,kj ). j is a (Mj kj) orthogonal matrix of the

    corresponding eigenvectors. From the spectral decomposition we can choose Lj = j1

    2

    j . In somecases a more favorable decomposition can be found. For instance, for P-splines a simpler choice for

    Lj is given by Lj = D where D is the first or second order difference matrix. Of course, for prior(13) of subsection 3.3 and in general for proper priors a decomposition of Kj is not necessary. Inthis case the unpenalized part vanishes completely.

    The matrix Xunpj is the identity vector 1 for P-splines with first order random walk penalty and

    Markov random fields. For P-splines with second order random walk penalty Xunpj is a two column

    matrix where the first column again equals the identity vector while the second column is composedof the (equidistant) knots of the spline.

    From the decomposition (16) we get

    1

    2jjKjj =

    1

    2j(penj )

    penj

    and from the general prior (7) for j it follows that

    p(unpjm ) const, m = 1, . . . , M j kj

    and

    penj N(0,

    2j I). (17)

    Finally, by defining the matrices Uj = XjXunpj and Xj = XjX

    penj , we can rewrite the predictor (6)

  • 8/14/2019 Methodology Manual

    13/32

    13

    as

    =

    pj=1

    Xjj + U

    =

    p

    j=1(Uj

    unpj + Xj

    penj ) + U

    = U unp + Xpen.

    The design matrix X and the vector pen are composed of the matrices Xj and the vectorspenj , respectively. More specifically, we obtain X = (X1 X2 . . . Xp) and the stacked vec-

    tor pen = ((pen1 ), . . . , (penp )). Similarly the matrix U and the vector unp are given by

    U = (U1 U2 . . . Up U) and unp = ((unp1 )

    , . . . , (unpp ), ).

    Finally, we obtain a GLMM with fixed effects unp and random effects pen N(0, ) where = diag(21 , . . . ,

    21 , . . . ,

    2p , . . . ,

    2p ). Hence, we can utilize GLMM methodology for simultaneous

    estimation of smooth functions and the variance parameters 2j , see the next section.

    The mixed model representation also enables us to examine the identification problem inherent to

    nonparametric regression from a different perspective. For most types of nonparametric effects thedesign matrix Uj for the unpenalized part contains the identity vector. Provided that there is atleast one such nonlinear effect and that contains an intercept, the matrix U has not full columnrank. Hence, all identity vectors in U except for the intercept have to be deleted to guaranteeidentifiability.

    4 Inference

    BayesX provides two alternative approaches for Bayesian inference. Bayesreg objects (chapter 7of the reference manual) estimate STAR models using MCMC simulation techniques described insubsection 4.1. Remlreg objects (chapter 8 of the reference manual) use mixed model representations

    of STAR models for empirical Bayesian inference, see subsection 4.2.

    4.1 Full Bayesian inference based on MCMC techniques

    This subsection may be skipped if you are not interested in using the regression tool for full Bayesianinference based on MCMC simulation techniques (bayesreg objects).

    For full Bayesian inference, the unknown variance parameters 2j are also considered as randomvariables supplemented with suitable hyperprior assumptions. In BayesX, highly dispersed (butproper) inverse Gamma priors p(2j ) IG(aj , bj) are assigned to the variances. The correspondingprobability density function is given by

    2j (2j )

    aj1 exp

    bj2j

    .

    Using proper priors for 2j (with aj > 0 and bj > 0) ensures propriety of the joint posterior despitethe partial impropriety of the priors for the j . A common choice for the hyperparameters aresmall values for aj and bj, e.g. aj = bj = 0.001 which is also the default in BayesX.

    In some situations, the estimated nonlinear functions fj may considerably depend on the particularchoice of hyperparameters aj and bj. This may be the case for very low signal to noise ratio and/orsmall sample size. It is therefore highly recommended to estimate all models under consideration

  • 8/14/2019 Methodology Manual

    14/32

    14 4 Inference

    using a (small) number of different choices for aj and bj to assess the dependence of results onminor changes in the model assumptions. In that sense, the variation of hyperparameters can beused as a tool for model diagnostics.

    Bayesian inference is based on the posterior of the model given by

    p(1, . . . , p, 21 , . . . ,

    2p , |y) L(y, 1, . . . , p, )

    p

    j=1

    p(j |2j )p(2j ) (18)where L() denotes the likelihood which, under the assumption of conditional independence, is theproduct of individual likelihood contributions.

    In many practical situations (and in particular for most structured additive regression models)the posterior distribution is numerically intractable. A technique that overcomes this problemare Markov Chain Monte Carlo (MCMC) simulation methods that allow to draw random samplesfrom the posterior. From these random samples, characteristics of the posterior such as posteriormeans, standard deviations or quantiles can be estimated by their empirical analogues. Instead ofdrawing samples directly from the posterior (which is impossible in most cases anyway) MCMCdevices a way to construct a Markov chain with the posterior as stationary distribution. Hence,

    the iterations of the transition kernel of this Markov chain converge to the posterior yielding asample of dependent random numbers. Usually the first part of the sample (the burn-in phase) isdiscarded since the algorithm needs some time to converge. In addition, some thinning is typicallyapplied to the Markov chain to reduce autocorrelations. In BayesX the user can specify optionsfor the number of burn-in iterations, the thinning parameter and the total number of iterations,see chapter 7 of the reference manual for more details.

    BayesXprovides a number of different sampling schemes, specifically tailored to the distribution ofthe response. The first sampling scheme is suitable for Gaussian responses. The second samplingscheme is particularly useful for categorical responses and uses the sampling scheme for Gaussianresponses as a building block. The third sampling scheme is based on iteratively weighted leastsquares proposals and is used for general responses from an exponential family. A further sampling

    scheme, not described in this manual, is based on conditional prior proposals.

    4.1.1 Gaussian responses

    Suppose first that the distribution of the response variable is Gaussian, i.e. yi|i, 2 N(i,

    2/ci),i = 1, . . . , n or y|, 2 N(, 2C1) where C = diag(c1, . . . , cn) is a known weight matrix. In thiscase, full conditionals for fixed effects as well as nonlinear functions fj are multivariate Gaussianand, as a consequence, a Gibbs sampler can be employed. To be more specific, the full conditional| for fixed effects with diffuse priors is Gaussian with mean

    E(|) = (UCU)1UC(y ) (19)

    and covariance matrix

    Cov(|) = 2(UCU)1 (20)

    where U is the design matrix of fixed effects and = U is the part of the additive predictorassociated with the remaining effects in the model. Similarly, the full conditional for the regressioncoefficients j of a function fj is Gaussian with mean

    mj = E(j |) =

    1

    2XjCXj +

    1

    2jKj

    11

    2XjC(y j), (21)

  • 8/14/2019 Methodology Manual

    15/32

  • 8/14/2019 Methodology Manual

    16/32

    16 4 Inference

    4. Update variance parameters 2j and 2

    Update variance parameters by drawing from inverse gamma full conditionals with parametersgiven in (23) and (24).

    4.1.2 Categorical Response

    For most models with categorical responses, efficient sampling schemes can be developed based on

    latent utility representations. The seminal paper by Albert and Chib (1993) describes algorithmsfor probit models with ordered categorical responses. The case of probit models with unorderedcategorical responses is dealt with e.g. in Fahrmeir and Lang (2001b). Recently, a similar dataaugmentation approach for logit models has been presented by Holmes and Knorr-Held (2006). Theadaption of these sampling schemes to STAR models used in BayesXis more or less straightforward.We briefly illustrate the concept for binary data, i.e. yi takes only the values 0 or 1. We firstassume a probit model. Conditional on the covariates and the parameters, yi follows a Bernoullidistribution, i.e. yi B(1, i) with conditional mean i = (i) where is the cumulativedistribution function of a standard normal distribution. Introducing latent variables

    Li = i + i, (25)

    with i N(0, 1), we can equivalently define the binary probit model by yi = 1 ifLi > 0 and yi = 0if Li < 0. The latent variables are augmented as additional parameters and, as a consequence,an additional sampling step for updating the Lis is required. Fortunately, sampling the Lis isrelatively easy and fast because the full conditionals are truncated normal distributions. Morespecifically, Li| N(i, 1) truncated at the left by zero if yi = 1 and truncated by zero at the rightif yi = 0. The advantage of defining a probit model through the latent variables Li is that the fullconditionals for the regression parameters j (and ) are again Gaussian with precision matrix andmean given by

    Pj = XjXj +

    1

    2jKj, mj = P

    1j X

    j(L ). (26)

    Hence, the efficient and fast sampling schemes for Gaussian responses can be used with slight

    modifications. Updating of j and can be done exactly as described in sampling scheme 1 usingthe current values Lc of the latent utilities as (pseudo) responses and setting 2 = 1, C = I.

    For binary logit models, the sampling schemes become slightly more complicated. A logit model canbe expressed in terms of latent utilities by assuming i N(0, i) in (25) with i = 4

    2i , where i

    follows a Kolmogorov-Smirnov distribution (Devroye, 1986). Hence, i is a scale mixture of normalform with a marginal logistic distribution (Andrews and Mallows, 1974). The full conditionalsfor the Lis are still truncated normals with Li| N(i, i) but additional drawings from theconditional distributions of i are necessary, see Holmes and Knorr-Held (2006) for details.

    Similar updating schemes may be developed for multinomial probit models with unordered cate-gories and cumulative threshold models for ordered categories of the response, see Fahrmeir andLang (2001b) for details. BayesXsupports both types of models. The cumulative threshold modelis, however, restricted to three response categories. For multinomial logit models up dating schemesbased on latent utilities are not available in BayesX.

    4.1.3 General uni- or multivariate response from an exponential family

    Let us now turn our attention to general responses from an exponential family. In this case the fullconditionals are no longer Gaussian, so that more refined algorithms are needed.

    BayesX supports several updating schemes based on iteratively weighted least squares (IWLS)proposals as proposed by Gamerman (1997) in the context of generalized linear mixed models. As

  • 8/14/2019 Methodology Manual

    17/32

    4.1 Full Bayesian inference based on MCMC techniques 17

    an alternative conditional prior proposals as proposed by Knorr-Held (1999) for estimating dynamicmodels are also available.

    The basic idea behind IWLS proposals is to combine Fisher scoring or IWLS (e.g. Fahrmeir andTutz, 2001) for estimating regression parameters in generalized linear models, and the Metropolis-Hastings algorithm. More precisely, the goal is to approximate the full conditionals of regressionparameters j and by a Gaussian distribution, obtained by accomplishing one Fisher scoringstep in every iteration of the sampler. Suppose we want to update the regression coefficients jof the function fj with current state cj of the chain. Then, according to IWLS, a new value

    pj

    is proposed by drawing a random number from the multivariate Gaussian proposal distributionq(cj ,

    pj ) with precision matrix and mean

    Pj = XjW(

    cj )Xj +

    1

    2jKj , mj = P

    1j X

    jW(

    cj )(y(

    cj ) j). (27)

    Here, W(cj ) = diag(w1, . . . , wn) is the usual weight matrix for IWLS with weights w1i (

    cj ) =

    b(i)(g(i))

    2 obtained from the current state cj . The vector j = Xjj is the part ofthe predictor associated with all remaining effects in the model. The working observations yi aredefined as

    yi(cj ) = i + (yi i)g(i).

    The sampling scheme can be summarized as follows:

    Sampling scheme 2 (IWLS-proposals):

    1. InitializationCompute the posterior mode for 1, . . . , p and given fixed smoothing parameters j = 1/

    2j .

    By default, BayesX uses j = 0.1 but the value may be changed by the user. The mode iscomputed via backfitting within Fisher scoring. Use the posterior mode estimates as theinitial state cj , (

    2j )

    c, c of the chain.

    2. Update

    Draw a proposed new value p from the Gaussian proposal density q(c, p) with mean

    m = (UW(c)U)1UW(c)(y )

    and precision matrixP = U

    W(c)U.

    Accept p as the new state of the chain c with acceptance probability

    =L(y , . . . , p)

    L(y , . . . , c)

    q(p, c)

    q(c, p),

    otherwise keep c as the current state.

    3. Update jDraw for j = 1, . . . , p a proposed new value pj from the Gaussian proposal density q(

    c, p)with mean and precision matrix given in (27). Accept j as the new state of the chain

    cj

    with probability

    =L(y , . . . , pj , (

    2j )

    c, . . . , c)

    L(y , . . . , cj , (2j )

    c, . . . , c)

    p(pj | (2j )

    c)

    p(cj | (2j )

    c)

    q(pj , cj )

    q(cj , pj )

    ,

    otherwise keep cj as the current state.

  • 8/14/2019 Methodology Manual

    18/32

    18 4 Inference

    4. Update 2jUpdate variance parameters by drawing from inverse gamma full conditionals with parametersgiven in (23).

    A slightly different updating scheme computes the mean and the precision matrix of the proposaldistribution based on the current posterior mode mcj (from the last iteration) rather than thecurrent cj , i.e. (27) is replaced by

    Pj = XjW(m

    cj)Xj +

    1

    2jKj , mj = P

    1j X

    jW(m

    cj)(y(

    cj ) j). (28)

    The difference of using mcj rather than cj is that the proposal is independentof the current state of

    the chain, i.e. q(cj , pj ) = q(

    pj ). Hence, it is not required to recompute Pj and mj when computing

    the proposal density q(pj , cj ).

    Usually acceptance rates are significantly higher compared to sampling scheme 2. This is partic-ularly useful for updating spatial effects based on Markov random fields where, in many cases,sampling scheme 2 yields quite low acceptance rates.

    The sampling scheme can be summarized as follows:

    Sampling scheme 3 (IWLS-proposals based on current mode):

    1. InitializationCompute the posterior mode for 1, . . . , p and given fixed smoothing parameters j = 1/

    2j .

    By default, BayesX uses j = 0.1 but the value may be changed by the user. The mode iscomputed via backfitting within Fisher scoring. Use the posterior mode estimates as theinitial state cj , (

    2j )

    c, c of the chain. Define mcj and mc as the current mode.

    2. Update Draw a proposed new value p from the Gaussian proposal density q(c, p) with mean

    m = (UW(mc)U)

    1UW(mc)(y )

    and precision matrixP = U

    W(mc)U.

    Accept p as the new state of the chain c with acceptance probability

    =L(y , . . . , p)

    L(y , . . . , c)

    q(p, c)

    q(c, p),

    otherwise keep c as the current state.

    3. Update jDraw for j = 1, . . . , p a proposed new value pj from the Gaussian proposal density q(

    cj ,

    pj )

    with mean and precision matrix given in (28). Accept pj as the new state of the chain cj

    with probability

    =L(y , . . . , pj , (

    2j )

    c, . . . , c)

    L(y , . . . , cj , (2j )

    c, . . . , c)

    p(pj | (2j )

    c)

    p(cj | (2j )

    c)

    q(pj , cj )

    q(cj , pj )

    ,

    otherwise keep cj as the current state.

    4. Update 2jUpdate variance parameters by drawing from inverse gamma full conditionals with parametersgiven in (23).

  • 8/14/2019 Methodology Manual

    19/32

    4.2 Empirical Bayes inference based on mixed model methodology 19

    4.2 Empirical Bayes inference based on mixed model methodology

    This section may be skipped if you are not interested in using the regression tool based on mixedmodel methodology (remlreg objects).

    For empirical Bayes inference, the variances 2j are considered as unknown constants to be estimatedfrom their marginal likelihood. In terms of the GLMM representation outlined in subsection 3.5,the posterior is given by

    p(unp, pen|y) L(y, unp, pen)

    pj=1

    p(penj |

    2j )

    (29)

    where p(penj |2j ) is defined in (17).

    Based on the GLMM representation, regression and variance parameters can be estimated us-ing iteratively weighted least squares (IWLS) and (approximate) marginal or restricted maximumlikelihood (REML) developed for GLMMs. Estimation is carried out iteratively in two steps:

    1. Obtain updated estimates unp and pen given the current variance parameters as the solu-tions of the system of equations

    UWU UWX

    XWU XWX+ 1

    unp

    pen

    =

    UWy

    XWy

    . (30)

    The (n 1) vector y and the n n diagonal matrix W = diag(w1, . . . , wn) are the usualworking observations and weights in generalized linear models, see subsubsection 4.1.3.

    2. Updated estimates for the variance parameters 2j are obtained by maximizing the (approxi-mate) marginal / restricted log likelihood

    l(21 , . . . , 2p ) =

    12 log(||)

    12 log(|U

    1U|)

    1

    2(y U

    unp

    )

    1

    (y U

    unp

    )

    (31)

    with respect to the variance parameters 21 , . . . , 2p . Here, = W

    1 + XX is an approxi-mation to the marginal covariance matrix of y|pen.

    The two estimation steps are iterated until convergence. In BayesX, the marginal likelihood (31)is maximized by a computationally efficient alternative to the usual Fisher scoring iterations asdescribed e.g. in Harville (1977), see Fahrmeir, Kneib and Lang (2004) for details.

    Convergence problems of the above algorithm may occur, if one of the parameters 2j is small.In this case the maximum of the marginal likelihood may be on the boundary of the parameterspace so that Fisher scoring fails in finding the marginal likelihood estimates 2. Therefore, the

    estimation of small variances 2

    j is stopped if the criterion

    c(2j ) =||Xj

    penj ||

    ||||(32)

    is smaller than the user-specified value lowerlim (see section 8.1.2 of the reference manual). Thisusually corresponds to small values of the variances 2j but defines small in a data driven way.

    Models for categorical responses are in principle estimated in the same way as presented above buthave to be embedded into the framework of multivariate generalized linear models, see Kneib andFahrmeir (2006) for details.

  • 8/14/2019 Methodology Manual

    20/32

    20 5 Survival analysis and multi-state models

    5 Survival analysis and multi-state models

    We will now describe some of the capabilities of BayesX for the estimation of survival time andmulti-state models. Discrete time duration and multi-state models can be estimated by categoricalregression models after some data augmentation as outlined in subsection 5.1. Continuous timesurvival models can estimated using either the piecewise exponential model or structured hazardregression, an extension of the well known Cox model (subsection 5.2). For the latter, extensions

    allowing for interval censored survival times are described in subsection 5.3. Finally, subsection 5.4contains information on continuous time multi-state models.

    More details on estimating continuous time survival models based on MCMC can be found inHennerfeind, Brezger and Fahrmeir (2006). Kneib and Fahrmeir (2007) and Kneib (2006) presentinference based on mixed model methodology. Kneib and Hennerfeind (2006) present both MCMCand mixed model based inference in multi-state models.

    5.1 Discrete time duration data

    In applications, duration data are often measured on a discrete time scale or can be grouped insuitable intervals. In this section we show how data of this kind can be written as categoricalregression models. Estimation is then based on methodology for categorical regression models asdescribed in the previous sections. We start by assuming that there is only one type of failureevent, i.e. we consider the case of survival times.

    Let the duration time scale be divided into intervals [a0, a1), [a1, a2), . . . , [aq1, aq), [aq, a). Usuallya0 = 0 is assumed and aq denotes the final follow up time. Identifying the discrete time index twith interval [at1, at), duration time T is considered as discrete, where T = t {1, . . . , q + 1}denotes end of duration within the interval t = [at1, at). In addition to duration T, a sequence ofpossibly time-varying covariate vectors ut is observed. Let u

    t = (u1, . . . , ut) denote the history of

    covariates up to interval t. Then the discrete hazard function is given by

    (t; ut ) = P(T = t | T t, ut ), t = 1, . . . , q ,

    that is the conditional probability for the end of duration in interval t, given that the interval isreached and the history of the covariates. Discrete time hazard functions can be specified in termsof binary response models. Common choices are binary logit, probit or grouped Cox models.

    For a sample of individuals i, i = 1, . . . , n, let Ti denote duration times and Ci right censoring times.Duration data are usually given by (ti, i, u

    iti

    ), i = 1, . . . , n, where ti = min(Ti, Ci) is the observeddiscrete duration time, i = 1 if Ti Ci, i = 0 else is the censoring indicator, and u

    iti

    = (uit,t = 1, . . . , ti) is the observed covariate sequence. We assume that censoring is noninformative andoccurs at the end of the interval, so that the risk set Rt includes all individuals who are censoredin interval t.

    We define binary event indicators yit, i Rt, t = 1, . . . , ti, by

    yit =

    1 if t = ti and i = 10 else.

    Then the duration process of individual i can be considered as a sequence of binary decisionsbetween remaining in the transient state yit = 0 or leaving for the absorbing state yit = 1, i.e.end of duration at t. For i Rt, the hazard function for individual i can be modelled by binaryresponse models

    P(yit = 1 | uit) = h(it), (33)

  • 8/14/2019 Methodology Manual

    21/32

  • 8/14/2019 Methodology Manual

    22/32

  • 8/14/2019 Methodology Manual

    23/32

    5.2 Continuous time survival analysis for right censored survival times 23

    log-baseline effect by a P-spline of arbitrary degree. This approach is less restrictive and does notdemand data augmentation.

    Let ut = {us, 0 s t} denote the history of possibly time-varying covariates up to time t. Thenthe continuous hazard function is given by

    (t; ut ) = lim

    t 0

    P(t T < t + t|T t, ut )

    t

    ,

    i.e. by the conditional instantaneous rate of end of duration at time t, given that time t is reachedand the history of the covariates. In the Cox model the individual hazard rate is modelled by

    i(t) = 0(t) exp(it) = exp(f0(t) + it) (37)

    where 0(t) is the baseline hazard, (f0(t) = log(0(t)) is the log-baseline hazard) and it is anappropriate predictor. Traditionally the predictor is linear and the baseline hazard is unspecified.In BayesX, however, a structured additive predictor may be assumed and the baseline effect isestimated jointly with the covariate effects either by a piecewise constant function (in case of ap.e.m.) or by a P-spline.

    5.2.1 Piecewise exponential model (p.e.m.)

    The basic idea of the p.e.m. is to assume that all values that depend on time t are piecewise constanton a grid

    (0, a1], (a1, a2], . . . , (as1, as], . . . , (at1, at], (at, ),

    where at is the largest of all observed duration times ti, i = 1, . . . , n. This grid may be equidistantor, for example, constructed according to quantiles of the observed survival times. The assumptionof a p.e.m. is quite convenient since estimation can be based on methodology for Poisson regressionmodels. For this purpose the data set has to be modified as described below.

    Let i be an indicator of non-censoring (i.e. i = 1 if observation i is uncensored, 0 else) and0s, s = 1, 2, . . . the piecewise constant log-baseline effect. We define an indicator variable yis aswell as an offset is as follows:

    yis =

    1 ti (as1, as], i = 10 else.

    is =

    as as1, as < titi as1, as1 < ti as0, as1 ti

    is

    = log

    is(

    is= if

    is= 0).

    The likelihood contribution of observation i in the interval (as1, as] is

    Lis = exp (yis(0s + is) exp(is + 0s + is)) .

    As this likelihood is proportional to a Poisson likelihood with offset is, estimation can be performedusing Poisson regression with response variable y, (log-)offset and a as a continuous covariate.Due to the assumption of a piecewise constant hazard rate the estimated log-baseline is a stepfunction on the defined grid. To obtain a smooth step function, a random walk prior is specifiedfor the parameters 0s.

  • 8/14/2019 Methodology Manual

    24/32

  • 8/14/2019 Methodology Manual

    25/32

    5.3 Continuous time survival analysis for interval censored survival times 25

    observed along with the censoring indicator = 1(TC). Many applications, however, confrontthe analyst with more complicated data structures involving more general censoring schemes. Forexample, interval censored survival times T are not observed exactly but are only known to fallinto an interval [Tlo, Tup]. IfTlo = 0 such survival times are also referred to as being left censored.Furthermore, each of the censoring schemes may appear in combination with left truncation of thecorresponding observation, i.e. the survival time is only observed if it exceeds the truncation timeTtr. Accordingly, some survival times are not observable and the likelihood has to be adjusted

    appropriately. Figure 1 illustrates the different censoring schemes we will consider in the following:The true survival time is given by T which is observed for individuals 1 and 2. While individual1 is not truncated, individual 2 is left truncated at time Ttr. Similarly, individuals 3 and 4 areright-censored at time C and individuals 5 and 6 are interval censored with interval [Tlo, Tup] andthe same pattern for left truncation.

    individual

    0 Ttr C Tlo T Tup

    1

    2

    3

    4

    5

    6

    Figure 1: Illustration of different censoring schemes.

    In a general framework an observation can now be uniquely described by the quadruple(Ttr, Tlo, Tup, ), with

    Tlo = Tup = T, = 1 if the observation is uncensored,Tlo = Tup = C, = 0 if the observation is right censored,Tlo < Tup, = 0 if the observation is interval censored.

    For left truncated observations we have Ttr > 0 while Ttr = 0 for observations which are nottruncated.

    Based on these definitions we can now construct the likelihood contributions for the different

    censoring schemes in terms of the hazard rate (t) and the survivor function S(t) = exp(t0 (u)du).

    Under the common assumption of noninformative censoring and conditional independence, thelikelihood is given by

    L =ni=1

    Li, (38)

    where

    Li = (Tup)S(Tup)/S(Ttr) = (Tup)exp

    TupTtr

    (t)dt

  • 8/14/2019 Methodology Manual

    26/32

    26 5 Survival analysis and multi-state models

    for an uncensored observation,

    Li = S(Tup)/S(Ttr) = exp

    TupTtr

    (t)dt

    for a right censored observation and

    Li = (S(Tlo) S(Tup))/S(Ttr) = exp

    TloTtr

    (t)dt

    1 exp

    TupTlo

    (t)dt

    for an interval censored observation. Note that for explicit evaluation of the likelihood (38) somenumerical integration technique has to be employed, since none of the integrals can in general besolved analytically.

    The above notation also allows for the easy inclusion of piecewise constant, time-varying covariatesvia some data augmentation. Noting that

    TTtr

    (t)dt =

    t1Ttr

    (t)dt +

    t2t1

    (t)dt + . . . +

    tptp1

    (t)dt +

    Ttp

    (t)dt

    for Ttr < t1 < .. . < tq < T, we can replace an observation (Ttr, Tlo, Tup, ) by a set of new observa-tions (Ttr, t1, t1, 0), (t1, t2, t2, 0), . . . (tp1, tp, tp, 0), (tp, Tlo, Tup, ) without changing the likelihood.Therefore, observations with time-varying covariates can be split up into several observations, wherethe values t1 < .. . < tp are defined by the changepoints of the covariate and the covariate is nowtime-constant on each of the intervals. In theory, other paths for a covariate x(t) than piecewiseconstant ones are also possible, ifx(t) is known for Ttr t Tlo. In this case the the likelihood (38)can also be evaluated numerically but a general path x(t) may lead to complicated data structures.

    Figure 2 illustrates the data augmentation step for a left truncated, uncensored observation and acovariate x(t) that takes the three different values x1, x2 and x3 on the three intervals [Ttr, t1], [t1, t2]and [t2, Tup]. Here, the original observation (Ttr, Tup, Tup, 1) has to be replaced by (Ttr, t1, t1, 0),(t1, t2, t2, 0) and (t2, Tup, Tup, 1).

    0 Ttr t1 t2 Tup

    x1

    x3

    x2

    Figure 2: Illustration of time-varying covariates.

    Currently, interval censored survival times can only be handled with remlreg objects.

    5.4 Continuous-time multi-state models

    Multi-state models are a flexible tool for the analysis of time-continuous phenomena that can becharacterized by a discrete set of states. Such data structures naturally arise when observing adiscrete response variable for several individuals or objects over time. Some common examplesare depicted in Figure 3 in terms of their reachability graph for illustration. For recurrent events(Figure 3 (a)), the observations evolve through time moving repeatedly between a fixed set of states.

  • 8/14/2019 Methodology Manual

    27/32

  • 8/14/2019 Methodology Manual

    28/32

    28 6 References

    Covariates uil(t) with time-varying effects ghl(t).

    Nonparametric effects fhj(xij(t)) of continuous covariates xij(t).

    Parametric effects h of covariates vi(t).

    Frailty terms bhi to account for unobserved heterogeneity.

    For each individual i, i = 1, . . . , n , the likelihood contribution in a multi-state model can be derivedfrom a counting process representation of the multi-state model. Let Nhi(t), h = 1, . . . , k be a setof counting processes counting transitions of type h for individual i. Consequently, h = 1, . . . , kindexes the observable transitions in the model under consideration and the jumps of the countingprocesses Nhi(t) are defined by the transition times of the corresponding multi-state process forindividual i.

    From classical counting process theory (see e.g. Andersen et al., 1993, Ch. VII.2), the intensityprocesses hi(t) of the counting processes Nhi(t) are defined as the product of the hazard rate fortype h transitions hi(t) and a predictable at-risk indicator process Yhi(t), i.e.

    hi(t) = Yhi(t)hi(t),

    where the hazard rates are constructed in terms of covariates as in (39). The at-risk indicator Yhi(t)takes the value one if individual i is at risk for a type h transition at time t and zero otherwise.For example, in the multi-state model of Figure 3a), an individual in state 2 is at risk for bothtransitions to state 1 and state 3. Hence, the at-risk indicators for both the transitions 2 to 1 and2 to 3 will be equal to one as long as the individual remains in state 2.

    Under mild regularity conditions, the individual log-likelihood contributions can now be obtainedfrom counting process theory as

    li =k

    h=1Ti

    0log(hi(t))dNhi(t)

    Ti0

    hi(t)Yhi(t)dt

    , (40)

    where Ti denotes the time until which individual i has been observed. The likelihood contributionscan be interpreted similarly as with hazard rate models for survival times (and in fact coincidewith these in the case of a multi-state process with only one transition to an absorbing state). Thefirst term corresponds to contributions at the transition times since the integral with respect tothe counting process in fact equals a simple sum over the transition times. Each of the summandsis then given by the log-intensity for the observed transition evaluated at this particular timepoint. In survival models this term simply equals the log-hazard evaluated at the survival time foruncensored observations. The second term reflects cumulative intensities integrated over accordantwaiting periods between two successive transitions. The integral is evaluated for all transitions thecorresponding person is at risk at during the current period. In survival models there is only one

    such transition (the transition from alive to dead) and the integral is evaluated from the time ofentrance to the study to the survival or censoring time.

    More details on multi-state models, including an exemplary analysis on human sleep, can be foundin Kneib and Hennerfeind (2006)

    6 References

    Albert, J. and Chib, S., (1993): Bayesian analysis of binary and polychotomous response data.Journal of the American Statistical Association, 88, 669-679.

  • 8/14/2019 Methodology Manual

    29/32

  • 8/14/2019 Methodology Manual

    30/32

    30 6 References

    Hastie, T. and Tibshirani, R. (1993): Varying-coefficient Models. Journal of the Royal Sta-tistical Society B , 55, 757-796.

    Hastie, T. and Tibshirani, R. (2000): Bayesian Backfitting. Statistical Science, 15, 193-223.

    Hastie, T., Tisbshirani, R. and Friedman, J. (2001): The Elements of Statistical Learning:Data Mining, Inference and Prediction. New York: SpringerVerlag.

    Hennerfeind, A., Brezger, A., Fahrmeir, L. (2006): Geoadditive Survival models. Journalof the American Statistical Association, 101, 1065-1075.

    Holmes, C., Held, L. (2006): Bayesian auxiliary variable models for binary and multinomialregression. Bayesian Analysis, 1, 145-168.

    Johnson, M.E., Moore, L.M. and Ylvisaker, D., (1990): Minimax and maximin designs.Journal of Statistical Planning and Inference , 26, 131-148.

    Kammann, E. E. and Wand, M. P., (2003): Geoadditive Models. Journal of the Royal Sta-tistical Society C, 52, 1-18.

    Kneib, T. (2006): Geoadditive hazard regression for interval censored survival times. Computa-

    tional Statistics and Data Analysis, 51, 777-792

    Kneib, T. and Hennerfeind, A. (2006): Bayesian Semiparametric Multi-State Models. Underrevision for Statistical Modelling. A preliminary version is available as SFB 386 DiscussionPaper 502.

    Kneib, T. and Fahrmeir, L. (2006): Structured additive regression for categorical space-timedata: A mixed model approach. Biometrics, 62, 109-118.

    Kneib, T. and Fahrmeir, L. (2007): A mixed model approach to structured hazard regression.Scandinavian Journal of Statistics, 34, 207-228..

    Knorr-Held, L. (1999): Conditional Prior Proposals in Dynamic Models. Scandinavian Journal

    of Statistics, 26, 129-144.Kragler, P. (2000): Statistische Analyse von Schadensfallen privater Krankenversicherungen.

    Master thesis, University of Munich.

    Lang, S. and Brezger, A. (2004): Bayesian P-splines. Journal of Computational and Graphi-cal Statistics, 13, 183-212.

    Lin, X. and Zhang, D., (1999): Inference in generalized additive mixed models by usingsmoothing splines. Journal of the Royal Statistical Society B, 61, 381-400.

    McCullagh, P. and Nelder, J.A. (1989): Generalized Linear Models. Chapman and Hall,London.

    Nychka, D. and Saltzman, N., (1998): Design of Air-Quality Monitoring Networks, LectureNotes in Statistics, 132, 5176.

    Osuna, L. (2004): Semiparametric Bayesian Count Data Models, Dr. Hut Verlag, Munchen.

    Rue, H. (2001): Fast Sampling of Gaussian Markov Random Fields with Applications. Journalof the Royal Statistical Society B, 63, 325-338.

    Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002): Bayesianmeasures of model complexity and fit. Journal of the Royal Statistical Society B, 65, 583-639.

    http://www.scor.fr/us/2_laureat.asp?pays=2http://www.scor.fr/us/2_laureat.asp?pays=2http://www.scor.fr/us/2_laureat.asp?pays=2
  • 8/14/2019 Methodology Manual

    31/32

    Index

    Bayesreg objects, 13

    Competing risks, 26Continuous covariates, 7

    Continuous time survival models, 22Cox model, 24

    Discrete time survival models, 20Disease progression, 26

    Empirical Bayes inference, 19Exponential family, 4

    Full Bayesian inference, 13

    Gaussian random fields, 8Generalized additive mixed model, 5

    Generalized additive model, 4Generalized linear model, 4Geoadditive model, 5Geographically weighted regression, 5Group indicators, 10

    Interactions, 11Interval censoring, 24IWLS proposal, 16

    Kriging, 8

    Left censoring, 24Left truncation, 24

    Marginal likelihood, 12Markov random fields, 8MCMC, 13

    Exponential families, 16Gategorical Response, 16Gaussian Response, 14

    Mixed model based inference, 19Mixed model representation, 12Multi-state models, 26

    P-splines, 8Piecewise exponential model, 23Prior assumptions, 6

    Continuous covariates, 7Fixed effects, 6Interactions, 11Spatial effects, 8

    Random effects, 10

    Random walk priors, 7Recurrent Events, 26Remlreg objects, 19Restricted maximum likelihood, 12

    Spatial priors, 8Survival analysis, 20

    Time scales, 7Time-varying covariates, 24Two-dimensional p-splines, 11

    Unstructured spatial effects, 10

    Varying coefficient models, 5, 11

    31

  • 8/14/2019 Methodology Manual

    32/32

    32 Index