1227 - Purdue Universitychuanhai/docs/liu1996.pdf · Each copy of any part of a JSTOR transmission...

10
Bayesian Robust Multivariate Linear Regression With Incomplete Data Author(s): Chuanhai Liu Source: Journal of the American Statistical Association, Vol. 91, No. 435 (Sep., 1996), pp. 1219- 1227 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2291740 . Accessed: 25/02/2011 09:04 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at . http://www.jstor.org/action/showPublisher?publisherCode=astata. . Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association. http://www.jstor.org

Transcript of 1227 - Purdue Universitychuanhai/docs/liu1996.pdf · Each copy of any part of a JSTOR transmission...

Page 1: 1227 - Purdue Universitychuanhai/docs/liu1996.pdf · Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed ...

Bayesian Robust Multivariate Linear Regression With Incomplete DataAuthor(s): Chuanhai LiuSource: Journal of the American Statistical Association, Vol. 91, No. 435 (Sep., 1996), pp. 1219-1227Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/2291740 .Accessed: 25/02/2011 09:04

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unlessyou have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and youmay use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at .http://www.jstor.org/action/showPublisher?publisherCode=astata. .

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printedpage of such transmission.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.

http://www.jstor.org

Page 2: 1227 - Purdue Universitychuanhai/docs/liu1996.pdf · Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed ...

Bayesian Robust Multivariate Linear Regression With Incomplete Data

Chuanhai Liu

The multivariate t distribution and other normal/independent multivariate distributions, such as the multivariate slash distribution and the multivariate contaminated distribution, are used for robust regression with complete or incomplete data. Most previous work focused on the method of maximum likelihood estimation for linear regression using normal/independent distributions. This article considers Bayesian estimation of multivariate linear regression models using normal/independent distributions with fully observed predictor variables and possible missing values from outcome variables. A monotone data augmentation algorithm for posterior simulation of the parameters and missing data imputation is presented. The posterior distributions of functions of the parameters can be obtained using Monte Carlo methods. The monotone data augmentation algorithm can also be used for creating multiple imputations for incomplete data sets. An illustrative example of using the multivariate t is also included.

KEY WORDS: Data augmentation; Expectation-Constrained-Maximization-Either; Expectation-Maximization; Missing data; Monotone data augmentation; Multiple imputation; Normal/independent distributions.

1. INTRODUCTION

The multivariate t distribution and other normal/indepen- dent multivariate distributions (see, e.g., Andrews and Mal- lows 1974; Rogers and Tukey 1972), such as the multi- variate slash distribution and the multivariate contaminated distribution, are used in statistical practice for robust lin- ear regression. Most previous work focused on the method of maximum likelihood (ML) estimation of the parame- ters of linear regression models with normal/independent distributions via the expectation-maximization (EM) al- gorithm (Dempster, Laird, and Rubin 1977). Dempster et al. (1980) considered univariate normal/independent distributions. Little (1988) studied multivariate nor- mal/independent distributions with incomplete data. For the multivariate t distribution, Lange, Little, and Tay- lor (1989) suggested estimating the location parameters (or linear regression coefficients), the scatter matrix, and the degrees of freedom simultaneously using the EM al- gorithm. Liu and Rubin (1994, 1995) showed that the Expectation-Constrained-Maximization-Either (ECME) al- gorithm, a simple extension of the EM algorithm and the Expectation-Constrained Maximization (ECM) algorithm (Meng and Rubin 1993) for the ML estimation of the pa- rameters of the multivariate t distribution, could converge substantially faster than the EM algorithm when the degrees of freedom are to be estimated.

Here we are concerned with Bayesian estimation of lin- ear regression models with normal/independent distribu- tions via the data augmentation (DA) algorithm (Tanner and

Chuanhai Liu is Member of Technical Staff, Statistics and Information Analysis Research Department, Bell Laboratories, Lucent Technologies, Murray Hill, NJ 07974. The author thanks Constantine Gatsonis, Jun Liu, Daryl Pregibon, Donald B. Rubin, and Alan Zaslavsky for their useful comments. The author is grateful to the referees and editors for numer- ous suggestions that have improved the presentation of this article. This work was partially supported by National Science Foundation Grant SES 92-07456 when the author was a graduate student in the Department of Statistics, Harvard University. The article was revised using the computing facilities in Bell Laboratories.

Wong 1987). DA can be viewed as a stochastic version of the EM algorithm. More specifically, DA is an iterative al- gorithm, each iteration of which consists of two steps:

1. I step. Impute the missing values with draws from the predictive model given the observed data and the current draw of the parameters.

2. P step. Draw the parameters from their posterior dis- tribution given the current filled-in complete data.

When applied to a special data structure called monotone pattern (Little and Rubin 1987; Rubin and Schafer 1990), DA is called monotone data augmentation (MDA). Using MDA and extensions of Bartlett's decomposition (Bartlett 1933; Liu 1993), Liu (1995) provided techniques for cre- ating multiple imputations for incomplete continuous data using the multivariate t distribution.

This article extends the work of Liu (1995) in two as- pects: including covariates in the multivariate t models (as in Liu and Rubin 1995), and replacing the multivariate t dis- tribution with a more general class of distributions; that is, the class of normal/independent distributions (as in Lange and Sinsheimer 1993). These extensions provide a flexible class of models for robust multivariate linear regression and multiple imputation. Methods to implement MDA for these models with fully observed predictor variables and possi- ble missing values from outcome variables are described. The methods of this article can be used either to obtain the posterior distributions of the parameters of the models or to create multiple imputations for incomplete data sets.

Section 2 describes the multivariate linear regression models with normal/independent distributions. Section 3 presents MDA for using these models. Section 4 discusses estimation of the regression coefficients of some of the out- come variables on both the other outcome variables and the predictor variables. Section 5 gives a numerical example of using the multivariate t to illustrate the methodology. Finally, Section 6 provides some concluding remarks.

? 1996 American Statistical Association Journal of the American Statistical Association

September 1996, Vol. 91, No. 435, Theory and Methods

1219

Page 3: 1227 - Purdue Universitychuanhai/docs/liu1996.pdf · Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed ...

1220 Journal of the American Statistical Association, September 1996

2. LINEAR REGRESSION MODELS WITH MULTIVARIATE NORMAL/INDEPENDENT

DISTRIBUTIONS

2.1 Normal/Independent Distributions

A normal/independent distribution (see, e.g., Andrews and Mallows 1974, Dempster et al. 1980, Rogers and Tukey 1972) is the distribution of the random vector Y = ,t + Z/ /;w, where , is a p-dimensional constant vector, w is a positive random variable whose distribution takes the form F(w iv), and Z, a p-dimensional random vector, is normally distributed with Np (0, ') and independent of w. For con- venience, call ,t the location parameter, w the weight, ' the scatter matrix, and v the hyperparameter.

Three normal/independent distributions are commonly used for robust estimation:

* The multivariate t distribution tp(i, ', iv) (Cornish 1954; Dunnett and Sobel 1954), where v is called the degrees of freedom and the distribution of w given v, F(w Iv), is the gamma distribution gamma(v/2, v/2) with the density function

(v/2) v/2wv/21 'exp{ -wv/2} f(wIv) F(v/2)

(w > 0, v > 0),

where F(.) is the standard gamma function * The multivariate slash distribution (Lange and Sin-

sheimer 1993; Rogers and Tukey 1972), where the dis- tribution of w given v, F(w iv), has the density func- tion

f (WIV) = VUl'-l (O < w < I, v > O)

* The multivariate contaminated normal distribution (Lange and Sinsheimer 1993; Tukey 1960), where v = (A, -y) (O < A < 1,0 < -y < 1) and the distribution of w given v, F(w Iv), is discrete with the probability mass function

if w = A, f (wjA, -y) if w -1.

(For more discussions on these distributions, see, e.g., Lange and Sinsheimer 1993.)

2.2 Linear Regression Models for Robust Linear Regression

Denote by {(Yi, Xi): i = 1, .. .,n} nr observations of p outcome variables Y- (yi, ... , yp)' and q predictor vari- ables X = (xl,... , xq)'. Consider the linear regression model

Yi = d Xi + ei, ,... n, (1)

where e3 is the (q x p) coefficient matrix and the error terms ?2 are independently and identically distributed with a nor- mal/independent distribution. Let w =(wi,. . . , wr) be the corresponding weights. The distribution of the error terms

can be described as follows:

ind c-il , W) V ind Np (O, /wi),

w2>O, for i=l,...,n, (2)

iid wi v/3'v l F(wIv), for i=1,...,n, (3)

and v is the hyperparameter. Denote by Y the (n x p) matrix whose ith row consists of

Yi, by X the (n x q) matrix whose ith row consists of Xi, and by e the (n x p) matrix the ith row of which consists of the error term ei = Yi - 3'X,. Thus Equation (1) can be written as

Y = X3 + e.

2.3 Monotone Patterns and the Associated Ignorable Missing-Data Models

This section describes monotone patterns and the associ- ated ignorable missing-data models. An incomplete rect- angular data set of observations of p variables is called monotone if it can be sorted in such a way that the jth variable is at least as observed as the (j - I)th variable for j = 2,... , p (cf. Little and Rubin 1987). A rectangular data set (Y, X) with fully observed predictor variables and in- complete monotone outcome variables can be represented by

(YMP, X) {(Y (k) ( (kX) X(k)): i, Yi,p 'I1?,2 i..

i=1) ... .nk; k-1,. .., p}, (4)

where nk > 0 for k =1,. ,mLPk=nk = , the total number of observations, and the superscript (k) indexes the p possible patterns. Figure 1 displays such a data structure. A complete data set is a special case of (4) when n1 = n and ni- 0 for i = 2,. ... , p. Therefore, the method of this article is applicable for both complete data and incomplete data.

When Y contains missing values, assume that the missing-data mechanism is ignorable (Little and Rubin

1987; Rubin 1976). Let w {w(k): i = 1,...Ink;k 1, ... ,p} be the missing weights corresponding to Equa- tion (4). The model given in Equations (1), (2), and (3) for

(YMP, X) can be represented as follows:

yt [k:p l

(d:) *) x: w) idNp_ k+l ((3(k) )/X(k) @*(k)

1W(k)),

for i =,...,nk; k = 1,...,p, (5)

and

W(k |V d F(w|v), for i = 1,.* *,nk; k =1, . ..,p,

where 1(k) iS the last p - k + 1 columns of d3 and @j(k) iS the lower right (p - k + 1) x (p - k + 1) submatrix of 'I.

Page 4: 1227 - Purdue Universitychuanhai/docs/liu1996.pdf · Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed ...

Liu: Bayesian Estimation of Multivariate Linear Regression 1221

Response Variables Predictor Variables y(l) y(l) y(l) x(ll . x(l)

Yi,i Y1,2 ... Yi,p Xi,i ... X1,q

Pattern 1 ...

(the superscript) y(l)' i(1) ... YM p Xy ... X(l) (2) y (2) X(2) x(2)

Y1,2 . Yi,p Xi,i 1. ,q

Possible Pattern 2 (2) (2) (2) . X(2)

Yn2,2 ... Yn2,p Xn2,1 .. n2,q

(p) (p) x(p) Yi,p X1,i ... 1i,q

Possible Pattern p .

() () .. (P) Ynpp xnp,1

* np,q

n-=k= nk Units

(the first subscript)

Figure 1. Data Sorted Into a Monotone Pattern With Fully Observed Predictor Variables.

2.4 Prior Distributions for (/3, ')

Assume that 3, ', and v are independent a priori. For the prior distribution of (/3, I), use

Pr(f, ') o 141-(m+l)/2exp{- tr *-1A}, (6)

where m is a scalar and A is a (p x p) nonnegative definite matrix. If m = p and A = 0, then the prior distribution (6) becomes the noninformative prior or Jeffreys's prior (Box and Tiao 1973), which is commonly used in applied statis- tics. If m = -1 and A = 0, then the prior (6) is flat; that is, Pr(I) oc constant.

For the multivariate t distributions, Liu (1995) gave a brief discussion on prior distributions for the hyperparame- ter, the degrees of freedom. For the slash and contaminated normal distributions, the prior distributions for the hyper- parameters are given in the next section.

3. MONOTONE DATA AUGMENTATION

This section presents MDA for taking draws of the pa- rameters of the models from their posterior distribution. For the sake of clarity, step by step, MDA is presented with known weights and observed data that have a monotone pattern, with known weights and observed data that do not have a monotone pattern, with unknown weights and known hyperparameter, and with unknown weights and unknown hyperparameter.

3.1 MDA With Known Weights and Observed Data Having a Monotone Pattern

With known weights and observed data having a mono- tone pattern, MDA is noniterative. Posterior simulation of (d, 4) can be completed based on the equation

Pr(,l3, 4IYMP, X, W)

= Pr(4lYMp, X, W) .Pr(I34, YMP, X, W).

Techniques for taking draws from Pr('I' YMp, X, w) and Pr(,3'I,YMp,X,w) are summarized into the following Theorem 1, Corollary 1, and Corollary 2 (Liu 1994), which extend the results of Liu (1993, 1995).

Theorem 1. For k = 1,... ,p, let

k n3,\-

13(k) = E-E1WX (Xi ) j=l i=l

k n:,

X E E Wi( Xi (Yi,~~t,[k:p])

kj=l i=l

(X'AkX)YX'AkY[k:p],

where Ak = diag{w () W (1) (k) W (k) 0. O} and Y[k:p] is the last p - k + 1 columns of Y (i.e., 1(k) iS

the weighted least squares estimates of the regression coef- ficients of Y[k:p] on X based on the sample {(Y(jk: XPi )

i = 1,... .,nj; j = 1,...,k},Sk is the corresponding weighted total sum of residual squares and cross-products matrix; that is,

k n3

Sk E W -U (f(k) )/X~k)) Sk = iE i Y,[k:p](d)X ) j=1 i=1

X (YiU)p] -( k))X k)

and

Bk =Ak + Sk,

where Ak is the lower right (p - k + 1) x (p - k + 1) sub- matrix of the nonnegative definite matrix A given in (6). Suppose that B1 is positive definite (i.e., B1 > 0), and thus for k = 1,... p, Bk is positive definite and has the Cholesky factorization B-1 = LkL', where Lk is lower triangular. Let H = (hl, ... h p) be a lower triangular p x p matrix with its lower triangular part formed by columns Ljui,L2u2, ..., Lpup, where Uk = (Uk,k,. .,Up,k)/ with ul... ., up satisfying (a) ui,j are independent for 1 < j < i < p, (b) ui,j , N(O, 1) for 1 < j < i < p, and (c)

J,j Xn+n2+...+n,-j+(m-p-q+1) for j = 1, ,p If n?1+ +Tnk > k+(p-m-1+q) for k = 1,...,p,then, given w, v, and YMP, the conditional posterior distribution of 4J-1 determined by Equations (5) and (6) is distributed as HH'.

The proof of Theorem 1 is given in the Appendix. This proof leads to the following corollary.

Corollary 1. The conditional posterior distribution of vec(,6), given 'I, w, v, and YMP, is normal with mean

0= (L(1)hl,...I(P)hp)H1

and covariance

( (X'AX)-l )

D =((H') 08)Iq) (XA.>

x (H 10X q).

Page 5: 1227 - Purdue Universitychuanhai/docs/liu1996.pdf · Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed ...

1222 Journal of the American Statistical Association, September 1996

Corollary 1 leads to the following result, which provides a simple way to generate /.

Corollary 2. Let Z = (Z,... ., Zp) be a (q x p) random matrix whose elements are independent standard normal de- viates. Then the conditional distribution of /, given i', w, and (YMP, X), is distributed as

(G1Z1,. ... , G Zp)H-1 + (/(')hi, ... I )(P)hp)H-1

where GkG'k = (XAkX)-1 with Gk lower triangular for k= 1,...,p.

3.2 MDA With Known Weights and Observed Data Not Having a Monotone Pattern

With known weights and observed data that do not have a monotone pattern, a monotone pattern YMP is created con- sisting of Yobs and some missing values YMp,mis that de- stroy the monotone pattern. Thus YMP = {YMP,mis, Yobs }. A complete iteration of MDA with known weights and YMP consists of two steps:

1. I step. Fill in the missing data YMp,mis with a draw from Pr(YMp,misI3 4', Yobs, X, w), which is a multivariate normal distribution.

2. P step. Simulate (/, 4') from Pr(/, 4 1YMp, X,Iw), as in MDA with known weights and observed data having a monotone pattern.

3.3 MDA With Unknown Weights and Known Hyperparameters

When the weights w are unknown, each iteration of MDA consists of two steps:

1. I step. Impute unknown weights w with a draw from Pr(wj/3, 'L, Yobs, X, v), then fill in the missing data YMp,mis (if YMP,mis :# 0) with a draw from Pr(YMp,misl W, /, II, Yobs, X, v), as in MDA with known weights.

2. P step. Simulate (/, 4') from Pr(/, 4 1YMp, X,Iw), as in MDA with monotone missing data.

To take draws from Pr(wI/ , 4IYobs, X, v), use the fact that given /, I, Yobs, X, and v, W(k) (i = 1, ... , nk; k =

1,... , p) are independent and

Pr(w k) = w /3, 'l, Yobs, X, v)

oc wIt /exp i,obs/2}f(wlv), (7)

where p(k) is the dimension of the observed components pi y(k of y(k it,obs of It

8 (k)_- (y (k) -(k) I(qT,(k) s)- 1 (y(k) s (k)s) ( o)bs = i( ,obs ( i,obs ) ('i,obs) i,obs- i,obs):

Ig(k)b consists of the components of /,yX(k) that correspond

to the observed components of y(k), and '(ko)bs is the sub- matrix of 'I corresponding to the scatter matrix of the ob- served components of y(k). Therefore, a draw of w can

be obtained by taking n independent draws, each from a one-dimensional distribution (7).

Clearly, for the multivariate t distribution,

w( ) ,0 4, Yobs,, X, V

gamma(v/2 + p(k) /2, v/2 + j(ko/2);

for the multivariate slash distribution, the density distribu- tion of wiI{/3, 'L, Yobs, X, v} is the incomplete gamma dis- tribution with the density function proportional to

w It exp wikobs/2}, (0 < w < 1, v > 0),

from which random deviates can be generated using, for example, the techniques of Schmeiser and Lal (1980); and for the multivariate contaminated normal distribution,

Pr(w( k) |d ,Yobs, X, v

ps 2exp{ A6d?b s/2}

j (k)2 exp{ Ad?j,obs/2} + (1 - y) exp{ 6,obs /2}

(1 - y)exp_-6b'kbs/2}

I ,>P )/exp{ A6(kjbS/2} + (1 - Py)exp{-6t,ojbs/

if _(k) A

if w(k) 1

3.4 MDA With Unknown Weights and Unknown Hyperparameter

When the hyperparameter v is unknown, it appears that there are two simple ways of grouping the variables W, YMP,mis /3, ,ii, and v to use MDA for taking draws of the parameters and missing values. One way is to par- tition these variables into two groups: {w,YMP,mis} and {/,xi,v}. Because (/,4') and v are conditionally inde- pendent given {w,YMp,mis}, the corresponding MDA al- gorithm can be implemented as follows:

1. I step. Impute {w, YMp,mis} as in MDA with un- known weights and known hyperparameters.

2. P step. Simulate (/, 4') as in MDA with unknown weights and known hyperparameters and draw v from Pr(vlw).

This MDA version corresponds to the EM algorithm (Dempster et al. 1977) for the ML estimates of the param- eters. Call this version of MDA the EM version.

The other way is to partition the variables into the two groups: {tv, w, YMp,mis} and {/3, 'J'}. Then replace the step of the EM version that takes a draw of v from Pr(vlw) with the step that takes a draw of v from Pr(vIYobs,,3, ,L). This versions of MDA corresponds to the ECME algorithm (Liu and Rubin 1994) for the ML estimates of the parame- ters. Call this version of MDA the ECME version. The EM version of MDA is easier to implement but converges more slowly than the ECME version.

Discussion on drawing the degrees of freedom vi of the multivariate t distribution appears in earlier work (Liu 1995). For the multivariate slash distribution, use

Page 6: 1227 - Purdue Universitychuanhai/docs/liu1996.pdf · Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed ...

Liu: Bayesian Estimation of Multivariate Linear Regression 1223

Table 1. Data for the Creatinine Clearance Example (Shih and Weisberg 1986)

Outcome variables Predictor variable

No. ln(SC) In(WT) ln(CR) ln(140 - AGE)

1 -.3389 4.263 4.883 4.625 2 .3931 4.234 3.970 4.127 3 .7909 4.443 3.912 4.263 4 .3542 4.605 4.407 4.248 5 -.3877 4.078 4.700 4.554 6 -.2774 4.290 4.605 4.317 7 .1131 4.143 4.220 4.159 8 -.0876 4.394 4.522 4.369 9 .4379 4.304 4.094 4.277

10 -.0632 4.466 4.543 4.331 11 -.0047 4.369 4.654 4.304 12 .0718 4.533 4.585 4.511 13 -.3549 4.094 4.718 4.575 14 -.3389 4.248 4.828 4.585 15 -.0047 4.419 4.682 4.304 16 .9251 4.248 3.401 4.127 17 .1231 4.290 4.710 4.654 18 .1131 4.443 4.868 4.663 19 .3220 4.220 4.543 4.654 20 .1131 4.174 4.868 4.820 21 -.0277 3.970 4.078 4.454 22 .4738 3.912 3.638 4.205 23 .4596 4.304 4.174 4.304 24 .3382 4.205 4.443 4.691 25 -.3877 4.382 4.942 4.682 26 .1814 4.205 4.382 4.779 27 2.0280 4.220 1.459 4.078 28 .7419 4.279 3.766 4.575 29 .3054 ? 4.317 4.127 30 .0505 ? 3.714 4.625 31 4.673 4.787 4.357 32 4.317 3.951 4.248 33 4.127 4.290 4.344 34 3.951 4.043 4.277

NOTE: This table displays a created monotone pattern, where YMpmls consists of the two missing values represented by "?," which will be filled in at each cycle of MDA. The four missing values denoted by "-2' do not need to be filled in by MDA.

gamma(a, b), a conjugate prior with respect to w, with small positive values of a and b (b < a), as the prior distribution for v. The posterior distribution of v given w turns out to be

/ ~~P nk\

vlw rVgamma a+n+1,b-EEI lw (k) k=1 i=1

Thus the EM version of MDA is straightforward. However, the ECME version of MDA is more difficult than the EM version, because it requires evaluation of the incomplete gamma function.

The hyperparameter of the multivariate contaminated normal distribution has two components: A and -y. First, consider the case with fixed -y, which can be understood as fixing the proportion of the outliers in the data set, and use the U(O, 1) as the prior distribution for A. In this situation, w and A are confounded. To generate A, w must be inte- grated out; otherwise, A will typically stay at its starting point forever. The MDA is as follows:

1. Draw (d3,4') from (d3,4IYMp,mis, W, A, y) as in MDA with known weights.

2. Draw (A, w, YMP,mis) from

Pr(Al,B, xF, Yobsi aY) Pr(YMP,mis, Wl0, F, Yobs, A, a-),I

where Pr(Al/, 4', Yobs, -y) is proportional to n

1171 [ /2expj-A6i,obs/21 + (1 - y)exp{-i obs/2}], i= 1

(O < A < 1),

from which draws can be obtained using, for example, the acceptance and rejection method with the Griddy technique (Ritter and Tanner 1992), as in earlier work (Liu 1995) for taking draws of the degrees of freedom of the multivariate t distribution.

When -y is also treated as unknown, use beta(a, b) as the prior distribution for -y and take draws from Pr(-y w, A, 3, IL, YMP) = Pr(-yIw, A), which is simply

p nk

beta a + EE(1 _ W(k))I(l - A), k=1 i=1

p nk

b + EE(W(k) A )( ) k=1 i=1

4. ESTIMATION OF REGRESSION COEFFICIENTS

Denote the outcome variables of interest by Yout and the other components of Y by Ypre. The ultimate linear regression model is of the form

E(YoutlYprei XI j, 'l, V) = ayoutlypreYpre + aYoutiXX-

(8)

From Equations (1) and (2), it follows that

= E(YoutYpre, XI, /3, 4')

= XPYout,YpreXFYpre Ypre (Ypre /3preX) + /3outX, (9)

where /out is the matrix comprising the columns of / corre- sponding to Yout, /pre is the matrix comprising the columns of / corresponding to Ypre, 4'yout,ypre and *Ypre,ypre parti- tion "'pre according to Ypre and Yout, and 41pre is the ma- trix comprising the columns of i' that correspond to Ypre. Comparing (8) to (9) yields the following equalities:

CaYoutlYpre = 41Yout,Ypre Ypre,ypre (10)

and

ayoutlX = | out -

*Yout,Ypre4'Ypre)Ypre /pre (11)

With draws of (/, 4'), samples of (ayoutlypre, ayoutlx) can be created using Equations (10) and (11). Therefore, it is straightforward to estimate the regression coefficient matrix (CayoutYpre~ vy0 i>tX) using Monte Carlo methods.

5. AN ILLUSTRATIVE EXAMPLE

Here an illustrative example uses the data in Table 1, tab-

Page 7: 1227 - Purdue Universitychuanhai/docs/liu1996.pdf · Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed ...

1224 Journal of the American Statistical Association, September 1996

(a) (b)

Q C-

-8 -6 -4 -2 0 2 0.5 1.0 1.5 2.0

00 fWT

(c) (d)

-1.4 0.0 -0.6 0.0 0.5 1.0 i.5

SC PAGE

Figure 2. Estimated Posterior Density Functions of the Regressi on Coefficients (a) of0, (b) OWT, (C) oSC, and (d) OAGE in the Clinical Trial Example Based on 10,000 Draws Using MDA With Model tj With all of the Observations (Solid Line), Model N3 With all of the Observations (Dashed Line), and Model N3 With Patient 27 Deleted From the Data Set (Dotted Line).

ulated according to Figure I using the data from table I of Shih and Weisberg (1986). The data are from a clini- cal trial on 34 male patients with three covariates-body weight (WT) in kg, serum creatinine (SC) concentration in mg/deciliter, and age (AGE) in years-and one outcome variable, endogenous creatinine (CR) clearance. Of the 34 male patients, 2 had no recorded WT and 4 were miss- ing SC; the missingness is assumed to be ignorable (as in Shih and Weisberg 1986). A typical model recommended in many pharmacokinetics textbooks for modeling CR as a function of WT, SC, and AGE (see Shih and Weisberg 1986) is of the form

E (In(CR)IWT, S C, AGE) = AGo + EWTI(WT)

+ SsCln(SC) + 0AGEIn(140 -AGE). ( 12)

To illustrate the methodology of the article, consider two multivariate linear regression models:

Modelrt3, with the multivariate t distribution: (tnO(SC), ln(WT), an(CR)) N t3 ((P, In(140 - AGE)),D, Da, v), with the prior distribution Pr(,l3, I, v) ox

Se (o -(p+L).2V2 (V > 1)

c Model N3 with multivariate normal distribution: (ln(SC), ln(WT), s n(CR)) ( N3((c, In(140 -

AGe))ci, and with the (rior diyeribution Po(t om) ec

ingesC; the missngess isr assumed to bhe ignorable (as yine

1986 pseiordsrbtos of the formsoncefiiet

(i0, /WT, SC, /3140-AGE) in Equation (12). The results are displayed in Figure 2. Because of the influential observa- tion, patient 27, the results from model N3 with all of the patients are quite different from those based on t3 with all the patients, especially the posterior distributions of the regression coefficient /31n(SC). With the influential observa- tion removed from the data set, the posterior distributions of the regression coefficients from model N3 tend to agree with those from model t3 with all the patients. However, the posterior distributions obtained using model N3 with the in- fluential observation deleted have smaller tail probabilities than those obtained using model t3 with all the observa- tions. This implies that deletion of influential observations may result in underestimation of uncertainties, which leads to unreliable inferences or multiple imputation.

6. CONCLUSIONS

This article has described Bayesian robust estimation for multivariate linear regression using normal/independent distributions with fully observed predictor variables and possible ignorable missing values from outcome variables. These models provide the modeling flexibility in practice with data containing possible missing values and possible observations that are influential to linear regression models with the normal distribution. An efficient MDA algorithm using extensions of Bartlett's decomposition is provided to implement Bayesian estimation for these models. The al- gorithm gives a way to simulate the posterior distributions of the parameters (and their functions) of the models using Monte Carlo methods. The methods can also be used for robust multiple imputations for incomplete data sets.

For data analysis and multiple imputation using these models, transformations of continuous variables should al- ways be considered. For example, the models require that the distribution of the residuals is ellipsoidally symmetric. Some diagnostic methods for model checking using, for ex- ample, posterior predictive checking (Gelman, Meng, and Stern 1995; Rubin 1984) can be useful.

MDA is an iterative simulation method except in the case of (complete) monotone pattern and known weights. As with any other iterative simulation method, a general issue concerns how to choose starting points for MDA. ECME (Liu and Rubin 1994, 1995) can be useful in finding starting points. Another outstanding issue in using iterative simula- tion methods is how to judge the convergence of MDA. The method of Gelman and Rubin (1992) can be helpful in monitoring the convergence of MDA.

APPENDIX: PROOF OF THEOREM 1

Following the proof of Theorem I of Liu (1993), the conditional posterior distribution of 3 and a given w, v and (YMIp. X) is

Pr (,3, *I Y\p, V) W) OC 1*1-(m+l)/2

x exp - (tr[(g(1 ))1A] + Ztr[(( )-R ]) }

x 171 4,k) 1-lk/2

Page 8: 1227 - Purdue Universitychuanhai/docs/liu1996.pdf · Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed ...

Liu: Bayesian Estimation of Multivariate Linear Regression 1225

where R(k) is the weighted total sum of residual squares and cross- products matrix of samples {(Y ... , y(k)) i = 1,Y.. ., nk} with weights {wi(k): i = 1,..., fk}; that is,

n k

R (k) = k)

(Y(k:] (O(k)) Xk))(y(kX]- (Y(k)]/X(k))-

i=l

where ,3(k) is the last p - k + 1 columns of the regression coef- ficients 3 of Y on X. Let j-1 = HH' with H = (hi,j) lower triangular; then

p Pr(o, Hl|YMP, V, w) oc exp {- 2 E (h/ Rk hk + hk Ak hk)}

X h(E3=1n 3+-p (A. 1) k=1

where hk = (hk,k,. .,hp,k)' for k 1,...,p, and Rk is the weighted total sum of squares and cross-products matrix of sam- ples i 1... n3;j =1, .. , k}; that is,

k n3

Rk Z Z ) w (Y,[p] -) ( ) X U)

j=1 i=1

x(i, [k:P] (3)X ).

Let A(1) --A, A(i) = Aj -Aj- = diag{O,.. ,O, w). Wi(j) 0 . . . 0 O for j = 2... ,pI

0 = (pf ) hj,,...,)(P) hp)H-1

and

D ((H')1 0 Iq)

( (X'A,X)-l

(H1 Iq).

(X'ApX)1)

Then P P

h/ hRkhk = 1: hkSkhk

k=1 k=1

p k

+ , h' , ( 3(k) - (k))/X/A(i)X(/(k) -_ 3(k))hk.

k=1 j=1

Using theorems 1, 2, and 3 of Searle (1982, pp. 332-333) yields

= ?? ( (:hk) (O)h'k)?X'XAWiX)

p

= (HJH' 0 X'Ah (X) - (H 0 Iq)(Iq 0X' j=1

x kJj@A(2) 0 (Iq08X)(H '0q)

= (H ? Iq)(Iq ? X')diag{Al,..., Ap}(Iq ?3 X)(H' ? Iq)

(H ? Iq) ( (H' 0 Iq) X'ApX

D-1

where

Jk= ( o _pk+l )Px

E (( hk ) (O, hk) ?X'XAkX)vec((X/AkX)lXAkY)

= 7vec (XAkY ( h

) (0)h)

p Jk

= vec (XA(J)Y ( h,p ) (0, hk))

p

= vec(XA0, YHJ3 H')

j=l

p

v(H Iq) e vec(XAXk )YHJ3), k=1

D ((h ) (0, h3) 0 XAkX) vec( (XAk X)X kY)

= ((H')-1 0 Iq)( |XAX<

p

x Zvec(XA(j)YHJj)

j=l

= ((H')1 0 Iq)(XAX<)

=(HO XIq) :vec(XA(3)hl. YHJj)

= vec(X),

(vec(O))'D'Xvec(A)

= (vec(f31)h/,... ,Iq) hp)).

( XA1X)Xl

x vec(f3(l)h1 ........ ()p

p

- h/k(v(k))XA3kX Jhk k=l

Page 9: 1227 - Purdue Universitychuanhai/docs/liu1996.pdf · Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed ...

1226 Journal of the American Statistical Association, September 1996

2_ (0, h'k)((X'AkX)<XAkY)'XAkY ( h ) k= 1

p

S (vec((X'Ak)<XAkY))' k=1

x vec(XAkY ( h) (0hk)

and, hence,

P k 1 h' i (1(k) _ ,(k))/X/A(i)X(,(k) - (k) )hk

k=1 3=1

p

S (0, h/)(f - (X'AkX) 1XAkY)'X'AkX k=1

x (3 - (X'AkX)<X'AkY) ( h0)

p

= 5 (vec(f - (X'AkX)<X'AkY))' k=1

x(( ) (o, hk) ? X'AkX)

x vec(,3 - (X'AkX)>X'AkY)

= (vec(,3))'D-1vec(,3) p

- 2(vec(,3))'D1D 5 vec(X'AkY(O, h/)(O, hk))

k=1

p

+ S (vec((XAkX)<X'AkY)) k=1

x vec(X'AkY(O, h/)'(O, hk))

= (vec(,3 - 0))'D-vec(/3 - 0).

Thus (A. 1) can be factored into

r 1 P

Pr(/3, H|YMP, V, W) oc exp - - 1 hk(Sk + Ak)hk} 2 k

P {k~~~~~= P (Zk nj-k+q-p)

x 7 hk 3=1

k=1

x exp{- (vec(3 - 0))'D-1vec(,3 - 0)}. (A.2)

Therefore, Pr(/3j,YMp,v,w) is normal with mean 0 and co- variance D, and then

Pr(HIYMp, v, w)

cx exP{- 2 h=kBkhk} f (Z n k-q+m-P)

Finally, because J(H(u1 ..,UP)) = I LkI, it follows that

Pr(ul,. ,U,lYMP,v,W) Oc exP{- k Uk}

k=l

u2 _ _ _

2

- k 2 f )

x (1u(13=1 ) exp{ Uk,k

This finishes the proof of Theorem 1.

[Received October 1994. Revised September 1995.]

REFERENCES

Andrews, D. F., and Mallows, C. L. (1974), "Scale Mixtures of Normal Distributions," Journal of the Royal Statistical Society, Ser. B, 36, 99- 102.

Bartlett, M. S., (1933), "On the Theory of Statistical Regression," Pro- ceedings of the Royal Society of Edinburgh, 53, 260-283.

Box, G. E. P., and Tiao, G. C. (1973), Bayesian Inference in Statistical Analysis, Reading, MA: Addison-Wesley.

Cornish, E. A. (1954), "The Multivariate t Distribution Associated With a Set of Normal Standard Deviates," Australian Journal of Physics, 7, 531-542.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), "Maximum Like- lihood From Incomplete Data via the EM Algorithm," Journal of the Royal Statistical Society, Ser. B, 39, 1-38.

(1980), "Iteratively Reweighted Least Squares for Linear Regres- sion When Errors Are Normal/Independent Distributed," in Multivari- ate Analysis, Vol. V, ed. P. R. Krishnaiah, North-Holland, pp. 35-57.

Dunnett, C. W., and Sobel, M. (1954), "A Bivariate Generalization of Stu- dent's t Distribution With Tables for Certain Cases," Biometrika, 41, 153-169.

Gelman, A., Meng, X. L., and Stern, H. (1996), "Posterior predictive as- sessment of model fitness via realized discrepancies (with discussion)," Statistica Sinica, to appear.

Gelman, A., and Rubin, D. B. (1992), "Statistical Inference From Itera- tive Simulation Using Multiple Sequences" (with discussion), Statistical Science, 7, 457-511.

Lange, K. L., Little, R. J. A., and Taylor, J. M. G. (1989), "Robust Statistical Modeling Using the t Distribution," Journal of the American Statistical Association, 84, 881-896.

Lange, K., and Sinsheimer, J. S. (1993), "Normal/Independent Distribu- tions and Their Applications in Robust Regression," Journal of Compu- tational and Graphical Statistics, 2, 175-198.

Little, R. J. A. (1988), "Robust Estimation of the Mean and Covariance Matrix From Data With Missing Values," Applied Statistics, 37, 23-39.

Little, R. J. A., and Rubin, D. B. (1987), Statistical Analysis With Missing Data, New York: John Wiley.

Liu, C. H. (1993), "Bartlett's Decomposition of the Posterior Distribu- tion of the Covariance for Normal Monotone Ignorable Missing Data," Journal of Multivariate Analysis, 46, 198-206.

(1994), "Statistical Analysis Using the Multivariate t Distribution," Ph.D. thesis, Harvard University, Dept. of Statistics.

(1995), "Missing Data Imputation Using the Multivariate t Distri- bution," Journal of Multivariate Analysis, 53, 139-158.

Liu, C. H., and Rubin, D. B. (1994), "The ECME Algorithm: A Sim- ple Extension of EM and ECM With Faster Monotone Convergence," Biometrika, 81, 633-648.

(1995), "ML Estimation of the Multivariate t Distribution With Unknown Degrees of Freedom," Statistica Sinica, 5, 19-39.

Meng, X. L., and Rubin, D. B. (1993), "Maximum Likelihood Estimation via the ECM Algorithm: A General Framework," Biometrika, 80, 267- 278.

Ritter, C., and Tanner, M. (1992), "Facilitating the Gibbs Sampler: The Gibbs Stopper and the Griddy-Gibbs Sampler," Journal of the American Statistical Association, 87, 861-868.

Rogers, W. H., and Tukey, J. W. (1972), "Understanding Some Long-Tailed Distributions," Statistica Neerlandia, 26, 211-226.

Rubin, D. B. (1984), "Bayesianly Justifiable and Relevant Frequency Cal- culations for the Applied Statistician," The Annals of Statistics, 12, 1151-1172.

Page 10: 1227 - Purdue Universitychuanhai/docs/liu1996.pdf · Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed ...

Liu: Bayesian Estimation of Multivariate Linear Regression 1227

Rubin, D. B., and Schafer, J. L. (1990), "Efficiently Creating Multiple Imputations for Incomplete Multivariate Normal Data," in Proceedings of the Statistical Computing Section, American Statistical Association, pp. 83-88.

Schmeiser, B. W., and Lal, R. (1980), "Squeeze Methods for Generating Gamma Variables," Journal of the American Statistical Association, 75, 679-682.

Searle, S. R. (1982), Matrix Algebra Usefulfor Statistics, New York: John Wiley.

Shih, W. J., and Weisberg, S. (1986), "Assessing Influence in Multiple Linear Regression With Incomplete Data," Technometrics, 28, 231- 239.

Tanner, M. A., and Wong, W. H. (1987), "The Calculation of Posterior Distributions by Data Augmentation" (with discussion), Joumrnal of the American Statistical Association, 82, 528-550.

Tukey, J. W. (1960), "A Survey of Sampling From Contaminated Distri- butions," in Contributions to Probability and Statistics I, ed. I. Olkin, Stanford, CA: Stanford University Press, pp. 448-485.