[Wiley Series in Probability and Statistics] Methods and Applications of Linear Models (Regression...

PART II

ANALYSIS OF VARIANCE MODELS

Methods and Applications of Linear Models: Regression and the Analysis of Variance, 2nd Edition. Ronald R. Hocking

Copyright 0 2003 John Wiley & Sons, Inc. ISBN: 0-471-23222-X

9 Introduction to Analysis of Variance Models

In Part I1 we describe and develop the methodology for a general class of analysis of variance models. In this chapter we introduce some basic concepts in terms of two specific models and, in subsequent chapters, the analysis of these and more complex models is developed in detail. Throughout Part 11, we will focus on basic principles, providing explanations for some controversies that have arisen over the years and developing alternative methods for computing and analyzing the data.

9.1 BACKGROUND INFORMATION

We have encountered the term, analysis of variance, in our discussion of regression models as a description of the tabular display of the sums of squares associated with the tests of certain hypotheses, Historically, the term arose in the study of situations where the response is determined by one or more factors and the objective was to examine sources of variation in the response as a function of these factors. The use of the term, design matrix, for the X matrix in the regression model arose in that same setting, where the matrix reflected the design of the experiment that had been used for the study. The two terms, Analysis of Variance Models or Design of Experiment Models, are commonly used to describe the models for the analysis of such experiments.

The models to be encountered in Part I1 are mathematically equivalent to the regression models fiom Part I and hence the concepts of inference developed in Chapter 2 are applicable. The primary distinction lies in the emphasis on hypothesis testing and the nature of the hypotheses that are considered with the A O V models. In particular, we will be making inferences on the means, variances and covariances of normal populations. The structure on the means and the covariance matrices will be a consequence of the experimental design that led to the data.

295

Methods and Applications of Linear Models: Regression and the Analysis of Variance, 2nd Edition. Ronald R. Hocking

Copyright 0 2003 John Wiley & Sons, Inc. ISBN: 0-471-23222-X

296 Chapter 9 Introduction to Analysis of Variance Models

We will discuss the classical approach to model description and analysis and also introduce an alternative way to describe the model. This will lead to different computational methods and a more informative procedure for estimating the components of variation. The basis for the alternative discussion is the cell means model which will be described in the next section.

9.2 CELL MEANS MODEL

We begin the discussion by considering a generalization of an application that we encountered in Example 7.6, where we were interested in comparing the means of two normal populations. The extension of that example to comparing the means of several populations is fimdamental to our development.

The concept of the cell means model was introduced by Hocking and Speed (1975) to resolve some of the conhion associated with analysis of variance models. In tsct the idea was not new and, as noted in the historical account by Urquhart, Weeks, and Henderson (1973), this form of the model preceded the current expressions used for analysis of variance models. We will see that this concept is especially useful for understanding the analysis of variance models. To motivate the model description and the nature of the analysis, we recall two simple examples fkom elementary statistical inference.

Example 9.1. Let gT, T = 1, ..., N denote a random sample from N ( p , u2). In linear model form we write

Y T = p + e , , (9.1)

y = p J + e , (9.2)

a

where the errors, e,, are independent, N(0, u2). In matrix form we write

where the design matrix, J, is the column vector of ones. Of interest is the estimation of p and u2 and a test of the hypothesis, Ha : p = pa.

Iikumple 9.2.. Suppose that we sample two normal populations with means p, and p2 and common variance u2. Let gir , i = 1 , 2 and r = 1, . . ., ni denote the rth observation on the ith population. Thus giT - N(pLi , u2). In equation form we write

where the errors are independent, N(0, u2). To express this model in regression form, we write

Yir = P151i + P 2 X 2 i + eir, (9.4)

9.2 Cell Means Model 297

where zli = 1 if i = 1 and 0 otherwise. Similarly, xzi indicates the observations fiom the second population.

To write this model in matrix form, let y,and yz denote the vectors of lengths nl and n2 of observations from the two populationsand let y be the vertical concatenation of these two vectors. Thus

u = [;;I. (9.5)

Further, let p be the parameter vector with elements p1 and pz, and let the design matrix be written as

where Ji is a vector of ones of length ni. It follows that we may describe the linear model in matrix form as

y = Wp+e , (9.7)

m where e N N(0, 02Z) Of interest is the estimation of the means and variance and a test of the hypothesis, Ho : p1 = pz.

These two examples are special cases of the problem where we take a sample of size ni fiom each ofp normal populations with common variance, uz, but, possibly, different means, pi. Let denote the vertical concatenation of the individual sample vectors yi, let p be the vector of means and let the design matrix be

W =

51 0 0

0 Ji 0

where Ji is a vector of ones of length ni. The model is given by (9.7) with design matrix (9.8). The role of Win this model is to associate the observations with their mean values and to indicate the number of observations with mean pi. For emphasis, we will refer to Was the cell means design matrix.

For reference, we make the following definition:

Definition 9.1 The cell means model is defined to be the special case of a linear model for which y = W p + e with W defined by (9.8). The parameters pi are known as cell means, and the sample sizes ni are known as cell frequencies.


The name population means model, or simply means model, might seem more appropriate. We will see that the populations are usually associated with the cells of a table, hence the term, cell means, has been adopted.

In this section we have focused on the structure of the expected value of the response vector. In general, the phrase, cell means model, will refer only to this mean structure. In Section 9.3, in our introduction to fixed effects models, we will assume the scalar covariance structure V = o'Z. In our discussion of mixed effects models in Section 9.4, we will allow for more complex covariance matrices.

If the cell frequencies na are all equal to n then the data are said to be balanced. We will see that this greatly simplifies the notation and the analysis. To develop the notation for the balanced case, we note that the design matrix W in (9.8) can be written compactly using the Kronecker product notation. From Appendix A.I.8, we see that the Kronecker product of the matrices A and B is given by

A @ B = (uJ3). (9.9)

Thus the Kronecker product of two matrices is a matrix whose basic shape is that of A and whose elements are the matrices a,,B. Using this notation, the design matrix for the cell means model is

W = I p @ J,. (9.10)

In addition to the notational convenience, we will see that the properties of the Kronecker product described in Appendix A.1.8 will greatly simplify the matrix computations to be developed for analysis of variance models. The idea of balance in this setting is quite simple. The term will be defined generally when we discuss more complex applications.

Parameter estimation in the cell means model is elementary with

and

V P

a2 = C(n2 - 1)s ; / c (7L i - 1) i=l i=l

(9.1 1)

(9.12)

where s! is sample variance for the ith population. Thus population means are estimated by sample means and the common variance estimate is the degrees-of- fieedom weighted average of the individual population variances. (In the expression for the sample mean we have used the standard dot (. ) and bar (-) notation to indicate that we have summed over the second subscript and divided by the range on that subscript. We will make extensive use of this notation in this and later chapters.)

9.3 Fixed Effects Models 299

The utility of this model will be apparent as we consider special cases in the next three chapters. In the following sections, we provide an introduction to some of the issues that will arise.

9.3 FIXED EFFECTS MODELS

To indicate the application of the cell means model we discuss two applications. In these applications, we assume that all populations have a common variance and hence our primary interest is in making inferences on the population means. The populations in this model may be explicitly defined such as the populations of males and females at a given university. Alternatively, the populations may be defined implicitly in that the data are obtained by applying treatments to identical experimental units. The treatments may be defined by a quantitative variable, such as temperature in the particle board study of Example 1.1 , or by a qualitative variable such as type of fertilizer or method of preparation. The essential point is that the population is observable and we need not distinguish between explicitly and implicitly defined populations.

The simplicity of the cell means model is deceptive. We will see that it can be used to describe any of the models that are classically known as fixed effects, analysis of variance models. In this class of models, the factors describing the populations are typically qualitative, but quantitative factors may be included. The simplest of these is the one-way classification model and we introduce it in the next section.

9.3.1 One-way Classification Model

In Example 7.6, we examined the strength of products produced by two different processes. Those data may be described by the cell means model with p = 2. Of interest in that example was a comparison of the means and we might have been interested in testing the hypothesis of equal means or developing a confidence interval on the difference, p, - p2. In our regression analysis of those data, we did not use the cell means model but rather, we considered the model

y = O O + O l z + e , (9.13)

where 21 is the concatenated data vector described in (9.7) and z has elements 1 if i = 1 and 0 if i = 2. We noted the parameter relations pI = Oo + O1 and p2 = 0,. Equivalently, O1 = p1 - p2. Thus inferences about the differences in the population means is conveniently made using the parameter 0, and applying basic theory fiom linear regression. (Recall, fiom Exercise 7.12, that the test statistic far the hypothesis, Ha : O1 = 0 is identical to the standard t-statistic for the hypothesis of equal means.) This is our first example of redefining the model


in terms of linear functions of the original parameters to simplifL the analysis. Such reparameterizations will be of fundamental importance in our discussion of the classical and cell means approach to such models.

To extend that example, suppose we were interested in comparing three processes. To test for differences in the processes we consider the hypothesis of equal means. Further, we might like to examine confidence intervals on differences in pairs of means. We will see that the cell means model may be analyzed directly but, for illustration, we introduce a reparameterization so that we can use regression theory. We encountered one such parameterization in Example 7.7, where we used the model

g = Oo + O1z, + 02z2 + e . (9.14)

Here g is the data vector, z1 has elements 1 if i = 1 and 0 otherwise, and z2 has elements 1 if i = 2 and 0 otherwise. The parameter relations are O1 = pl - p3, O2 = pz - p3 and O3 = p3. The hypothesis of equal means is HO : O1 = 02 = 0. Confidence intervals on O1 or O2 allow us to make inferences on the associated mean differences. This parameterization would be useful if process three was the standard and the others were new proposed processes. To test the hypothesis Ho : pL1 = p2, we must test the hypothesis Ho : B1 = 02. We will consider this and other reparameterizations in Chapter 10, where we provide a detailed discussion of the one-way classification model.

9.3.2 Two-way Classification Model

In more complex experiments the populations may be defined by combinations of several hctors. The basic cell means model may be applied directly but it is often convenient to use more than one subscript to describe the populations. The following example will illustrate this concept and the new issues that arise.

Example 9.3 Consider an agricultural experiment in which it is desired to examine the yields obtained by using one of four different types of fertilizers on each of three varieties of cotton. For the study we have available a total of 60 experimental plots that are homogeneous, in the sense that we have no reason to expect differences in the yields if there are no differences in the fertilizers or in the varieties. To further control for unforeseen differences, we randomly allocate the twelve treatment combinations, that is, combinations of the fixtors, fertilizers and varieties, to each of five plots. We refer to the plots as experimental units.

The cell means model may be applied to describe the data in this example, with yijr denoting the yield fkom the rth plot using the ith variety with the jth fertilizer. The mean yield for this treatment combination, is denoted by p i j . The


model is written in algebraic form as

(9.15)

for i = 1,. -.,a, j = 1,. - -, b, and T = 0,1, - . a , nij where, in this example, u = 3, b = 4, and njj = 5 for all pairs. The use of two subscripts to define the means is primarily for notational convenience in referring to the treatment combinations.

In a carefblly designed study, we would, if possible, allocate the same number of plots to each factor combination assuring a balanced design. Of course, there is always the possibility that, for reasons not related to the treatments, observations on some plots may not be available, and hence we would have unequal cell fiequencies. The variable range on the subscript T indicates this, including the possibility that there may be no observations on some treatment combinations. The assumption of constant variance across treatment combinations is also appropriate since the variability is assumed to be a function of plot differences and not a function of the treatments. Thus the scalar covariance structure is appropriate.

To write this model in the matrix form of (9.79, let gij denote the nij-vector of observations on the (ij)th treatment combination. The observation vector is Written by vertically concatenating the gij. The order is determined by first concatenating the vectors gij for fixed i and j = 1, . . ., b. We then concatenate these vectors for i = 1, .-.,a. This suggests a general conventionthat will be used for ordering the observations in the response vector when the entries have multiple subscripts. The same ordering is then implicit in the vector of cell means. The design matrix is identical in appearance to (9.8) except for the use of two subscripts to denote the column vectors of ones of length nij.

The model in (9.15) is called a two-factor, cell means model, recognizing that the populations are defined by two factors. When the factor combinations are combined as in Example 9.1, the phrase two-way cross-classification design is used to reflect the hct that we consider all combinations of the two factors. The phrases two-way factorial or two-way classification are also used.

It is convenient to think of the populations in the twu-way classification design as being described symbolically in a tweway table. For example, in Table 9.1, we display the means for the two factors in Example 9.3. The categories defined in this figure are often called cells and this is the source of the name for the model.

Table 9.1. Cell means array for Esample 9 3 Fertilizer 1 2 3 4

PI1 p12 p13 P I 4

Variety p 2 l c122 p23 1124

I p31 p32 p33 1134

302 Chapter 9 ktroduction to Analysis of Variance Models

In our numerical examples we will display the data in this same format, and for summary purposes we will display the sample cell means in such a table. The sample cell means are defined by

(9.16)

The cell means are estimated as usual by the sample cell means and the variance by the degrees-of-freedom weighted average of the sample cell variances.

The general hypothesis of equal means may be of interest, but there are other hypotheses that are commonly considered. The first of these is the hypothesis of no interaction. We have encountered this concept with the SAT data in Example 7.6. In that example, the yeawgender term reflected the difference in the slopes of the fitted lines for males and females. A non-zero value for this term means that there is a difference in male and female scores for different years. The idea is the same for Example 9.3. Thus, there is said to be an interaction between varieties and fertilizers if the difference in the mean response for two varieties is not the same for all pairs of fertilizers, or equivalently, the difference in mean response for two fertilizers is not the same for all pairs of varieties. The hypothesis of no interaction is thus written as

(9.17)

Another hypothesis of interest is that there is no difference in the mean This will be called the responses for varieties when average over fertilizers.

variety main effect hypothesis and it is written as

Hv : pi. = pi.. for i # i*. (9.18)

By analogy, the fertilizer main effect hypothesis says there is no difference in column means when averaged over rows. In Chapter 11 we give a detailed account of some of the issues that arise in the interpretation of these hypotheses.

In our discussion of hypothesis testing in regression models, all hypotheses were tested by comparing the residual sum of squares for the original model with that obtained by deleting one of more terms from that model. For testing hypotheses in the cell means model there are two basic approaches. First, as in Examples 7.6 and 7.7, we can rewrite the model in terms of parameters that are zero if the hypothesis is true and then use the regression approach. Alternatively, we can use a general expression for the test statistic as a function of the appropriate hypothesis matrix. This expression will be illustrated in subsequent chapters and the development is given in Chapter 17.

No-Interaction Model. An interesting special case of the two-factor model arises when it is appropriate to assume that there are no interactions. That is, it is known a priori that the constraints on the means in (9.17) are satisfied. We will see that there are two advantages of making this assumption. First, it will enable us to make stronger inference about the main effect hypotheses and


second, we will be able to obtain an estimate of a2and hence make inferences, even when we have only one observation per treatment combination. The assumption that these constraints are satisfied is a strong one and should not be made without justification.

9.3.3. Constrained Cell Means Model

The two-factor, no-interaction model is our first example of a constrained cell means model. In the general constrained model, we write our constraints as a system of linear equations, Gp = g, and assume, without loss of generality, that G is q x p of rank q. We make the following definition:

Definition 9.2. The constrained cell means model is written

Y = W p + e ,

subject to the constraints

Gp = g.

Here W is the usual cell means design matrix and G and g are known.

The way in which we deal with these constraints will depend on the nature of the problem, the complexity of the constraints and our objectives. Conceptually, a natural procedure is to create a reduced model by imposing the constraints on the expression for the mean vector. That is, we could solve the equations Gp = g for q of the means in terms of the remaining r = p - q means and substitute into the mean vector to obtain an unconstrained problem. The reduced model is, conceptually, the simplest way to incorporate the constraints into the model statement, but it may be notationally complicated. We will see that this notational difficulty can sometimes be eliminated by rewriting the model in terms of a new set of parameters as defined by the constraints. For OUT theoretical discussions and for practical interpretations, we will often find it convenient to work in terms of the original parameters.

In the constrained model the imposition of q non-redundant constraints, reduces the number of parameters from p to r = p - q and we refer to this as the correct dimension of the problem. Formally, the dimension is defined as the rank of the design matrix after imposing the constraints.

The concept of a constrained model also arises in the context of hypothesis testing. The reduced model is appropriate for computing the residual sum of squares under the hypothesis. Such computations are often simpler in terms of the reparameterized models.

This brief discussion introduces the basic concepts of the cell means model. In Chapters 10, 1 1, and 12 we elaborate on these and other fixed effects models.


9.4 MIXED EFFECTS MODELS

Analysis of variance models with a non-scalar covariance structure are known as mixed effects models, or simply mixed models. The basis for the terminology lies in the fact that such models often arise when the contribution of a factor is viewed as a random variable. In such cases interest focuses on the variances of these random variables as well as the cell means. The term variance component arose in this setting as the variance of the random effect. In this book we take a slightly broader view of the problem, which allows the structure of the experiment to specifL a non-scalar covariance matrix. Our development will be related to the classical formulation of the mixed model and the relative merits of the two approaches will be discussed. The following example will illustrate the classical formulation of a mixed model:

b m p l e 9.4 Two-Factor Mixed Model. Suppose that we wish to examine the yield of a varieties of cotton, and we would like to apply each variety to n experimental plots. Ideally we would have nu homogeneous plots to which we would randomly assign the treatments with n plots for each variety. Since the treatments are described by one factor, the one-way classification model can be used, with yiT denoting the yield fiom the rth replicate of the ith variety and with pi denoting the mean response for the ith variety. Frequently we do not have the luxury of a sufficient number of similar experimental units, but we do have groups of units such that, within a group, the units are expected to respond similarly if there are no differences in the varieties. There may be differences between groups caused by such things as differences in fertility. In this example suppose that we have b different fields fiom which we can select a plots. The revised experiment is to assign each of the varieties at random to one plot in each field. With this design we obtain b observations on each variety, and within a field we obtain a comparison of varieties which is fiee of differences between the fields.

To model this experiment, let gaj denote the yield of the ith variety in the jth field, and let p , be the mean yield for the ith variety. Note that the mean structure is still that of the one-factor model. To incorporate the differences in the fields into the model, we assume that an amount bj is added to the jth field to allow for the differences in fertility. The fields in this experiment might be viewed as a random sample fiom a large population of such fields. We are not interested in making inferences about these particular fields and we would like to apply our conclusions about the varieties to this whole population of fields. We extend the fixed effeet concept by assuming that the bj are random variables. In particular, we assume that these variables are independent, usually normal, with zero means and Var[b,] = at. To account for differences in plots within a field, we assume that that an amount ea3 is added to each response. These random

9.4 Mixed Effects Models 305

errors are independent with zero mean, Var[e,,] = c,", and are assumed to be independent of the bJ. The response for the plot in the jth field that receives the ith variety is thus given by

(9.19)

Using the results on moments of linear functions given in Chapter 2, we can show that the covariance structure that is implicit in these assumptions is as follows:

Yl, = / I t + b3 + e21.

Var[pt,] = (T: + cz for all i and j cod%,, Y,*,*l = 4 i# i * j = j *

= o j # f. (9.20)

This covariance structure is easily written in matrix formand a convenient expression for the covariance matrix is obtained by using the Kronecker product notation defined in Appendix A.1.8. To do so, let Ua denote a square matrix of ones of size a, and assume the usual convention for ordering the vector of responses. Letting V denote the covariance matrix wemay write

v = + U ~ U , B I ~ . (9.2 1)

In this expression, the term U, @ Ib indicates an a x a array of identity matrices of size b.

Recalling the design matrix for the one-kctor model, we write the model for this experiment as

II = (10 @ Jb)P e , (9.22)

where Var[e] = V as defined in (9.2 1). Since the variance components ~76" and CT; are variances, and hence positive, the positive definiteness of V is assured.

This example illustrates a general linear model in the sense of Definition 1.3, and represents a common source of mixed models. In Chapter 14, we will give an alternative development of the model.

While we may have been forced into this modified design because of the lack of experimental units, it actuaIly may be better to conduct the experiment in this way. Suppose that the fields differed because of different fertility levels. If we conduct the experiment on one field of constant fertility, then our inferences on the variety differences would be restricted to that level of fertility. By running the experiment as described above, we get variety comparisons on a range of fertility levels.

The experimental procedure used in this example is often referred to as blocking. In this case the fields are the blocks and the idea is that, if there are differences in the experimental units within a block, applying each treatment in each block will allow for treatment comparisons that are not affected by block differences. In other situations the blocking may be defined by other factors such

306 Chapter 9 Introduction to Analysis oi Variance Models

as time, location, laboratory, or analyst. This experimental setting is known as a randomized block design, to reflect the fact that treatments are allocated at random to the experimental units within a block. The treatment structure can be more complicated. For example, a two-factor, cross-classified treatment structure, such as the fertilizer by variety combination in Example 9.3, might be run in a randomized block design. A detailed discussion of the blocking concept and some controversies associated with the model descriptions is given in Chapter 13.

9.5 CONCLUDING COMMENTS

These simple examples serve to introduce the concepts of fixed effects and mixed effects analysis of variance models. The details of the analysis for specific models and general results and notation will be developed in the remaining chapters of Part 11. These models provide a means of analyzing complex experiments and there is an elegant mathematical theory associated with the development. Our approach will be to develop the analysis using simple principals. This will reveal some potential pitfalls that arise if one becomes too hscinated with the mathematical theory. It will also reveal some new ideas for understanding the data, especially for mixed effects models.

EXERCISES

Section 9.2

9.1 Use the regression methodology to develop the estimates for p and c2 in Example 9.1 and also to develop an algebraic expression for the test statistic for the hypothesis, Ho : p = po.

9.2 Use the regression methodology to develop estimates of pland p2 and to develop an algebraic expression for the test statistic for the hypothesis, Ho : Pl = P2.

9.3 a. Suppose A and B are matrices of dimensions (a1 x az) and (bl x b2). What is the dimension of A 63 B? b. Write out explicit expressions for the Kronecker products, I I8 J, J @ I , U @ I , I 18 U, U @ J and J 63 U, where I and U are 3 x 3 and J is 2 x 1. 9.4 For the model in Equation (9.14), describe how you would use regression Computations to develop the test statistic for the hypothesis, Ho : O1 = 4.

Exercises 307

Section 9.3

9.5 Consider the two-factor model in Equation (9.15) with a = b = 2. Write the design matrix associated with the reparameterization given by

00 = Pl1 4 = PI1 - 1121 02 = PI1 - P12 012 = PI1 - P12 - PZl + P22.

Comment on the implications of testing the hypotheses, HI : OI2 = 0, H A : O1 = Oand HB : O2 = 0.

9.6 Consider the no-interaction, two-factor model with a = 2 and b = 3. a. Show that a non-redundant set of constraints is given by

P1j = P13 + P2j - 1123 for j = 2.

b. Use these relations to replace pll and pl2 and write the reduced model in algebraic form. c. Write the design matrix for this reduced model.

9.7 Consider an experiment designed to compare the response to two different fertilizers, each at three different amounts. The two-factor cell means model is appropriate with pij denoting the mean response of the jthamount of the ith fertilizer. Suppose that the lowest level of each fertilizer is zero. It follows that the means satisfy the relation pll = pZ1. a. Write the reduced model that is appropriate if this constraint is imposed. b. Write the design matrix for this reduced model. c. Assuming n observations per cell, determine the estimates of the cell means. d. Suppose this constraint is ignored in the analysis. Comment on the possible effect on the hypothesis of no interaction.

Section 9.4

9.8 For the two-factor, mixed model described in Example 9.4, write out the covariance matrix associated with the structure described algebraically in (9.20), and in matrix form in (9.21), for the case u = 3, b = 2.

9.9 equal, Cow[yi,yk] = d2. covariance matrix in the form

Suppose all variances are equal, Var[yi] = cr2, and all covariances are Determine the matrices Vl and Vz to write the


9.10 Let V = (vij) be an arbitrary covariance matrix of size three. Write V in the form

9.1 1 In the two sample problem described in Example 9.2, write the covariance matrix if it is assumed that the two populations have different variances, ~7: and a;.

[Wiley Series in Probability and Statistics] Methods and Applications of Linear Models (Regression...

Documents

Transcript of [Wiley Series in Probability and Statistics] Methods and Applications of Linear Models (Regression...