Chapter 5

10
Chap. 5, page 1 Chapter 5: Comparisons among several samples Case Study 5.1 A randomized experiment to compare lifetimes in six different diets among female mice. Principles of good experimental design: Randomization: female mice were randomly assigned to the six treatments. Randomization ensures no bias in the assignment of mice to treatments. It does not guarantee that the groups will be identical, but it allows us to use probability to assess whether the differences observed could have occurred by chance. Replication – important for estimating variability within groups Other? Scope of inference The scope of inference is to what would have happened if all mice had been fed each diet. The scope of inference can be expanded further if these mice can be viewed as representative of a larger population of female mice. Since it’s an experiment, we can infer cause-and-effect if the experiment was well-run. Comparison of all six diets was of interest, but there were some specific comparisons that were of interest as outlined in Display 5.3. Although comparisons between pairs of treatments could be handled with two-sample t- procedures (if normality assumptions are satisfied), there are some advantages to a more comprehensive procedure: If the variability within each treatment is about the same for all treatments, then it makes sense to estimate a pooled standard deviation from all the treatments even if we’re only comparing any two at a time. We may want to carry out more complicated comparisons, such as a comparison of a control group to the average of the other five groups. A standard first question of interest when comparing several groups is whether there is evidence that any of the means are different from each other. Comparing all the treatments pairwise with two-sample t tests results in a lot of individual tests (15 for 6 treatments). An overall test of equality of all the treatment means is much more efficient and will not suffer from the problem of running multiple tests (where statistically significant results have to be considered in the context of how many tests were run). An Ideal Model which allows the problems above to be solved fairly easily Population distributions are normal Population standard deviations are equal Independent random samples from each population (a randomized experiment satisfies this assumption)

Transcript of Chapter 5

Page 1: Chapter 5

Chap. 5, page 1

Chapter 5: Comparisons among several samples Case Study 5.1 A randomized experiment to compare lifetimes in six different diets among female mice. Principles of good experimental design:

• Randomization: female mice were randomly assigned to the six treatments. Randomization ensures no bias in the assignment of mice to treatments. It does not guarantee that the groups will be identical, but it allows us to use probability to assess whether the differences observed could have occurred by chance.

• Replication – important for estimating variability within groups • Other?

Scope of inference • The scope of inference is to what would have happened if all mice had been fed each diet. • The scope of inference can be expanded further if these mice can be viewed as

representative of a larger population of female mice. • Since it’s an experiment, we can infer cause-and-effect if the experiment was well-run.

Comparison of all six diets was of interest, but there were some specific comparisons that were of interest as outlined in Display 5.3. Although comparisons between pairs of treatments could be handled with two-sample t-procedures (if normality assumptions are satisfied), there are some advantages to a more comprehensive procedure: • If the variability within each treatment is about the same for all treatments, then it makes

sense to estimate a pooled standard deviation from all the treatments even if we’re only comparing any two at a time.

• We may want to carry out more complicated comparisons, such as a comparison of a control group to the average of the other five groups.

• A standard first question of interest when comparing several groups is whether there is evidence that any of the means are different from each other. Comparing all the treatments pairwise with two-sample t tests results in a lot of individual tests (15 for 6 treatments). An overall test of equality of all the treatment means is much more efficient and will not suffer from the problem of running multiple tests (where statistically significant results have to be considered in the context of how many tests were run).

An Ideal Model which allows the problems above to be solved fairly easily

• Population distributions are normal • Population standard deviations are equal • Independent random samples from each population (a randomized experiment satisfies this

assumption)

Page 2: Chapter 5

Chap. 5, page 2

This model is exactly the model for the pooled two-sample t-test when there are two groups: different means, but common standard deviation The assumption of equal standard deviations is very important and must be checked. If there are large differences in variability, this may be of interest in and of itself and the reasons for this should be addressed. Often, differing variability is caused by higher values of the variable in some groups than another. For example, the variability in lifetimes of animals is likely to be greater the longer they tend to live. Transformations (such as log) can sometimes solve this problem. Comparing pairs of means

The two-sample pooled t procedure for comparing any pairs of means, say 1µ and 2µ uses

21 YY − and SE( 21 YY − ) = 21

11nn

s p + where 2

)1()1(

21

222

211

−+−+−

=nn

snsns p .

The only change in adapting this to several groups is to use the pooled standard deviation from all of the groups if the assumption of equal standard deviations seems reasonable.

Descriptives

Months survived

49 27.40 6.134 6.4 35.557 32.69 5.125 17.9 42.371 42.30 7.768 18.6 51.956 42.89 6.683 24.2 50.756 39.69 6.992 23.4 49.760 45.12 6.703 19.6 54.6

NPN/N85N/R50R/R50N/R loproN/R40

N Mean Std. Deviation Minimum Maximum

The equal variance assumption seems reasonable for this experiment so we will use the pooled standard deviation from all 6 treatments.

)1()1()1()1()1()1(

21

2222

211

−++−+−−++−+−

=I

IIp nnn

snsnsns……

= 678.6599.44595648

)703.6(59)125.5(56)134.6(48 222

==+++

+++…

The degrees of freedom for the t distribution when you use this pooled standard deviation is the denominator in the above expression which is In − , where n is the total sample size (349 in our example) and I is the number of groups or treatments (6 in our example). So we use a t with 343 degrees of freedom for the mice experiment.

Page 3: Chapter 5

Chap. 5, page 3

• One desired comparison is between groups 1 and 2: the unrestricted non-purified diet

(NP) to a standard 85 calorie diet (N/N85). The result is summarized in part e) on p. 116.

First, note that SE( 21 YY − ) = 571

491678.611

21

+=+nn

s p = 1.301.

A 95% confidence interval for 21 µµ − :

)(SE)975(. 2134321 YYtYY −±− = 35.5 – 42.3 ± 1.967 (1.301) = -6.8 ± 2.56 ≈ -9.4 months to -4.2 months

Conclusion: It is estimated that the 85 calorie standard diet increases mean life expectancy by 6.8 months over an unrestricted diet with a 95% confidence interval of 4.2 to 9.4 months.

• A test of the null hypothesis that 21 µµ = against a one-sided alternative that 21 µµ <

(we would have to decide before collecting the data that we were only interested in detecting an increase in mean life expectancy with the 85 calorie diet):

Test statistic = 301.1

8.6)(SE 21

21 −=

−−

YYYY = -5.23

Compare to t distribution with 343 d.f. P-value = area to left of –5.22 < .0001

Conclusion: The data provide very strong evidence that the 85-calorie diet increases life expectancy over the unrestricted diet.

Note: if the equal standard deviations assumption did not appear reasonable, then we could have done the confidence interval and hypothesis test the usual way using the pooled standard deviation from the two groups or the unpooled Welch’s t procedures. The advantage of pooling all 6 groups is a better estimate with increased degrees of freedom.

One-way Analysis of Variance F-Test Designed to answer the question: is there evidence of a difference between any of the means?

That is, we wish to test the null hypothesis 6543210 : µµµµµµ =====H . The alternative hypothesis is that at least one mean is different from the others. The alternative hypothesis would include all these possibilities:

• All the means are different from one another • Five means are the same and one is different

Page 4: Chapter 5

Chap. 5, page 4

• Three of the means are the same, the other three are the same but different from the first group

The idea of a one-sided alternative hypothesis is meaningless with three or more groups.

Testing the hypothesis of equal means relies on a general approach which we will use frequently in the rest of the course:

Extra Sum of Squares Principle

General principle for testing hypotheses.

Full model: a general model which adequately describes the data.

Reduced model: a special case of the full model obtained by imposing the restriction of the null hypothesis. For testing the equality of several population means, these models are:

Full model: the population distributions are normal with the same standard deviations, but different (possibly) means

Reduced model: the population distributions are normal with the same standard deviations, and the same means

The general idea is that we “fit” both these models to the data (like regression). Each model gives a predicted value for every case. The full model uses each observation’s group mean as the predicted value. The reduced model uses the mean of all the observations together. We then measure how well the data fit the models by computing the sum of squared residuals. The full model can fit no worse than the reduced model because the reduced model is a special case of the full model. So, the predicted responses are

Group 1 2 3 4 5 6 Full 1Y 2Y 3Y 4Y 5Y 6Y Reduced Y Y Y Y Y Y

Example: To illustrate these calculations, we’ll use a small hypothetical example, with 3 groups and 10 observations in all. Group 1: 10.7 13.2 15.7 1n =3 1Y = 13.2 1s = 2.500 Group 2: 12.1 14.2 16.0 16.5 2n =4 2Y = 14.7 2s = 1.995 Group 3: 20.9 24.4 27.3 3n =3 3Y = 24.2 3s = 3.205 Total: n =10 Y = 17.1 ps = 2.535

Page 5: Chapter 5

Chap. 5, page 5

Y is called the “grand mean” and is the mean of all 10 observations.

Group

Obs Response

Predicted (reduced model)

Residual (reduced model)

Squared residual

(reduced) Predicted

(full model) Residual

(full model) Squared

residual (full)1 1 10.7 17.1 -6.4 40.96 13.2 -2.5 6.25 1 2 13.2 17.1 -3.9 15.21 13.2 0.0 0.00 1 3 15.7 17.1 -1.4 1.96 13.2 2.5 6.25 2 1 12.1 17.1 -5.0 25.00 14.7 -2.6 6.76 2 2 14.2 17.1 -2.9 8.41 14.7 -0.5 0.25 2 3 16.0 17.1 -1.1 1.21 14.7 1.3 1.69 2 4 16.5 17.1 -0.6 0.36 14.7 1.8 3.24 3 1 20.9 17.1 3.8 14.44 24.2 -3.3 10.89 3 2 24.4 17.1 7.3 53.29 24.2 0.2 0.04 3 3 27.3 17.1 10.2 104.04 24.2 3.1 9.61

Total 264.88 44.98

Extra sum of squares = Residual sum of squares (reduced) – Residual sum of squares (full)

= 264.88 - 44.98 = 219.9 The residual sum of squares for a model represents the variability in the original data which is not explained by the model. The extra sum of squares therefore represents the amount of the unexplained variability in the reduced model that is explained by the full model. The question now is whether the improved fit represents something real or could just be attributed to sampling variability. We use the F-statistic to test the null hypothesis that the populations follow the reduced model against the alternative that they follow the full model and not the reduced model.

F-statistic = 2fullˆ

freedom) of degrees Extrasquares)/( of sum (Extraσ

Extra degrees of freedom = # params for full model – # params for reduced model = 4 – 2 = 2

2fullσ̂ = estimate of 2σ based on full model = 2

ps (square of pooled standard deviation) The numerator of the F-statistic is the average reduction in residual sum of squares for each parameter added and the denominator is the reduction we would expect per extra parameter just by chance. For the above small example,

Page 6: Chapter 5

Chap. 5, page 6

F2,7 = 219.9/2 = 17.11 2.5352

This statistic is compared to an F distribution. F distributions have two parameters: numerator degrees of freedom and denominator degrees of freedom. Numerator d.f. = extra degrees of freedom Denominator d.f. = d.f. for ps = n – I

ANOVA

Response

219.900 2 109.950 17.111 .00244.980 7 6.426

264.880 9

Between GroupsWithin GroupsTotal

Sum ofSquares df Mean Square F Sig.

Notice that there are 3 different sums of squares.

Total sum of squares = SST = 2

1 1

( )inI

iji j

Y Y= =

−∑∑

Sum of squares between groups = SSB = 2

1( )

I

i ii

n Y Y=

−∑

Sum of squares within groups = SSW = 2

1 1( )

inI

ij ii j

Y Y= =

−∑∑

Note that, SST = SSB + SSW and Extra sum of squares = SST – SSW, hence SSB = ESS.

Mean square between groups = MSB = 1

SSB−I

Mean square within groups = MSW = In −

SSW = 2ps

Source Sum of squares d.f. Mean square F-statistic P-value Between groups SSB I - 1 MSB MSB/MSW Within groups SSW n - I MSW Total SST n - 1

Page 7: Chapter 5

Chap. 5, page 7

Logic behind the F-test It’s easiest to see if the sample sizes are equal: Innn === …21 . Call the common sample size n*. Remember that we always assume that the population distributions are normal, the standard deviations are all equal, and the samples are independent. MSW = 2

ps is an estimate of 2σ no matter which model (equal means or separate means) is correct. If the population means are equal (i.e., if the null hypothesis is true) then 1Y is N( */, nσµ ) 2Y is N( */, nσµ ) … LY is N( */, nσµ ) Since the samples are independent, IYYY ,,, 21 … are like a random sample from a normal population with mean µ and standard deviation */ nσ . Therefore, the sample variance of

IYYY ,,, 21 … is an estimate of */2 nσ :

2

1

1 ( )1

I

ii

Y YI =

−− ∑ is an estimate of

*

2

nσ .

Hence, 2

1

1 *( )1

I

ii

n Y YI =

−− ∑ = MSB is an estimate of 2σ .

To summarize:

• MSW is an estimate of 2σ no matter whether the full or reduced model is correct. • MSB is an estimate of 2σ only if the reduced model (the equal means model) is correct.

If the reduced model is not correct, then MSB will tend to overestimate 2σ . •

Therefore,

• if the null hypothesis is true (i.e., the equal means model is correct), then MSB/MSW should be 1 except for sampling error

• if the null hypothesis is false, MSB/MSW will tend to be bigger than 1 • if the null hypothesis is true, the sampling distribution of MSB/MSW is an F distribution

with I-1 d.f. in the numerator and n-I d.f. in the denominator. • large values of MSB/MSW are evidence in favor of the alternative hypothesis; therefore,

the P-value is the area to the right of MSB/MSW in the F distribution. Case Study 5.1 (comparing diets in mice)

Page 8: Chapter 5

Chap. 5, page 8

ANOVA

Months survived

12733.942 5 2546.788 57.104 .0000015297.415 343 44.59928031.357 348

Between GroupsWithin GroupsTotal

Sum ofSquares df Mean Square F Sig.

Conclusion: There is overwhelming evidence that there is a difference in the mean lifetimes under the different diets. This does not mean that all the diets are different, only that at least one of them is. Robustness to assumptions: see Section 5.5.1, p. 130. The main distributional assumptions we need to worry about are: • Population standard deviations are roughly equal • There are no extreme outliers; the F-test is not resistant to outliers, particularly with small

samples We can judge these assumptions from side-by-side dotplots or boxplots of the raw data. Judging equality of standard deviations is a little easier if we subtract off the mean of group. That is we examine the residuals for the full (separate means) model: iij YY − . As in regression, we plot the residuals versus the predicted values. The predicted value for an observation is the group mean.

Judging from this plot, the original boxplots, and the sample standard deviations, there doesn’t seem to be any reason to doubt the assumptions of the F test.

Page 9: Chapter 5

Chap. 5, page 9

Examining models between the separate means and the equal means models Suppose we wanted to examine the model which assumes the two control groups (NP and N/N85) have the same mean lifetime and the remaining four calorie restricted diets have the same mean lifetime. The question is: how much of the difference among the means is due simply to the differences between these two groups of diets? This is a two-mean model that is between the separate means model (with 6 parameters to describe the means) and the equal means model (with parameter to describe the means). Model NP N/N85 N/R50 R/R50 N/R50lopro N/R40

Separate means

Two means

Equal means These three models are said to be nested because each model is a special case of the ones above it. We can test the two means model against the separate means model in SPSS by creating a new categorical value which identifies the first two diets as group 1 and the remaining four diets as group 2. We then run the ANOVA with this new variable as the explanatory variable. Control diets (NP and N/N85) vs. restricted diets

ANOVA

Months survived

11131.393 1 11131.393 228.556 .00016899.964 347 48.70328031.357 348

Between GroupsWithin GroupsTotal

Sum ofSquares df Mean Square F Sig.

This ANOVA table is comparing the two means model to the equal means model. We see that it is significant. Now, to compare the two-means model to the separate means model we need to use the sums of square to compute a new F statistic. Recall

F = 2fullˆ

freedom) of degrees Extrasquares)/( of sum (Extraσ

where

reduced fullESS SSR SSR= −

Page 10: Chapter 5

Chap. 5, page 10

Here are both ANOVA tables, side by side: Comparing the separate means model to the equal means model

ANOVA

Months survived

12733.942 5 2546.788 57.104 .0000015297.415 343 44.59928031.357 348

Between GroupsWithin GroupsTotal

Sum ofSquares df Mean Square F Sig.

Comparing the two-means model to the equal means model

ANOVA

Months survived

11131.393 1 11131.393 228.556 .00016899.964 347 48.70328031.357 348

Between GroupsWithin GroupsTotal

Sum ofSquares df Mean Square F Sig.

Calculate the F statistic to test the separate means model against the two-means model: F , =