Post on 01-Oct-2020
STA 5126 -- 1 of Chp 2
©D. Mohr
CHAPTER 2: SOME TRULY USEFUL BASIC TESTS FOR QUANTITATIVE
VARIABLES
TOPICS RANDOM SAMPLES
INDEPENDENT AND DEPENDENT VARIABLES
PAIRED T-TESTS (Before/After comparisons)
INDEPENDENT SAMPLE T-TESTS (Separate group comparisons)
A. Large sample or equal population variances
B. Unequal variances
TESTS TO COMPARE VARIANCES
NORMALITY ASSUMPTION
Section 1. Introduction
Many important experimental results are based on statistical analyses no more difficult
than those we will review in this chapter. They are among the most useful test statistics ever
devised, simply because the experimental designs which they match are easy, powerful, and
popular.
All statistical tests require that the sample studied be a random sample from the
population of interest. This is an extremely stringent requirement. It means that every person or
item in the population had an equal chance of making it into the sample. These are the ONLY
conditions under which sampling variability can be calculated. It ensures that sampling
variability is the sole source of error in your results. If random samples are not selected
randomly, biases can easily contaminate the results. At the very least, if random samples are not
possible, randomization has to be used to create comparison groups.
We should note that most hypothesis tests are loosely stated as questions of the form
"Does variable X affect variable Y?" For instance, we might ask if Gender affects a person's
opinion on Abortion, or whether Age affects a person's Blood Pressure. In questions where one
variable can be thought of as "affecting" the other variable, the "cause" is referred to as the
independent variable. The outcome is referred to as the dependent variable. Hence, in the
preceding examples, Gender and Age are independent variables possibly affecting the dependent
variables Opinion and Blood Pressure.
The tools discussed in this chapter are only applicable when the dependent variable is
quantitative. Basically, that is because all these methods focus on the effect of the independent
STA 5126 -- 2 of Chp 2
©D. Mohr
variable on the mean and standard deviation of the dependent variable. These parameters only
make sense if the variable is quantitative. There are further mathematical requirements, which
we will list at the end of the chapter.
Section 2. Paired t-test
The most common experimental designs are in the form of a comparison. Often, the
comparison is on values collected on the same experimental subjects. For instance, we may have
reading proficiency scores for children before and after they undergo a six-week training
program. We may have strength scores on right and left arms of the same person. We may have
yields from tomato plants of type A and B, when one of each were planted in the same pot. In
each of these examples, the key feature is that there is a "matching" mechanism which pairs an
observation of one type unambiguously with an observation of the other type. The statistical
technique we discuss will preserve the information due to the pairing by using one of the
observations as a "baseline" against which the other is measured.
The method is simple. Consider the before and after scores for reading proficiency. If we
are interested in whether the program (independent variable) affected the reading proficiency
(dependent variable), we are really interested in whether there was typically a change in the
scores from before to after. We will calculate the individual changes for each child and use the
one-sample t-test (Chapter 1) to test the null hypothesis that the mean change is zero (no effect
exists.)
Recipe for Paired t-test
Data structure: For n individuals, we have measurement 1 (X1) and measurement 2 (X2)
which we wish to compare. X1 and X2 must be quantitative variables. Form a new
column D=X1 -X2 which contains the differences in the two measurements for each
individual.
Perform one sample t-test on D
1) Ho: D=0 (typical values do not differ for measurements 1 and 2)
H1: D0 (typical values do differ for measurements 1 and 2)
2) Since the sample of D which we have observed has n observations, the t-
statistics will have n-1 d.f.
STA 5126 -- 3 of Chp 2
©D. Mohr
n/s
x
n/s
0xt
D
D
D
D
The subscript "D" is to remind you that these statistics are calculated from the column of
D=Differences. Form your critical region using the table of the t-distribution with n-1 df
where n is the number of pairs.
3) Calculate the value of t for your sample. If you use a statistical computer package, it
may give you the p-value for this test automatically.
4) Write the appropriate conclusion.
Example of a paired comparison. Notice the presence of a natural pairing mechanism between
observations with the different "treatments". What are the advantages of such a mechanism? The data below are from Darwin's study of cross- and self-fertilization.
Pairs of seedlings of the same age, one produced by cross-fertilization and
the other by self-fertilization, were grown together so that the members of
each pair were reared under nearly identical conditions. The data are the
final heights of each plant after a fixed period of time, in inches. Darwin
consulted the famous 19th century statistician Francis Galton about the
analysis of these data. The summary information was produced by the
statistical package SAS for Windows.
PAIR CROSS SELF DIFF = cross-self
1 23.5 17.4 6.1
2 12.0 20.4 -8.4
3 21.0 20.0 1.0
4 22.0 20.0 2.0
5 19.1 18.4 0.7
6 21.5 18.6 2.9
7 22.1 18.6 3.5
8 20.4 15.3 5.1
9 18.3 16.5 1.8
10 21.6 18.0 3.6
11 23.3 16.3 7.0
12 21.0 18.0 3.0
13 22.1 12.8 9.3
14 23.0 15.5 7.5
15 12.0 18.0 -6.0
Summary on N Mean Std Dev Minimum Maximum
variable ----------------------------------------------------------
DIFF 15 2.6066667 4.7128194 -8.4000000 9.3000000
STA 5126 -- 4 of Chp 2
©D. Mohr
The null hypothesis is that the mean difference in the population is 0,
implying that mean heights of cross and self-fertilized plants would not
differ. In symbols,
Ho: D=0 vs H1: D 0
There are 15 observations in the data set, so 14 d.f. If we use = 5%, then the critical region would be "Reject Ho if t < -2.145 or t > 2.145" In this
sample, t=2.142. Hence, there is no significant evidence, at =5%, that cross and self-fertilized seedlings differ in mean length.
Section 3. Two-sample t-test (also called the independent samples t-test)
Frequently, we have two separate groups on which we wish to make comparisons. We
may be interested in comparing mean salaries for male and female entry-level employees, or
length of hospital stays for HMO and PPC plan insurees. In the first case, our independent
variable is gender while the dependent variable is salary. Salary is a quantitative variable for
which we summarize typical values using the mean. Unlike the paired t-test, where the values in
each group are naturally matched, here we assume the two groups are completely independent.
Diagram 1 gives a schematic of the statistical situation. We have two populations
summarized by the means in each (1 and 2) and the standard deviations (1 and 2). Our
hypotheses are
Ho: 1 = 2 (1 - 2 = 0) "group" has no effect on mean
Ha: 1 2 (1 - 2 0) "group" has an effect on mean
Since we cannot observe 1 and 2, we must use our sample data to reach conclusions. Looking
at the hypotheses, our natural move is to compare the two sample means to each other, or
equivalently, their difference to 0.
STA 5126 -- 5 of Chp 2
©D. Mohr
Ho: 1 = 2
DIAGRAM 1. Comparing two populations
If the population variances are known, probability theory shows that the appropriate statistic
would be
2221
21
21
n/n/
xxZ
The two sample t-test has two versions, which differ in how they "doctor up" the
denominator of this statistic since the population variance is hardly ever known. The versions
differ depending on whether the two population variances can be assumed equal or unequal. In
section 4 we cover a method for checking this assumption.
Section 3A. Large samples or unequal variances
When the variances (or standard deviations) in the two groups appear very dissimilar, the
best method may be the unequal variance version. This method does not require the assumption
of equal variances. The disadvantage of this method is that the degrees of freedom are
sometimes small, and they are always difficult to calculate (this is referred to as Satterthwaite‘s
approximation). While the test statistic itself is easy to calculate, the degrees of freedom are best
computed by a statistical package. Without the computer, it helps to know that the d.f. are
always between 1sn where sn is the size of the smallest sample, and 1 2 2n n , so if you get
the same conclusion using both those d.f., you are safe. If both samples are large (at least 50), it
Population 1
Parameters 1, 1
Population 2
Parameters 2, 2
Sample 1
Stats: n1, mean x1, s1
Sample 2
Stats: n2, mean x2, s2
STA 5126 -- 6 of Chp 2
©D. Mohr
is probably safe to use infinite () d.f. The value of the test statistic is computed by:
2221
21
21
n/sn/s
xxt
Section 3B. Equal variance t-test
When the variances in the two samples appear similar, it is advantageous to "pool" the
two estimates into an estimate of the alleged single underlying variance. This allows us to pool
the degrees of freedom in the two groups as well, giving more sensitive critical regions.
iancevarpooled2nn
s)1n(s)1n(s
21
222
2112
p
)snot,snote(n/1n/1s
xxt 2
pp
21p
21
d.f. = n1 + n2 - 2
Section 4. Comparing two standard deviations
Some authors now argue that we should always use the unequal variance version of the
test. Traditionally, however, the equal variance version was preferred both because of the
potentially greater degrees of freedom and because its relation to more advanced topics (like the
one-way ANOVA) is well understood. In this tradition, before we decide which version of the
two-sample t-test to use, we need a tool for deciding whether the variances in two groups are
equal or different. This amounts to a hypothesis test for the hypotheses
Ho: 1 = 2 (or equivalently, 2
1 = 2
2)
vs Ha: 1 2 (or equivalently, 2
1 2
2)
Note the restatement of the hypotheses in terms of the variances. There are many tests available
for testing these hypotheses. The most commonly cited are Fisher's test (F-test classic!) and
STA 5126 -- 7 of Chp 2
©D. Mohr
Levene's test.
Section 4a. Fisher's test
The test statistic used to compare the variances if the F-statistic. F is for Sir R. A. Fisher,
who pioneered many classic statistical techniques.
2min
2max
22
21 s/s'Fors/sF
F' differs from F only in that it always places the larger of the two sample variances in the
numerator. If the null hypothesis is true, we expect F (or F') to be near 1. If F is either very
much larger or very much smaller than 1 (F' very much larger than 1), we would believe Ha is
true. As always, the question is where to draw the line (critical region).
The table of the F-distribution is provided in most statistics texts. Most tables only give
the cutpoint which marks off the lower 1-A of area from the upper A of area in the righthand tail.
It can be quite confusing to understand how to use this to get the critical values for all the
varieties of test.
Generic shape of an Area A
F distribution: Area 1-A
a) F is explicitly a two-tailed test. So we need cutpoints which mark off the lower /2 area
in the lefthand tail, and /2 in the righthand tail. Most tables only give the righthand cutpoint.
To get the lefthand cutpoint, you use
Lefthand cutpoint for lower /2 with M,N df = 1. / (Righthand cutpoint for upper /2 with N,M df)
Example: Suppose we are using = 5%, and sample 1 has n=10 while sample 2 has n=6 (9 and
5 df, respectively). We should put 2.5% in each tail. From the table, we see that the cutpoint for
the upper tail is 6.68. To get the lower cutpoint, we need to reverse the order of the d.f. (now 5
and 9), then take the reciprocal of the upper cutpoint. That is, the lower cutpoint is 1/4.48=.223.
b) F' is also two-tailed, but it finesses the problem of getting the lower cutpoint by arranging
to always put the largest variance on top. Hence, if we had sample 1 with n=10 and sample 2
with n=6, our critical region would be: If s1 is the largest, reject if F' > 6.68 (9 and 5 df, with
/2 in the upper tail); if s2 is the largest, reject if F' > 4.48 (5 and 9 df, with /2 in the upper
STA 5126 -- 8 of Chp 2
©D. Mohr
tail). Example of an F test.
Notice that sometimes hypothesis tests about the variances (or standard deviations) are of
interest in there own right. Drill press operators in a manufacturing plant must drill holes of specified
diameter in sheets of metal. One goal is that all holes should have the same
diameter (small variability in the individual diameters). Actual diameters
are measured for 20 holes drilled by inexperienced operators, and 10 holes
drilled by experienced operators. The data is summarized below. Is there
significant evidence, at = 5%, that the population variances differ for experienced and inexperienced operators?
Inexperienced n = 20 s = .52 mm
Experienced n = 10 s = .21 mm
1) Ho: 1 = E (variability is the same for experienced and inexperienced operators)
Ha: 1 E (variability is not the same) 2) We will reject Ho if F' > 3.69, using F table for upper area of .025, 19
df in numerator and 9 in denominator.
3) F' = .522 / .21
2 = 6.13
4) There is significant evidence that the variability in the diameters
differs for experienced and inexperienced operators. Inexperienced operators
have larger variability (less consistency) in the diameters of the holes.
Section 4b. Levene's Test (used by SPSS)
Levene's test actually tests the null hypothesis that the mean values of the magnitude of
the distances from individual observations to the mean are the same. Instead of defining
dispersion in terms of 'squared distances' as variances do, it uses absolute values of distances.
The actual algorithm is as follows:
1) Within each group, compute the difference between the individual observations and
the group mean.
2) Take the absolute value of these differences.
3) Do a independent sample t-test (equal variance version) of the null hypothesis that the
means of the absolute differences are equal.
4) Square the t-value from the t-test. (Under Ho, the square of a t should have an F
distribution with 1 df in the numerator and n1 + n2 -2 in the denominator.) Compare it to the
cutpoint which places (usually 5%) area in the upper tail of the distribution. You are only
interested in large values of F, because only large values would indicate that the variances are
different. (Note the difference between this and the cutpoints for Fisher's test, which place /2 in
each tail.)
STA 5126 -- 9 of Chp 2
©D. Mohr
Large values of F indicate that one of the means must be different from the other (Ha
true). Bear in mind at this point that we are no longer talking about the means of the raw data,
but of the distance of the raw values around their group means. In the example above, a large
value for F would indicate that the typical (mean) distance of individual diameters from the
group mean was larger in one group than in another, indicating more variability in one group.
Levene's Test and Fisher's Test do not give exactly the same result. Except in borderline
cases, however, they usually give comparable values. There is some intuitive evidence that
Levene's Test is less sensitive to departures from the normality assumption, and I think that is
why it is the default in SPSS.
Example of Levene's Test The following data shows test scores for five freshman and five juniors
on an assessment test for critical thinking. Does variability differ in the
two groups, using = 5%? Freshman: 28 32 21 36 33 ( sample mean = 30.0)
Juniors: 34 49 43 32 27 ( sample mean = 37.0)
Ho: 2 2
1 2 , vs Ha: 2 2
1 2 . Reject Ho if F > 5.32 (using table
with 1 and 8 d.f., and 5% in the upper tail.)
Absolute values of differences from mean within each group:
Freshman: 2 2 9 6 3 (sample mean = 4.4, s=3.05)
Juniors: 3 12 6 5 10 (sample mean = 7.2, s=3.70)
Sp = 3.39, df = 8, t = 1.31, F = 1.71. Since 1.71 is less than the
cutpoint of 5.32, there is no significant evidence that the variances are
different in the two groups. When we compute the t-test to compare the means
of the test scores, we can use the equal variance version.
Section 5. A Full Example!!
Recipe for a two sample t-test:
Data structure: Two separate groups are measured for a quantitative variable Y.
Performing the two sample t-test
1) State the null and alternative hypotheses in terms of the means in the two groups.
2) Decide which version of the t-test to use by using the F-test or Levene's test to examine the
variances within the two samples.
3) Calculate the number of degrees of freedom for the appropriate version of the test. Use your
and a table of the t-distribution to set the critical region.
4) Calculate the appropriate version of t.
5) Write your conclusion.
Most computer programs will automatically calculate both versions of t as well as F, along with their p-
values, saving you a lot of effort.
Example for two-sample test (two independent samples)
STA 5126 -- 10 of Chp 2
©D. Mohr
Notice that the two groups of patients are completely separate, with no natural pairing.
In small to moderate samples, the particular version of the two-sample t-test depends on whether
the variances within the two groups seem similar. SAS for Windows computes an F-test to help
you decide which version is appropriate. The data summarized below show cholesterol values for the 39 heaviest
men in the Western Collaborative Group Study. (This study was carried out in
California in 1960-1961 and involved 3,154 middle-aged men. The purpose was
to study behaviour patterns and risk of coronary heart disease.) All the
cholesterols summarized below are for men weighing more than 225 pounds.
Cholesterols are given in mg per 100 ml. Each man was rated as generally
having Behaviour Type A (urgency, aggression, ambition) or Behaviour Type B
(relaxed, non-competitive, less hurried.) In heavy, middled-aged men, is
cholesterol level related to behaviour type?
1) The null hypothesis is that behavior type has no effect on mean
cholesterol, while the alternative hypothesis is that it does have an effect
on mean cholesterol. In symbols:
Ho: A = B vs Ha: A B
2) Since the hypotheses concern the means in two separate groups, we will use
the two sample t-test. To decide which version, we notice that the program
has printed the value of F', along with the p-value (labeled Prob>F').
Recall that this statistic tests the null hypothesis that the two population
variances are equal. Since the p-value of .2927 is greater than any
reasonable (.1 to .01) so it is reasonable to assume that the variances are equal and use that version of the t-test.
3) For the pooled (equal) variance version, the d.f.=37. With a significance
level of 5%, we would reject Ho if t < -2.021 or t > 2.021. Alternatively, we
reject if the p-value is less than .05.
4) For the equal variance version, t=2.5191, df=37 and the p-value is .0162.
You should use the table of sample means and standard deviations to check
these results.
5) If we are using a significance level of .05, we would reject the null
hypothesis. Hence, we can say there is significant evidence that behaviour
type is associated with differences in mean cholesterol.
COMPUTER PRINTOUT - TTEST PROCEDURE
Variable: CHOL
TYPE N Mean Std Dev Std Error
---------------------------------------------------------------------
A 19 245.36842105 37.61384279 8.62920735
B 20 210.30000000 48.33991486 10.80913356
Variances T DF Prob>|T| <----- note how SAS labels
--------------------------------------- the p-values
Unequal 2.5355 35.7 0.0158
Equal 2.5191 37.0 0.0162
For H0: Variances are equal, F' = 1.65 DF = (19,18)
Prob>F' = 0.2927 <------note how SAS labels
the p-values
STA 5126 -- 11 of Chp 2
©D. Mohr
Boxplot for CHOL by TYPE
|
400 +
|
| | 0
300 + |
| +-----+ |
| *--+--* +-----+
200 + | *--+--*
| | +-----+
| |
100 +
------------+-----------+-----------
TYPE A B
CASE STUDY
Jerrold et al (2009) compared typically developing children to young adults who had Downs Syndrome, with respect to a number of psychological measures thought to be related to the ability to learn new words. Data on two of the measures is summarized in Table 5.6. Recall Score is a measure of verbal short-term memory. Raven’s CPM is a task in which the participant must correctly identify an image which completes a central pattern. The authors used the pooled t test to compare the typical scores in the two groups. For Raven’s
CPM, .485, .629t p value . For Recall Score, 7.007, value .0001t p . Hence, the two
groups did not differ significantly with respect to mean Raven’s CPM, but the Down’s Syndrome group scored significantly differently (apparently lower) on Recall Score. Based on this and a number of other comparisons, the authors conclude that verbal short-term memory is a primary factor in the ability to learn new words. The authors choice of the pooled t test rather than the unequal-variance t Test appears
reasonable here. For Raven’s CPM, 0.700, value .379F p . For Recall Score,
0.691, value .361F p . Neither variable showed a significant difference in the variances within the
groups. The other distributional assumption underlying t tests is that the data comes from normal distributions. Journal publications rarely have space in which to present graphical evidence with which the reader can check this assumption. However, the discussion will often include a sentence addressing this issue, and remark on any transformations (e.g. logarithms) used to make the variable more nearly normal. The authors actually presented the results of the pooled t test (with 80 degrees of freedom) as an F test with 1 degrees of freedom in the numerator and 80 in the denominator. The relation between these two test statistics will be explained in Chapter 4. Summary statistics from Jerrold (2009).
Down Syndrome young adults Typically developing children
n = 21 n = 61
Mean S.D. Mean S.D.
Raven’s CPM 19.33 4.04 19.90 4.83
Recall Score 12.00 3.05 18.25 3.67
(Source: Jerrold, C., Thorn, A. S. C, and Stephens, E. (2009). The relationship among verbal short-term memory, phonological awareness, and new word learning: evidence from typical development and Down syndrome. J. Experimental Child Psychology, 102(2) 196-218.)
STA 5126 -- 12 of Chp 2
©D. Mohr
Section 6. Nasty mathematical assumptions
We already know of two fundamental assumptions underlying the tests in this chapter,
and that of the t-test in Chapter 1.
1) The sample must be random
2) The dependent variable must be quantitative
In addition, the derivations of the t and F-distributions have a nasty mathematical assumption:
that the distribution of the variable in the population must follow a normal distribution. In plain
language, if you could draw a histogram of the values for all the observations in the entire
population, you should see the famous "bell curve". So we have a third assumption:
3) The distribution for the individual values is normal.
It is not very likely that we will ever know for sure whether assumption 3 is met. What can we
do to check and how important is it anyway? There are several graphical techniques we can use
to check for normality. So far, we have seen dotplots and boxplots, though we have not
discussed them. (See your elementary text.) In chapter 4 we will meet a tool called a normal
probability plot which gives a more sensitive check. What are we really looking for? An
immediate cause of trouble in a small or moderate data set would be when one or two values are
very far away from the rest. The self/cross-fertilization data used as an example of the paired t-
test may be a case where the data contains two "outliers". Outliers should be rare in normally
distributed data. Outliers can cause the p-values and critical regions to be only approximate.
The most common problem is to make the p-value larger than what it should be.
If there are no outliers, and the data show a nearly symmetric pattern with the points
clustering in the middle of the range, then it is unlikely that non-normality is a serious problem.
If you do seriously suspect nonnormality in your data, consult a statistician on a variety of
"nonparametric" statistical tests which do not require the normality assumption.
There is one frequent case in social science data where normality is very questionable. If
you have data collected on an ordinal scale (e.g. 0 = strongly disagree to 4 = strongly agree), it is
unlikely to be normally distributed. Recall that the normal distribution is for continuous or
nearly continuous random variables, and data on a five point scale is quite discrete. This is
STA 5126 -- 13 of Chp 2
©D. Mohr
especially true if the values cluster at one end or the other end of the scale (e.g. almost all agree
or strongly agree). In this case, one of the techniques of Chapter 3 might be appropriate.
Furthermore, it is questionable as to whether one can legitimately average values on this kind of
scale — does a (―disagree‖+‖strongly agree‖)/2 = ―agree‖? Nevertheless, treating this ordinal
data AS IF it were numerical on a 0-4 scale, and conducting averaging operations, is a sloppy but
common practice in the social sciences. Averages over several questions frequently produce
values which appear reasonably normally distributed.
Finally, when comparing two population means, the choice of the version of the test
depends on whether variances can be assumed equal. Since this assumption, referred to as
―homogeneity of variance‖, underlies much of the Analysis of Variance, we list it as a fourth
assumption:
4) variances in the two groups are equal.
STA 5126 -- 14 of Chp 2
©D. Mohr
EXERCISES FOR CHAPTER 2
*Exercise 1. Data below show blood pressures for 5 subjects. The first value was taken while the
subject was resting. The second was taken while the subject was resting, but asked to work a mental
arithmetic problem. Does math affect mean blood pressure? Use = 5%.
Subject number Resting BP During Math BP
1 115 125
2 125 125
3 110 130
4 120 115
5 110 125
*Exercise 2. Occupancy rates (average annual percentage of beds filled) are compared for randomly
selected urban and suburban hospitals in a state.
a. Is there evidence of difference in variability between the two groups? Compute both Levene's Test
and Fisher's Test to answer this. Use = 5%.
b. Is there evidence of a difference in the mean occupancy rates? Use the results of A to help you
decide on an appropriate version of the t-test. Use = 5%.
Urban: 76.5 79.6 77.5 79.4 79.3 78.1
Suburban: 71.5 73.4 71.2 67.8 63.0 76.5
Exercise 3. Eight students volunteer to participate in a test of the effect of caffeine on the speed with
which they can respond to a flashing light. Each student takes the test on a morning when they have
had no caffeine, then again a week later on a morning after having had the equivalent of two cups of
coffee. The data is given below, in hundredths of seconds to respond to the light.
Subject Without caffeine With caffeine
1 12 10
2 18 14
3 22 20
4 9 8
5 14 14
6 24 21
7 21 19
8 16 14
Does caffeine have an effect? Use = 5%.
Exercise 4. Do HMO‘s really reduce costs of care? 40 adults aged 55-60 enrolled in HMO‘s are
questioned on their health care within the last 2 years. They report an average of days hospitalized
STA 5126 -- 15 of Chp 2
©D. Mohr
during that period of 1.19 days with a standard deviation of 1.4 days. A similar sample of 40 adults
with ordinary healthcare insurance reports an average of 1.35 days with a standard deviation of 1.7
days.
a) Is there evidence of a difference in the variability within the groups? (You don‘t have enough
information to do Levene‘s test here, you must use Fisher‘s.)
b) Is there evidence of a difference in the means for the groups?
Use =5% for each test.
Exercise 5. We are comparing math FCAT scores for rural and urban high schools. We have a
random sample of 20 urban high schools and 20 rural high schools. Their school-aggregated math
FCAT scores for 10th
graders are summarized below.
Location n sample mean sample standard deviation
Rural 20 1925 252
Urban 20 1982 212
a. Use the F‘ test to say whether it is reasonable to assume that the two populations have the same
variance. Why can you not prove that the variances are equal?
b. Do the means differ significantly in the two groups? Use = 1%.
Exercise 6. From each of 4 different litters of mice, a researcher chooses two female mice (for a total
of 8 mice). Within each pair of sisters, one is chosen to be fed a standard diet, and the other is fed a
high-protein diet. Their weight in grams, at the end of 6 weeks, is shown below.
Diet Pair 1 Pair 2 Pair 3 Pair 4
Standard 19.4 18.2 18.5 19.8
High-protein 17.6 19.4 17.2 19.2
Do the different diets seem to affect mean weight? Use = 5%.
Exercise 7. (A development from Exercise 1.) The researcher wishes to know whether girls and boys
differ in their reaction to arithmetic. 5 girls are are recruited, and their blood pressures are tested
resting, and again resting but doing mental arithmetic. 5 boys are tested under the same
circumstances. The data is given below. Is there significant evidence, at = 5%, that boys and girls
differ in the mean change in BP experienced while doing arithmetic? Note: this experimental design
uses ideas from both paired and two-sample experiments!
Girls Boys
Resting During Math Resting During Math
120 130 120 125
110 115 110 125
115 115 105 115
110 120 110 120
110 115 120 115
STA 5126 -- 16 of Chp 2
©D. Mohr
Exercise 8. Pedersen (2007, Perceptual and Motor Skills, 104(1), pp 201-211) interviewed a sample
of students enrolled in psychology courses in a large private university in the western U.S. regarding
their attitudes towards sports. Each student was asked to self-rate his or her degree of sport
participation, on a scale of 1 to 5. The 112 men in the sample had M = 4.3 and SD = 1.7. The 173
women had M = 3.6 and S.D. = 1.7. (M is a common abbreviation for the sample mean, and SD a
common abbreviation for the standard deviation.) Is there significant evidence, at = 1%, that men
and women at this university differ in their mean self-rankings of sport participation?
Exercise 9. Martinussen et al. (2007, J. Criminal Justice 35, 239-249) compared ‗burnout‘ among a
sample of Norwegian police officers to a comparison group of air traffic controllers, journalists and
building constructors. Burnout was measured on three scales: exhaustion, cynicism, and efficacy. The
data is summarized in the table below. The authors state
The overall level of burnout was not high among police compared to other occupational
groups sampled from Norway. In fact, police scored significantly lower on exhaustion and
cynicism than the comparison group, and the difference between groups was largest for
exhaustion.
Substantiate the authors‘ claim regarding Exhaustion. That is, check that it does show a significant
difference between the two groups..
Summary Statistics for Exercise 9
Police, n = 222 Comparison group, n = 473
Mean std dev mean std dev
Exhaustion 1.38 1.14 2.20 1.46
Cynicism 1.50 1.33 1.75 1.34
Efficacy 4.72 0.97 4.69 0.89
SOLUTIONS TO STARRED PROBLEMS
Exercise 1. Notice the existence of a pairing mechanism between items. Each experimental unit (a subject) has two
blood pressures--a resting and a ‗during math‘ blood pressure. This should be done via a paired t-test.
a) D = mean difference in during math – resting blood pressure in the population
Ho: D = 0 versus Ha: D 0.
b) The 5 differences in the sample are: 10 0 20 –5 15. There will be 4 degrees of freedom. We will reject Ho if t
< -2.776 or t> 2.776
c) 8 0
8 10.368, 1.72510.368/ 5
Dd and s t
d) Do not reject Ho. There is no significant evidence that math affects mean blood pressure, at = 5%.
Further Note. A computer package would not tell you the cutpoints for t. Instead, it would report that the p-value for this
data was .1595. Since .1595 > .05 (your ), you would not reject Ho.
Exercise 2.
Urban group had mean = 78.4, s.d. 1.2458
STA 5126 -- 17 of Chp 2
©D. Mohr
Suburban group had mean = 70.5667 and s.d. = 4.6779
a) Difference in variability: Ho: 2 2
s U versus Ha: 2 2
s U
Fisher‘s test. Reject if F‘ > F-table value with 6-1=5 and 6-1=5 df and 2.5% in tail
F‘ = 4.67792 / 1.24582 = 14.1. Cutpoint in table is 7.15. Since F‘ > 7.15, we reject Ho. There is significant evidence of
variability, at = 5%. (Note, tail value is half the desired for F‘ version.)
Levene‘s test. Absolute values of difference of individual scores from group mean—
Urban 0.9 1.2 0.9 1.0 0.9
Suburban .93 2.83 .63 2.77 7.57 5.93
Running an independent samples t-test (equal variance version) on this data gives t=2.099. F=2.0992 = 4.406 with 1 and
10 df. Since the cutpoint in the F-table is 4.96 for =5% (don‘t split the !) we would not reject Ho: that is, we have no
significant evidence of a difference in variability.
b) Fisher‘s test and Levene‘s test differ on the advisability of using the equal variance /unequal variance version.
Fortunately, in this case, the answers don‘t differ. Both versions of t come out to 3.96, which would be significant
whether you use df=10 (equal variance version) or df=5 (smallest df possible under unequal variance version).
STA 5126 -- 18 of Chp 2
©D. Mohr
SPSS NOTES IF YOU WANT TO GET STARTED ON YOUR OWN!
Step 1. Deciding how to set up your data.
When you double-click on the SPSS icon, the first thing you see is a spreadsheet-like grid for entering your data. This is
called the Data Editor. Before you charge in and start typing, you have to think about how the data is structured. The
format in which you enter the data must follow that structure.
The basic rule-of-thumb is that entries on the same row, or line, are from the SAME subject, or experimental unit.
Things on different lines are from different subjects. Things in different columns are different measurements. Let‘s see
how that plays out in the Starred Exercises. By double-clicking on the heading of each column, you can change the name
to something sensible, and also indicate whether your data is nominal (‗string‘) or numerical.
Exercise 1. There are 5 different subjects, so our data entry will have five rows. There should be one column for subject
number (‗SUBJ‘), one for the resting BP (‗REST‘) and one for the during-math BP (‗MATH‘). In otherwords, the data
entry will look very like the table given in the problem.
Exercise 2. There are 14 different hospitals. Each will have its own row in the data entry. In addition to a column for
occupancy rate (‗O_RATE‘), I will need a column which tells me whether it is an Urban or Suburban hospital. I will call
this column ‗LOCATION‘. Many of the ANOVA and T-test routines in SPSS want group variables to be coded AS IF
they were numeric. I am going to code Urban=0, Suburban=1. Keep notes of the codes you define.
LOCATION O_RATE
0 76.5
0 79.6
......
1 76.5
Step 2. Request the appropriate t-test
Exercise 1.
In SPSS, click on the ANALYZE option at the top. From the drop-down menu, request COMPARE MEANS. Choose the
type of T-test you need, in this case, the PAIRED SAMPLES T-TEST. You will see a ‗Dialog Box‘ like the one below.
You need to click on the column names with the two variables you are trying to compare (REST and MATH), and move
them into the big box on the right using the key that looks like an arrow >. Then hit the OK button. You will see printout
like that on the next page.
STA 5126 -- 19 of Chp 2
©D. Mohr
Paired Samples Statistics
116.0000 5 6.51920 2.91548
124.0000 5 5.47723 2.44949
REST
MATH
Pair
1
Mean N Std. Dev iat ion
Std. Error
Mean
Paired Samples Correlations
5 -.490 .402REST & MATHPair 1
N Correlation Sig.
Paired Samples Test
-8.0000 10.36822 4.63681 -20.8738 4.8738 -1.725 4 .160REST - MATHPair 1
Mean Std. Dev iat ion
Std. Error
Mean Lower Upper
95% Conf idence
Interv al of the
Dif f erence
Paired Dif f erences
t df Sig. (2-tailed)
The first panel gives you some of the summary statistics within each group. The last panel reports the results of the t-test.
The p-value is labeled Sig., which is short for ‗Observed Significance Level‘. Since the .16 is greater than your , you do
not have significant evidence of a math effect.
Exercise 2. From the ANALYZE / COMPARE MEANS menu, choose INDEPENDENT SAMPLES T-TEST. You need
to click on O-RATE and use the > key to move it into the Test Variable(s) box. You need to click on location and move it
into the Grouping Variable box. Then hit OK.
STA 5126 -- 20 of Chp 2
©D. Mohr
Group Statistics
6 78.4000 1.24579 .50859
6 70.5667 4.67789 1.90974
LOCATION
0
1
O_RATE
N Mean Std. Dev iation
Std. Error
Mean
Independent Samples Test
4.406 .062 3.964 10 .003 7.8333 1.97630 3.42985 12.23681
3.964 5.706 .008 7.8333 1.97630 2.93638 12.73029
Equal variances
assumed
Equal variances
not assumed
O_RATE
F Sig.
Levene's Test f or
Equality of Variances
t df Sig. (2-tailed)
Mean
Dif f erence
Std. Error
Dif f erence Lower Upper
95% Conf idence
Interv al of the
Dif f erence
t-test for Equality of Means
Note that SPSS automatically gives you Levene‘s test to help you choose the version of the t-test. The p-value is once
again labeled Sig. in SPSS..
STA 5126 -- 21 of Chp 2
©D. Mohr
BOXPLOTS
Simple boxplots use a box to mark off the middle 50% of the data. The box extends from the first
quartile to the third quartile, with a thick mark at the median. The purpose of the box is to draw your
eye to the central 'typical' half of the data. The lowest 25% of the data is marked off by a whisker that
extends from the minimum value to the first quartile. The highest 25% of the data is marked off by a
whisker that extends from the third quartile (75th
percentile) to the maximum.
Modified boxplots alter the whiskers to draw your attention to outliers, or wild values in the data set.
To define outliers, the computer calculates a value called the hinge width, which is 1.5 x (75th
percentile - 25th
percentile ) = 1.5 x length of 'box'. Any value lying more than one hinge width
ABOVE the 75th
percentile is an outlier on the high side. Any value lying more than one hinge width
BELOW the 25th
percentile is an outlier on the low side. Modified boxplots draw the whiskers from
the quartile to the most extreme value that is not an outlier. Any outliers are marked off with a
separate symbol.
Boxplots give you a quick view as to whether typical values (denoted by the boxes) are changing.
They also help you see whether the spread (variance) are relatively stable. They can also help you
diagnose non-normality, by helping you spot asymmetries or outliers.
Outliers are wild, unusual values. Normally distributed data should have very few, if any, outliers.
A very large data set might reasonably have a few (1%?) outliers without causing harm, but very
severe or frequent outliers can cause statistical trouble. Moreover - outliers are of interest in there
own right -- what causes these people to be so different from the rest?
Example of Boxplots The typical values are higher in Group 1 than in Group 2. The spreads are
similar, except that Group 2 has an outlier with an unusually large value.
1515N =
GROUP
2.00001.0000
X
14
12
10
8
6
4
2
0
-2
25