Study Guide

25
1 EDUC9762 Study Guide Week 8 Testing for differences among multiple groups So far, when examining differences on a continuous variable between different groups (categories) we have looked at: differences between the scores within a group and a single reference value the one sample t-test; differences in the mean scores between two separate groups the independent samples t-test; and differences between the scores of matched individuals (e.g. twins) or the same person across two occasions (e.g. pre-and post-tests) the dependent or paired samples t-test. These tests are very useful and indeed are used very commonly. However, there are many situations in which we have multiple groups three or more and we want to know whether there are differences between any of these groups on a continuous variable that interests us. We could do this by using multiple t-tests, but as you see below, this is poor practice and can lead to errors in making inferences. In your data file, you have variables that have multiple groups, e.g. ASDHHER (home educational resources), ASDGSRSC (reading self-concept), and ASDHEDUP (parents education level). We will focus on ASDHHER for now and we will ignore cases with missing values. The variable HER has three ‘levels’, i.e. there are three categories of HER, namely low, medium and high levels of access to educational resources at home, but there are also likely to be missing values and we need to check the ‘missingness’ of this variable before we use it for other analyses. The results of this check are shown in Table 1. Table 1 Index of home economic resources Frequency Per cent Valid per cent Cumulative per cent Valid High 546 11.6 11.9 11.9 Medium 3975 84.6 87.0 98.9 Low 50 1.1 1.1 100.0 Total 4571 97.2 100.0 Missing Omitted 130 2.8 Total 4701 100.0 For this variable, we have 2.8% missing data, and that is generally regarded as not being a threat to the validity of conclusions we might draw. This might not be the case for your country, so you will need to check this. So, we have three distinct groups based on HER. If we want to compare the means for these groups to see if there are differences in reading achievement between them, we could do a series of t-tests. If we use a t-test for independent samples, we could compare low- with medium-, medium- with high-, and low- with high-HER. For three groups, we would need to use three t-tests. If you have g groups, you will need to do g(g-1)/2 tests and this soon becomes many tests. There is a problem with this method. Recall from previous discussions of statistical inference, when we conduct a statistical test, we are using our sample data to make

description

gg

Transcript of Study Guide

Page 1: Study Guide

1

EDUC9762

Study Guide Week 8

Testing for differences among multiple groups

So far, when examining differences on a continuous variable between different groups (categories) we have looked at:

differences between the scores within a group and a single reference value – the one sample t-test;

differences in the mean scores between two separate groups – the independent samples t-test; and

differences between the scores of matched individuals (e.g. twins) or the same person across two occasions (e.g. pre-and post-tests) – the dependent or paired samples t-test.

These tests are very useful and indeed are used very commonly.

However, there are many situations in which we have multiple groups – three or more – and we want to know whether there are differences between any of these groups on a continuous variable that interests us.

We could do this by using multiple t-tests, but as you see below, this is poor practice and can lead to errors in making inferences.

In your data file, you have variables that have multiple groups, e.g. ASDHHER (home educational resources), ASDGSRSC (reading self-concept), and ASDHEDUP (parents education level). We will focus on ASDHHER for now and we will ignore cases with missing values.

The variable HER has three ‘levels’, i.e. there are three categories of HER, namely low, medium and high levels of access to educational resources at home, but there are also likely to be missing values and we need to check the ‘missingness’ of this variable before we use it for other analyses. The results of this check are shown in Table 1.

Table 1 Index of home economic resources

Frequency Per cent Valid per cent Cumulative per cent

Valid High 546 11.6 11.9 11.9 Medium 3975 84.6 87.0 98.9 Low 50 1.1 1.1 100.0 Total 4571 97.2 100.0

Missing Omitted 130 2.8 Total 4701 100.0

For this variable, we have 2.8% missing data, and that is generally regarded as not being a threat to the validity of conclusions we might draw. This might not be the case for your country, so you will need to check this.

So, we have three distinct groups based on HER. If we want to compare the means for these groups to see if there are differences in reading achievement between them, we could do a series of t-tests. If we use a t-test for independent samples, we could compare low- with medium-, medium- with high-, and low- with high-HER. For three groups, we would need to use three t-tests. If you have g groups, you will need to do g(g-1)/2 tests and this soon becomes many tests.

There is a problem with this method. Recall from previous discussions of statistical inference, when we conduct a statistical test, we are using our sample data to make

Page 2: Study Guide

2

inferences about the population from which the sample is drawn. Because samples are a random selection from the population, we expect them to represent the population but we know that there are variations between the samples we select. This leads to the possibility that we will draw a sample that is not truly representative. In inferential testing, we know about this problem and we agree before we conduct any tests that we are prepared to accept a 5% chance that we will make such an error – a Type I error in which we reject a null hypothesis (that there is no difference between two groups) that is true. That is, we claim there is a difference when there is not. That is OK, because we can expect that other researchers will repeat our study with other samples from that population and that eventually, our theory will be thoroughly tested.

But, if we keep making comparisons using a single sample, the likelihood of making a Type I error increases. It is like throwing dice. If you throw a die once, you have the same chance of getting a six as any other number (1/6). But if you throw the die twice, you increase the chance of getting at least one ‘6’ and the more often you throw it, the more likely you will eventually get a ‘6’. If you do three t-tests, the probability of not making a Type I error is 3*.95 or .0.857. Thus, in the probability of making a Type I error over three tests is .143 or 14.3%. This is much higher than we would want. If you are planning to do three t-tests, you could set the p-value at .05/3 or .017 for each test. The problem with this method is that is very demanding and you would rarely reject a null hypothesis, even if it is false (a Type II error), unless you had very large samples.

This dilemma is resolved by using a one-way analysis of variance or a one-way ANOVA.

ANOVA

The name of this procedure gives us a clue about how it works; it analyses sources of variance in the continuous outcome variable (reading achievement). We can understand how it works by thinking about the variation in reading achievement between groups based on their access to educational resources at home. We might expect that students with high access to educational resources should do rather well, but there will be variation in achievement within that group. Similarly, we might expect that students with access to a medium level of HER will do moderately well, but not as well as those with more HER. Finally, we might predict that students with very limited access to HER would do relatively poorly but we would also expect that there will be considerable variation within this group. Of course, we expect exceptions to this pattern and we know that within each group, there will be variation in reading ability. The question that interests us is:

Is the variation in reading ability between the three groups substantial compared with the variation that we observe within the groups?

Most students in Lithuania have moderate access to HER, but about 11% have very good access and about 1% have very limited access (see Table 1).

We should explore the variation that occurs within the sample and the sub-groups. This is exploratory data analysis. We are just having a quick look without doing any serious tests.

If we think about variation that occurs between individuals, we can first think about the entire sample, without separating students into their HER categories, and we can find the mean reading score and the variance or standard deviation. Then we can do the same investigation for each sub-group separately. The mean for overall reading score is 540.846 (sd=56.280; N=4701). (Note the discrepancy between the mean for the total sample and the mean shown for the total in Table 2. This is because there are 130 cases with missing data on HER, and these are not included in the total shown in

Page 3: Study Guide

3

the Table 2). The table was generated using the Case Summaries... command from the Analyze, Reports menu option.

I moved reading achievement (ASRREA01) into the Variables box and ASDHHER into the Grouping Variable(s) box. I unchecked the Display cases check box.

Using the Statistics button, I moved Number of cases, Mean, and Standard Deviation into the Cell Statistics box then clicked Continue and OK.

Table 2: Descriptive statistics for reading achievement for HER groups

Index of Home Educational Resources (HER)

N Mean Std. Deviation

High 546 577.668 51.312

Medium 3975 536.926 54.782

Low 50 499.867 58.731

Total 4571 541.387 56.161

In the discussion above, I guessed that high-HER students would have the highest reading score and that low-HER students would have the lowest. While it is useful to begin an analysis with an expectation of the outcome, we need to be open to surprising results. In his case the means do appear to be quite different. The standard deviations of the groups are a bit different, but I am not going to be concerned about that at this stage. When we do an ANOVA, it is assumed that the variances within each of the groups are similar. If they are very different, our analysis might be invalid. In this case, I am comfortable that they are sufficiently similar not to be a concern.

The information in Table 2 is not a basis for drawing a firm conclusion about the differences between the groups. We would like to know if the differences are statistically significant.

We turn our attention to how students vary. We calculate variance by finding the difference between an individual’s score and the mean. We square this, sum the squared values for all individuals before finally dividing by the number of cases. This gives us the variance of the entire sample.

But we can calculate the variance in another way. We can think about the variance as being based on two components: the difference between the individual’s score and their group’s mean and the difference between their group’s mean and the overall mean. That is, we are saying that the variance can be split into two components, a within group component and a between group component.

In order to see how this partitioning of the variance into two components works in practice, we will examine the differences between the three sub-groups identified in the HER variable and we will do this using an ANOVA.

Moving beyond an exploratory analysis

I have been a bit lax in developing this example through exploratory analysis because I did not specify a null hypothesis and an alternative hypothesis. As in previous comparisons, the null hypothesis states:

Ho: there are no differences in reading achievement between the groups based on HER.

The alternative hypothesis is:

Ha: there is a difference in reading achievement between at least one pair of groups.

Notice the wording of the null and alternative hypotheses. We are not saying anything about which groups might be different from which other groups. The alternative

Page 4: Study Guide

4

hypothesis simply states that there is at least one difference between groups. There might be more than one difference. If we reject the Ho, this test does not tell us which groups are different, just that there is a statistical difference in there somewhere. Later we will find ways to identify which groups are different.

In order to test the null hypothesis, we run an ANOVA by selecting the Analyze, Compare Means, One-way ANOVA option from the SPSS menu. This produces a familiar dialogue box (see Figure 1).

Figure 1: One-way ANOVA dialogue box with the reading achievement variable

selected and the HER variable as the grouping or factor variable

When this analysis is run, we get the following output table. In this table, the variances as we are accustomed to seeing them are not calculated. Instead part of the variance calculation is done (sums of squares and mean squares), first for the differences between the groups and then for the differences between individual scores and their group means, i.e. a within-group component (see Table 3).

Table 3 Result of partitioning the variance of reading achievement scores into a between-groups component and a within-groups component on HER

Sum of Squares df Mean Square F Sig.

Between Groups 884026.626 2 442013.313 149.232 .000

Within Groups 13530048.711 4568 2961.920

Total 14414075.336 4570

If there was not much difference between the groups, i.e. the groups were fairly similar, the variance between the groups would be low compared to the variance within the groups. In this case, we can see that the variation between groups, indicated by the mean square is relatively high at 442013 whereas the variation within the groups is rather smaller at 2962. So, we observe that we have much more variation between groups than within them, the ratio of the between group: within group variance is large (149 times as much variance between- as within-groups), and we find that this ratio is statistically significant (p<0.001). So we conclude that the difference in reading achievement between at least two of the HER groups is significant.

SPSS and other statistical programs compare these indicators of variance (the mean squares) by finding their ratio. In this case, the ratio is 149.232, i.e. the variation between groups is 149 times the variation within groups. This suggests very strongly that there are probably differences in reading proficiency between the various groups based on their HER status.

We can reject the Ho that there are no differences because the F statistic (149.232) has a very low probability (p<0.001) that we would see this much variation between groups in our sample if there were no differences in the population.

In reporting the results of this analysis, we might say something like:

Page 5: Study Guide

5

In order to investigate possible differences in reading achievement between students based on their access to educational resources at home (HER), a one-way analysis of variance was undertaken. We find evidence that there is a difference in reading achievement based on HER status (F=149.232, p<0.001).

Which groups are different?

If we reject Ho, we can begin a search to locate which groups are different from which others. Before we do that, we should look again at Table 2. There do seem to be substantial differences between all groups, but there is quite a bit of variation within each group and the low-HER group is a relatively small sample, so it is hard to guess which differences might be statistically significant.

In order to test for statistical significance, we need to supplement the ANOVA with a post-hoc test. This simply means that after we have found that there seems to be a difference, the post-hoc test helps us to locate the source of the significance we found in the ANOVA.

We need to re-run the ANOVA, but this time we will click on the Post-hoc button (see Figure 1). This leads to the dialogue box shown in Figure 2. There are many post-hoc tests available. I have checked ‘Tukey’s b’ for no other reason than it is the one I normally use if the variances of the groups are reasonably similar. Feel free to try others to see if they make any difference. Field (2009, p. 378) has a more detailed discussion and he selects several of these tests.

Figure 2: Post-hoc comparisons dialogue box for a one-way ANOVA

In addition to the ANOVA table that we saw as Table 3 that led us to reject the null hypothesis, we get the table shown in Table 4 that enables us to identify where any group difference(s) lie.

Table 4 Results of a post-hoc comparison of reading achievement between three HER groups

Index of Home Educational Resources (HER)

N Subset for alpha = 0.05

1 2 3

Low 50 499.867

Medium 3975 536.926

High 546 577.668

This table tells us that there are significant differences between all three pairs of groups. That is, the differences between the high-HER and medium-HER groups, the medium- and the low-HER groups and the high- and low-HER groups are all statistically significant.

Page 6: Study Guide

6

Reading

Field, A. (2009), Chapter 10, pp. 347-394

Note that Field, like most other authors, goes into much more detail than I have done about partitioning of variance. We will look at this in class.

Activity

Duplicate the above analyses using HER and reading achievement for your country.

In your data set, you have some other categorical variables such as ASDGSRSC and ASDHEDUP.

Investigate whether there are differences in reading achievement scores for the subgroups recorded in these categorical variables.

These investigations are amenable to analysis using one-way ANOVA.

Reference

Field, A. P. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage.

Page 7: Study Guide

1

EDUC9762

Study Guide Week 7

Review of a test for difference In the past few weeks, we have examined sampling distributions. We made two important findings. First, we found that when you take a sample from a population, the sample mean is a good (unbiased) estimate of the population mean. Second, we found that, if we repeatedly take samples of size n from a population and calculate the mean for each sample, the standard deviation of the sample means is given by: 𝜎𝑥 = 𝜎𝑥/√𝑁 Eq 1

This is very useful. It tells us that the standard deviation of a set of sample means is equal to the standard deviation of the original variable in the population divided by the square root of the sample size. If the variable in the original sample is very large, the variability of the mean across many samples will be relatively large. It also tells us that as we take larger samples, the variability of the sample means will be smaller – i.e. better estimates of the population mean. Both of these findings should make intuitive sense to you. But, we can say more than simply making literal interpretation of equation 1. We made a third important finding. Even if the distribution of the original variable is not normal, the distribution of the sample means tends to be normal, especially if the sample size is large. This is important because, knowing the distribution is normal tells us that we can use the properties of that distribution to estimate how reliable our estimate is. Review the diagram of the normal distribution (Gay, Mills, & Airasian, 2012, Figure 12.1, p. 327). Notice that 95% of the area under the normal distribution is bounded by the values -1.96 and +1.96 standard deviations. In class, we undertook a test to see if the average reading score in a country was equal to the international average of 500. (See the Class Notes summary for Week 6). For Latvia, we found that the mean reading score was 492.73 with a standard error of the mean of 1.32. (The standard deviation of reading scores in Latvia was 90 and the sample size was 4627. We could use equation 1 to estimate the standard error of the mean or the standard deviation of means if we took repeated samples of n=4627). If we apply the properties of the normal distribution to the mean reading score from Latvia we can say that our estimate of the population mean is 492.7, that its standard deviation is 1.3 and that we can be 95% confident that the true population mean lies between 492.72-1.96*1.32 or 490.14 and 492.73+1.96*1.32 or 495.32. We can make the very useful statement that ‘we are 95% confident that the true mean reading score in the Latvian 15-year-old population is between 490.14 and 495.32 units’. This finding implies that there is a 5% chance that the true mean reading score lies outside this range. We were interested to know whether the mean for Latvia was the same as or different from the international mean. We should expect that Latvia’s score is not the same as the international mean, but we tested that proposition by doing a difference test using a one-sample t-test in SPSS. (See the Class Notes summary for Week 6. Remember that we stated a belief, then set up a null hypothesis and an alternative hypothesis). The t-test is used to find out how likely it is to observe a difference between means, given the variability in means. If sample means are very variable, it would not

Page 8: Study Guide

2

surprise us to find substantial differences between two samples, but if means do not vary much then large differences between samples are unexpected. The t statistic is calculated using:

𝑡 = (�̅�−𝜇𝑜)𝜎𝑥�

2

In our case, we want to test how likely it would be to observe a mean of 492.73 in a sample of N=4627 if the real population mean is 500 (µ0), knowing that the standard error of the mean (i.e. the standard deviation of means in samples of n=4627) is 1.32. Using the above equation, we get a t value of -5.51. The calculation tells us that the difference we observed is 5.51 standard deviation units away from the international mean of 500. For large sample sizes, the t statistic follows a normal distribution. Having run the t-test in SPSS, we get the results of our test in the following tables: Table 1: One sample statistics N Mean Std. Deviation Std. Error

Mean Plausible value in reading

4627 492.73 89.69 1.32

Table 2: One sample test 95% Confidence interval t df Sig. (2-

tailed) Mean

Difference Lower Upper

Plausible value in reading

-5.51 4626 0.000 -7.265 -9.850 -4.680

The first table (Table 1) tells us the sample statistics, and we have already seen these. The second table (Table 2) shows us the results of the t-test. We can see the mean difference and we had already calculated that. The output also shows us the confidence interval for that difference. Above, we calculated the confidence interval for the sample mean. In this table, we see that range converted into differences from 500. Given the expected variation between samples, we could expect to see differences in a range from 9.85 units below 500 to 4.68 units below. This tells us that we should be quite confident that the mean reading score we observe in the Latvian sample reflects a population mean that is below the international mean 500. But how sure can we be? The column labelled ‘Sig.’ (significance) tells us this. In other statistical programs, this column would be labelled ‘Probability’ or ‘p’. It is the probability that we might observe a difference of 7.265 from the international mean of 500 if the true mean in Latvia is 500, that is, if the null hypothesis (H0) is true. You can see that the probability of seeing a difference of this magnitude is extremely small. The output is shown to three decimal places, so we are sure that, whatever the probability is, it is less than 0.001 or less than one chance in 1000. This is so unlikely that we must conclude that our null hypothesis is not true and we must reject it. Using the idea of significance, we would say that we have a significant result. That is, the difference that we have observed is significantly different from what we would expect if the null hypothesis were true. If we were to report the above results in a paper, we might say something like:

In order to test the proposition that the mean reading score in the 15-year-old population in Latvia is no different from the international mean reading score, a one-sample t-test was undertaken. The mean difference was found to be -7.265 (se=1.32) score units (t=-5.51, df=4626, p<0.001). The hypothesis is therefore rejected and it is concluded that the reading achievement of Latvian students is lower than the international average.

Page 9: Study Guide

3

Activity Run a test for your country to see whether the mean reading performance is equal to or different from the international average. Write this as a research question, write a null hypothesis and write an alternative hypothesis. When you have run the analysis, keep a copy of the two tables generated in the output. • What is the mean for your country? How different is that from 500? • What is the sample size? • What is the standard error of the mean? • How many standard deviations is the mean of your sample away from 500? (Hint:

What is the t statistic in your analysis?) • What is the probability that you would observe the mean value in your sample if

the true mean of the population is 500? • Based on your sample, what is the likelihood that the true mean in the population

is 500? Write a short paragraph answering the research question suggested above, summarising and interpreting the result of your analysis in your answer.

Confidence intervals When we estimate a parameter for a population using data from a sample, there is always some uncertainty in our estimate because we know that samples taken from a population are not exact miniatures of the population – there are always some differences between samples taken from a population and we refer to this as sampling error. However, even if our estimates are not exact, we need to know how closely they reflect the true parameter of the population. Because we know the distribution of sampling statistics, we can estimate a likely range. We just did that in relation to the mean reading achievement scores of 15-year-old students from Latvia, and I hope you have calculated similar statistics for your country. For Latvia, we were able to say ‘we are 95% confident that the true mean reading score in the Latvian 15-year-old population is between 490.14 and 495.32 units’. Our ability to make these estimates depends on knowing the distribution of the statistic of interest. We estimated the population mean using the sample mean. Then, knowing the standard deviation of the variable, we used equation 1 to calculate the standard deviation of the sampling distribution of the mean. Then, knowing some properties of the normal distribution, in particular that 95% of the area under the normal curve is bounded by the range 1.96 standard deviations above and below the mean, enables the 95% confidence interval to be calculated. Other common confidence intervals are the 90% range, bounded by 1.68 standard deviations above and below the mean and the 99% interval bounded by 2.58 standard deviations above and below the mean. The 95% confidence interval is the one most commonly computed.

Statistical and practical significance We say that a finding is statistically significant if we have evidence from our sample that leads us to reject the null hypothesis. We found in the case of Latvia that the likelihood of getting a sample mean of 492.7 if that the true population mean in Latvia is 500 is extremely small (p<0.001). We interpret this to mean that, given the

Page 10: Study Guide

4

evidence of our sample, the true population mean is most unlikely to be 500. We can say that we have a statistically significant finding. This raises the question ‘How unlikely does an outcome have to be before we say it is too improbable?’ In most statistical analyses in the social sciences, a probability of 0.05 is taken as the threshold. If the p value (Sig. in SPSS) reported for the analysis is <0.05, i.e. if there is less than 1 chance in 20 that we would observe the outcome in our sample if our null hypothesis were true, we reject the null hypothesis in favour of the alternative. There is nothing magical about p<0.05; it is a convention. Before we decide on an appropriate probability as our threshold for making decisions about whether to reject or not reject the null hypothesis (H0), we should think about the consequences of making an error in that decision. Recall from our last class discussion the decision table we constructed (see Table 3). Table 3: Decision table for statistical analyses

Population H0 is true H0 is false

Based on our sample

Reject H0 Type I error Reject a true H0 Good decision

Do not reject H0 Good decision Type II error Fail to reject a false H0

We will consider Type II errors a little later. For now, we focus on the Type I error – the error we make if we reject the null hypothesis when we should not. Let us illustrate this problem by considering an example like reading scores in a country. Suppose that the true mean reading score in a country is 500 but that, by chance, we have selected a sample in which the reading score is quite low. We would probably get a test result like the one we found for Latvia, but in this case, the statistically significant difference is a result of sampling variation. We would reject the null hypothesis, but in doing so, we would be making a Type I error. Of course, we would not know this because we would not know the true population mean. This is one reason why replication studies are undertaken. If another researcher repeated the study, but drew a different sample from the same population, we would be interested to know whether they arrived at the same conclusion as the first researcher. The likelihood of both researchers independently making a Type I error is very low – 1 in 400 or 0.0025 or .25%. By setting the p value at 0.05, we are saying that we are prepared, in 5% of the tests that we undertake, to make such an error. We could decide that this level of error is too high and we could set the p value to 1% (0.01). This is done in some research, but a consequence of this is that we wold increase the likelihood of making a Type II error. So statistics, like politics, is the art of compromise.

Activity A teacher lobby group approaches the education ministry and advocates a 20% reduction in class sizes, arguing that in smaller classes, students will get much more individualised attention and that this will raise educational achievement. A statistical consultant is asked to design an experiment in which class sizes are reduced in some schools, but not in comparison schools, and achievement scores are measured after several terms of instruction. • What is your null hypothesis for this study? • What are the consequences of reducing class sizes?

Page 11: Study Guide

5

• What are the consequences of reducing class sizes if your null hypothesis is true? • What p value would you establish before reaching a conclusion about the

influence of class size on achievement?

In summary, we can say that a finding is statistically significant if we reject the null hypothesis, and we do this if the probability of observing the outcome we test is unlikely to be due to sampling variation and therefore is likely to represent a ‘real’ effect or relationship in the population. Usually we use the 0.05 ‘level of significance’ to make a judgment about whether the observation is due to sampling variation or reflects a real feature of the population from which the sample was taken. Two aspects of the sample influence the probability of the effect we observe. First, if there is a very strong feature of the population, then this is very likely to be observed in any sample that we might select. Second, if the feature in the population is small, but we take a very large sample, i.e. a sample that is likely to be fully representative of the population, then we are likely to get a significant result – a result that is not attributable to sampling variation. In the preceding discussion, we have solved the problem of statistical significance. However, a new problem arises for us. If it is possible to find that a small difference is statistically significant if we use very large samples, how do we know if a small statistically significant effect is practically important? In order to answer this question, we turn to the issue of effect size.

Effect size Before we look at the notion of effect size, undertake the following activity.

Activity The SES variable that you have in your data set, ESCS, has been created by the managers of the PISA study to have an international average of 0. What is the average ESCS score in your country? Undertake an investigation of whether the ESCS of your country is equal to the international average. • Pose a research question. • State a null hypothesis. • State an alternative hypothesis. • Conduct a test. • Do you have a statistically significant finding? • Write a brief paragraph in which you answer the research question by

summarising and reporting the results of the statistical analysis. Now, look back over your analyses. • How big is the difference in ESCS between your country and the international

average? • How big was the difference in reading achievement between your country and the

international average? • Can you comment on the relative magnitudes of these two differences?

We have a problem when we are comparing variables that are measured on different scales. Reading achievement is measured on a scale with a mean of 500 and a

Page 12: Study Guide

6

standard deviation of 100. The ESCS variable is measured on a scale where the international mean is 0 and the standard deviation is 1. When I completed the above activity for Latvia, I found the difference between the international reading average and Latvia’s was -7.27, but the corresponding difference in the ESCS was +0.164. However, I know that these two variables are not comparable because they are measured on different scales. Nonetheless, I want to get an idea about relative magnitudes of these differences. This is done using an effect size. I am aware of about ten different measures of effect size, but we will work with the simplest of these, Cohen’s d (Cohen, 1992). Cohen’s d is calculated by finding the difference between means and dividing that difference by the standard deviation of the variable. I compared the reading scores of males and females in Latvia and found that the mean female score is 512.3 while for males it is 471.7, a difference of 40.6. The standard deviation of scores is about 90, so the effect size measured by Cohen’s d is 0.45. Cohen (1992, p. 157) suggested that effect sizes of 0.2, 0.5 and 0.8 can be regarded as small, medium and large respectively. Thus, it seems we have a medium effect size or a medium difference between the reading abilities of males and females. Incidentally, I compared their mathematics scores. The mean male score is 4.9 units higher than the mean females score, an effect size of 0.05 – a very small effect. A policy maker in Latvia might recommend an intervention to enhance the reading ability of boys, but would probably not recommend a corresponding intervention for mathematics instruction for girls.

Tests of differences Research questions about differences are very common. • Is there a difference between Group A and Group B on Variable X? • Is there a difference between Variable X on Occasion 1 and Occasion 2? Here we will preview some tests of difference. All the tests we will consider require the test variable to be a continuous interval or ratio variable and this should be normally distributed.

One-sample t-test We have used this test to see whether, in our country sample, reading achievement is equal to the international mean. This test is a one-sample t-test. It is a t-test because it calculates the t statistic and provides the significance (p value) of that statistic. It is a one-sample test because we are comparing the performance of one group with a specified value. You have seen an example of the research questions that can be answered using this test.

Dependent or paired sample t-test This test is used when we have two samples that are related in some way and we want to test for a difference in a variable. A common situation in educational research occurs when we test a group of students, provide an instructional intervention, then re-test the same group. Because we are interested in change between occasions within individuals, i.e. we want to compare each student’s pre-test score with their post-test score, the participants are paired between the two tests. The question we are asking is whether the difference within individuals between the two test occasions is significant.

Page 13: Study Guide

7

Another less common application of this type of test would be a comparison of twins or even of siblings. Here two different individuals are paired because of their twin or sibling status. The term dependent is also used because the observations that we make are not statistically independent; we compare the same person on two occasions or we compare two related individuals on a common task.

Independent samples t-test An independent samples t-test is used when we are comparing two groups but we do not attempt to match individuals between the groups. We are simply testing whether there is a difference between the mean of one group and the mean of the other. A very common application of this test is the comparison of boys and girls on a particular variable.

Reading Field (2009, pp. 326-342) has an extensive discussion of paired sample t-tests and independent sample t-tests.

Analysis of variance An extension of the t-test occurs when we have more than two groups. In our country data sets, we have the variable IMMIG – immigrant status, and this includes three groups, namely native born, first-generation and immigrant students. In order to see if there are differences between these groups we could do a series of independent sample t-tests comparing native-born with first-generation students, native-born with immigrant students, and first-generation with immigrant students. This would be very bad practice. The reason is that each time we undertake a test, there is a chance that we will make a Type I error. The more tests we do, the greater the likelihood that we will make such an error. The solution is to use a method that is able to manage multiple comparisons without compromising the Type I error rate and that technique is known as Analysis of Variance or ANOVA. There are several types of analysis of variance. We will focus on one-way ANOVA. It is one-way because we will compare one variable, e.g. reading achievement, across several groups.

Reading Field (2009, pp. 347-394) devotes a chapter to this method.

References Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155-159. Field, A. P. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage. Gay, L. R., Mills, G. E., & Airasian, P. W. (2012). Educational research.

Competencies for analysis and applications (10th ed.). Boston: Pearson.

Page 14: Study Guide

1

EDUC9762 Week 13 Study Guide

Non-parametric statistics

This week we step back to address some old questions about differences and relationships, but we do so using a new set of methods. You will recall that in our discussion of inferential statistics, we made the assumption that our variables were normally distributed. This meant that we could ‘parameterise’ the distributions – we could describe the distribution of a variable by specifying just two parameters, the mean and the standard deviation. This is enough information to reproduce the distribution that we would get if we plotted a histogram of all the data. We know, however, that not all variables behave in this way, and we need to do analyses on those variables despite their disagreeable distributions. Methods of analysis that do not depend on the assumption that the variables are normally distributed are called non-parametric methods. In almost all cases, instead of depending on the mathematical properties of the normal distribution, we use the rank order of cases on a variable. I prefer to use parametric tests when I can because they are usually more powerful than non-parametric ones. We have discussed the power of statistical tests previously. The power of a test is the probability of finding an effect if it is present in the population. Because non-parametric tests use ranks instead of observed values, some information is lost. Consider the following data (see Table 1). In the top row, we have the original values for 12 cases and in the bottom row, their ranks from lowest to highest. Notice that the difference in the value of the 10th and 11th ranked cases is 9 units while the difference between the 11th and 12th ranked cases is 23 units. When we use ranks, we effectively collapse differences between values and therefore we lose information and hence we lose statistical power. Table 1 Ranking data values Value 16 30 30 30 34 39 42 45 45 58 67 90 Rank 1 3 3 3 5 6 7 8.5 8.5 10 11 12

As an aside, notice that where scores are tied, ranks are averaged, so we have no cases ranked 2nd or 4th, but three ranked 3rd and two cases ranked 8.5th. This week we will consider just a few research questions that we can answer using data that are non-normal. The research questions we have considered to date include difference questions and relationship questions.

Checking for normality One of the early steps we take before running any analysis is to check the properties of each variable. We look for missing data and we check the distributions of continuous variables. We have examined socioeconomic status and for this construct, we have used the continuous variable ESCS and the categorical variable HSECATEG. There is another socioeconomic status variable in the data set, HISEI. This index is based on ratings of the status of parental occupation. It takes values between 16 and 90. Basic descriptive statistics (see Table 1) suggest that the variable is not skewed nor does it show much kurtosis.

Page 15: Study Guide

2

Table 2 Descriptive statistics for HISEI

The problem with occupationally based indices is that some occupations are very common (e.g. teachers and nurses) while others are relatively rare (e.g. judges). This means that the distribution can be rather ‘patchy’ (see Figure 1).

Figure 1 Frequency histogram for HISEI for Latvia

The distribution shown in Figure 1would make me rather nervous about doing any parametric tests as it appears that this variable does not comply with the normal distribution. I could get a QQ-plot to see if I could convince myself that it might be safe to proceed with a parametric test (see Figure 2). The points approximately follow the expected normal line with some deviations above and below the line.

Page 16: Study Guide

3

Figure 2 QQ plot to check the normality of HISEI

I decided to undertake a third test for normality and requested a Kolmogorov-Smirnov test (Analyze, Descriptive statistics, Explore). I did this test twice. On the first occasion, I used all 4627 cases, but I know that in large samples, small differences (departures from normality) are shown as being highly significant (p<<0.05), so I repeated the test using 400 cases. The result is shown in Table 2. The test is significant (p<0.001) so I have to conclude that the variable is not normally distributed. Table 3 K-S test to check for normality of HISEI

As much as I would like to treat this variable as being normally distributed, it appears that it is not and that if I want to do any inferential statistics, I will need to use non-parametric methods. If I use parametric methods and they are not justified, I will report findings that are potentially misleading.

Exercise 1 Undertake the above analysis on your own country’s data set. Contrast the findings for your country with those shown for Latvia.

Comparing ranks for independent samples When we posed research questions about differences between groups, we used a t-test (parametric) when the data were normally distributed. We have already determined that mathematics achievement is normally distributed, but I want you to pretend that it is not quite normal enough for your liking and that you need to do a non-parametric test. I want

Page 17: Study Guide

4

you to do this, because I want you to compare the results that you found from the t-test with the results of its non-parametric equivalent – the Mann-Whitney test. The research question is similar. ‘Is there a difference in the mathematics achievement of males and females in Latvia?’ For this question, we can write the null and alternative hypotheses and set the level of significance (I’ll stay with 0.05 for α). If the data were not normally distributed, it would not make sense to talk about the mean. Instead, we would focus on the median value, so if we present descriptive statistics for our variable, we should report medians rather than means. However, in non-parametric statistics, we work with ranks. If H0 is true, the mathematics achievement of males and females will be about the same. That is, among the top performing students, we should find about equal numbers of males and females and similarly we should find about equal proportions in the lower performing groups. If we rank all students in order, we would expect the male and female groups to have similar numbers of low and high ranks. The Mann-Whitney test compares the ranks of the two groups to see whether their shares of high and low ranks are different. Running a Mann-Whitney test I want you to run this test the ‘old’ way as the output is more informative than the new way. Select the ‘Analyze, Non-parametric tests, Legacy dialogs, independent samples’ option (see Figure 3).

Figure 3 Legacy dialogue box for a Mann-Whitney test of difference for mathematics

achievement

The command that is generated from this dialogue produces two tables. The first is a summary of the ranks for males and females (see Table 4). Table 4 Summary of ranks on mathematics achievement for males and females

If males and females were equally distributed at the various score levels, their mean ranks would be about the same – about 2313 or half way along the list of ranks of the 4627 students. However, we notice that girls have a lower average rank than males.

Page 18: Study Guide

5

As always, we cannot be sure whether this difference in average ranks is significant, so we look for a test of significance. This is found in the second table (see Table 5). Table 5 Results of the significance test of differences in rank order on mathematics

achievement between males and females

We see in that table the Mann-Whitney statistic and its level of significance (p=0.185). This is greater than the critical value (α=0.05), so we have no grounds for rejecting H0. Notice also that SPSS provides the results of similar tests, the Wilcoxon rank sum (sum of ranks) test and the Kolmogorov-Smirnov Z test. (This is different from their test for normality). Field (2009, pp. 540-548) briefly describes these tests.

Exercise 2 Compare this result with the t-test we conducted several weeks ago. Is the result different? If so, why? If you were to repeat this test for reading achievement, what would you expect to find? Why? Try it.

I wrote above that we would do this test the ‘old’ way. I prefer this because it shows you the results for the test quite directly. However, in the most recent versions of SPSS, a new dialogue is available for testing this type of hypothesis. You select this using the ‘Analyze, Non-parametric tests, Independent samples’ menu options. This presents a different dialogue box (see Figure 4).

Figure 4 Dialogue box anticipating a Mann-Whitney test for difference between two

groups

Page 19: Study Guide

6

Notice that we need to select the ‘compare medians’ option and we need to select the ‘Fields’ tab in order to enter the variables. This is similar to other dialogues. You need to enter the test variable (PV1Math) and the grouping variable (ST3Q01). On this occasion, you do not have to specify the number of groups as SPSS works that out from the values in the variable. Finally, you must select the ‘Settings’ tab to specify the tests that you want. For now, select the Mann-Whitney test, but we will return to this dialogue later. Now click ‘Run’ and examine the output. (Notice that SPSS uses Run instead of OK in this new dialogue). The output is shown in Figure 5.

Figure 5 The results of the Mann-Whitney test using the new dialogue in SPSS

The results are displayed in a user-friendly way. The null hypothesis being tested is shown, the test is described, the p value is shown and its interpretation – to accept or reject H0 is given. What you do not see is any of the underlying calculations. You do not see how many cases were included nor the mean ranks or the sums of ranks. Perhaps you do not need this, but I like to look at this information because it helps me to track exactly what is being analysed.

Exercise 3 Run the above analysis using your own data. Do you prefer the old or new dialogue?

Associations between non-normal variables In Week 11 we looked at associations between categorical variables, for which we used cross-tabulations, and between continuous normally distributed variables, for which we used the Correlation command and we requested the Pearson product-moment correlation coefficient (r). This would be a good time to review what we did then. When you have a variable that is not normal, you need to decide whether it can be categorised into groups (e.g. low, medium and high SES) or it is better to leave as a continuous variable and use non-parametric statistics. Now, I want to focus on describing relationships between continuous variables, but ones that are not normally distributed. We have been looking at HISEI and we are confident that this is not normal. I will use Sense of belonging at school as the other variable (Belong), although I think it is not too far from normal. While we are doing this, we might consider some other attitudinal variables, perhaps teacher-student relations and The correlation coefficient that we use for non-normal data is the Spearman rank order correlation coefficient (ρ rho). We can compare the results that we get if we assume normality (and use Pearson’s r) with the correlation we would get if we do not assume normality (and use Spearman’s ρ). Select the ‘Analyze, Correlate, Bivariate’ menu option to get the dialogue shown in Figure 6. Note that both Pearson’s r and Spearman’s ρ are being requested.

Page 20: Study Guide

7

Figure 6 Dialogue for requesting correlation coefficients – both Pearson’s r and

Spearman’s ρ

Two tables are generated. In the first, we see Pearson’s r and in the second Spearman’s ρ. Compare the values for these two correlation coefficients. The first is based on the scores on each variable and the second on the ranks of those scores.

Exercise 4 Run the above analysis using your own data. How close are the two correlation coefficients (r and ρ)? Is one systematically higher or lower than the other? Can you explain why you might see these differences?

Reading Field devotes a chapter (15) to non-parametric statistics. Field (2009, Chapter 15, especially pp. 539-568). See also the table ‘Commonly used parametric and non-parametric tests’ taken from Gay et al. (2012, p. 369).

References Field, A. P. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage. Gay, L. R., Mills, G. E., & Airasian, P. W. (2012). Educational research. Competencies for

analysis and applications (10th ed.). Boston: Pearson.

Page 21: Study Guide

1

EDUC9762 Week 12 Study Guide

Simple Linear Regression

Last week we introduced measures of association and correlation so that we could answer research questions about associations between variables.

We will extend that discussion to consider regression relationships, noting that regression seeks to describe the association between continuous variables in a way that enables prediction and explanation.

Regression is a powerful technique and is used very commonly in statistical analysis and modelling. Indeed, this method marks our introduction to the idea of creating and testing statistical models as a method for understanding and theorising about relationships.

A model is a way of representing reality. Sometimes a model is a scaled-down version of a large construction, or a scaled-up version of something small like an atom, but models are frequently simplifications of the real situation. The reality is simplified so that we can focus on a few important features of the construction. We know that human behaviour is variable and we use statistical models to ‘see through’ the variability and to focus on suspected underlying regularities. That is exactly what regression models do.

A typical research question involving a regression relationship between variables might be “To what extent does socioeconomic status predict (or explain) mathematics achievement at school?”

We will restrict our discussion to simple linear regression involving only two variables – an outcome or criterion or dependent variable and an explanatory, predictor or independent variable. However, regression models can be quite complex; they can include many predictor or independent variables, they can use discrete outcome variables, and they can involve predictors at multiple levels of measurement (student and school characteristics). The ideas of regression have also been combined with some other methods, notably factor analysis, and this has led to a structural equation modelling. These advanced techniques lie beyond the scope of this topic, but a sound understanding of simple linear regression will provide a foundation for your later exploration of these more complex approaches.

Year 9 Algebra revisited

In high school, I recall doing some simple experiments. Here are the results of one to investigate Hooke’s Law – the extension of a spring is proportional to the force (of gravity) applied to the spring (Figure 1).

We should notice a few things about the graph. First, it is a straight line. We can say that the relationship between the extension of a spring (shown on the vertical axis) and the mass suspended by the spring (horizontal axis) is linear. Second, the line has a particular slope the represents the influence of increasing the mass on the extension of the spring. This is expressed as a ratio of the change in extension to the change in mass. In this case, the extension of the spring changes from 17 to 67 mm, a change of 50mm, for a change from 0 to 100g in mass. This ratio is, therefore, 50/100 or 0.5 and this is the slope of the line. Third, the slope is positive; an increase in mass is associated with an increase in extension. Fourth, the line cuts the y-axis at 17. This point is called the intercept. This is the displacement of the pointer when there is no mass on the spring. Fifth, this relationship is a deterministic one. The points are not scattered at all – they all lie along the line. This is clearly not educational data and, even for a physics experiment, the data must have been collected by a meticulous researcher. In the social sciences, we expect to see variability in the data with points scattered about a line.

Page 22: Study Guide

2

Figure 1: A graph of the data from an experiment to investigate Hooke’s Law

A key point to notice about the line is that we can characterise it by specifying just two values, the slope and the intercept. If they are given, you can draw the line without any additional information.

In educational research where are data have much more variability, if we think there is a linear relationship between variables, we can find out what the line is – that is we can find the slope of the line and its intercept. If we do this, we are creating a model to describe an aspect of the data – we are saying, in effect, that there is an underlying linear relationship but it is obscured by the variability. When we estimate the two parameters for the line (intercept and slope) there will be some uncertainty about them because of the variability in the data.

Last week we looked at a more typical social science relationship – one between ESCS and mathematics achievement, shown again in Figure 2.

Figure 2: Scatterplot of Mathematics against SES index (ESCS) scores

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100 120

Dis

pla

cem

en

t (m

m)

Mass (g)

Page 23: Study Guide

3

When I look at Figure 2, I am reasonably confident that there is an underlying linear relationship between these two variables. I think that some of the variation in mathematics scores can be explained by variation in their socioeconomic status. To test this idea, I will run a regression in SPSS. I will regress PV1Math on ESCS.

Regression in SPSS

To run a regression model in SPSS, select the Analyze, Regression, Linear… option. This opens the following dialogue box (Figure 3).

Figure 3: Dialogue box in which dependent and independent variables are specified

When we run this command, SPSS generates four tables. We will ignore the first, which lists the variables that we have entered, and look at the other three. The Model Summary table (Figure 4) gives us information about the correlation (r) between the dependent and independent variables. It also tells us the value of r

2, which is an indication of how much of

the variance of the dependent variable is explained by the independent variable. In this case, 11.6% of the variation in mathematics scores is explained by differences in students’ socioeconomic status.

Figure 4: Model summary of the regression of mathematics achievement on ESCS

In regression models, SPSS also does an analysis of variance (Figure 5). We have seen this before when we compared the means of three or more groups. The analysis of variance in this case tells us whether the amount of variance predicted by our linear model is substantial compared with the total amount of variance in the dependent variable. In this case, most of the variance is predicted by our model and only a small proportion remains. The ratio of the predicted variance to the residual variance is very large (over 600) and this value is

Page 24: Study Guide

4

statistically significant. In other words, we are informed that the regression model is a sound one.

Figure 5: Analysis of variance for the regression of mathematics achievement on ESCS

We are most interested in the coefficients that define the regression line. They are shown in the Coefficients table (Figure 6). The intercept is shown as the ‘Constant’ in the coefficients table. The slope of the line is shown first as an ‘unstandardized coefficient’ with its standard error. The slope of 40.262 indicates that if a student’s ESCS score is one unit more than another’s, the difference in their predicted mathematics scores is 40.262 points. Notice that this is a positive number, so as ESCS increases, the predicted mathematics score increases. For the slope, there is also a ‘standardized coefficient’. Compare this with the correlation coefficient that you found last week when you examined these variables.

Figure 6: The coefficients (intercept and slope) for the regression of mathematics

achievement on ESCS

For both coefficients, there is a corresponding t statistic and a p-value. In both instances, there is an implied null hypothesis that they are no different from zero. In both cases, we can reject these and conclude that the intercept and slope are both statistically different from zero and that there is a relationship between ESCS and mathematics achievement. Moreover, we can use this relationship to predict a student’s mathematics score given their ESCS score and we can say that 11.6% of the variance in mathematics achievement is explained by ESCS.

Exercise 1

Undertake the above analysis on your own country’s data set.

Contrast the findings for your country with those shown for Latvia. We will make a list of countries with the regression slope for mathematics achievement on ESCS. What will this table tell us?

What do you conclude about equity in your country?

Comparing correlation with regression

Both correlation and regression are methods of telling us about associations between variables. A correlation coefficient tells us whether the association is positive or negative and how weak or strong the relationship is.

Page 25: Study Guide

5

Regression provides similar information to what correlation tells us, and in its output includes information on correlations between variables, but it goes further and enables us to make predictions about the score on one variable if we know the score on the other.

Both correlation and regression evaluate a possible relationship between variables, and both assume that there is an underlying linear relationship. Correlation tells us how closely data points are clustered about the line. Regression tells us the intercept and slope of the line. If the correlation is weak and the points are widely dispersed, the coefficients for the line (intercept and slope) will have high standard errors and may not be significantly different from zero.

Reading

Field devotes a substantial chapter to regression, reflecting its importance.

Field (2009, Chapter 7, especially pp. 197-208).

Reference

Field, A. P. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage.