DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February...

31
DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012

Transcript of DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February...

Page 1: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

DTC Quantitative Methods

Bivariate Analysis: Cross-tabulations and Chi-square

Thursday 23rd February 2012

 

Page 2: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

A reminder...• Hypothesis testing involves testing the NULL HYPOTHESIS that there is

no difference/effect/relationship, e.g. variables are unrelated. • In testing such a hypothesis, we test whether the relationship or

difference that we have observed in our sample is likely to have occurred assuming the null hypothesis is true for the population.

• The sampling distribution (a distribution reflecting the full range of possible samples) plays an important role in inferential statistics, by allowing us to work out how much variation is likely to have been produced by sampling error (e.g. if levels of happiness vary a lot, then small gender differences are quite likely to reflect sampling error).

• When we find a very low probability (p<0.05, or less than 5%) that we would have found what we found in our sample if the null hypothesis were true, we can infer that it is unlikely to be true.

• And therefore, we can infer that the ALTERNATIVE HYPOTHESIS (saying that there is a difference/effect) is likely to be true.

• This is a sort of backwards logic. So… if you find it easier to think forwards, the simpler version (although less correct, technically) is that if we find that p<0.05 then we have identified a relationship/effect.

Page 3: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

So far...So far we have examined at tests that have had interval-level variables (such as income, or years spent at an address) as dependent variables. Specifically we looked at tests that investigated:

• Whether a population has a mean that is different from a suggested mean (z-tests, or ‘one-sample t-tests’ in SPSS). e.g. Do people on average stay at an address for 10 years?

• Whether two groups have population means that differ from one another (t-tests, or ‘independent samples t-test’ in SPSS). e.g. Is there a gender difference in the mean time spent at one’s current address?

• Whether the different categories of a variable (e.g. social class) have population means that differ in some way (ANOVA, or ‘one way ANOVA’ in SPSS). e.g. Do people in different social classes have different average lengths of time at their addresses?

Page 4: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

Categorical data analysis• Today we are going to look at relationships between categorical

variables (e.g. gender, ‘race’, religious denomination).• When both variables are categorical we cannot produce means.

Instead we construct contingency tables that show the frequency with which cases fall into each combination of categories – e.g. ‘man’ and ‘Christian’ (we cannot do this directly with continuous variables, e.g. age, as people often fall into numerous categories, leading to tables that are enormous and unmanageable).

• When we conduct statistical analyses of cross-tabulated data, we are trying to work out whether there is any systematic relationship between the variables being analyzed or whether any ‘patterns’ are ‘random’, i.e. only reflect sampling error.

• Therefore the tests that we do compare what we find (observe) with what would be expected if there were no relationship – i.e. given the null hypothesis of no relationship in the population.

Page 5: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

From: Phoenix, A. 1991. Young Mothers? Cambridge: Polity Press. Table 3.2 in Chapter 3: ‘How the Women Came to be Mothers’ (p61)

 MARITAL STATUS (AT CONCEPTION) by ORIENTATION TO PREGNANCY

 

Wanted to Did not Had not Important TOTAL

conceive mind thought not to

about it

 

Single 4 (8%) 12 (23%) 13 (25%) 24 (45%) 53

Cohabiting 4 (44%) 2 (22%) 1 (11%) 2 (22%) 9

 

Married 9 (53%) 6 (35%) 0 (0%) 2 (12%) 17

 

 

TOTAL 17 20 14 28 79

Page 6: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

From: Jupp, P. 1993. ‘Cremation or burial? Contemporary choice in city and village’. In Clark, D.

(ed.) The Sociology of Death. Oxford: Blackwell.(Derived from Tables 5 and 6 on pages 177 and 178).

OCCUPATIONAL CLASS by DISPOSAL CHOICE

 

Cremation Burial TOTAL

 

Working class 20 (59%) 14 (41%) 34

Middle class 21 (88%) 3 (13%) 24

 

TOTAL 41 17 58

Page 7: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

What can be learned from these cross-tabulations?

Do you think that the patterns in the

MARITAL STATUS by ORIENTATION TO PREGNANCY

and

OCCUPATIONAL CLASS by DISPOSAL CHOICE

cross-tabulations provide sufficient evidence

to conclude that relationships exist?

 

Where can the relationship, if any, be found in the

MARITAL STATUS (AT CONCEPTION)

by ORIENTATION TO PREGNANCY

cross-tabulation?

Page 8: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

How do you work out whether a gender difference is likely to be due to chance?

• For each cell in the table: we work out the frequency that we would expect and see how much the observed frequency differs from this.

• We then square this difference and divide by the expected frequency.

• We then sum these values.• The observed frequency for each cell is

what we see.• The expected frequency is the

frequency that you would get in each cell if men and women were exactly as likely as each other to fall into each of the categories of the other variable.

We use chi-square (2) to look at the difference between what we observe and what would be likely if there were no difference except that generated by chance (i.e. sampling error):

2 = (Observedij – Expectedij)2

Expectedij

It may or may not be clear from this slide that the above formula simply (or not so simply?!) represents the process described to the right, so working through an example may help...

Page 9: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

DEGREE SUBJECT by GENDER: ‘BLACK ‘ GRADUATES.

Subject Male Female Total

Arts 16 (47%) 18 (53%) 34

Sciences 29 (58%) 21 (42%) 50

Social Sciences 42 (53%) 38 (47%) 80

Education 3 (19%) 13 (81%) 16

 

TOTAL 90 (50%) 90 (50%) 180

 

  The above table is based on a random sample of 180 ‘Black’ graduates born in the UK and aged 25-34 in 1991. (Data adapted from the 1991 Census SARs; ‘Black’ includes Black-African, Black-Caribbean and Black-Other).

Page 10: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

‘Expected’ frequencies 

Subject Male Female Total

 

Arts 17 (50%) 17 (50%) 34

Sciences 25 (50%) 25 (50%) 50

Social Sciences 40 (50%) 40 (50%) 80

Education 8 (50%) 8 (50%) 16

  TOTAL 90 (50%) 90 (50%) 180

Page 11: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

 Differences (‘Observed’ minus ‘Expected’) 

Subject Male Female Total

 

Arts -1 1 0

Sciences 4 -4 0

Social Sciences 2 -2 0

Education -5 5 0

  TOTAL 0 0 0

Page 12: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

Squared differences 

Subject MaleFemale

 

Arts 1 1

Sciences 16 16

Social Sciences 4 4

Education 25 25

Page 13: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

 ... divided by ‘Expected’ values and summed 

Subject Male Female

 

Arts 1/17 = 0.06 1/17 = 0.06

Sciences 16/25 = 0.64 16/25 = 0.64

Social Sciences 4/40 = 0.10 4/40 = 0.10

Education 25/8 = 3.13 25/8 = 3.13

0.06 + 0.06 + 0.64 + 0.64 + 0.10 + 0.10 + 3.13 + 3.13 = 7.86

 

7.86 is the value of the chi-square statistic (2) for the original table (cross-tabulation).

 

Page 14: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

Degrees of freedom

If a cross-tabulation has R rows and C columns, the corresponding chi-square statistic has (R-1) x (C-1) ‘degrees of freedom’

 

i.e. in this case it has (4 - 1) x (2 - 1) = 3 degrees of freedom.

Degrees of freedom can be thought of as sources of variation. In an independent samples t-test, the number of sources of variation depends on the numbers of cases in the samples being compared. In a chi-square test, however, it depends on the number of cells in the cross-tabulation being examined.

Page 15: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

Chi-square distributions for1 to 5 degrees of freedom

Here, chi-square values of more than 6 are relatively rare.

However higher values of chi-square are more common when the degrees of freedom (k in this chart) is larger.

Chi-square tables reflect this: it can be seen in these that the values for p=0.05 (the point at which only 5% of cases lie to the right) increase as the degrees of freedom increase.

Page 16: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

Checking the chi-square statistic• To check whether a result is statistically significant (i.e. unlikely to simply reflect

sampling error) we look up critical values of chi-square in a table.• For 3 d.f. the critical values are 7.81 (p=0.05) and 11.34 (p=0.01).• Because our chi-square value (7.86) is bigger than 7.81 we can say that it is

significant at p<0.05. (Since it is not bigger than 11.34 we cannot say that it is significant at p<0.01).

• What this means is that we would find a difference in our sample of at least the magnitude of the difference that we observed in the distributions of proportions for men and for women between 1% and 5% of the time if there were no real difference between the distributions of proportions in the population (which is the null hypothesis).

• This is rare enough that we consider the null hypothesis to be unlikely. We can therefore reject the null hypothesis and accept the alternative hypothesis of a relationship between gender and subject.

• When you use SPSS, it produces an estimate of the precise probability of obtaining a chi-square statistic at least as big as (in this case) 7.86 by chance. A p-value of less than 0.05 is said to be statistically significant (and to imply a significant relationship between the two variables).

Page 17: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

A note on using chi-square

• Chi-square tests can be safely used if each cell in the table has an expected value of at least 5. Opinions differ as to what is appropriate where this is not the case: one ‘rule of thumb’ is that no expected values should be less than 1 and no more than 20% of expected values should be less than 5.

• The categories must be discrete, i.e. mutually exclusive. No case should fall into more than one category (which might pose some difficulties when carrying out analyses focusing on degree subjects!)

Page 18: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

Chi-square statistics  

MARITAL STATUS (AT CONCEPTION) 26 = 24.86

by ORIENTATION TO PREGNANCY (p < 0.001)

(but the sparseness of cases means that the chi-square statistic is invalid!)

 

OCCUPATIONAL CLASS 21 = 5.58

by DISPOSAL CHOICE (p < 0.05)

(but the chi-square statistic for a 2x2 cross-tabulation needs – arguably –

to be adjusted using ‘Yates’ correction for continuity’, giving an adjusted

value of 4.29 (p <0.05))

 

GENDER by SUBJECT: 23 = 1542

OTHER GRADUATES (p < 0.0001)

Page 19: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

Marital Status (at Conception) by Orientation to Pregnancy:

Collapsed/reduced versions of cross-tabulation 

  Wanted to Did not Had not Important

conceive mind thought + not to

about it

Single 4 (8%) 12 (23%) 37 (70%)

Cohab./Married 13 (50%) 8 (31%) 5 (19%)

 

22 = 23.46 (p < 0.0001)

 

Wanted to Did not Had not Important

conceive mind thought not to

about it

Cohabiting 4 (44%) 2 (22%) 1 (11%) 2 (22%)

Married 9 (53%) 6 (35%) 0 (0%) 2 (12%)

 

23 = 2.72 (p > 0.05) [N.B. ‘Sparse’ table: chi-square invalid]

 

 

Page 20: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

Strength of association• Chi-square tells us whether there is a ‘significant’

relationship between two variables (or whether a relationship exists that would have been unlikely to have been found by chance).

• However it does not tell us in a clear-cut way how strong this association is, since the size of the chi-square statistic depends in part on the sample size (as well as the cross-tabulation shape)

• We will therefore look at a different measure (which we ‘met’ in Week 3) that does tell us about the strength of association:

• Cramér’s V. This is a chi-square-based measure that tells us the strength of association in a cross-tabulation. No association is represented by 0; ‘perfect’ association by 1.

Page 21: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

Example: Sex, Age and SportData from Young People’s Social Attitudes Study 2003 (available from Nesstar!!)

YP played sport as part of sports club YP * Age band Crosstabulationa

62 68 34 164

65.3% 54.4% 41.0% 54.1%

33 57 49 139

34.7% 45.6% 59.0% 45.9%

95 125 83 303

100.0% 100.0% 100.0% 100.0%

Count

% within Age band

Count

% within Age band

Count

% within Age band

Yes

No

YP played sport as partof sports club YP

Total

12-13 14-16 17-19

Age band

Total

YP sex household grid [BSA2003] YP22 = Malea.

YP played sport as part of sports club YP * Age band Crosstabulationa

55 48 11 114

44.4% 32.0% 12.8% 31.7%

69 102 75 246

55.6% 68.0% 87.2% 68.3%

124 150 86 360

100.0% 100.0% 100.0% 100.0%

Count

% within Age band

Count

% within Age band

Count

% within Age band

Yes

No

YP played sport as partof sports club YP

Total

12-13 14-16 17-19

Age band

Total

YP sex household grid [BSA2003] YP22 = Femalea.

Boys:

Girls:

Page 22: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

What can we say about these tables?

• It looks like boys play sport as part of a club more than girls do.

• And it looks like both boys and girls become less likely to play sport as part of a club as they get older.

• But is the association between age and sports club membership stronger/weaker for boys than it is for girls?

Page 23: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

YP played sport as part of sports club YP * Age band Crosstabulationa

62 68 34 164

65.3% 54.4% 41.0% 54.1%

33 57 49 139

34.7% 45.6% 59.0% 45.9%

95 125 83 303

100.0% 100.0% 100.0% 100.0%

Count

% within Age band

Count

% within Age band

Count

% within Age band

Yes

No

YP played sport as partof sports club YP

Total

12-13 14-16 17-19

Age band

Total

YP sex household grid [BSA2003] YP22 = Malea.

To answer this question we work out chi-square, by calculating:

χ2= (Observedij – Expectedij)2

Expectedij

= (62 – (95*164)/303)2 + (33 – (95*139)/303)2 + (68 – (125*164)/303)2 +

(95*164)/303 (95*139)/303 (125*164)/303

(57 – (125*139)/303)2 + (34 – (83*164)/303)2 + (49 – (83*139)/303)2 + (125*139)/303 (83*164)/303 (83*139)/303

= (62-51.4)2/51.4 + (33-43.6)2/43.6 + (68-67.7)2/67.7 + (57-57.3)2/57.3 + (34-44.9)2/44.9 + (49-38.1)2/38.1

= 2.2 +2.6 + 0 + 0 +2.6 + 3.1 = 10.5

Does age significantly affect sports club membership for boys?

Page 24: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

Does age significantly affect sports club membership for boys?

• Chi-square = 10.5. • d.f. = (r-1) x (c-1) = 1 x 2 = 2• If we look up the .05 value for 2 degrees of freedom it is 5.99, and

the .01 value is 9.21. Since 10.5 is bigger than both of these, it is significant at (p < 0.01).

• The following SPSS output confirms that, in fact, the p-value is .005, which is less than 0.01 (N.B. the chi-square valuewithout rounding is 10.541). Chi-Square Testsb

10.541a 2 .005

10.626 2 .005

10.457 1 .001

303

Pearson Chi-Square

Likelihood Ratio

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Theminimum expected count is 38.08.

a.

YP sex household grid [BSA2003] YP22 = Maleb.

Page 25: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

And for girls…?

• Work out the chi-square statistic and test whether it is significant for girls…

YP played sport as part of sports club YP * Age band Crosstabulationa

55 48 11 114

44.4% 32.0% 12.8% 31.7%

69 102 75 246

55.6% 68.0% 87.2% 68.3%

124 150 86 360

100.0% 100.0% 100.0% 100.0%

Count

% within Age band

Count

% within Age band

Count

% within Age band

Yes

No

YP played sport as partof sports club YP

Total

12-13 14-16 17-19

Age band

Total

YP sex household grid [BSA2003] YP22 = Femalea.

Page 26: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

And for girls…?

• The SPSS output for the chi-square test for girls shows a chi-square value of 23.394.

• The p-value for this chi-square statistic ,with 2 degrees of freedom is rounded to 0.000 (SPSS only shows you results to 3 decimal places). This means that it is less than 0.001 (or, more precisely, that it is less than 0.0005).

• Therefore “Age has a significant effect on whether or not girls play sport in clubs (p < 0.001)”:

Chi-Square Testsb

23.394a 2 .000

25.370 2 .000

22.862 1 .000

360

Pearson Chi-Square

Likelihood Ratio

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Theminimum expected count is 27.23.

a.

YP sex household grid [BSA2003] YP22 = Femaleb.

Page 27: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

Is the effect of age stronger/weaker for boys than it is for girls?

• The chi-square statistic is bigger for girls than for boys, however the sample of girls is also bigger (360 as compared to 303) so this will have affected the relative size of the two values.

• To work out the strength of association, we need to correct for both sample size and for the table shape (since this also affects the magnitude of chi-square statistics). A frequently-used measure of association is Cramér’s V:

where 2 is chi-square, N is the sample size,

and L is the lesser (smaller) of the number of rows and number of columns.

Note: In any table where either the number of rows or the number of columns is equal to 2, Cramér’s V is equal to another measure of association, referred to as phi (or Φ).

2

( 1)N L

Page 28: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

YP played sport as part of sports club YP * Age band Crosstabulationa

62 68 34 164

65.3% 54.4% 41.0% 54.1%

33 57 49 139

34.7% 45.6% 59.0% 45.9%

95 125 83 303

100.0% 100.0% 100.0% 100.0%

Count

% within Age band

Count

% within Age band

Count

% within Age band

Yes

No

YP played sport as partof sports club YP

Total

12-13 14-16 17-19

Age band

Total

YP sex household grid [BSA2003] YP22 = Malea. YP played sport as part of sports club YP * Age band Crosstabulationa

55 48 11 114

44.4% 32.0% 12.8% 31.7%

69 102 75 246

55.6% 68.0% 87.2% 68.3%

124 150 86 360

100.0% 100.0% 100.0% 100.0%

Count

% within Age band

Count

% within Age band

Count

% within Age band

Yes

No

YP played sport as partof sports club YP

Total

12-13 14-16 17-19

Age band

Total

YP sex household grid [BSA2003] YP22 = Femalea.

Boys:

χ2 = 10.541

Girls :

χ2 = 23.394

Cramér’s V values for the two tables are therefore:

10.50.186

303(2 1)

23.9

0.258360(2 1)

2

( 1)N L

Boys

Girls

Comparing strength of association between ageand involvement in sport for boys and for girls

Page 29: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

Cramér’s V in SPSS

• We can also see Cramér’s V in SPSS output

• The value for boys is above and that for girls is below – note that the values of Cramér’s V are the same as those we worked out (0.186 and 0.258), with small differences due to rounding error.

Symmetric Measuresc

.255 .000

.255 .000

360

Phi

Cramer's V

Nominal byNominal

N of Valid Cases

Value Approx. Sig.

Not assuming the null hypothesis.a.

Using the asymptotic standard error assuming the nullhypothesis.

b.

YP sex household grid [BSA2003] YP22 = Femalec.

Symmetric Measuresc

.187 .005

.187 .005

303

Phi

Cramer's V

Nominal byNominal

N of Valid Cases

Value Approx. Sig.

Not assuming the null hypothesis.a.

Using the asymptotic standard error assuming the nullhypothesis.

b.

YP sex household grid [BSA2003] YP22 = Malec.

Page 30: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

What do the results mean substantively?

• We can say that age has a significant effect on boys’ participation in sport.

• And that age has a significant and somewhat stronger effect on girls’ participation in sport.

• Hence both girls and boys are likely to decrease their participation in sports clubs as they get older but this effect is more pronounced among girls than among boys.

• There is thus a (small-ish) gender difference in the relationship between age and participation in sport.

Page 31: DTC Quantitative Methods Bivariate Analysis: Cross-tabulations and Chi-square Thursday 23rd February 2012.

Two more Cramér’s V values… 

Subject and Gender: ‘Black’ graduates = 7.86 = 0.209

180(2-1)

 

Subject and Gender: Other graduates = 1542 = 0.300

17094(2-1)

But could this difference just reflect sampling error?

Log-linear model: Test for difference between form of Subject/Gender relationship for ‘Black’ graduates and for Other graduates:

23 = 3.67 (p > 0.05)

  i.e. Not enough evidence to conclude subject ‘gendering’ varies between ‘Black’ graduates and Other graduates