Copyright ©2011 Brooks/Cole, Cengage Learning More about Inference for Categorical Variables...

Copyright ©2011 Brooks/Cole, Cengage Learning

More about Inference

for Categorical Variables

Chapter 15

1

Copyright ©2011 Brooks/Cole, Cengage Learning 2

Principle Question:

Is there a relationship between the two variables, so that the category into which individuals fall for one variable seems to depend on the category they are in for the other variable?


15.1 Chi-Square Test for Two-Way Tables

• Data displayed in a contingency or two-way table.• Each combination of row/column is a cell of table.• Two types of conditional percents: row and column.• Row percents: percents across a row, based on total

number in the row.• Column percents: percents down a column, based

on total number in the column.• If one variable is explanatory, use it to define rows

and use row percents.


Recall:Five steps for assessing

statistical significance.

Step 1: Null and alternative hypotheses

H0: The two variables are not related.

Ha: The two variables are related.

Sometimes associated is used instead of related.


Example 15.1 Ear Infections and Xylitol

Experiment: n = 533 children randomized to 3 groups Group 1: Placebo Gum; Group 2: Xylitol Gum; Group 3: Xylitol LozengeResponse = Did child have an ear infection?

Only 16.2% of children in Xylitol Gum group had infection.


Example 15.1 Infections and Xylitol

H0: p1 = p2 = p3(no relationship between trt and

outcome)

Ha: p1, p2 , p3 are not all the same (there is a relationship)

Let p1 = proportion who would get an ear infection

in the population given placebo gum p2 = proportion who would get an ear infection

in the population given xylitol gum p3 = proportion who would get an ear infection

in the population given xylitol lozenges


Example 15.2 Making FriendsQ: With whom do you find it easiest to make friend – opposite sex or same sex or no difference?

H0: No difference in distribution of responses of men and women (no relationship between gender and response)

Ha: There is a difference in distribution of responses of men and women (is a relationship between gender and response)


Tech Note: Homogeneity and Independence

Two variations of the general hypotheses statements which depend on the method of sampling.

• If samples have been taken from separate populations, the null hypothesis statement is a statement of homogeneity (sameness) among the populations.

• If a sample has been taken from a single population, and two categorical variables measured for each individual, the statement of no relationship is a statement of independence between the two variables.


Step 2: Chi-square Statistic and Necessary Conditions

Compute expected count for each cell:Expected count = Row total Column total

Total n

Compute test statistic by totaling over all cells: (Observed – Expected)2

Expected 2

Chi-square statistic measures the difference between the observed counts and the counts that would be expected if there were no relationship (i.e. if null hypothesis were true).


More on the Chi-square Statistic

Large difference evidence of a relationship.

Guidelines for large sample:1. All expected counts should be greater than 1.

2. At least 80% of the cells should have an expected count greater than 5.



Expected count for “Placebo Gum, Yes Infection” cell:

Expected Counts:



Chi-square Test Statistic:


Step 3: p-value of Chi-square Test

p-value = probability the chi-square test statistic could have been as large or larger if the null hypothesis were true.

Large test statistic evidence of a relationship.So how large is enough to declare significance?

Chi-square probability distribution used to find p-value.

Degrees of freedom df = (Rows – 1)(Columns – 1) = (r – 1)(c – 1)


Chi-square Distributions

• Skewed to the right distributions.• Minimum value is 0.• Indexed by the degrees of freedom (df).



Chi-square statistic was 6.69 df = (3-1)(2-1) = 2

p-value = 0.035


Finding the p-value from Table A.5:

• If value of statistic falls between two table entries, p-value is between values of p (column headings) for these entries.

• If value of statistic is larger than entry in rightmost column (labeled p = 0.001), p-value is less than 0.001 (p < 0.001).

• If value of statistic is smaller than entry in leftmost column (labeled p = 0.50), p-value is greater than 0.50 (p > 0.50).

Look in corresponding “df” row of Table A.5. Scan across until you find where the statistic falls.



There is a statistically significant relationship between the risk of an ear infection and the preventative treatment.

Chi-square statistic was 6.69 df = (3-1)(2-1) = 2

.025 < p-value < .05


Example 15.6 A Moderate p-Value

Table has three rows and three columns.The computed chi-square statistic is 8.12. Degrees of freedom are df = (3 – 1)(3 – 1) = 4.

Finding the p-value:Scan the df = 4 row in Table A.5 and the value of 8.12 is between the entries 7.78 (p = 0.10) and 8.50 (p = 0.075). Thus, the p-value is between 0.075 and 0.10.

0.075 < p-value < 0.10


Steps 4 and 5:Making a Decision andReporting a Conclusion

Two equivalent rules: Reject H0 when …

• p-value 0.05

• Chi-square statistic is greater than the entry in the 0.05 column of Table A.5 (the critical value).

Large test statistic small p-value evidence a real relationship exists in population.

Note: For 2x2 tables, a test statistic of 3.84 or larger is significant.


Reporting a Conclusion

Ways to write “do not reject H0”

• The relationship between smoking and drinking alcohol is not statistically significant.

• The proportions of smokers who never drink, drink occasionally, and drink often are not significantly different from the proportions of non-smokers who do so.

• There is insufficient evidence to conclude that there is a relationship in the population between smoking and drinking alcohol.

Example: Testing whether there is a relationship between smoking (yes or no) and drinking alcohol (never, occasionally, often).


Reporting a Conclusion

Ways to write “reject H0”

• There is a statistically significant relationship between smoking and drinking alcohol.

• The proportions of smokers who never drink, drink occasionally, and drink often are not the same as the proportions of non-smokers who do so.

• Smokers have significantly different drinking behavior than non-smokers.

Example: Testing whether there is a relationship between smoking (yes or no) and drinking alcohol (never, occasionally, often).


Example 15.8 Making FriendsQ: With whom do you find it easiest to make friend –

opposite sex or same sex or no difference?

df = (2 – 1)(3 – 1) = 2. Table A.5: value of 8.515 falls between entries in 0.025 column (7.38) and 0.01 column (9.21). 0.01 < p-value < 0.025

There is statistically significant relationship at the 0.05 level.

There appears to be a a difference in distribution of responses of men and women if the populations were asked this question.


Supporting Analyses

• Description of row (or column) percents.

• Bar chart of counts or percents.

• Examination each cell’s “contribution to chi-square.” Cells with largest values have contributed most to significance of relationship deserve attention in any description of relationship.

• Confidence intervals for important proportions or for differences between proportions.

To learn about the specific nature of the relationship:


Chi-Square Test or Z-Test forDifference in Two Proportions?

Does it make a difference?

• If desired Ha has no specific direction (two-sided), the two tests give exactly the same p-value. The squared value of the z-statistic equals the chi-square statistic.

• If desired Ha has a direction (one-sided), the z-test should be used.


15.3 Testing Hypotheses about One Categorical Variable: GOF

Step 1: Determine the null and alternative hypotheses.

H0: The probabilities for k categories are p1, p2, . . . , pk.

Ha: Not all probabilities specified in H0 are correct.

Note: Probabilities in the null hypothesis must sum to 1.

Goodness of Fit (GOF) Test



Step 2: Verify necessary data conditions, and if met, summarize the data into an appropriate test statistic.

If at least 80% of the expected counts are greater than 5 and none are less than 1, compute

where the expected count for the ith category is computed as npi.

(Observed – Expected)2

Expected 2



Step 3: Assuming the null hypothesis is true, find the p-value. Use chi-square distribution with df = k – 1.

Step 4: Decide whether or not the result is statistically significant based on the p-value. The result is statistically significant if the p-value .

Step 5: Report the conclusion in the context of the situation.


Example 15.15 Pennsylvania Daily Number

State lottery game: Three-digit number made by drawing a digit between 0 and 9 from each of three different containers.

Focus = draws from the first container. If numbers randomly selected, each value would be equally likely to occur.

H0: p = 1/10 for each of the 10 possible digitsHa: Not H0


Example 15.15 Daily Number

Data: n = 500 days between 7/19/99 and 11/29/00


Example 15.15 Daily NumberChi-square goodness of fit statistic:

From Table A.5: df = k – 1 = 10 – 1 = 9 p-value > 0.50

Result is not statistically significant; the null hypothesis is not rejected.

Copyright ©2011 Brooks/Cole, Cengage Learning More about Inference for Categorical Variables...

Documents

Transcript of Copyright ©2011 Brooks/Cole, Cengage Learning More about Inference for Categorical Variables...