Download - Chi Squared Tests

Chi Squared TestsChi Squared Tests

Chapter 16

16.1 Introduction

• Two statistical techniques are presented, to analyze nominal data.– A goodness-of-fit test for the multinomial experiment.– A contingency table test of independence.

• Both tests use the 2 as the sampling distribution of the test statistic.

• The hypothesis tested involves the probabilities p1, p2, …, pk.of a multinomial distribution.

• The multinomial experiment is an extension of the binomial experiment.– There are n independent trials.– The outcome of each trial can be classified into one of k

categories, called cells.– The probability pi that the outcome fall into cell i remains

constant for each trial. Moreover, p1 + p2 + … +pk = 1.

– Trials of the experiment are independent

16.2 Chi-Squared Goodness-of-Fit Test

16.2 Chi-squared Goodness-of-Fit Test

• We test whether there is sufficient evidence to reject a pre-specified set of values for p i.

• The hypothesis:

ii

kk

aponeleastAtH

apapapH

:

,...,,:

1

22110

• The test builds on comparing actual frequency and the expected frequency of occurrences in all the cells.

• Example 16.1– Two competing companies A and B have enjoy

dominant position in the market. The companies conducted aggressive advertising campaigns.

– Market shares before the campaigns were:• Company A = 45%• Company B = 40%• Other competitors = 15%.

The multinomial goodness of fit test - Example

• Example 16.1 – continued– To study the effect of the campaign on the market

shares, a survey was conducted.


– 200 customers were asked to indicate their preference regarding the product advertised.

– Survey results:• 102 customers preferred the company A’s product,• 82 customers preferred the company B’s product,• 16 customers preferred the competitors product.


• Example 16.1 – continued

Can we conclude at 5% significance level that the market shares were affected by the advertising campaigns?

• Solution– The population investigated is the brand preferences.– The data are nominal (A, B, or other)– This is a multinomial experiment (three categories).– The question of interest: Are p1, p2, and p3 different

after the campaign from their values before the campaign?


1

2

3

1

2

3

• The hypotheses are:H0: p1 = .45, p2 = .40, p3 = .15H1: At least one pi changed.

The expected frequency for eachcategory (cell) if the null hypothesis is true is shown below:

90 = 200(.45)

30 = 200(.15)

102 82

16

What actual frequencies did the sample return?


80 = 200(.40)

• The statistic is

• The rejection region is

ii

k

1i i

2ii2

npewheree

)ef(

21k,

2




18.830

)3016(80

)8082(90

)90102( 22k

1i

22

)]2,18.8(([

01679.)18.8(

99147.5

2

213,05.

21,

CHIDISTExcelfrom

PvaluepThe

k



0

0.005

0.01

0.015

0.02

0.025

0 2 4 6 8 10 12

Conclusion: Since 8.18 > 5.99, there is sufficient evidence at 5% significance level to reject the null hypothesis. At least one of the probabilities pi is different. Thus, at least two market shares have changed.

P valueAlpha

5.99 8.18

Rejection region

2 with 2 degrees of freedom

Required conditions – the rule of five

• The test statistic used to perform the test is only approximately Chi-squared distributed.

• For the approximation to apply, the expected cell frequency has to be at least 5 for all the cells (npi 5).

• If the expected frequency in a cell is less than 5, combine it with other cells.

16.3 Chi-squared Test of a Contingency Table

• This test is used to test whether…– two nominal variables are related?– there are differences between two or more

populations of a nominal variable• To accomplish the test objectives, we need to

classify the data according to two different criteria.

Contingency table 2 test – Example

• Example 16.2– In an effort to better predict the demand for courses

offered by a certain MBA program, it was hypothesized that students’ academic background affect their choice of MBA major, thus, their courses selection.

– A random sample of last year’s MBA students was selected. The following contingency table summarizes relevant data.


Degree Accounting Finance MarketingBA 31 13 16 60

BENG 8 16 7 31BBA 12 10 17 60

Other 10 5 7 3961 44 47 152

There are two ways to address the problem

If each undergraduate degree is considered a population, do these populations differ?

If each classification is considered a nominal variable, are these twovariables dependent?

The observed values

• Solution– The hypotheses are:

H0: The two variables are independent

H1: The two variables are dependent

k is the number of cells in the contingency table.

– The test statistic

k

1i i

2ii2

e)ef(

– The rejection region

2)1c)(1r(,

2


Since ei = npi but pi is unknown, we need to estimate the unknown probability from the data, assuming H0 is true.

Under the null hypothesis the two variables are independent:

P(Accounting and BA) = P(Accounting)*P(BA)

Undergraduate MBA MajorDegree Accounting Finance Marketing Probability

BA 60 60/152BENG 31 31/152BBA 39 39/152Other 22 22/152

61 44 47 152Probability 61/152 44/152 47/152

The number of students expected to fall in the cell “Accounting - BA” iseAcct-BA = n(pAcct-BA) = 152(61/152)(60/152) = [61*60]/152 = 24.08

= [61/152][60/152].

60

61 152

The number of students expected to fall in the cell “Finance - BBA” iseFinance-BBA = npFinance-BBA = 152(44/152)(39/152) = [44*39]/152 = 11.29

44

39

152

Estimating the expected frequencies

The expected frequencies for a contingency table

eij = (Column j total)(Row i total)Sample size

• The expected frequency of cell of raw i and column j in the contingency table is calculated by

k

1i i

2ii2

e)ef(

Undergraduate MBA MajorDegree Accounting Finance Marketing

BA 31 (24.08) 13 (17.37) 16 (18.55) 60BENG 8 (12.44) 16 (8.97) 7 (9.58) 31BBA 12 (15.65) 10 (11.29) 17 (12.06) 39Other 10 (8.83) 5 (6.39) 7 (6.80) 22

61 44 47 152

The expected frequency

31 24.08

31 24.08

31 24.08

31 24.08

31 24.08

(31 - 24.08)2

24.08 +….+

5 6.39

5 6.39

5 6.395 6.39

(5 - 6.39)2

6.39 +….+

7 6.80

7 6.80

7 6.80

(7 - 6.80)2

6.80

7 6.80

2= = 14.70

k

1i i

2ii2

e)ef(

Calculation of the 2 statistic

• Solution – continued


• Conclusion: Since 2 = 14.70 > 12.5916, there is sufficient evidence to infer at 5% significance level that students’ undergraduate degree and MBA students courses selection are dependent.

• Solution – continued– The critical value in our example is:

5916.122)13)(14(,05.

2)1c)(1r(,

Degree MBA Major3 11 11 11 12 21 3. .

. .

Code:Undergraduate degree 1 = BA2 = BENG3 = BBA4 = OTHERSMBA Major 1 = ACCOUNTING2 = FINANCE3 = MARKETING

Contingency Table1 2 3 Total

1 31 13 16 602 8 16 7 313 12 10 17 394 10 5 7 22Total 61 44 47 152Test Statistic CHI-Squared = 14.7019P-Value = 0.0227

Select the Chi squared / raw data Option from Data Analysis Plus under tools. See Xm16-02

Define a code to specify each nominal value. Input the data in columns one column for each category.

Using the computer

10 (10.1) 14 (12.8)12 (12.7) 16 (16.0) 8 ( 7.2) 8 (9.2)

Required condition Rule of five

– The 2 distribution provides an adequate approximation to the sampling distribution under the condition that eij >= 5 for all the cells.

– When eij < 5 rows or columns must be added such that the condition is met.

4 (5.1) 7 (6.3)4 (3.6)

18 (17.9) 23 (22.3)12 (12.8)

Example

14 + 4 12.8 + 5.116 + 7 16 + 6.3 8 + 4 9.2 + 3.6

We combinecolumn 2 and 3

16.5 Chi-Squared test for Normality

• The goodness of fit Chi-squared test can be used to determined if data were drawn from any distribution.

• The general procedure:– Hypothesize on the parameter values of the distribution we test

(i.e. 0, 0 for the normal distribution).– For the variable tested X specify disjoint ranges that cover all its

possible values.– Build a Chi squared statistic that (aggregately) compares the

expected frequency under H0 and the actual frequency of observations that fall in each range.

– Run a goodness of fit test based on the multinomial experiment.

• Testing for normality in Example 12.1 For a sample size of n=50 (see Xm12-01) ,the sample mean was 460.38 with standard error of 38.83. Can we infer from the data provided that this sample was drawn from a normal distribution with = 460.38 and = 38.83? Use 5% significance level.

15.5 Chi-Squared test for Normality

2 test for normalitySolutionFirst let us select z values that define each cell (expected frequency > 5 for each cell.)z1 = -1; P(z < -1) = p1 = .1587; e1 = np1 = 50(.1587) = 7.94z2 = 0; P(-1 < z< 0) = p2 = .3413; e2 = np2 = 50(.3413) = 17.07z3 = 1; P(0 < z < 1) = p3 = .3413; e3 = 17.07 P(z > 1) = p4 = .1587; e4 = 7.94

460.38 499.21

The cell boundaries are calculated from the corresponding z values under H0.

z1 =(x1 - 460.38)38.83 = -1; x1 = 421.55

421.55

The expected frequenciescan now be determined for each cell.

e1 = 7.94

e2 = 17.07 e3 = 17.07

e4 = 7.94.1587

.3413

.1587

.3413

2 test for normality

(19 - 17.07)2

17.07(8 - 7.94)2

7.94

e1 = 7.94

e2 = 17.07 e3 = 17.07

e4 = 7.94

f1 = 10f2 = 13

f3 = 19

f4 = 8

(10 - 7.94)2

7.94(13 - 17.07)2

17.072= = 1.72+ + +


2 test for normality

Conclusion: There is insufficient evidence to conclude at 5% significance level that the data are not normally distributed.

(10 - 7.94)2

7.94(13 - 17.07)2

17.07(19 - 17.07)2

17.072= = 1.72+ + (8 - 7.94)2

7.94+

.datathefromestimated

parametersofnumbertheisLwhere2L1k,

2

– The rejection region


84146.3234,05.

23k,