Chi Squared TestsChi Squared Tests
Chapter 16
16.1 Introduction
• Two statistical techniques are presented, to analyze nominal data.– A goodness-of-fit test for the multinomial experiment.– A contingency table test of independence.
• Both tests use the 2 as the sampling distribution of the test statistic.
• The hypothesis tested involves the probabilities p1, p2, …, pk.of a multinomial distribution.
• The multinomial experiment is an extension of the binomial experiment.– There are n independent trials.– The outcome of each trial can be classified into one of k
categories, called cells.– The probability pi that the outcome fall into cell i remains
constant for each trial. Moreover, p1 + p2 + … +pk = 1.
– Trials of the experiment are independent
16.2 Chi-Squared Goodness-of-Fit Test
16.2 Chi-squared Goodness-of-Fit Test
• We test whether there is sufficient evidence to reject a pre-specified set of values for p i.
• The hypothesis:
ii
kk
aponeleastAtH
apapapH
:
,...,,:
1
22110
• The test builds on comparing actual frequency and the expected frequency of occurrences in all the cells.
• Example 16.1– Two competing companies A and B have enjoy
dominant position in the market. The companies conducted aggressive advertising campaigns.
– Market shares before the campaigns were:• Company A = 45%• Company B = 40%• Other competitors = 15%.
The multinomial goodness of fit test - Example
• Example 16.1 – continued– To study the effect of the campaign on the market
shares, a survey was conducted.
The multinomial goodness of fit test - Example
– 200 customers were asked to indicate their preference regarding the product advertised.
– Survey results:• 102 customers preferred the company A’s product,• 82 customers preferred the company B’s product,• 16 customers preferred the competitors product.
The multinomial goodness of fit test - Example
• Example 16.1 – continued
Can we conclude at 5% significance level that the market shares were affected by the advertising campaigns?
• Solution– The population investigated is the brand preferences.– The data are nominal (A, B, or other)– This is a multinomial experiment (three categories).– The question of interest: Are p1, p2, and p3 different
after the campaign from their values before the campaign?
The multinomial goodness of fit test - Example
1
2
3
1
2
3
• The hypotheses are:H0: p1 = .45, p2 = .40, p3 = .15H1: At least one pi changed.
The expected frequency for eachcategory (cell) if the null hypothesis is true is shown below:
90 = 200(.45)
30 = 200(.15)
102 82
16
What actual frequencies did the sample return?
The multinomial goodness of fit test - Example
80 = 200(.40)
• The statistic is
• The rejection region is
ii
k
1i i
2ii2
npewheree
)ef(
21k,
2
The multinomial goodness of fit test - Example
The multinomial goodness of fit test - Example
• Example 16.1 – continued
18.830
)3016(80
)8082(90
)90102( 22k
1i
22
)]2,18.8(([
01679.)18.8(
99147.5
2
213,05.
21,
CHIDISTExcelfrom
PvaluepThe
k
The multinomial goodness of fit test - Example
• Example 16.1 – continued
0
0.005
0.01
0.015
0.02
0.025
0 2 4 6 8 10 12
Conclusion: Since 8.18 > 5.99, there is sufficient evidence at 5% significance level to reject the null hypothesis. At least one of the probabilities pi is different. Thus, at least two market shares have changed.
P valueAlpha
5.99 8.18
Rejection region
2 with 2 degrees of freedom
Required conditions – the rule of five
• The test statistic used to perform the test is only approximately Chi-squared distributed.
• For the approximation to apply, the expected cell frequency has to be at least 5 for all the cells (npi 5).
• If the expected frequency in a cell is less than 5, combine it with other cells.
16.3 Chi-squared Test of a Contingency Table
• This test is used to test whether…– two nominal variables are related?– there are differences between two or more
populations of a nominal variable• To accomplish the test objectives, we need to
classify the data according to two different criteria.
Contingency table 2 test – Example
• Example 16.2– In an effort to better predict the demand for courses
offered by a certain MBA program, it was hypothesized that students’ academic background affect their choice of MBA major, thus, their courses selection.
– A random sample of last year’s MBA students was selected. The following contingency table summarizes relevant data.
Contingency table 2 test – Example
Degree Accounting Finance MarketingBA 31 13 16 60
BENG 8 16 7 31BBA 12 10 17 60
Other 10 5 7 3961 44 47 152
There are two ways to address the problem
If each undergraduate degree is considered a population, do these populations differ?
If each classification is considered a nominal variable, are these twovariables dependent?
The observed values
• Solution– The hypotheses are:
H0: The two variables are independent
H1: The two variables are dependent
k is the number of cells in the contingency table.
– The test statistic
k
1i i
2ii2
e)ef(
– The rejection region
2)1c)(1r(,
2
Contingency table 2 test – Example
Since ei = npi but pi is unknown, we need to estimate the unknown probability from the data, assuming H0 is true.
Under the null hypothesis the two variables are independent:
P(Accounting and BA) = P(Accounting)*P(BA)
Undergraduate MBA MajorDegree Accounting Finance Marketing Probability
BA 60 60/152BENG 31 31/152BBA 39 39/152Other 22 22/152
61 44 47 152Probability 61/152 44/152 47/152
The number of students expected to fall in the cell “Accounting - BA” iseAcct-BA = n(pAcct-BA) = 152(61/152)(60/152) = [61*60]/152 = 24.08
= [61/152][60/152].
60
61 152
The number of students expected to fall in the cell “Finance - BBA” iseFinance-BBA = npFinance-BBA = 152(44/152)(39/152) = [44*39]/152 = 11.29
44
39
152
Estimating the expected frequencies
The expected frequencies for a contingency table
eij = (Column j total)(Row i total)Sample size
• The expected frequency of cell of raw i and column j in the contingency table is calculated by
k
1i i
2ii2
e)ef(
Undergraduate MBA MajorDegree Accounting Finance Marketing
BA 31 (24.08) 13 (17.37) 16 (18.55) 60BENG 8 (12.44) 16 (8.97) 7 (9.58) 31BBA 12 (15.65) 10 (11.29) 17 (12.06) 39Other 10 (8.83) 5 (6.39) 7 (6.80) 22
61 44 47 152
The expected frequency
31 24.08
31 24.08
31 24.08
31 24.08
31 24.08
(31 - 24.08)2
24.08 +….+
5 6.39
5 6.39
5 6.395 6.39
(5 - 6.39)2
6.39 +….+
7 6.80
7 6.80
7 6.80
(7 - 6.80)2
6.80
7 6.80
2= = 14.70
k
1i i
2ii2
e)ef(
Calculation of the 2 statistic
• Solution – continued
Contingency table 2 test – Example
• Conclusion: Since 2 = 14.70 > 12.5916, there is sufficient evidence to infer at 5% significance level that students’ undergraduate degree and MBA students courses selection are dependent.
• Solution – continued– The critical value in our example is:
5916.122)13)(14(,05.
2)1c)(1r(,
Degree MBA Major3 11 11 11 12 21 3. .
. .
Code:Undergraduate degree 1 = BA2 = BENG3 = BBA4 = OTHERSMBA Major 1 = ACCOUNTING2 = FINANCE3 = MARKETING
Contingency Table1 2 3 Total
1 31 13 16 602 8 16 7 313 12 10 17 394 10 5 7 22Total 61 44 47 152Test Statistic CHI-Squared = 14.7019P-Value = 0.0227
Select the Chi squared / raw data Option from Data Analysis Plus under tools. See Xm16-02
Define a code to specify each nominal value. Input the data in columns one column for each category.
Using the computer
10 (10.1) 14 (12.8)12 (12.7) 16 (16.0) 8 ( 7.2) 8 (9.2)
Required condition Rule of five
– The 2 distribution provides an adequate approximation to the sampling distribution under the condition that eij >= 5 for all the cells.
– When eij < 5 rows or columns must be added such that the condition is met.
4 (5.1) 7 (6.3)4 (3.6)
18 (17.9) 23 (22.3)12 (12.8)
Example
14 + 4 12.8 + 5.116 + 7 16 + 6.3 8 + 4 9.2 + 3.6
We combinecolumn 2 and 3
16.5 Chi-Squared test for Normality
• The goodness of fit Chi-squared test can be used to determined if data were drawn from any distribution.
• The general procedure:– Hypothesize on the parameter values of the distribution we test
(i.e. 0, 0 for the normal distribution).– For the variable tested X specify disjoint ranges that cover all its
possible values.– Build a Chi squared statistic that (aggregately) compares the
expected frequency under H0 and the actual frequency of observations that fall in each range.
– Run a goodness of fit test based on the multinomial experiment.
• Testing for normality in Example 12.1 For a sample size of n=50 (see Xm12-01) ,the sample mean was 460.38 with standard error of 38.83. Can we infer from the data provided that this sample was drawn from a normal distribution with = 460.38 and = 38.83? Use 5% significance level.
15.5 Chi-Squared test for Normality
2 test for normalitySolutionFirst let us select z values that define each cell (expected frequency > 5 for each cell.)z1 = -1; P(z < -1) = p1 = .1587; e1 = np1 = 50(.1587) = 7.94z2 = 0; P(-1 < z< 0) = p2 = .3413; e2 = np2 = 50(.3413) = 17.07z3 = 1; P(0 < z < 1) = p3 = .3413; e3 = 17.07 P(z > 1) = p4 = .1587; e4 = 7.94
460.38 499.21
The cell boundaries are calculated from the corresponding z values under H0.
z1 =(x1 - 460.38)38.83 = -1; x1 = 421.55
421.55
The expected frequenciescan now be determined for each cell.
e1 = 7.94
e2 = 17.07 e3 = 17.07
e4 = 7.94.1587
.3413
.1587
.3413
2 test for normality
(19 - 17.07)2
17.07(8 - 7.94)2
7.94
e1 = 7.94
e2 = 17.07 e3 = 17.07
e4 = 7.94
f1 = 10f2 = 13
f3 = 19
f4 = 8
(10 - 7.94)2
7.94(13 - 17.07)2
17.072= = 1.72+ + +
– The test statistic
2 test for normality
Conclusion: There is insufficient evidence to conclude at 5% significance level that the data are not normally distributed.
(10 - 7.94)2
7.94(13 - 17.07)2
17.07(19 - 17.07)2
17.072= = 1.72+ + (8 - 7.94)2
7.94+
.datathefromestimated
parametersofnumbertheisLwhere2L1k,
2
– The rejection region
– The test statistic
84146.3234,05.
23k,
Top Related