Chi square

22
CHI-SQUARE AND ANALYSIS OF VARIANCE

description

 

Transcript of Chi square

Page 1: Chi square

CHI-SQUARE AND ANALYSIS OF VARIANCE

Page 2: Chi square

PROBABILITY DISTRIBUTIONS

Chi-Square as a Test of independence

Contingency Tables

Chi-Square as a Test of Goodness of Fit

Analysis of Variance

In this session ….

Page 3: Chi square

- the B-school

When do we use Chi-Square?

Chi-Square test is used when1.We have to compare the proportions of more

than two populations.2.We have to determine if certain attributes of

a population are independent of each other. E.g If we classify a population with respect to two attributes such as age and job performance, we can use a Chi-Square test to establish if the two attributes are independent of each other.

Page 4: Chi square

- the B-school

Chi-Square as a test of independence

If a population is classified into categories, the dependence or independence of the categorical variables may be established through a contingency table using the Chi-Square test.

An m×n contingency table shows the observed frequencies for two categorical variables (say A and B) arranged in m rows and n columns. The sum of all observed frequencies is N, the sample size.

If a sampled individual has the ith value for A and the jth value for B then this individual is assigned to the (i,j)th cell of the contingency table.

Page 5: Chi square

- the B-school

Chi-Square as a test of independence

Two attributes A and B are independent if the value of one variable has no influence on the value of another variable.

Test H0: ‘A and B are independent’ against HA: ‘A and B are not independent’.

Assume H0 to be true and calculate the expected frequencies

Chi-square statistic,

( ) ( )ij

sum of ith row sum of jth columnE

n

2( )ij ijc

ij

O E

E

Page 6: Chi square

- the B-school

The Chi-Square test statistic

The Chi-Square test statistic, is a random variable with its own probability distribution, known as the Chi-square distribution.

When the sample size n is large, the probability histogram of can be approximated by a chi-square curve with k = (m-1)×(n-1) degrees of freedom.

c

___ ___ ___ * Row 1

___ ___ ___ * Row 2

* * * * Row 3

Col 1 Col 2 Col 3 Col 4

c

Page 7: Chi square

- the B-school

Contingency Table - Example

A brand manager is concerned that her brand’s share may be unevenly distributed throughout the country. In a survey in which the country was divided into four geographical regions, a random sampling of 100 consumers in each region was surveyed, with the following results.

(a) Construct the contingency table and calculate the chi-square statistic.(b) State the null and alternative hypothesis.(c) At α = 0.05, test whether brand share depends on the region OR the brand share is the same across the 4 regions.

NE NW SE SW TOTAL

Purchase the brand

40 55 45 50 190

Do not purchase

60 45 55 50 210

Total 100 100 100 100 400

Page 8: Chi square

- the B-school

Contingency Table

NE NW SE SW TOTAL

Purchase the brand

O11= 40

E11=47.5

O12= 55

E12=47.5

O13 = 45

E13=47.5

O14 = 50

E14=47.5

190

Do not purchase

O21 = 60

E21=52.5

O22 = 45

E22=52.5

O23 = 55

E23=52.5

O24 = 50

E24=52.5

210

Total 100 100 100 100 400

H0: Purchasing is independent of region.

HA: Purchasing depends on the region.

OR

H0: PNE=PNW=PSE=PSW

HA: All proportions are not equal (at least two are unequal)

Page 9: Chi square

The Chi-Square test statistic

If reject H0 and accept HA. If accept H0 and reject HA.

2( ),R

c k 2( ),R

c k

5.012c 2( ), 7.815 1 3 3 0.05Rk for k df and

2( ), 05.012 7.815

sin .

Rc kSince we accept H

that purcha g is independent of region

(Use CHIINV)

Page 10: Chi square

- the B-school

Exercise

A researcher, studying the relationship between having a particular disease and addiction of the individuals, interviewed 32 male subjects. For each individual, the researcher recorded his disease status (Y = yes, N = No), and addiction type (Type – I, Type –2, Type – 3) as shown.

From the above dataset can the researcher conclude that the males with addiction types 2 or 3 are more likely to have the disease than those with addiction type 1?

Disease status

N N N Y N N N N Y N N Y N N N N

Addiction type

1 1 2 1 1 1 3 1 1 1 1 2 1 1 2 1

Disease status

Y N Y Y N Y N N N N N Y N Y N N

Addiction type

1 3 1 2 3 1 1 1 1 1 3 1 3 2 1 2

Page 11: Chi square

- the B-school

Exercise

H0: Disease is independent of addiction type

HA: Disease is prevalent among those who have addiction types 2 and 3

TYPE 1 TYPE 2 TYPE 3 Row totals

N O11 = 15

E11 = 15.09

O12 = 3

E12 = 4.31

O13 = 5

E13 = 3.59

23

Y O21 = 6

E21 = 5.91

O22 = 3

E22 = 1.69

O23 = 0

E23 = 0

9

Column

totals

21 6 5 32

2( )1.969ij ij

cij

O E

E

2( ), 5.99

1 2 2 0.05

Rk for

k df and

Page 12: Chi square

- the B-school

Chi-Square as a Test of Goodness of Fit: Testing the Appropriateness of a Distribution

The Chi-Square test can also be used to decide whether a particular probability distribution such as Binomial, Poisson or Normal is the appropriate distribution for representing a given data.

The Chi-Square test enables us to test whether there is a significant difference between the observed frequency distribution and the theoretical distribution.

Page 13: Chi square

- the B-school

Chi-Square as a Test of Goodness of Fit: Testing the Appropriateness of a DistributionThe salesman of a Paper Company has five accounts to visit per day. It is suggested that the variable, sales by him be described by a binomial distribution, with the probability of selling each account being 0.4. Given the following frequency distribution of the number of sales per day, can we conclude that the data does not follow the suggested distribution? Use 0.05 significance level.No. of sales per day 0 1 2 3 4 5 Frequency 10 41 60 20 6 3

Page 14: Chi square

- the B-school

Chi-Square as a Test of Goodness of Fit: Testing the Appropriateness of a DistributionSolution:H0: A binomial dist. With p = 0.4 is a good description of the sales.HA: A binomial dist. With p = 0.4 is NOT a good

description of the sales. k (degrees of freedom) = 5, α = 0.05

Chi-Square Statistic,

We reject H0 (that the distribution is well described by a binomial distribution with p = 0.4 and n = 5) since

2( )11.9413o e

ce

f f

f

2( ) 2( ), 5,0.05 11.070R R

K

2( )4,0.05R

c

Page 15: Chi square

- the B-school

What is Analysis of Variance (ANOVA) ?

ANOVA1. Enables us to test for the significance of the

differences among more than two sample means.2. Using ANOVA we can make inferences about

whether our samples are drawn from populations having the same mean.e.g used in research in evaluation of new drugs, effects of diseases, frequency of medication etc.

3. ASSUMPTION: Each of the samples is drawn from a normal population and that each of these populations have the same variance 2.

4. ANOVA helps us to compare two different estimates of the variance of our overall population (variance among sample means and variance within sample means).

Page 16: Chi square

- the B-school

ANOVA – An Example

Three training methods were compared to see if they led to greater productivity after training. The following are the productivity measures for individuals trained by each method.

Method 1 45 40 50 39 53 44Method 2 59 43 47 51 39 49Method 3 41 37 43 40 52 37

Use ANOVA at 0.05 level of significance, to determine whether these training methods lead to different levels of productivity?

Page 17: Chi square

- the B-school

ANOVA –AN EXAMPLE

Solution Method:Statement of Hypothesis:

H0: µ1= µ2 = µ3 HA: µ1, µ2 and µ3 are not all equal Step 1: Calculate the grand mean,Step 2: Calculate the three sample means ,Step 3: Calculate the three sample variances Step 4: Estimate the between-column-variance, i.e variance among sample means

Step 5: Estimate the within-column-variance, i.e variance within the sample means ni = size of ith sample

nT = total sample size

Step 6: Calculate F-ratio/F-statistic,

X1 2 3, ,x x x2 2 21 2 3, ,s s s

22 ( )

1i i

b

n x X

k

2 21iw i

T

ns

n k

2

2b

w

F

Page 18: Chi square

- the B-school

ANOVA –AN EXAMPLE

Solution Method:Statement of Hypothesis:

H0: µ1= µ2 = µ3 HA: µ1, µ2 and µ3 are not all equal Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

Step 6: F-ratio/F-statistic,

44.94X

1 2 345.17, 48, 41.67x x x

2 2 21 2 325.13, 39.67, 25.89s s s

22 ( ) 120.78

60.391 2

i ib

n x X

k

2 21 5(25.14 39.68 25.89) 90.69 / 3 30.23

15i

w iT

ns

n k

2

2

60.391.997

30.23b

w

F

Page 19: Chi square

- the B-school

The F - Distribution

The F statistic has a particular sampling distribution called F-distribution.

Similar to t and chi-square distribution, the F-distribution represents a family of distributions.

Each distribution has a pair of degrees of freedom (a,b)where a = no. of df of the numerator (2

b) = k – 1 b = no. of df of the denomerator (2

w) = nT – k

For our problema = k – 1 = 3 – 1 = 2b = nT – k = 18 – 3 = 15

Page 20: Chi square

- the B-school

The F - Distribution

The F –table at 0.05 level of significance with (2,15) as the df indicates that 3.68 (F-statistic) is the upper level of the acceptance region.

Since our F- ratio = 1.99 is within this region we accept the null hypothesis and conclude that there are no significant differences in the effects of the three training methods on employee productivity.

The above example highlights the one-way ANOVA since we have considered only one factor i.e the effect of training method on employee productivity.

Page 21: Chi square

- the B-school

ANOVA – EXERCISE

The following data show the number of claims processed per day for a group of four insurance company employees observed for a number of days. Using ANOVA test the hypothesis that the employee’s mean claims per day are all the same. Use the 0.05 level of significance.

Employee 1 15 17 14 12Employee 2 12 10 13 17Employee 3 11 14 13 15 12Employee 4 13 12 12 14 10 9

Page 22: Chi square

- the B-school

ANOVA – EXERCISE

Solution Method:Statement of Hypothesis:

H0: µ1= µ2 = µ3= µ4 HA: µ1, µ2 , µ3 , µ4 are not all equal Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

Step 6: F-ratio/F-statistic,

Step 7: Critical F value (with df (3,15), α = 0.05) = 3.29

12.89X

1 2 3 414.5, 13, 13, 11.67x x x x

2 2 2 21 2 3 44.33, 8.67, 2.5, 3.46s s s s

22 ( ) 19.456

6.481 3

i ib

n x X

k

2 21 3 3 4 54.33 8.67 2.5 3.47) 4.42

15 15 15 15i

w iT

ns

n k

2

2

6.481.47

4.42b

w

F