Chi-square Distribution and Its Applications

8/10/2019 Chi-square Distribution and Its Applications

1/13

CHI-SQUARE DISTRIBUTION AND ITS APPLICATIONS

Introduction

The science of statistics comprises of two parts: Descriptive statistics and Inferential Statisco.

The descriptive statistics deals with the data and information base of analysis. It basically dealwith the condensation and preparation of data for presentation through classification, tabulationand condensation of data through summary statistics such as the measure of central tendency,

measures of variation etc.. It also deals with the graphical presentation of the data. Graphs are the

portrayal of the basis features of the frequency distribution. But the descriptive statistics can alsodeal with the information base pertaining to the qualitative attributes that fall within the purview

of non parametric statistics.

As against the Descriptive Statistics, Inferential Statistics goes deeper to unfathom whatunderlies the data. For this, inferential statistics focuses on the analysis of data in order to arrive

at the conclusions or inferences. It basically deals with the i) problems of estimation of the

values of the population parameters; and ii) hypothesis testing. The hypotheses may relate to a)population parametric values; or b) Representative character of one or more samples which may

involve the evaluation of the differences between the sample estimates and true values; c) the

relationships between two or more variables. Inferential statistics has the theory of probability as

its pivotal base. The probability theory may, however, be applied either to parametric or non-parametric statistico. Hence, both the descriptive and inferential statistics are further classifiable

into i) Parametric statistics; and ii) non-parametric statistics. Parametric statistics revolves

around the variables which are cardinally measurable, whereas non-parametric statistics dealswith nominally or ordinally measurable qualitative attributes of the phenomena under study.

Similarly, inferential statistics also have both parametric non-parametric components.

The following graph may portray this classification:

Parametric and Non-parametric Statistics

Statistics

Descriptive

Statistics

Inferential

Statistics

Parametric

Inferential Statistics

Non-Parametric

Inferential Statistics

Parametric

Descriptive

Statistics

Non-Parametric

Descriptive

Statistics


2/13

Sampling theory constitutes a large part of parametric statistics which use numerous test

statistics. The test statistics that have been discussed earlier by us enables us to test the

differences between the sample and population means, differences between the two samplemeans, equality of more than two sample means, equality of sample and/or population variances

etc. All these tests are based either on the assumption of the variate values having normal

distribution or a good approximation to the normal distribution under certain conditions. z and ttests are the important tests in parametric statistics. F test complements these tests. These testsare frequently used to evaluate one, two or multiple sample estimates. The causal relations

between two or more cardinally measured variables may also be evaluated by these tests. But

numerous research investigations focus either exclusively on ordinally or even nominallymeasured qualitative attributes and their classification into distinct categories. Since the

population in such cases has no parameters as such, there is no need for the parametric sample

estimates. The research investigations in such cases do, however, involve the testing of the

hypotheses. These hypotheses may pertain to one or more samples. Besides, the study may relateto one, two or even more attributes. In cases involving two or more attributes, the focus of

analysis may be inter-relations among the attributes under investigation. Rigorous statistical

analysis of such data obviously needs non-parametric methods of statistical analysis. Non-parametric statistics is now well developed and several non-parametric tests and methods of

statistical analysis have been developed to fulfill the research requirements. Obviously, the

choice of any of these tests for application depends on the nature of the measurement, conditions

of the problem setting and the nature of the sample(s).

2Distribution

2

distribution was discovered by Helmert in 1875. Subsequently, Karl Pearson developed it in

1900 independently of Helmert. 2distribution constitutes a part of Non-Parametric Statistics of

major importance. 2is probably the most widely used test statistics in social science research.

The large scale use of 2in research is accounted by the versatility of this test statistics on the

one hand, and the newer uses that the researchers have successfully evolved continuously on the

other. 2test mostly suits the problems where i) the nominal scale of measurement in one sample

is used; ii) higher scale of ordinal measurement is involved; iii) the testing of the goodness of thefit of some curve/mathematical/statistical function or model is required. Karl Pearson developed

2as a test of the goodness of fit of the mathematical/statistical distribution. For example, one

may have got data relating to the monthly earnings of 500 employees of different managerialcadres in a company. The company may be interested in knowing whether the distribution of

earnings follows normal, or binomial, multinomial or hyper geometric distribution. For getting

an answer to such questions, 2 may come as quite handy, while the answer may also help the

company in sorting out some knotty issues of HRM that revolve around the earning profiles. Arelated problem is to evaluate the significance of the differences between the observed and

expected/theoretical values; iv) it is useful not only in one sample study but also for studying two

or more samples; v) the test may also be used to evaluate the statistical significance of some

value or characteristic; vi) the test may be used for evaluating multi-collinearity in econometricmodels (Koutsoyiannis, 198 ); vii) the values of the test statistics, applied for the testing of the

homoskedasticity, tends to be distributed like 2, if these are normally and independently

distributed with k-1 degrees of freedom (Kane, p. 373); and viii) evaluating the association


3/13

between the attributes; and ix) the value of 2

is also used for the calculation of the coefficient of

contingency.

2 may also be defined as follows for the sum of squares of n independently and normally

distributed variables xi, i=1,2,..n when these have been normalized as follows:

)1,0(~Ni

ixizi

i

ixizix

2

22 )(

With n degrees of freedom. The degrees of freedom equal the Independent number of variables.

The distribution is skewed towards the right; it starts from the origin but extends to + as towards

the right tail. As n increases, 2tends more and more towards symmetry.

The mean and variance of the 2distribution are given by

E(x2)=n

V(x2)=2n

Thus, the mean equals the number of variables and the variance is twice the number of variables.

)52(6/1)(2 kanx

2distribution is of great importance in the statistical testing of the hypotheses. It is used to test

the correspondence between the sample data and expected values or frequencies, especially those

which relate to the variance, independence of the classification and the rank order.

It is interesting to note that the expected/theoretical values may be generated from the alternative

assumptions or theoretical framework. Each option may furnish an opportunity to formulate and

evaluate the empirical validity of a set of hypotheses rather than focusing on one singlehypothesis. This advantage may not be available from most of the other tests. The advantage isin-built in the formulation of the definition given above. This test enables us to compare not only

one sample value such as mean or variance but the whole set of sample values with the

corresponding theoretical or expected values.It is also considered to be one of the several non-parametric test statistics. Thus, this test statistic is unique in so far as it is capable of being used

to test the hypotheses relating to both the qualitative and quantitative data. In some cases, 2

offers an alternative to the Binomial distribution. Binomial distribution suits the problems most,

if the small size of the sample precludes the use of 2test. But in large samples, it is an option tothe Binomial distribution specially for the problems having binary values or such events as

success or failure.

Definition:


4/13

where f0refer to the actually observed values/frequencies, feare the expected values/frequencies

and n is the sample size. If the distribution of the sum of the independent normal variates with

zero mean is itself normally distributed, the sum of the squares of these variates has the ChiSquare distribution with n-1 degrees of freedom. This property furnishes the general definition,and hence,the following equation of this distribution:

2

2

1

0

eyY . 22

2v

(1)

where v=n-1 shows the degrees of freedom, Y is the function of these degrees of freedom.

Since the degrees of freedom are generally exogenously given, there is no need to estimate this

as the distributional parameter.This makes the 2distribution parametric free.The shape of the

distribution obviously depends upon the degrees of freedom, which is the only parameter of the

distribution.

For n=2, the distribution will be inverse exponential:2/122/1

0 )2.(

xxeyY

If n is large, it can be shown that the quantity e

e

f

ff 20 )( is distributed like Chi-square with n-

1 degrees of freedom.

The following diagram shows the Chi square curves for three different values of n: n=2, n=4, and

n=8

Chi Square distributions for 2,

n=8

n=4

n=2

1

2

3

4

5

6

7

8

01 2 3 4 5 6 7 8 9 10 11 12


5/13

These curves show that i) for each value of n, there exists one 2 distribution. Thus, like t

distribution, Chi-square is a family of the distributions. One distribution corresponds to each

given value of the degrees of freedom, ranging from 1 to infinity; ii) for one degree of freedom,2 is approximated by hyperbola; iii) as the degrees of freedom change, the shape of the curve

changes; iv) as the degrees of freedom increase, there occurs a peak; v) larger the n, more

towards the left the peak shifts.

2distribution, for purposes of calculation, may be defined differently for different cases.If one is

interested in evaluating the goodness of the fit of a function or differences between actual and

expected frequencies,2distribution can be defined by the following relation:

e

e

f

ff 202 )( (2)

Properties

2

distribution has the following characteristic features:

i) Chi-square is the sum of squares, and therefore, its value can never be negative. The

distribution, in fact, ranges from zero to infinity (positive);ii) The mode of Chi-square distribution is located at the point where Chi-square equals

n-1. The distribution is uni-model and it is positively skewed towards the right. In

fact, the curve falls gradually and if ultimately approaches zero as Chi-square tends

towards infinity;

iii)

2 curve is aPearsoniantype III curve;iv) As the size of the sample increases, the distribution approaches normality. Fisher

showed that, if n>30, the quantity [(2 2)1/2

(2y-1)1/2

] is distributed normally withzero mean and unit standard deviation. Therefore, it is a normal deviate, having

values of 1.96 and 2.58at 5 and 1 per cent probability level. This quantity can be both

positive and negative; it is distributed like a usual normal variate. Thus, 5 and 1 percent significance points of

2, like other test statistics, refer to both the tails of the

distribution;

v) A sum of several Chi-squares is also distributed as Chi-square and the degrees of

freedom are equal to the sum of the degrees of freedom of the componentChi-squares(For proof, See, Kenny and Keeping);

vi)

Generally, Chi-square could be defined as

,)(

2

22

xxi (3)

that is, it is the ratio of the sum of the deviations of the sample values from the

sample mean to the population variance. Greater the degree of departure of the datafrom the stipulated hypothesis, greater will be the value of Chi-square and greater


6/13

will be the probability of the rejection of the hypothesis that the data are in

consonance with a priori theory;

vii) Chi-square distribution is a continuous distribution even though the actual frequenciesof the occurrence may be discontinuous. This introduces an error, especially if the

frequency is small. A frequency of less than 5 is considered to be small. The

alternative solutions are available to over-come this limitation:a) Fisher suggested that the expected frequency should not be less than five in anycell or class. This limitation may be overcome since several classes, having low

frequencies, could be clubbed together if their individual expected frequencies are

less than five;b) Yates correction may also be used. According to this procedure, is added to the

small frequencies, while is deducted from the largely frequencies so as to keep the

marginal totals unchanged. This is, however, applicable only to 2x2 tables(See Yule

and Kendall);c) In some cases, it may be possible to arrive at an exact solution, which renders this

correction un-necessary; and

viii)

As pointed out earlier, the equation of the Chi-square does not involve any parameterof the population. Hence, it does not depend upon the form of the population

distribution. Since it does not contain any population parameter, it is known as a non-

parametric statistics.

Conditions For The Application of Chi-Square Test:

i) The sample must be large; the sample should preferably contain 50 or more itemseven though the number of cells or class intervals may be small. Aggregation and

classification generally reduces the number of cells;

ii) N individual items in the sample must have been drawn independently;

iii) The number of cells must be neither too small nor too large. It is preferable to havethe class intervals or cells in the range of 5 to 20;

iv) The constraints to which the cell frequencies are subjected must be linear. The

researcher can exercise his choice in favour of formulating the constraints to satisfythe condition of linearity;

v) The cell frequencies must not be small. In any case, no cell frequency should be less

than five. It is preferable to have 10 or more than 10 as the smallest value of the cellfrequency. This condition can easily be satisfied by clubbing several classes and

aggregating the corresponding frequencies together in case their frequencies are less

than five.

Numerical Examples:

If the population variance is known or its value is postulated, Chi-square statistic can be used to

determine whether the sample estimate of the population variance differs from the actualpopulation variance by more than what is warranted by the sampling fluctuations. For such

problems, Chi-square is defined as follows:


7/13

,)(

2

2

2

22

xxnS

where n is the sample size, S2is the sample estimate of the variance, x is the sample mean, and

2 is population variance.

Test of Variances

Example 1: A random sample of the heights of 20 males is collected from the records of an armyrecruitment board, which gives a variance of 3.21. Is it consistent with the population variance of

2.79?

.01.2379.2/2.6479.2/21.3202

22

xnS

For 19 d.f., 5% value of Chi-square is 30.14,

which is greater than the calculated value. Hence, the difference between the true population

variance and its sample estimate could have arisen from the sampling fluctuations.

Example 2: Ten random samples of 50 persons each were selected for an opinion poll with the

following number of votes, favouring a particular candidate: 25,25,27,27,28,29,30,31,33,33. Is itan unusual variation to expect? In the opinion poll, the respondents are given two choices

(binary-success or failure): either they favour the particular candidate or they do not. If they

favour the candidate, we may take it as being denoted by S and if they do not, then it may be

designated by

S. Then, the probabilities of the occurrence of S and

Smay be shows by p and q.

Occurrences of S and

S may be calculated from the binominal distribution: (q+p)n. The overall

sample size, if all the ten sub-samples are pooled together Nxn=10x50=500=n.

The variance of the binomial distribution is given by npq, where n is the sample size, p is theprobability of success and q is the probability of failure in any trial. The mean number of

successes, that is, expected value will be np. For testing the acceptability of sample variance,

equation 3 has to be used:

npq

xxinS

2

2

22

Total number of votes in all the samples, which favour the particular candidate =

25+25+27+27+28+29+30+31+33+33) = 288. But the total number of voters, who have been

interviewed = 50X10 = 500. Hence, the actual probability of any voter favouring in all thesamples taken together is given by the particular candidate = 288/500 = .576. Hence, q = 1 p =

.424.

The mean number of successes = np =500 X .576 = 288.

nxxxx /222 and


8/13

2x = (625+625+729+729+784+841+900+961+1089+1089) = 8372

Chi-Square = .35.62112.12/6.77)424(.)576(.)50(

10/2882888372

XX

X For 9 d.f., 5% table value of Chi-

square is 16.919. The table value is much greater than the calculated value. Hence, the evidence

is in support of the hypothesis that the variation could have arisen from the samplingfluctuations.

The Test of Goodness of Fit:

The most common use of the Chi-square test is to assess the goodness of the fit of a theoretical

distribution/curve or a mathematical function to a given set of data. Then, the hypothesis to be

tested is whether the observed sample frequencies are in consonance with expected frequenciesor the values predicted by the function are acceptable.

ee fff /)(

2

0

2

, where f0 denotes the observed frequency and fe is theexpected/theoretical frequency/value. The expected frequencies may be obtained by any one of

the following methods: these may be derived from some theoretical distribution like the binomial

distribution, they may be obtained by applying the principle of independence in case of a 2X2frequency table, or these may be obtained from the expected ratios such as sex ratio, or these

may be based on empirical data. It may also assumed that the frequencies occur randomly. If a

mathematical/econometric function has been used (See Prakash and Subramanian, 2006), thevalue predicted by this function may be used.

Example 3: In 120 throws of a single die, the following results are obtained:

Number: X: 1 2 3 4 5 6 TotalFrequency: f: 30 25 18 10 22 15 120

Do these frequencies discredit the hypothesis of equal probability of each number?

If the die is unbiased, the probability of each of the six numbers is 1/6 and the expected number

of each number/face is np=120/6=20.

2

= (30-20)2/20 + (25-20)

2/20 + (18-20)

2/20 + (10-20)

2/20 + (22-20)

2/20 + (15-20)

2/20 =

[100+25+4+100+4+25]1/20 = 258/20 = 12.9, v = 61 = 5. 5% value of Chi-square is 1.7, which

is much less than the calculated value. Hence, the hypothesis of unbiased die of equal probability

of each number is rejected.

There is no evidence to support the thesis that the die is unbiased.

2For Contingency Table

Example 4: The following table shows the classification of the employed male graduates byoccupation related to the occupation of their fathers


9/13

Youths

Occupation

Number of Youth Reporting

Fathers Occupation

Total

Unskilled Skilled White Collar

White Collar 8 39 56 103

Skilled 34 118 38 190

Unskilled 5 16 10 31

Total 47 173 104 324

Test the association between the occupation of the youth and that of their fathers.

The probability of the occurrences of two independent events is the product of the

probabilities of their individual occurrences. If n is the sample size, n(io) is the total frequency of

i-th row, n(oj) is the total frequency of the j-th column and n(ij) is the frequency of the i-j-th cell,

expected frequency of this cell is then given byn

ojnionijp

)()()(

If we assume the independence between the various cell frequencies, that is, the occupation

chosen by the sons is independent of the occupation of their fathers, the probability of i-j-th cellfrequency is p(ij)=p(i)Xp(j). But p(i) = n(io)/n and p(j) = n(oj)/n.

Hence,nn

ojnionjp

)()()( . This probability multiplied by the sample size gives the expected

frequency:

n

ojnionijnp

)()()(

Hence, ).(/)}()( 22 ijnijnijn ee Thus, the expected frequencies are calculated with thehelp of the grand total and the marginal frequencies. If we have p rows and q columns, the

number of degrees of freedom is given by (p-1))(q-1). If the null hypothesis of the independence

of the cell frequencies is rejected, the relationship between the given attributes will be estimatedby the coefficient of contingency:

2

2

nC

In the given example, we will have the expected frequencies as given below: n(11) =47X103/324 = 14.9, n(12) = 103X173/324 = 55, n(13) = 103X104/324 = 33.1, n(21) =190X47/324 = 27.6, n(22) = 190X173/324 = 101.4, n(23) = 190X104/324 = 61, n(31) =

31X47/324 = 4.5, n(32) = 31X173/324 = 16.6, n(33) = 31X104 = 9.9. The value of 2 will,

therefore, be given by

.645.369.9

)9.910(

6.16

)6.1616(

5.4

)5.45(

61

)6138(

4.101

)4.101118(

6.27

)6.2734(

1.33

)1.3356(

55

)5539(

9.14

)9.148( 2222222222

x


10/13

For d.f.=2X2=4, 5% value of Chi square is 8.488, which is less than the calculated value of 2.

Hence, the evidence is against the hypothesis of independence. Now the question is what is the

degree of relation ship between the occupation of fathers and sons?

The Coefficient of Contingency, 32.0645.36324

645.36

C while the maximum value of the

coefficient is 0.84. Hence the calculated value of the contingency coefficient is moderate. It is

neither too low, nor too high. The empirical evidence suggests that there is a weak relation

between the occupation of fathers and sons.

Test of the Coefficient of Association

2X2 Frequency Table. A 2X2 frequency table is a special case of pxq contingency table. Such atable is used to illustrate the following points: i) Formula for calculating Chi-square without

correction; ii) Calculation of 2 with Yates correction for small frequency; and iii) An exact

solution for Chi-square. Let the 2X2 table be as follows:

A B Total

C a b a+b

D c d c+d

Total a+c b+d n

Chi-square for such a table is given by:

))()()((

))(( 22

dbcadcba

bcaddcba

where n = a+b+c+d

Example 5: Find the value of Chi-square for the following table without Yates correction for the

small frequencies:

A B Total

C 10 4 14

D 2 8 10

Total 12 12 24

Calculated value of Chi-Square:

17.635/216)10141212/()727224(10141212

)42108(24 2


11/13

Value of Chi-square at 5% level for (r-1)(c-1)=(2-1)(2-1)=1x1=1 d.f. is 3.841. This is much less

than the observed or calculated value. Hence, the attributes are not independent. The results

suggests that the attributes are associated.

Alternatively, the expected frequencies may be calculated as follows:

524

1012)22(,5

24

1012)21(,7

24

1412)12(,7

24

1412)11(

nnnn

Then, the value of Chi-square

17.635/21635

126905/187/185/

2)58(5/

2)52(7/

2)74(7/

2)710(

Results remain the same irrespective of the procedure followed. Since the two cell frequencies

are small, we may apply the Yates correction by raising the small frequencies by and by

decreasing the large frequencies by . The table thus corrected will be as given below:

A B

C 9.5 4.5 14

D 2.5 7.5 10

12 12 24

29.410141212/60602410141212

)5.25.45.75.9(24,

2

SquareChiThen

Without correction, the term (ad-bc)2

=72X72=5184 which is now reduced to 60X60=3600 bythe correction. This reduction in the value of the numerator accounts for the reduction in the

value of Chi Square from 6.17 to 4.29. But the calculated value is still greater than the tablevalue. Hence, it is again significant at 5% probability level and the inference drawn earlier is not

altered.

Derivation of Expected Values

Expected values may be derived in alternative ways. Some scientific principle may be used to

estimate the expected values. Alternatively, the empirical evidence may be used to evolve the

criteria for generating expected values. The evidence may be used to determine the relative

frequencies or probabilities ad we have done for solving example 4. Some statistical distributionor mathematical/ econometric model may be use to determine the theoretical values.

Scientific Principle as the Base

Example 6: In an experiment of pea breeding, Mendel obtained the following frequencies ofseeds:


12/13

Round and

Yellow

Wrinkled

Yellow

Round Green Wrinkled Green Total

315 101 108 32 556

The theory predicts that the frequencies should be in the following proportions 9:3:3:1. Examine

consistency of the data with the above expectation.

According to theoretical prediction of relative shares of 4 types of seeds in a total of 16, the

expected frequencies will be16

5561,

16

5563,

16

5563,

16

5569 =313, 104,104 and 35 respectively.

The value of35

2)3532(

104

2)104108(

104

2)104101(

313

2)313315(2

= 4/313 + 9/104 + 16/104 + 9/35 =

0.51. But the 5% value of Chi-square for 3 d.f. is 7.815, which is much greater than the observed

value. Hence, the difference between the observed and the theoretical frequencies is not

significant. The data are thus consistent with the theoretical expectation.

Example 7: Ten random samples of 100 items each have been collected. Are the following

frequencies of the males and females consistent with an expectation of equal division of sexes?

Male 40 52 49 50 43 48 42 45 41 51

Female 60 48 51 50 47 52 58 55 59 49

Expected number of males or females=np= x100=50 in each sample. The difference between

the observed and expected frequencies of both males and females are the same. Hence, Chi-

square = 2(100+4+1+0+49+4+64+25+81+1) (1/50)=2x329=13.16.

But the 5% value of Chi-square is 18.037, which is greater than the observed value. Therefore,

the data do not discredit the hypothesis of equal proportions of the males and females in thepopulation.

Books Recommended:

1. Croxton and Cowden:Applied General Statistics.

2. Kenney, J.F. and Keeping, E. S.: Mathematics of Statistics, part I&II, Affiliated East-

West Press Pvt. Ltd., New Delhi.3. Rosander, A.C.:Elementary Principles of Statistics,Affiliated East-West Press Pvt. Ltd.,

New Delhi.

4.

Yamane, Taro: Statistics.5. Yule and Kendell:Introducation to Theory of Statistics.6. Weatherburn, C.E.: A First Course in Mathematical Statistics, Cambridge University

Press, London.

7. Kane, Edward J.: Economic Statistics & Econometrics, Harper & Row, New York,Evenston & Condon and John Weatherhill, Inc., Tokyo.

8. Kotsoyiannis


13/13

9. Cooper Donald R. and Schindler, Pamela S.: Business Research Methods, Tata McGraw

Hill, New Delhi.

10.Levine, Devid M., Krenbiel, Timothy L. and Berenson Mark L.: Business Statistics,Pearson education Asia, 2001.

11.Prakash and Subramanian (2006) Determination of Share Prices: Analysis of A Select

Group of Indian Companies Forthcoming in Finance India.

Chi-square Distribution and Its Applications

Documents

Transcript of Chi-square Distribution and Its Applications