Chi-square Distribution and Its Applications
Transcript of Chi-square Distribution and Its Applications
-
8/10/2019 Chi-square Distribution and Its Applications
1/13
CHI-SQUARE DISTRIBUTION AND ITS APPLICATIONS
Introduction
The science of statistics comprises of two parts: Descriptive statistics and Inferential Statisco.
The descriptive statistics deals with the data and information base of analysis. It basically dealwith the condensation and preparation of data for presentation through classification, tabulationand condensation of data through summary statistics such as the measure of central tendency,
measures of variation etc.. It also deals with the graphical presentation of the data. Graphs are the
portrayal of the basis features of the frequency distribution. But the descriptive statistics can alsodeal with the information base pertaining to the qualitative attributes that fall within the purview
of non parametric statistics.
As against the Descriptive Statistics, Inferential Statistics goes deeper to unfathom whatunderlies the data. For this, inferential statistics focuses on the analysis of data in order to arrive
at the conclusions or inferences. It basically deals with the i) problems of estimation of the
values of the population parameters; and ii) hypothesis testing. The hypotheses may relate to a)population parametric values; or b) Representative character of one or more samples which may
involve the evaluation of the differences between the sample estimates and true values; c) the
relationships between two or more variables. Inferential statistics has the theory of probability as
its pivotal base. The probability theory may, however, be applied either to parametric or non-parametric statistico. Hence, both the descriptive and inferential statistics are further classifiable
into i) Parametric statistics; and ii) non-parametric statistics. Parametric statistics revolves
around the variables which are cardinally measurable, whereas non-parametric statistics dealswith nominally or ordinally measurable qualitative attributes of the phenomena under study.
Similarly, inferential statistics also have both parametric non-parametric components.
The following graph may portray this classification:
Parametric and Non-parametric Statistics
Statistics
Descriptive
Statistics
Inferential
Statistics
Parametric
Inferential Statistics
Non-Parametric
Inferential Statistics
Parametric
Descriptive
Statistics
Non-Parametric
Descriptive
Statistics
-
8/10/2019 Chi-square Distribution and Its Applications
2/13
Sampling theory constitutes a large part of parametric statistics which use numerous test
statistics. The test statistics that have been discussed earlier by us enables us to test the
differences between the sample and population means, differences between the two samplemeans, equality of more than two sample means, equality of sample and/or population variances
etc. All these tests are based either on the assumption of the variate values having normal
distribution or a good approximation to the normal distribution under certain conditions. z and ttests are the important tests in parametric statistics. F test complements these tests. These testsare frequently used to evaluate one, two or multiple sample estimates. The causal relations
between two or more cardinally measured variables may also be evaluated by these tests. But
numerous research investigations focus either exclusively on ordinally or even nominallymeasured qualitative attributes and their classification into distinct categories. Since the
population in such cases has no parameters as such, there is no need for the parametric sample
estimates. The research investigations in such cases do, however, involve the testing of the
hypotheses. These hypotheses may pertain to one or more samples. Besides, the study may relateto one, two or even more attributes. In cases involving two or more attributes, the focus of
analysis may be inter-relations among the attributes under investigation. Rigorous statistical
analysis of such data obviously needs non-parametric methods of statistical analysis. Non-parametric statistics is now well developed and several non-parametric tests and methods of
statistical analysis have been developed to fulfill the research requirements. Obviously, the
choice of any of these tests for application depends on the nature of the measurement, conditions
of the problem setting and the nature of the sample(s).
2Distribution
2
distribution was discovered by Helmert in 1875. Subsequently, Karl Pearson developed it in
1900 independently of Helmert. 2distribution constitutes a part of Non-Parametric Statistics of
major importance. 2is probably the most widely used test statistics in social science research.
The large scale use of 2in research is accounted by the versatility of this test statistics on the
one hand, and the newer uses that the researchers have successfully evolved continuously on the
other. 2test mostly suits the problems where i) the nominal scale of measurement in one sample
is used; ii) higher scale of ordinal measurement is involved; iii) the testing of the goodness of thefit of some curve/mathematical/statistical function or model is required. Karl Pearson developed
2as a test of the goodness of fit of the mathematical/statistical distribution. For example, one
may have got data relating to the monthly earnings of 500 employees of different managerialcadres in a company. The company may be interested in knowing whether the distribution of
earnings follows normal, or binomial, multinomial or hyper geometric distribution. For getting
an answer to such questions, 2 may come as quite handy, while the answer may also help the
company in sorting out some knotty issues of HRM that revolve around the earning profiles. Arelated problem is to evaluate the significance of the differences between the observed and
expected/theoretical values; iv) it is useful not only in one sample study but also for studying two
or more samples; v) the test may also be used to evaluate the statistical significance of some
value or characteristic; vi) the test may be used for evaluating multi-collinearity in econometricmodels (Koutsoyiannis, 198 ); vii) the values of the test statistics, applied for the testing of the
homoskedasticity, tends to be distributed like 2, if these are normally and independently
distributed with k-1 degrees of freedom (Kane, p. 373); and viii) evaluating the association
-
8/10/2019 Chi-square Distribution and Its Applications
3/13
between the attributes; and ix) the value of 2
is also used for the calculation of the coefficient of
contingency.
2 may also be defined as follows for the sum of squares of n independently and normally
distributed variables xi, i=1,2,..n when these have been normalized as follows:
)1,0(~Ni
ixizi
i
ixizix
2
22 )(
With n degrees of freedom. The degrees of freedom equal the Independent number of variables.
The distribution is skewed towards the right; it starts from the origin but extends to + as towards
the right tail. As n increases, 2tends more and more towards symmetry.
The mean and variance of the 2distribution are given by
E(x2)=n
V(x2)=2n
Thus, the mean equals the number of variables and the variance is twice the number of variables.
)52(6/1)(2 kanx
2distribution is of great importance in the statistical testing of the hypotheses. It is used to test
the correspondence between the sample data and expected values or frequencies, especially those
which relate to the variance, independence of the classification and the rank order.
It is interesting to note that the expected/theoretical values may be generated from the alternative
assumptions or theoretical framework. Each option may furnish an opportunity to formulate and
evaluate the empirical validity of a set of hypotheses rather than focusing on one singlehypothesis. This advantage may not be available from most of the other tests. The advantage isin-built in the formulation of the definition given above. This test enables us to compare not only
one sample value such as mean or variance but the whole set of sample values with the
corresponding theoretical or expected values.It is also considered to be one of the several non-parametric test statistics. Thus, this test statistic is unique in so far as it is capable of being used
to test the hypotheses relating to both the qualitative and quantitative data. In some cases, 2
offers an alternative to the Binomial distribution. Binomial distribution suits the problems most,
if the small size of the sample precludes the use of 2test. But in large samples, it is an option tothe Binomial distribution specially for the problems having binary values or such events as
success or failure.
Definition:
-
8/10/2019 Chi-square Distribution and Its Applications
4/13
where f0refer to the actually observed values/frequencies, feare the expected values/frequencies
and n is the sample size. If the distribution of the sum of the independent normal variates with
zero mean is itself normally distributed, the sum of the squares of these variates has the ChiSquare distribution with n-1 degrees of freedom. This property furnishes the general definition,and hence,the following equation of this distribution:
2
2
1
0
eyY . 22
2v
(1)
where v=n-1 shows the degrees of freedom, Y is the function of these degrees of freedom.
Since the degrees of freedom are generally exogenously given, there is no need to estimate this
as the distributional parameter.This makes the 2distribution parametric free.The shape of the
distribution obviously depends upon the degrees of freedom, which is the only parameter of the
distribution.
For n=2, the distribution will be inverse exponential:2/122/1
0 )2.(
xxeyY
If n is large, it can be shown that the quantity e
e
f
ff 20 )( is distributed like Chi-square with n-
1 degrees of freedom.
The following diagram shows the Chi square curves for three different values of n: n=2, n=4, and
n=8
Chi Square distributions for 2,
n=8
n=4
n=2
1
2
3
4
5
6
7
8
01 2 3 4 5 6 7 8 9 10 11 12
-
8/10/2019 Chi-square Distribution and Its Applications
5/13
These curves show that i) for each value of n, there exists one 2 distribution. Thus, like t
distribution, Chi-square is a family of the distributions. One distribution corresponds to each
given value of the degrees of freedom, ranging from 1 to infinity; ii) for one degree of freedom,2 is approximated by hyperbola; iii) as the degrees of freedom change, the shape of the curve
changes; iv) as the degrees of freedom increase, there occurs a peak; v) larger the n, more
towards the left the peak shifts.
2distribution, for purposes of calculation, may be defined differently for different cases.If one is
interested in evaluating the goodness of the fit of a function or differences between actual and
expected frequencies,2distribution can be defined by the following relation:
e
e
f
ff 202 )( (2)
Properties
2
distribution has the following characteristic features:
i) Chi-square is the sum of squares, and therefore, its value can never be negative. The
distribution, in fact, ranges from zero to infinity (positive);ii) The mode of Chi-square distribution is located at the point where Chi-square equals
n-1. The distribution is uni-model and it is positively skewed towards the right. In
fact, the curve falls gradually and if ultimately approaches zero as Chi-square tends
towards infinity;
iii)
2 curve is aPearsoniantype III curve;iv) As the size of the sample increases, the distribution approaches normality. Fisher
showed that, if n>30, the quantity [(2 2)1/2
(2y-1)1/2
] is distributed normally withzero mean and unit standard deviation. Therefore, it is a normal deviate, having
values of 1.96 and 2.58at 5 and 1 per cent probability level. This quantity can be both
positive and negative; it is distributed like a usual normal variate. Thus, 5 and 1 percent significance points of
2, like other test statistics, refer to both the tails of the
distribution;
v) A sum of several Chi-squares is also distributed as Chi-square and the degrees of
freedom are equal to the sum of the degrees of freedom of the componentChi-squares(For proof, See, Kenny and Keeping);
vi)
Generally, Chi-square could be defined as
,)(
2
22
xxi (3)
that is, it is the ratio of the sum of the deviations of the sample values from the
sample mean to the population variance. Greater the degree of departure of the datafrom the stipulated hypothesis, greater will be the value of Chi-square and greater
-
8/10/2019 Chi-square Distribution and Its Applications
6/13
will be the probability of the rejection of the hypothesis that the data are in
consonance with a priori theory;
vii) Chi-square distribution is a continuous distribution even though the actual frequenciesof the occurrence may be discontinuous. This introduces an error, especially if the
frequency is small. A frequency of less than 5 is considered to be small. The
alternative solutions are available to over-come this limitation:a) Fisher suggested that the expected frequency should not be less than five in anycell or class. This limitation may be overcome since several classes, having low
frequencies, could be clubbed together if their individual expected frequencies are
less than five;b) Yates correction may also be used. According to this procedure, is added to the
small frequencies, while is deducted from the largely frequencies so as to keep the
marginal totals unchanged. This is, however, applicable only to 2x2 tables(See Yule
and Kendall);c) In some cases, it may be possible to arrive at an exact solution, which renders this
correction un-necessary; and
viii)
As pointed out earlier, the equation of the Chi-square does not involve any parameterof the population. Hence, it does not depend upon the form of the population
distribution. Since it does not contain any population parameter, it is known as a non-
parametric statistics.
Conditions For The Application of Chi-Square Test:
i) The sample must be large; the sample should preferably contain 50 or more itemseven though the number of cells or class intervals may be small. Aggregation and
classification generally reduces the number of cells;
ii) N individual items in the sample must have been drawn independently;
iii) The number of cells must be neither too small nor too large. It is preferable to havethe class intervals or cells in the range of 5 to 20;
iv) The constraints to which the cell frequencies are subjected must be linear. The
researcher can exercise his choice in favour of formulating the constraints to satisfythe condition of linearity;
v) The cell frequencies must not be small. In any case, no cell frequency should be less
than five. It is preferable to have 10 or more than 10 as the smallest value of the cellfrequency. This condition can easily be satisfied by clubbing several classes and
aggregating the corresponding frequencies together in case their frequencies are less
than five.
Numerical Examples:
If the population variance is known or its value is postulated, Chi-square statistic can be used to
determine whether the sample estimate of the population variance differs from the actualpopulation variance by more than what is warranted by the sampling fluctuations. For such
problems, Chi-square is defined as follows:
-
8/10/2019 Chi-square Distribution and Its Applications
7/13
,)(
2
2
2
22
xxnS
where n is the sample size, S2is the sample estimate of the variance, x is the sample mean, and
2 is population variance.
Test of Variances
Example 1: A random sample of the heights of 20 males is collected from the records of an armyrecruitment board, which gives a variance of 3.21. Is it consistent with the population variance of
2.79?
.01.2379.2/2.6479.2/21.3202
22
xnS
For 19 d.f., 5% value of Chi-square is 30.14,
which is greater than the calculated value. Hence, the difference between the true population
variance and its sample estimate could have arisen from the sampling fluctuations.
Example 2: Ten random samples of 50 persons each were selected for an opinion poll with the
following number of votes, favouring a particular candidate: 25,25,27,27,28,29,30,31,33,33. Is itan unusual variation to expect? In the opinion poll, the respondents are given two choices
(binary-success or failure): either they favour the particular candidate or they do not. If they
favour the candidate, we may take it as being denoted by S and if they do not, then it may be
designated by
S. Then, the probabilities of the occurrence of S and
Smay be shows by p and q.
Occurrences of S and
S may be calculated from the binominal distribution: (q+p)n. The overall
sample size, if all the ten sub-samples are pooled together Nxn=10x50=500=n.
The variance of the binomial distribution is given by npq, where n is the sample size, p is theprobability of success and q is the probability of failure in any trial. The mean number of
successes, that is, expected value will be np. For testing the acceptability of sample variance,
equation 3 has to be used:
npq
xxinS
2
2
22
Total number of votes in all the samples, which favour the particular candidate =
25+25+27+27+28+29+30+31+33+33) = 288. But the total number of voters, who have been
interviewed = 50X10 = 500. Hence, the actual probability of any voter favouring in all thesamples taken together is given by the particular candidate = 288/500 = .576. Hence, q = 1 p =
.424.
The mean number of successes = np =500 X .576 = 288.
nxxxx /222 and
-
8/10/2019 Chi-square Distribution and Its Applications
8/13
2x = (625+625+729+729+784+841+900+961+1089+1089) = 8372
Chi-Square = .35.62112.12/6.77)424(.)576(.)50(
10/2882888372
XX
X For 9 d.f., 5% table value of Chi-
square is 16.919. The table value is much greater than the calculated value. Hence, the evidence
is in support of the hypothesis that the variation could have arisen from the samplingfluctuations.
The Test of Goodness of Fit:
The most common use of the Chi-square test is to assess the goodness of the fit of a theoretical
distribution/curve or a mathematical function to a given set of data. Then, the hypothesis to be
tested is whether the observed sample frequencies are in consonance with expected frequenciesor the values predicted by the function are acceptable.
ee fff /)(
2
0
2
, where f0 denotes the observed frequency and fe is theexpected/theoretical frequency/value. The expected frequencies may be obtained by any one of
the following methods: these may be derived from some theoretical distribution like the binomial
distribution, they may be obtained by applying the principle of independence in case of a 2X2frequency table, or these may be obtained from the expected ratios such as sex ratio, or these
may be based on empirical data. It may also assumed that the frequencies occur randomly. If a
mathematical/econometric function has been used (See Prakash and Subramanian, 2006), thevalue predicted by this function may be used.
Example 3: In 120 throws of a single die, the following results are obtained:
Number: X: 1 2 3 4 5 6 TotalFrequency: f: 30 25 18 10 22 15 120
Do these frequencies discredit the hypothesis of equal probability of each number?
If the die is unbiased, the probability of each of the six numbers is 1/6 and the expected number
of each number/face is np=120/6=20.
2
= (30-20)2/20 + (25-20)
2/20 + (18-20)
2/20 + (10-20)
2/20 + (22-20)
2/20 + (15-20)
2/20 =
[100+25+4+100+4+25]1/20 = 258/20 = 12.9, v = 61 = 5. 5% value of Chi-square is 1.7, which
is much less than the calculated value. Hence, the hypothesis of unbiased die of equal probability
of each number is rejected.
There is no evidence to support the thesis that the die is unbiased.
2For Contingency Table
Example 4: The following table shows the classification of the employed male graduates byoccupation related to the occupation of their fathers
-
8/10/2019 Chi-square Distribution and Its Applications
9/13
Youths
Occupation
Number of Youth Reporting
Fathers Occupation
Total
Unskilled Skilled White Collar
White Collar 8 39 56 103
Skilled 34 118 38 190
Unskilled 5 16 10 31
Total 47 173 104 324
Test the association between the occupation of the youth and that of their fathers.
The probability of the occurrences of two independent events is the product of the
probabilities of their individual occurrences. If n is the sample size, n(io) is the total frequency of
i-th row, n(oj) is the total frequency of the j-th column and n(ij) is the frequency of the i-j-th cell,
expected frequency of this cell is then given byn
ojnionijp
)()()(
If we assume the independence between the various cell frequencies, that is, the occupation
chosen by the sons is independent of the occupation of their fathers, the probability of i-j-th cellfrequency is p(ij)=p(i)Xp(j). But p(i) = n(io)/n and p(j) = n(oj)/n.
Hence,nn
ojnionjp
)()()( . This probability multiplied by the sample size gives the expected
frequency:
n
ojnionijnp
)()()(
Hence, ).(/)}()( 22 ijnijnijn ee Thus, the expected frequencies are calculated with thehelp of the grand total and the marginal frequencies. If we have p rows and q columns, the
number of degrees of freedom is given by (p-1))(q-1). If the null hypothesis of the independence
of the cell frequencies is rejected, the relationship between the given attributes will be estimatedby the coefficient of contingency:
2
2
nC
In the given example, we will have the expected frequencies as given below: n(11) =47X103/324 = 14.9, n(12) = 103X173/324 = 55, n(13) = 103X104/324 = 33.1, n(21) =190X47/324 = 27.6, n(22) = 190X173/324 = 101.4, n(23) = 190X104/324 = 61, n(31) =
31X47/324 = 4.5, n(32) = 31X173/324 = 16.6, n(33) = 31X104 = 9.9. The value of 2 will,
therefore, be given by
.645.369.9
)9.910(
6.16
)6.1616(
5.4
)5.45(
61
)6138(
4.101
)4.101118(
6.27
)6.2734(
1.33
)1.3356(
55
)5539(
9.14
)9.148( 2222222222
x
-
8/10/2019 Chi-square Distribution and Its Applications
10/13
For d.f.=2X2=4, 5% value of Chi square is 8.488, which is less than the calculated value of 2.
Hence, the evidence is against the hypothesis of independence. Now the question is what is the
degree of relation ship between the occupation of fathers and sons?
The Coefficient of Contingency, 32.0645.36324
645.36
C while the maximum value of the
coefficient is 0.84. Hence the calculated value of the contingency coefficient is moderate. It is
neither too low, nor too high. The empirical evidence suggests that there is a weak relation
between the occupation of fathers and sons.
Test of the Coefficient of Association
2X2 Frequency Table. A 2X2 frequency table is a special case of pxq contingency table. Such atable is used to illustrate the following points: i) Formula for calculating Chi-square without
correction; ii) Calculation of 2 with Yates correction for small frequency; and iii) An exact
solution for Chi-square. Let the 2X2 table be as follows:
A B Total
C a b a+b
D c d c+d
Total a+c b+d n
Chi-square for such a table is given by:
))()()((
))(( 22
dbcadcba
bcaddcba
where n = a+b+c+d
Example 5: Find the value of Chi-square for the following table without Yates correction for the
small frequencies:
A B Total
C 10 4 14
D 2 8 10
Total 12 12 24
Calculated value of Chi-Square:
17.635/216)10141212/()727224(10141212
)42108(24 2
-
8/10/2019 Chi-square Distribution and Its Applications
11/13
Value of Chi-square at 5% level for (r-1)(c-1)=(2-1)(2-1)=1x1=1 d.f. is 3.841. This is much less
than the observed or calculated value. Hence, the attributes are not independent. The results
suggests that the attributes are associated.
Alternatively, the expected frequencies may be calculated as follows:
524
1012)22(,5
24
1012)21(,7
24
1412)12(,7
24
1412)11(
nnnn
Then, the value of Chi-square
17.635/21635
126905/187/185/
2)58(5/
2)52(7/
2)74(7/
2)710(
Results remain the same irrespective of the procedure followed. Since the two cell frequencies
are small, we may apply the Yates correction by raising the small frequencies by and by
decreasing the large frequencies by . The table thus corrected will be as given below:
A B
C 9.5 4.5 14
D 2.5 7.5 10
12 12 24
29.410141212/60602410141212
)5.25.45.75.9(24,
2
SquareChiThen
Without correction, the term (ad-bc)2
=72X72=5184 which is now reduced to 60X60=3600 bythe correction. This reduction in the value of the numerator accounts for the reduction in the
value of Chi Square from 6.17 to 4.29. But the calculated value is still greater than the tablevalue. Hence, it is again significant at 5% probability level and the inference drawn earlier is not
altered.
Derivation of Expected Values
Expected values may be derived in alternative ways. Some scientific principle may be used to
estimate the expected values. Alternatively, the empirical evidence may be used to evolve the
criteria for generating expected values. The evidence may be used to determine the relative
frequencies or probabilities ad we have done for solving example 4. Some statistical distributionor mathematical/ econometric model may be use to determine the theoretical values.
Scientific Principle as the Base
Example 6: In an experiment of pea breeding, Mendel obtained the following frequencies ofseeds:
-
8/10/2019 Chi-square Distribution and Its Applications
12/13
Round and
Yellow
Wrinkled
Yellow
Round Green Wrinkled Green Total
315 101 108 32 556
The theory predicts that the frequencies should be in the following proportions 9:3:3:1. Examine
consistency of the data with the above expectation.
According to theoretical prediction of relative shares of 4 types of seeds in a total of 16, the
expected frequencies will be16
5561,
16
5563,
16
5563,
16
5569 =313, 104,104 and 35 respectively.
The value of35
2)3532(
104
2)104108(
104
2)104101(
313
2)313315(2
= 4/313 + 9/104 + 16/104 + 9/35 =
0.51. But the 5% value of Chi-square for 3 d.f. is 7.815, which is much greater than the observed
value. Hence, the difference between the observed and the theoretical frequencies is not
significant. The data are thus consistent with the theoretical expectation.
Example 7: Ten random samples of 100 items each have been collected. Are the following
frequencies of the males and females consistent with an expectation of equal division of sexes?
Male 40 52 49 50 43 48 42 45 41 51
Female 60 48 51 50 47 52 58 55 59 49
Expected number of males or females=np= x100=50 in each sample. The difference between
the observed and expected frequencies of both males and females are the same. Hence, Chi-
square = 2(100+4+1+0+49+4+64+25+81+1) (1/50)=2x329=13.16.
But the 5% value of Chi-square is 18.037, which is greater than the observed value. Therefore,
the data do not discredit the hypothesis of equal proportions of the males and females in thepopulation.
Books Recommended:
1. Croxton and Cowden:Applied General Statistics.
2. Kenney, J.F. and Keeping, E. S.: Mathematics of Statistics, part I&II, Affiliated East-
West Press Pvt. Ltd., New Delhi.3. Rosander, A.C.:Elementary Principles of Statistics,Affiliated East-West Press Pvt. Ltd.,
New Delhi.
4.
Yamane, Taro: Statistics.5. Yule and Kendell:Introducation to Theory of Statistics.6. Weatherburn, C.E.: A First Course in Mathematical Statistics, Cambridge University
Press, London.
7. Kane, Edward J.: Economic Statistics & Econometrics, Harper & Row, New York,Evenston & Condon and John Weatherhill, Inc., Tokyo.
8. Kotsoyiannis
-
8/10/2019 Chi-square Distribution and Its Applications
13/13
9. Cooper Donald R. and Schindler, Pamela S.: Business Research Methods, Tata McGraw
Hill, New Delhi.
10.Levine, Devid M., Krenbiel, Timothy L. and Berenson Mark L.: Business Statistics,Pearson education Asia, 2001.
11.Prakash and Subramanian (2006) Determination of Share Prices: Analysis of A Select
Group of Indian Companies Forthcoming in Finance India.