Categorical data II The association between two categorical variables دکتر سید ابراهیم...
-
Upload
braedon-gilliom -
Category
Documents
-
view
218 -
download
3
Transcript of Categorical data II The association between two categorical variables دکتر سید ابراهیم...
Categorical data II
The association between two categorical
variables
/ 1388دکتر سید ابراهیم جباری فر تاریخ : 2010
دانشیار دانشگاه علوم پزشکی اصفهان بخش دندانپزشکی جامعه نگر
By the end of the session you should be able to
Construct 2-way table to examine association between two
categorical variatles
Conduct Chi-squared test to assess the association between
two categorical variables
Conduct Chi-square test for trend when the exposure is
ordered categorical
Example
Random sample of individuals were questioned about their
occupatim and their BP was measured. Based on SBP and DBP
measures they were classified as hypertensive or non
hypertensive. Among 300 people in non-manual jobs, there
were 72 hypertensive individuals. Among 240 people in
manual jobs, there were 96 hypertensive indwiduals.
Constructing 2- way table
As a first step we need to organize our data in a formal way
we construct 2-way table
Hypertension
Total No Yes
240 144 96 Manual
300 228 72 Non- manual
540 372 168 Total
What does it mean when we speak about an
association between two categorical variables?
What does it mean when we speak about an association
between two categorical variables ?
It means that knowing the value of one variable tells us
sanething about the value of the other variable .
Two variables are therefore saidto be associated if the
distribution of one variable varies according to the value of
the other varialde .
What does it mean when we speak about an association
between two categorical variables ?
In our example, the two variables, occupation and hypertension,
are associated if the distribution of hypertension varies between
occupatioml groups .
And, if distribution of hypertension is same in both occupational
groups, we can say that there is no association between
hypertension and occupational category - because knowing a
occupational category of indi\idual wil! not tell us anything
about hypertension.
What does it mean when we speak about an association between
two categorical variables ?
•Having constructed a two-way table, the next step is to look
whether the distribution of one variable differs according to the
value of the other variable .
We need to calculate either row or column percentages .
Often, one variable can be regarded as the response variable, while
the other is the explanatory variable, and this should help to decide
what percentages are shown
•If the columns represent the explanatory variable, then column
percentages are more appropriate, and vice versa.
Constructing 2-way table
As a second step we calculate proportion of hypertensive
individuals among manual workers. non-manual workers and in
the whole sample
Hypertension
Total No Yes
240 144(60.0%) 96(40.0%) Manual
300 228(76.0%) 72(24.0%) Non-manual
450 372(68.9%) 168(31.1%) Total
The numbers in the four categories in the 2 way
table in the previous slide all called
OBSERVED NUMBERS I
The data seem to suggest sane association between
hypertension and occLPation )40% of manual workers with
hypertension compared to 24% of non-manual workers with
hypertension(
The calculation and examination of such percentages is an
essential step in the analysis of a two-way table, and should
always be done before starting formal significance tests.
Significance test the association
Although it seemsthat there is an association in the table, the
question is whether this may be attributable to samping variability
Each of the percentages i1 the table is subject to sampling error,
and we need to assess whether the differences between them may
be due to chance
This is done by conducting a sig1ificance test
The null hypothesis is "there is no association between the two
variables'
Expected numbers
The significance test is Chi-squared test
This test canpares the observed numbers in each of four
categories of contingency table with the numbers to be
expected if there was no difference in proportion of
hypertensive indviduals in two occupational groups
Hypertension
Total No Yes
240 74.64 Manual
300 Non-manual
450 372(68.9%) 168(31.1%) Total
From the table above, the overall proportion of hypertensive
individuals is 168/372 )31.1 %(.
If the null hypothesis were true, the expected number of manual
subjects with hypertension is 31.1 % of 240, which is 74.64
Hypertension
Total No Yes
240 74.64 Manual
300 Non-manual
450 372(68.9%) 168(31.1%) Total
Expected numbers in the other cells of the table can be
calculated similarly, using the general formula:
Row×Culumn totalExpected number=
Overall total
Hypertension OBSERVED
Total No Yes
240 144 96 Manual
300 228 72 Non-manual
450 372(68.9%) 168(31.1%) Total
Next step – compare observed and expected numbers
Hypertension EXPECTED
Total No Yes
240 165.36 74.64 Manual
300 206.64 93.36 Non-manual
450 372(68.9%) 168(31.1%) Total
Chi-squared test (X2test)
Calculate )O-E(/E for each cell and sum over all cells
In our example
X2 = [)96-74.64( 2/74.64 + )144-165.36(2 / 165.36 + )72-
93.36(2 / 93.36 + )228-206.64(2 / 206.64] = 15.97
/E]E)OX 22
If X2 value is large then )O-E( is, in general, large and data do not
support H0 = association
If X2 value is small then )O-E( is, in general, Small and data do
support
H0= no association
Large values of X2 suggest that the data are inconsistent with the
null hypothesis, and therefore that there is an associaion between
the two variables.
Obtaining p-value
Under Ho: X2 distribution
Obtaining p-value
The P-value is obtained by referring the calculated value of X2
to tables of the en-squared
distribution .
The P-value in this case corresp:mds to the value shown as a in
the tables .
The degrees of freedom are given by the formula:
d.f=(r-1)x(c-1)
R= number of rows , c=number of columns
Back to our example:
X2=15.97
Table 2x2
d.f.=1
and from the table
P<O.001
Relation with normal test for the difference between 2
proportions:
Pl-P2= 40%-24% = 16%
=4.01
Z=)P1-P2(/SE=16/4.01=3.99
For 2 × 2 table, X2 = Z2, i.e. 15.97 = 3.992
Pvalues for both tests are identical
3001
2401
*68.9*31.1P2)SE(P1
Quick formula
There is a quick formula calculating chi-square test for 2×2 table
If we use following notation:
Quicker formula for calculating chi-squared test is
Total
e b a
f d c
N h g Total
Quicker formula for calculating chi- squared test is
d.f=1
efghbc)(ad*N
X2
2
Continuity correction
The chi-squared test for 2×2 table can be improved by using
a continuity correction, called Yates' continuity correction
When we compare two p-oportions using the Z test )or the
X2 test(, we are making the assumption that the sampling
distributions of each of the proportions, and of the
difference between them, are approximately normal.
Continuity correction - cont.
In fact, the distribution et a proportion is not exactly
normal, because it is a dscrete variable, whereas the
normal distribution elates to continuous variables.
Continuity correction
For large enough samples, the binomial distribution
approximates very closely tothe normal distribution
This is why we use of the Z and x2 tests for the comparison of
proportions
However, the approximation can be improved by making a
continuity correction, which adjusts the value of z to allow for
the fact thatwe are modelling a discrete variable with a
continuous distribution
Yates’c ontinuity correction - formula
The effect of the continuity correction is to reduce the value
of x2; with smaller samples, the effect is more substantial -
which is exactly what we want given that the normal
distribution is better approximation of binomial distribution
when sample is larger
Exact test for 2×2 tables
The exact test to compare 2 proportions is needed when the
numbers in the 2 × 2 table are very small )the consensus: if the
total number of individuals
is less than 40 and one of the cells )or more( is lesslhan 5(
It is based on calcLJating exact probabilities of the observed
table and of more 'extreme' tables with the same row and
column totas
Formula and more details - Kirkwood and Sterne, p.169-171
Larger tables )r x c tables(
No continuity correction
Valid if less than 20% of expected numbers are under 5 and none
is less than 1
If low expected numbers- combine either rows or columns to
overcome this problem
How to calculate expected number in particular cell
Row total Column total
Expected number=
Overall total
Ordered exposures 1X2 test for trend
A special case = there are two rows but more than two columns
)or it can be two columns araJ more than two rows(
AND an ordered categorical variable is involved )as the
variable with more than 2 categories(
Education M1
Total University Secondary Primary
132 20(22.2%) 64(32%) 48(40%) Yes
278 70 136 72 No
410 90 200 120 Total
3 2 1 Score for trend test
Example
In this 2 x 3 table, the aim is to examine if there is any
association between educaton and MI.
MI is the outcome of interest
Education is the exposure of nterest
We want to compare the proportions of subjects with MI in
the three education groups.
Standard chi-squared test
X2=7.45
Table 2×3-d.f=2
P=0.024
So there is statistical evidence of an association between the two
variables.
However, when the column variable is an ordered categorical
variable, a more
sensitive procedure for detecting an association is to use a x2 test-
for-trend
X2 test – for trend
The null hypothesis is the same as for an ordinary 2 × c X2 test,
that there is no association between the two variables. But the
test-for-trend is particularly sensitive to an upward or downward
trend in proportions across the columns of the table
The test is conducted by assigning the numerical scores 1, 2, 3 ...
to the columns of the table. We then calculate the mean scores X1
and X2 observed in those with MI and in those without MI.
• Mean score among those withoutMI
X1=[48*1+64*2+20*3]/132=236/132=1.79
Mean score among those without MI
X2= [72*1 + 136* 2+ 70* 3]/278= 554/278= 1.99
If there is no difference between the three proportions, two mean
scores will be equal.
If there is an upward )or downward( trend in the proportions, the
mean score in the MJ group will be greater )or less( than that in the
no MI group.
We now need to calculate stancard deviation s.
Let's assume that we have a sampe of 41 0 points made up
of 120 ones, 200 twos and 90 threes. From this sample we
find the standard deviation of the whole sample in the
usual way.
S=
Mean is overall mean of ones, twos and threes
Example
mean = (120 *1 + 200 *2 +90 *3)/410= 1.93
And
S=
S=0.71
4091.93)(3*901.93)(2*2001.93)(1*120 222
Chi-squared test for trend
obtained by comparing the mean scores for the 2 groups:
d.f.=1
p = 0.006
Where n, and n2 are total numbers of individuals with and without
MI
You can see that square root of the formula is formula for Z test for
comparing means of 2 samples
Z is square root of X2
7.6
n1
n1
*s
)xx(X
21
21
You can see that p value is smaller than with ordinary chi-
squared test
There is even stronger evidence of an association between MI
and education
The alternative formula for chi-squared test for trend is
described by Kirkwood cnd Sterne, p.173 176
Chi- squared tests in STATA
Let's use data from practical 10 )last session(
We try to evaluate whether there is an association between current
smoking and age
We have age grouped into 4 groups )30-39, 40-49, 50-59, 60 69(
Smoking )variable smok( was coded 1 =current smokers, 2=non-
smokers
Firstly, we recode smoking
Recode smok 2=0 )we need the outcome to be coded as 0.1(
Let’s check proportion of smokers in each age category
Tab smok agegroup col
30-39,40-49,50-59,60-69
Total 60 50 40 30 1=Yes 0=no
1.67565.69
49178.81
49072.38
35756.31
33754.71
0
87534.31
13221.19
18727.62
27743.69
27945.29
1
2.550100.00
623100.00
677100.00
634100.00
616100.00
Total
Now , standard chi- squared test
Tab smok agegroup , col chi
30-39,40-49,50-59,60-69
Total 60 50 40 30 1=Yes 0=no
1.67565.69
49178.81
49072.38
35756.31
33754.71
0
87534.31
13221.19
18727.62
27743.69
27945.29
1
2.550100.00
623100.00
677100.00
634100.00
616100.00
Total
Pearson chi2)3(=118.7458 pr=0.000
Degrees of freedom chi-squared test value P<0.001
Chi – squared test for trend
COMMAND OUTCOME EXPOSURE
Tabodds smok ageroup
interval 95% conf odds controls cases agegroup
0.970220.907750.451660.32580
0.706440.663220.322460.22184
0.827890.775910.381630.26884
337357490491
279277187132
30405060
Test of homogeneity (equal odds): chi2(3)=118.70
pr>chi2=0.0000
Score test for trend of odds: chi2(1)=108.83
pr>chi2=0.0000
This is part that is interesting at themomenti
What have we learnt in the session?
1. Construct 2-way table to examine association
between two categorical variatles
2. Conduct Chi-squared test to assess the association
between two categorical variables
3. Conduct Chi-squared test for trend for ordered
exposures