Categorical data 1 Single proportion and comparison of 2 proportions دکتر سید ابراهیم...
-
Upload
annabelle-strickland -
Category
Documents
-
view
220 -
download
2
Transcript of Categorical data 1 Single proportion and comparison of 2 proportions دکتر سید ابراهیم...
Categorical data 1
Single proportion and comparison of 2 proportions
/ 1388 تاریخ : Dr. jabarifar)دکتر سید ابراهیم جباری فر) 2010
دانشیار دانشگاه علوم پزشکی اصفهان بخش دندانپزشکی جامعه نگر
The objectives of the session
Sampling distribution of simple proportion
Calculation of 95% confidence interval for a proportion
The comparison of two proportions (or percentages)
Statistical test of significance for comparison of two proportions
Calculation of 95% Confidence interval for the difference in two proportions.
Examples of categorical data
Education primary , secondary , university
Marital status: married , single,divorced, widowed
Cigarette smoking history: never smoker , ex-smoker, current smoker
More examples of categorical data
Endpoint in a study
Person is dead or alive
Person with MI or without MI
Person can rate their own health as very good, good, average, bad or very bad
More examples of categorical data
Quantitative measurements or assessments can be used
as categorical data:
Hypertension: Yes (for example systolic BP≥ 160 or
diastolic BP ≥ 90 mm Hg) or no
Alcohol consumption : none, light(<200 ml of ethanol/
week, heavy ≥ 200 ml of ethanol/week)
Proportions and percentages
In this session, we will concentrate on the use of binomial data( = data with just two categories)
Example: in a survey interviews were conducted with 5335 middle- aged women. Of these , 1476 were current smokers while 3859 were not.
Proportion of smokers= =0.277
Percentage of smokers= 0.277×100=27.7%5335
1476
Sampling variability of a proportion
• It is important to take into account the number of
subjects included
• The greater the number of subjects the more reliable
our estimates are
• Example: if we want to estimate proportion of men in
a population who smoke cigarettes study of 1000
men will be more trustworthy than study of 10 men
Important assumption
We need to know that the sample of
individuals studied has been randomly
selected from some population of interest
Sampling distribution of single proportion
Let’s continue with the example of middle aged women.
Among 5335 women, there were 1476 smokers
If we want to say something about the population which this
study sample represents, we need the concept of a sampling
distribution.
• Let’s assume that we repeatedly took a sample of 5335 women and clculated the proportion of smokers
• For each sample , we calculate the proportion of smokers and then construct a histogram of these values
• This histogram represents the sampling distribution of the proportion and will take the following shape.
The curve is centred over value of the proportion of smokers in
the population , often referred to as the true proportion and
represented by µ Some of the sample proportions will be larger
than µ, others will be smaller. Many will be close in value to µ
a few will be a lot larger or a lot smaller
In practice we only conduct one survey, from which we have a
sample proportion represented by P.
Is P close to µ , or is it very different from µ ?
Only of we are very lucky will P actually be equal to µ.
In any random sample , there will be some sampling variation in P.
The larger the sample , the smaller the extent of such sampling variation.
Consider (P- µ)2 as a measure of variation in p from the true proportion µ.
Then it can be shown mathematically that if you took lots of random samples each of n subjects then the average value of (P- µ)2 is equal to
n
)1(
Variance and standard error of proportion
is the vaiance of a proportion
is the standard error of a proportion
It is a measure of the average extent of error in P= how far we can expect the observed proportion to differ from π on average
n
)1(
n
)1(
Example:
π= 0.4:
N=100, then SE= 0.049
N=1000 ,then SE= 0.0155 (SE smaller)
SE does not depend much on π
N= 1000 π = 0.5: SE=0.0158
Back to the example:5335 women, 1476 current smokers
It means that 27.7% of women are smokers
The estimated standard error of the proportion of smokers is
We can also use percentages:
0061.05335
)277.01(277.0
%61.05335
7.27100(7.27
95% confidence limits for a proportion
We want to get an interval of possible values within which the true population proportion might lie
This can be done using the theoretical properties of the Normal distribution
It can be shown that P will be within 1.96 standard errors of with probaility 0.95
That is , there is just a 2.5% risk that the observed proportion will exceed the true population proprtion by more than 1.96 standard errors , and another 2.5% risk that p will understimate by more than 1.96 standard errors.
95% Confidence limits for a proportion
We use this fact to define a 95% confidence
P-1.96× to P + 1.96 ×
Usually written as P±1.96 ×standard error of Pn
PP )1( n
PP )1(
Back to example
The true population percentage of smokers has following 95% confidence interval
This means that 95% confidence interval is from 26.5% to 28.9%
These two values are the lower and upper confidence limits , respectively.
5335
3.727.2796.1%7.27
95% confidence interval
95% confidence intervals= the most common statistical
technique for displaying the degree of uncertainty that
should be attached to any proportion.
There is a 5% risk that the true population proportion lies
outside thd interval
That is , you can anticipate that one in every 20 confidence
intervals you calculate will not include
Two proportions
Example
Total Women Men
879)34.2%(
313)23.8%(
566)45.1%(
Yes Smoking
1691 1001 690 No
2570 1314 1256 Total
Question
From the table, we want to evaluate how strong is
the evidence that men smoke more than women
The null hypothesis
We need to define null hypothesis
In our case , the null hypothesis is that smoking is as freqent
among women as is among men (same proportion of smokers
among men and women)
If the null hypothesis were true, then the whole population
would have identical percent (%) of smokers.
Alternatively , one can say that if the null hypothesis were true
for any randomly selected person (man or women ), the
probability of being a smokers is the same independent of sex
of the person selected.
Significance testing for comparing 2 proportions
After defining the null hypothesis , the main question is
If the null hypothesis is true , what are the between the two
percentages as that observed?
For example , in the Czech study, what is the probability of
getting a sex difference in smoking as large as (or larger than)
45% versus 24%?
Observed difference in percentages
= P1-P2
= 45.1%-23.8%=21.3%
The overall percentage response= 879.2570=34.2%
If the null hypothesis is true , then the only reason that
P1-P2 differs from 0 is due to the sampling variation
Under the null hypothesis we are assuming that the
two samples of size n1=1256 and n2 =1314 are random
samples of people with equal true probabilities of
response .
We need to calculated the standard error of the
difference in two percentages
=1.9%
)11
(100(21 nn
PP
)1314
1
1256
1(8.652.34
Now , we compare the observed difference with the
standard error of the difference, simply by dividing
one by the other.
Thus, we compute
9.1
3.21
=11.2
Observed difference in percentages
Standard Error of differenceZ=
• How large does Z have to be in orther for us to assert
that we have strong evidence that the null hypothesis is
untrue?
• We need to make use of the fact that the difference
between two observed proportions has approximately a
Normal distribution, since this enables us to convert any
value of Z into a probability P (as we have already learnt
in previous sessions)
In our example , Z= 11.2 and so the probability P is
(substantially )less than 0.001 . That means , if the
proportion of smokers is same among men and women,
the chances of getting such a big percentage difference
in our study is less than 0.001
We therefore have storing evidence that the proportin of
smokers in men and women in the defined population is
different (and is lower in women) . We may also say the
difference between the percentages is staitstically
significant at the 0.1% level.
Exercise: we want to know wheter smoking depends on marital status
Total Unmarried Married
879) 34.2( 147)34.9%( 732) 34.1%( Smoking Yes
1691 274 1417 Smoking no
2570 421 2149 Total
The observed difference in percentages is
The standard error of the difference (using the
formula given above) is
Z=
P=
32.053.2
18
SE
d
95% confidence interval for a difference in two percentages
While giving the actual P-value is useful, we also need to give
attention to estimating the magnitude of the difference and
express the uncertainty in such an estimate by using a
confidence intervals.
The 95% confidence interval for the difference between two
percentages is
Observed difference ±1.96×Standard Error of difference
In the calculation of the confidence interval, the formula
for the standard error of the difference does not assume
the null hypothesis of the two proportions being equal . A
slightly different formula is used for the standard error.
SE (difference in proportions)=
2
22
1
11 )1()1(
n
PP
n
PP
In our study , for smoking difference between men
and women 95% confidence interval is
1314
2.768.23
1256
9.541.4596.1%)8.23%1.45(
=17.7% to 24.9%
Exercise
Calculate 95% confidence interval for difference in
percentage of smokers among married and unmarried
individuals
SE=2.54
CI=0.8±1.96×2.54=-42%, 5.8%
Note that if such a 95% confidence interval for a difference
includes the value 0.0 (i.e one limit is positive and the other
is negative), then P is greater than 0.05
Conversely , if the 95% confidence interval does not
include 0.0 then P is less than 0.05
This illustrates that there is a close link between
significance testing and confidence intervals.