Categorical data 1 Single proportion and comparison of 2 proportions دکتر سید ابراهیم...

38
Categorical data 1 Single proportion and comparison of 2 proportions ( ر ف اری ب ج م ی هرا ب د ا ب س ر کت د( Dr. jabarifar : خ ی ار ت1388 / 2010 ر گ ت عه ام ی ج ک. ش ر0 ت ن دا دت. ش خ ی4 هان ف ص ی ا ک. ش ر0 ب وم ل عاه گ. ش نر دا ا ب. ش ن دا

Transcript of Categorical data 1 Single proportion and comparison of 2 proportions دکتر سید ابراهیم...

Categorical data 1

Single proportion and comparison of 2 proportions

/ 1388 تاریخ : Dr. jabarifar)دکتر سید ابراهیم جباری فر) 2010

دانشیار دانشگاه علوم پزشکی اصفهان بخش دندانپزشکی جامعه نگر

The objectives of the session

Sampling distribution of simple proportion

Calculation of 95% confidence interval for a proportion

The comparison of two proportions (or percentages)

Statistical test of significance for comparison of two proportions

Calculation of 95% Confidence interval for the difference in two proportions.

Categorical data

• What is categorical data?

• Examples?

Examples of categorical data

Education primary , secondary , university

Marital status: married , single,divorced, widowed

Cigarette smoking history: never smoker , ex-smoker, current smoker

More examples of categorical data

Endpoint in a study

Person is dead or alive

Person with MI or without MI

Person can rate their own health as very good, good, average, bad or very bad

More examples of categorical data

Quantitative measurements or assessments can be used

as categorical data:

Hypertension: Yes (for example systolic BP≥ 160 or

diastolic BP ≥ 90 mm Hg) or no

Alcohol consumption : none, light(<200 ml of ethanol/

week, heavy ≥ 200 ml of ethanol/week)

Proportions and percentages

In this session, we will concentrate on the use of binomial data( = data with just two categories)

Example: in a survey interviews were conducted with 5335 middle- aged women. Of these , 1476 were current smokers while 3859 were not.

Proportion of smokers= =0.277

Percentage of smokers= 0.277×100=27.7%5335

1476

Sampling variability of a proportion

• It is important to take into account the number of

subjects included

• The greater the number of subjects the more reliable

our estimates are

• Example: if we want to estimate proportion of men in

a population who smoke cigarettes study of 1000

men will be more trustworthy than study of 10 men

Important assumption

We need to know that the sample of

individuals studied has been randomly

selected from some population of interest

Sampling distribution of single proportion

Let’s continue with the example of middle aged women.

Among 5335 women, there were 1476 smokers

If we want to say something about the population which this

study sample represents, we need the concept of a sampling

distribution.

• Let’s assume that we repeatedly took a sample of 5335 women and clculated the proportion of smokers

• For each sample , we calculate the proportion of smokers and then construct a histogram of these values

• This histogram represents the sampling distribution of the proportion and will take the following shape.

The curve is centred over value of the proportion of smokers in

the population , often referred to as the true proportion and

represented by µ Some of the sample proportions will be larger

than µ, others will be smaller. Many will be close in value to µ

a few will be a lot larger or a lot smaller

In practice we only conduct one survey, from which we have a

sample proportion represented by P.

Is P close to µ , or is it very different from µ ?

Only of we are very lucky will P actually be equal to µ.

In any random sample , there will be some sampling variation in P.

The larger the sample , the smaller the extent of such sampling variation.

Consider (P- µ)2 as a measure of variation in p from the true proportion µ.

Then it can be shown mathematically that if you took lots of random samples each of n subjects then the average value of (P- µ)2 is equal to

n

)1(

Variance and standard error of proportion

is the vaiance of a proportion

is the standard error of a proportion

It is a measure of the average extent of error in P= how far we can expect the observed proportion to differ from π on average

n

)1(

n

)1(

Example:

π= 0.4:

N=100, then SE= 0.049

N=1000 ,then SE= 0.0155 (SE smaller)

SE does not depend much on π

N= 1000 π = 0.5: SE=0.0158

Back to the example:5335 women, 1476 current smokers

It means that 27.7% of women are smokers

The estimated standard error of the proportion of smokers is

We can also use percentages:

0061.05335

)277.01(277.0

%61.05335

7.27100(7.27

95% confidence limits for a proportion

We want to get an interval of possible values within which the true population proportion might lie

This can be done using the theoretical properties of the Normal distribution

It can be shown that P will be within 1.96 standard errors of with probaility 0.95

That is , there is just a 2.5% risk that the observed proportion will exceed the true population proprtion by more than 1.96 standard errors , and another 2.5% risk that p will understimate by more than 1.96 standard errors.

95% Confidence limits for a proportion

We use this fact to define a 95% confidence

P-1.96× to P + 1.96 ×

Usually written as P±1.96 ×standard error of Pn

PP )1( n

PP )1(

Back to example

The true population percentage of smokers has following 95% confidence interval

This means that 95% confidence interval is from 26.5% to 28.9%

These two values are the lower and upper confidence limits , respectively.

5335

3.727.2796.1%7.27

95% confidence interval

95% confidence intervals= the most common statistical

technique for displaying the degree of uncertainty that

should be attached to any proportion.

There is a 5% risk that the true population proportion lies

outside thd interval

That is , you can anticipate that one in every 20 confidence

intervals you calculate will not include

Two proportions

Example

Total Women Men

879)34.2%(

313)23.8%(

566)45.1%(

Yes Smoking

1691 1001 690 No

2570 1314 1256 Total

Question

From the table, we want to evaluate how strong is

the evidence that men smoke more than women

The null hypothesis

We need to define null hypothesis

In our case , the null hypothesis is that smoking is as freqent

among women as is among men (same proportion of smokers

among men and women)

If the null hypothesis were true, then the whole population

would have identical percent (%) of smokers.

Alternatively , one can say that if the null hypothesis were true

for any randomly selected person (man or women ), the

probability of being a smokers is the same independent of sex

of the person selected.

Significance testing for comparing 2 proportions

After defining the null hypothesis , the main question is

If the null hypothesis is true , what are the between the two

percentages as that observed?

For example , in the Czech study, what is the probability of

getting a sex difference in smoking as large as (or larger than)

45% versus 24%?

Observed difference in percentages

= P1-P2

= 45.1%-23.8%=21.3%

The overall percentage response= 879.2570=34.2%

If the null hypothesis is true , then the only reason that

P1-P2 differs from 0 is due to the sampling variation

Under the null hypothesis we are assuming that the

two samples of size n1=1256 and n2 =1314 are random

samples of people with equal true probabilities of

response .

We need to calculated the standard error of the

difference in two percentages

=1.9%

)11

(100(21 nn

PP

)1314

1

1256

1(8.652.34

Now , we compare the observed difference with the

standard error of the difference, simply by dividing

one by the other.

Thus, we compute

9.1

3.21

=11.2

Observed difference in percentages

Standard Error of differenceZ=

• How large does Z have to be in orther for us to assert

that we have strong evidence that the null hypothesis is

untrue?

• We need to make use of the fact that the difference

between two observed proportions has approximately a

Normal distribution, since this enables us to convert any

value of Z into a probability P (as we have already learnt

in previous sessions)

0.5 With probability 0.674 exceeds Z

0.2 1.282

0.1 1.645

0.05 1.960

0.01 2.576

0.001 3.291

In our example , Z= 11.2 and so the probability P is

(substantially )less than 0.001 . That means , if the

proportion of smokers is same among men and women,

the chances of getting such a big percentage difference

in our study is less than 0.001

We therefore have storing evidence that the proportin of

smokers in men and women in the defined population is

different (and is lower in women) . We may also say the

difference between the percentages is staitstically

significant at the 0.1% level.

Exercise: we want to know wheter smoking depends on marital status

Total Unmarried Married

879) 34.2( 147)34.9%( 732) 34.1%( Smoking Yes

1691 274 1417 Smoking no

2570 421 2149 Total

The observed difference in percentages is

The standard error of the difference (using the

formula given above) is

Z=

P=

32.053.2

18

SE

d

95% confidence interval for a difference in two percentages

While giving the actual P-value is useful, we also need to give

attention to estimating the magnitude of the difference and

express the uncertainty in such an estimate by using a

confidence intervals.

The 95% confidence interval for the difference between two

percentages is

Observed difference ±1.96×Standard Error of difference

In the calculation of the confidence interval, the formula

for the standard error of the difference does not assume

the null hypothesis of the two proportions being equal . A

slightly different formula is used for the standard error.

SE (difference in proportions)=

2

22

1

11 )1()1(

n

PP

n

PP

In our study , for smoking difference between men

and women 95% confidence interval is

1314

2.768.23

1256

9.541.4596.1%)8.23%1.45(

=17.7% to 24.9%

Exercise

Calculate 95% confidence interval for difference in

percentage of smokers among married and unmarried

individuals

SE=2.54

CI=0.8±1.96×2.54=-42%, 5.8%

Note that if such a 95% confidence interval for a difference

includes the value 0.0 (i.e one limit is positive and the other

is negative), then P is greater than 0.05

Conversely , if the 95% confidence interval does not

include 0.0 then P is less than 0.05

This illustrates that there is a close link between

significance testing and confidence intervals.