Download - Stat171_09_2015_1 copy 4

STAT171 Statistical Data Analysis

(2015)

1

Topic 9

Inference regarding proportions

(one and two populations)

1. Testing a hypothesis about pi.

2. Confidence interval for pi.

J & BChapter 8 section 5 (one proportion)

Chapter 10 section 7 (two proportions)

8.5

8.5

2

3. Testing a hypothesis about twoproportions, pi1 and pi2 .

4. Confidence interval for pi1- pi2 .

10.7

10.7

Notation

Text & Lecture notesn = sample size

(number of independent Bernoulli trails)X = count of the number of successes

Lecture notespi = population proportion (a constant)P = the sample proportion P = X / n

3

P = the sample proportion P = X / n

Text bookP = population proportion (a constant)

= the sample proportion = X / nP P

Care is needed due to different notation!!!!

Testing a single proportionExample: In past years, each year 15% of people who insured their car made a claim. This year, of a random sample of 400 policies, 76 made a claim. Is there any evidence that the proportion has changed?

Setting up the problem:

4

X = the number in the sample who made a claim this year

X ~ B (n , pi) (x = 0, 1, , n)

P = the sample proportion who made a claim this year

P = X / n

=

n

n

nnnp ,...,2,1,0

We have to assume the policyholders are independent

Distribution results for X: We have: X = number of claims made in the sample

this yearX ~ Binomial

n=400 pi=0.15 IF the claim rate is unchanged

X ~ B (400, pi)

5

( )( ) ( ) ( )

( )( )

( ) ( )

, 1

For n "sufficiently large" (CLT applies):

approx ~ , 1

or ~ 0,11

For ~ , :

E X n Var X n

X N n n

X n Nn

X Bin n

pi pi pi

pi pi pi

pi

pi pi

pi

= =

( )both 15

and 1 15n

n

pi

pi

>

>

Distribution results for P: The sample proportion of claims, P, has a scaled binomial distribution

P ~ (1/n) * B (400, pi)pi=0.15 IF the claim rate is unchanged

( ) ( )

For :

E XX nE P En n n

XPn

pipi

= = = =

=

6

( ) ( ) ( ) ( )

( )

( ) ( )

2 2

1 1

For n "sufficiently large" (CLT applies):

1approx ~ ,

or ~ 0,11

n n n

Var X nXVar P Varn n n n

P Nn

P N

n

pi pi pi pi

pi pipi

pi

pi pi

= = = =

( )both 15

and 1 15n

n

pi

pi

>

>

Example (cont)Here, we have observed the sample result p = 76/400 = 0.19

We want to test given that p is 0.19, do we have evidence that pi is no longer 0.15? (i.e. has the claim rate changed from 0.15?)

Under the assumption that pi is 0.15, we want the probability of getting a sample

7

proportion at least as far away as 0.19 is from 0.15 (that is p 0.11 or p 0.19).

This is the same as getting a sample count X 76 or X 44 since 0.15*400 = 60

76 60 = 16 (we observed 16 more than we expected)

and 60 - 16 = 44(the same distance away in the other direction)

We can obtain this probability in two ways:

(1) using the exact binomial:

( ) ( )

( ) ( )

44400

0

400400

76

400Prob 0.15 0.85

4000.15 0.85

x x

x

x x

x

x

x

=

=

=

+

8

(2) Using the normal approximation to the binomial:

( ) ( )

( ) ( ) 0,1

1

Prob 0.11 Prob 0.19

where approxP N

n

P P

pi

pi pi

+

For the general case

Following the steps as for a test of : H0: pi = pi0 e.g. H0: pi = 0.15

H1: pi pi0 H1: pi 0.15 = = 0.05

CLT assumption check

9

The text book states that to use the z-test for proportions, we need:

both npipipipi0 15and n(1-pipipipi0) 15

We will use this check (as it is the one in the quizzes).

If H0 is true, set up the test statistic:

if

then 0

0 0(1 - )0 ,1P

n

Npipi pi

( )0 00

1,P N

n

pi pipi

10

Note: the mean and variance are exact, the normality is approximate here

with observed value

n

0

0 0(1- )obs

pz

n

pi

pi pi

=

Obtain the p-valueThis enables us to determine whether zobs is a believable or not-believable value from the Z distribution

For a HA: pi pi0 p-value P( | Z | |zobs| )

Make the decision:p-value Reject H

11

p-value Reject H0p-value > Retain H0

Write a meaningful conclusion

Continuity CorrectionWe are approximating a discrete (binomial)distribution with a continuous (normal)distribution. Therefore, the continuity correction should be used.For a two-sided alternative, the corrected test statistic is:

01

2p

npi

See J&B p.254

Allows finding the area in both

12

The larger n is, the less important it is to use the continuity correction.

0

0 0

2(1- )obs

pnz

n

pi

pi pi

=

the area in both tails includingthe observed

sample p

Note: The text book (J&B) does not use the cc in hypothesis testing for proportions(which leads to less accurate approximations to the p-value) and this is also the case for the quizzes.

One tailed testsThe hypothesis test can be one or two-tailed.

If one-tailed where H1: pipipipi > pipipipi0Test statistic is:

p-value P(Z zobs)

0

0 0

1p pi2

pi (1-pi )n

obsnz

=

13

p-value P(Z zobs)

If one-tailed where H1: pipipipi < pipipipi0Test statistic is:

p-value P( Z zobs)

0

0 0

1p pi2

pi (1-pi )n

obsnz

+=

For the exampleH0: pi = 0.15H1: pi 0.15 = 0.05

Checking the assumption of approximate normality:

n*pi0 = 400*0.15 = 60 15

and n*(1- pi0) = 400*0.85 = 340 15

14

and n*(1- pi0) = 400*0.85 = 340 15 reasonable to assume normality

The test statistic is

0

0 0

1| |2

(1- )P

nZ

n

pi

pi pi

=

With observed value

p-val = P(| Z | 2.17 ) 0.030Reject H0 at the 5% level of significance.There is sufficient evidence to conclude the proportion of claimants is different

10.19 0.15 0.03875800 2.170.01785360.15(1-0.15)

400

obsz

=

15

the proportion of claimants is different from previous years. The sample proportion of insured claiming this year is significantly greater than 15%.

In the above example, not using the continuity correction gives zobs = 2.24 with an associated p-value of 0.025.That is, no c.c. will give a smaller p-value than when c.c. is used the actual Type I error rate will be higher than specified by the significance level.

Confidence interval for pipipipi[Usually a confidence interval is of the form:

statistic z/2* std. error(statistic) ]

Here, it should be:

But pipipipi is unknown.

We dont even have a hypothesised value!

pipipipi

2(1 - )

np z

pi pi

Ideally, to get the CI for pipipipi, we have to solve a quadratic

16

So, we use p as our best estimate of pipipipi an approx confidence interval for pipipipi is:

/2(1- )p p

np z

... we use an approximation for the standard error of P instead of the exact standard error but we still refer to the z-tables, not the t ... theory to be done next year.

Confidence interval CLT check:In the hypothesis test for pi, we used the normal approximation to the binomial, and had to check the validity of this under H0.

We also need to check the validity of using the CLT for the confidence interval, but here we do not have a pi0.

Instead, we check using the sample p:CLT check:

17

CLT check: we need np 15 and n(1-p) 15where np = the sample number of successes and n(1-p) = the sample number of fails

Continuity correction:When doing confidence intervals for pi, we dont worry about continuity correction. It is pointless trying to improve accuracy when the standard error is only approximated.

For the example

CLT check: np = 76 15 n(1-p) = 324 15

We can validly use the normal approximation to the binomial here.

95% confidence interval for pi

0.19(0.81)0.19 1.96400

18

We are 95% confident that the interval (0.1515, 0.2285) includes the true population proportion pi of claimants this year.

( )( )( )

0.19 1.96 0.01963

0.19 0.0385

0.1515,0.2285

Using the CI for pipipipi for testing H0Even though this interval for pi does not contain 0.15 (the hypothesised proportion for this year), we cannot accurately use it to test the hypothesis H0: pi = 0.15 vs H0: pi 0.15

Why is this so?In evaluating the standard error of P: the hypothesis test uses pipipipi0 ; but the C.I. uses the sample p .

19

the C.I. uses the sample p .

Under H0: pi = 0.15 we used:

For the C.I. calculations we used

( ) ( )0 01 0.15 0.85 0.01785400n

se p pi pi = =

( ) ( )1 0.19 0.81 0.019615400

p pn

se p = =

However, in most cases, the difference between the two s.e.s will be very small.

Only if pi0 is close to the CI boundaries is there a problem with using the CI to perform the

20

hypothesis test.

Here, the 95% ci for pi was (0.1515, 0.2285)

and we were testing H0: pi = 0.15, so it a bit too close to call in this case (so we would have to do the hypothesis test).

Limits on c.i.s for pipipipiA two-sided approx confidence interval for pipipipi is:

However, pipipipi must be in the interval (0,1) as it is a proportion.

/2p(1-p)

np z

Ideally, to get the CI for pi, we have to solve a quadratic

21

The confidence interval CLT checknp 15 and n(1-p) 15

should guarantee the ci will not be outside the interval (0,1).

The 3 CLT check will guarantee the ci for pi is in the interval (0,1), as

long as the z/2 < 3.

One sided c.i.s for pipipipi

For a one-sided CI for pi using the normal approximation, we cannothave a boundary of we have boundaries of 0 or 1

for a proportion.

The 100(1-)% ci for pi:

22

The 100(1-)% ci for pi:

For a < alternative:

For a > alternative:

p(1-p)n

,0 p z

+

p(1-p),

n1p z

Under Stat Basic Stats 1 Proportion

In MTB 17, there is a drop-down panel for this.

Using Minitab (16):

Under options, Click on: use test based on normal distribution to carry out the z test.

Otherwise, p-value is calculated using exact binomial probabilities.

23

Large n normal approx quite accurate and quicker than many binomial calculations

Small n normal approx not necessarily accurate and a small number of binomial calculations is quite quick

MTB > POne 400 76;SUBC> Test 0.15;SUBC> UseZ.

Test and CI for One Proportion

Test of p = 0.15 vs p not = 0.15

X N Sample_p 95%CI Z-Val P-Val

76 400 0.190 (0.151555,0.228445) 2.24 0.025

Using the normal approximation.

Resulting Minitab outputMinitab does not

use pipipipi notation in the output, it uses p

24

Using the normal approximation.

MTB > POne 400 76;SUBC> Test 0.15.

Test and CI for One Proportion

Test of p = 0.15 vs p not = 0.15Exact

X N Sample p 95% CI P-Value76 400 0.190 (0.152721,0.231938) 0.036

Two sample test of proportionsUsed if we have two independentsamples, where we measure the proportion of something in each.

Example

Children are randomly selected from two different schools take the same test.

The number who pass at each school is

25

The number who pass at each school is recorded.

At School1, 40 out of 70 pass the test.

At School2, 45 out of 100 pass the test.

We want to know: is there any difference between the two schools in their overall pass rates?

The (hypothetical) populations of interest are all students who may ever be in either of the schools.

Here we have two independent samples:

School1 p1 = 40/70 0.57 n1 = 70

School2 p2 = 45/100 = 0.45 n2 = 100

Based on these samples, we need to decide which scenario we believe:

The proportions estimate the same pipipipi(and the difference between p and p

26

(and the difference between p1 and p2can be explained by random variation)

or

The proportions estimate two differentpopulation proportions pi1 and pi2(and the difference between p1 and p2is due to this systematic difference)

General case for two proportionsSample1: observe X1 successes

from n1 observations P1 = X1 / n1

Sample2: observe X2 successesfrom n2 observations P2 = X2 / n2

27

We want to test: H0: pi1 = pi2 (= pi)H1: pi1 pi2 at sig level

If n1 and n2 are large enough to apply the CLT:

( )2 22 2

2

1~ ,P N

n

pi pipi

( )1 11 1

1

1~ ,P N

n

pi pipi

If the two samples are independent:

If H0 is true, i.e. if pipipipi1 = pipipipi2 = pipipipi

( ) ( )1 1 2 21 2 1 2

1 2

1 1 ,P P N

n n

pi pi pi pipi pi

+

( ) ( )1 2

1 2

1 1 ,

1 1

P P Nn n

pi pi pi pipi pi

+

28

Therefore the test statistic is:

But pipipipi is unknown !!!

( )1 2

1 1 0 , 1N

n npi pi

+

( )1 2

1 2

1 11obs

p pz

n npi pi

=

+

We cannot get the exact standard error of P1 P2 , as we need the (unknown) value of pi to substitute in. use the pooled sample

proportion to estimate pi .

Use= p = weighted average of p1 and p2pi

29

= p = weighted average of p1 and p2

= number in sample1+number in sample2total n

So

pi

1 1 2 2 1 2

1 2 1 2

n p n p x xpn n n n

pi+ +

= = =

+ +

The sample proportions are weighted by the sample sizes

This then gives an estimated standard error of P1 P2, and we get the observed test statistic:

Obtain p-value and then reject or

( )1 2

1 2

1 1 1

obsp p

z

n npi pi

=

+

30

Obtain p-value and then reject or retain H0 like any other z-test.As for any test, this can be one or two tailed.

This IS a z-test (even though we have estimated the standard error of (P1P2) using the pooled p-hat) ... as we have used binomial distributional properties in this estimation.

We need approximate normality for bothsample proportions under H0, but we dont know the value of pi, so use its estimate, the pooled sample proportion p:

Need: n 1 p 15 and n 1(1- p) 15n 2 p 15 and n 2(1- p) 15

These are just the number of successes and

CLT check for two proportions

31

These are just the number of successes and failures in the two samples.

Continuity correction?There is no need for continuity correction in two sample proportions tests, as you need to add and subtract a correction term (one for p1 and one for p2) and they will approximately cancel.

For the school exampleH0: pi1 = pi2H1: pi1 pi2 = 0.05

p1 = 40 / 70 0.57 based on n1 = 70p2 = 45/100 = 0.45 based on n2 = 100

Under H0, the pooled proportion is:

32

Checking for approximate normality:

n1*p = 35 15 and n1(1-p) = 35 15 n2*p = 50 15 and n2(1-p) = 50 15

CLT applies

40 45 85 1 0.5070 100 170 2

p pi + = = =+

= =

p-value P(| Z | 1.55)

( )

40 4570 100

1 10.5 0.570 100

17140

0.07792

1.558

obsz

=

+

33

2*0.061 0.121

p-value > 0.05 retain H0

There is insufficient evidence, at the 5% significance level, to be able to conclude there is a difference in the pass rate between the two schools.

Confidence interval for pipipipi1 - pipipipi2Here, we have no null hypothesis, so we are not assuming that pi1 = pi2 .

When doing the hypothesis test, we averaged p1 and p2 to get a pooled p, and used that in our estimate of s.e.(p1 - p2). However, to evaluate the confidence interval, we still need an estimate of

( ) ( )

34

use p1 as an estimate of pi1 and p2 as an estimate of pi2So, an approx 100(1-)% C.I. for pi1 - pi2 is:

( ) 1 1 2 21 2 21 2

(1 ) (1 )p p p pp p zn n

+

( ) ( ) ( )1 1 2 21 21 2

1 1se p p

n n

pi pi pi pi = +

Warning: We cant use the confidence interval to accurately carry out the hypothesis test H0: pi1 = pi2 , as the standard error of the difference in the two sample proportions is evaluated in different ways:

Under H0, there was only one value of pi to estimate, and we then used the pooled p to estimate the relevant standard error

( ) ( ) 1 1

35

In the CI calculations, we are not assuming pi1 = pi2 , so the relevant standard error is estimated by:

( ) 1 1 2 21 21 2

(1 ) (1 )p p p pse p p

n n

= +

( ) ( )1 21 2

1 1 1se p p

n npi pi

= +

For the example95% C.I. for pi1-pi2 is:

( )( )( )

40 30 45 5540 45 70 70 100 100

1.96*70 100 70 100

0.5714 0.45 1.96*0.077289

0.1214 0.1514

+

We are 95% confident that the above interval includes the true difference between the population proportions.

( )( )0.030 , 0.273

36

Note: the standard error used in the hypothesis test calculations was 0.07792;

the standard error used in the ci calculations was 0.07729

Using MinitabUnder Stat Basic Stats 2 Proportions

37

Note the only option is to use the normal approximation. There is no exact binomial test.

Use pooled estimate of p for test must be ticked or the CI (unpooled p) estimate of the standard error is used in the hypothesis test.

MTB > PTwo 70 40 100 45;SUBC> Pooled.

Test and CI for Two Proportions

Sample X N Sample p1 40 70 0.5714292 45 100 0.450000

Difference = p(1) - p(2)

Output (pooled option)

38


Estimate for difference: 0.121429

95% CI for diff:(-0.0300545, 0.272912)

Test for difference = 0 (vs not = 0): Z = 1.56 P-Value = 0.119

Fisher's exact test: P-Value = 0.161Ignore this until next year

MTB > PTwo 70 40 100 45.

Test and CI for Two Proportions

Sample X N Sample p1 40 70 0.5714292 45 100 0.450000


Output (unpooled option)

39

Estimate for difference: 0.121429

95% CI for diff:(-0.0300545, 0.272912)

Test for difference = 0 (vs not = 0): Z = 1.57 P-Value = 0.116

Results: same CI only a small difference

in z (1.56 vs 1.57)and p values (0.119 vs 0.116)

Topic 9. Appendix A

Insurance claims example: In past years, 15% of the policy holders

have made an insurance claim per year.

This year, of a random sample of 400 policies, 76 have made a claim.

Is there any evidence that the proportion has increased by a factor of more than 1.2 times?

40

times?

Sample estimate of pi is p = 76/400 = 0.19

Ratio of proportions is

(bigger than 1.2)

sample proportion 0.19 1.267past proportion 0.15

=

The results of previous inference were:

H0: pi = 0.15 vs H0: pi 0.15 was rejected with p-val = 0.030

The 95% CI for pi is (0.1515, 0.2285 )

We found that there is evidence that

41

We found that there is evidence that the true proportion this year is higher than 15%

BUT we have not yet answered the question about whether the proportion has increased by a factor of more than 1.2 times!!!

We can approach this a couple of ways:

(1) Hypothesis testH0: pi = 0.15 * 1.2 = 0.18H1: pi > 0.18 = 0.05

01 76 1 0.18

2 400 800p

nzpi

= =

CLT check:npi = 400 * 0.18 = 72 15

n(1-pi) = 400 * 0.82 = 328 15

42

p-value = P(Z 0.46) 0.3228

Insufficient evidence (at = 5%) to conclude the proportion has increased by a factor of more than 1.2.

( ) ( )0 02 400 8001 0.18 0.82

400

0.00875 0.460.0192094

obsnz

n

pi pi= =

(2) Confidence interval (one sided)(not strictly equivalent)95% CI lower bound for pi:

( )0.05

1

76 32476 400 4001.645400 400

0.19 0.032267

0.1577

p pp z

n

=

=

=

CLT check (sample numbers):np = 76 15 n(1-p) = 324 15

Max value for a proportion is 1, so upper limit cannot be

43

We are 95% certain the interval (0.1577, 1) contains the true proportion. Because 0.18 is in the interval (and is not close to the boundary), there is insufficient evidence to be able to claim that the proportion has increased by a factor of more than 1.2.

To claim the proportion has increased by a factor of more than 1.2, the CI for the new pi would have to be completely ABOVE 0.18 .

0.1577=

(3) Confidence interval for the ratioThe claim is that the ratio of the proportions is more than 1.2

i.e. that pi / 0.15 > 1.2

The appropriate 95% one-sided CI for the new pi was found to be (0.1577, 1).So, the appropriate 95% one-sided CI for

the ratio pi / 0.15 is ( )0.1577 1

44

the ratio pi / 0.15 is

Because 1.2 is in the interval (and is not close to the boundary), there is insufficient evidence to be able to claim that the ratio of the proportions is more than 1.2.

To claim the ratio of proportions has is more than 1.2, the CI would have to be completely ABOVE 1.2

( )0.1577 1, 1.051,6.6670.15 0.15

Summary: One Sample Proportions Test & C.I.Hypothesis test:H0: pi = pi0 versus H1: pi pi0Sample: P = X/n where X ~ Bino (n , pi)CLT check: npi0 15 and n(1-pi0) 15

01

2(1- )obs

pnz

pi

pi pi

=

45

Confidence Interval:CLT check: np 15 and n(1-p) 15 An approximate 100(1-)% CI for pi is

0 0(1- )n

pi pi

2p(1-p)

np z

One-sided CIs for pi must have as their limit 0 or 1 (not )

Summary: Two Sample Proportions Test

Hypothesis test:H0: pi1 = pi2 versus H1: pi1 pi2Sample: P1 = X1/n1 and P2 = X2/n2

Pooled estimate of pi is 1 21 2

X XPn n

+=

+

46

CLT check:n1*p 15 and n2p 15

n1*(1-p) 15 and n2(1-p) 15

1 2

1 2

1 1(1 )obs

p pz

p pn n

=

+

Summary: Two Sample Proportions C.I.

Confidence Interval:Sample: P1 = X1/n1 and P2 = X2/n2

CLT check (simply uses observed counts):

n1*p1 15 and n2p2 15

47

n1*p1 15 and n2p2 15 n1*(1-p1) 15 and n2(1-p2) 15

An approximate 100(1-)% CI for (pi1-pi2) is

( ) 1 1 2 21 2 21 2

(1- ) (1- )p p p pp p zn n

+

One-sided CIs for (pi1 - pi2) must have as their limit -1 or +1 (not )