STAT171 Statistical Data Analysis
(2015)
1
Topic 9
Inference regarding proportions
(one and two populations)
1. Testing a hypothesis about pi.
2. Confidence interval for pi.
J & BChapter 8 section 5 (one proportion)
Chapter 10 section 7 (two proportions)
8.5
8.5
2
3. Testing a hypothesis about twoproportions, pi1 and pi2 .
4. Confidence interval for pi1- pi2 .
10.7
10.7
Notation
Text & Lecture notesn = sample size
(number of independent Bernoulli trails)X = count of the number of successes
Lecture notespi = population proportion (a constant)P = the sample proportion P = X / n
3
P = the sample proportion P = X / n
Text bookP = population proportion (a constant)
= the sample proportion = X / nP P
Care is needed due to different notation!!!!
Testing a single proportionExample: In past years, each year 15% of people who insured their car made a claim. This year, of a random sample of 400 policies, 76 made a claim. Is there any evidence that the proportion has changed?
Setting up the problem:
4
X = the number in the sample who made a claim this year
X ~ B (n , pi) (x = 0, 1, , n)
P = the sample proportion who made a claim this year
P = X / n
=
n
n
nnnp ,...,2,1,0
We have to assume the policyholders are independent
Distribution results for X: We have: X = number of claims made in the sample
this yearX ~ Binomial
n=400 pi=0.15 IF the claim rate is unchanged
X ~ B (400, pi)
5
( )( ) ( ) ( )
( )( )
( ) ( )
, 1
For n "sufficiently large" (CLT applies):
approx ~ , 1
or ~ 0,11
For ~ , :
E X n Var X n
X N n n
X n Nn
X Bin n
pi pi pi
pi pi pi
pi
pi pi
pi
= =
( )both 15
and 1 15n
n
pi
pi
>
>
Distribution results for P: The sample proportion of claims, P, has a scaled binomial distribution
P ~ (1/n) * B (400, pi)pi=0.15 IF the claim rate is unchanged
( ) ( )
For :
E XX nE P En n n
XPn
pipi
= = = =
=
6
( ) ( ) ( ) ( )
( )
( ) ( )
2 2
1 1
For n "sufficiently large" (CLT applies):
1approx ~ ,
or ~ 0,11
n n n
Var X nXVar P Varn n n n
P Nn
P N
n
pi pi pi pi
pi pipi
pi
pi pi
= = = =
( )both 15
and 1 15n
n
pi
pi
>
>
Example (cont)Here, we have observed the sample result p = 76/400 = 0.19
We want to test given that p is 0.19, do we have evidence that pi is no longer 0.15? (i.e. has the claim rate changed from 0.15?)
Under the assumption that pi is 0.15, we want the probability of getting a sample
7
proportion at least as far away as 0.19 is from 0.15 (that is p 0.11 or p 0.19).
This is the same as getting a sample count X 76 or X 44 since 0.15*400 = 60
76 60 = 16 (we observed 16 more than we expected)
and 60 - 16 = 44(the same distance away in the other direction)
We can obtain this probability in two ways:
(1) using the exact binomial:
( ) ( )
( ) ( )
44400
0
400400
76
400Prob 0.15 0.85
4000.15 0.85
x x
x
x x
x
x
x
=
=
=
+
8
(2) Using the normal approximation to the binomial:
( ) ( )
( ) ( ) 0,1
1
Prob 0.11 Prob 0.19
where approxP N
n
P P
pi
pi pi
+
For the general case
Following the steps as for a test of : H0: pi = pi0 e.g. H0: pi = 0.15
H1: pi pi0 H1: pi 0.15 = = 0.05
CLT assumption check
9
The text book states that to use the z-test for proportions, we need:
both npipipipi0 15and n(1-pipipipi0) 15
We will use this check (as it is the one in the quizzes).
If H0 is true, set up the test statistic:
if
then 0
0 0(1 - )0 ,1P
n
Npipi pi
( )0 00
1,P N
n
pi pipi
10
Note: the mean and variance are exact, the normality is approximate here
with observed value
n
0
0 0(1- )obs
pz
n
pi
pi pi
=
Obtain the p-valueThis enables us to determine whether zobs is a believable or not-believable value from the Z distribution
For a HA: pi pi0 p-value P( | Z | |zobs| )
Make the decision:p-value Reject H
11
p-value Reject H0p-value > Retain H0
Write a meaningful conclusion
Continuity CorrectionWe are approximating a discrete (binomial)distribution with a continuous (normal)distribution. Therefore, the continuity correction should be used.For a two-sided alternative, the corrected test statistic is:
01
2p
npi
See J&B p.254
Allows finding the area in both
12
The larger n is, the less important it is to use the continuity correction.
0
0 0
2(1- )obs
pnz
n
pi
pi pi
=
the area in both tails includingthe observed
sample p
Note: The text book (J&B) does not use the cc in hypothesis testing for proportions(which leads to less accurate approximations to the p-value) and this is also the case for the quizzes.
One tailed testsThe hypothesis test can be one or two-tailed.
If one-tailed where H1: pipipipi > pipipipi0Test statistic is:
p-value P(Z zobs)
0
0 0
1p pi2
pi (1-pi )n
obsnz
=
13
p-value P(Z zobs)
If one-tailed where H1: pipipipi < pipipipi0Test statistic is:
p-value P( Z zobs)
0
0 0
1p pi2
pi (1-pi )n
obsnz
+=
For the exampleH0: pi = 0.15H1: pi 0.15 = 0.05
Checking the assumption of approximate normality:
n*pi0 = 400*0.15 = 60 15
and n*(1- pi0) = 400*0.85 = 340 15
14
and n*(1- pi0) = 400*0.85 = 340 15 reasonable to assume normality
The test statistic is
0
0 0
1| |2
(1- )P
nZ
n
pi
pi pi
=
With observed value
p-val = P(| Z | 2.17 ) 0.030Reject H0 at the 5% level of significance.There is sufficient evidence to conclude the proportion of claimants is different
10.19 0.15 0.03875800 2.170.01785360.15(1-0.15)
400
obsz
=
15
the proportion of claimants is different from previous years. The sample proportion of insured claiming this year is significantly greater than 15%.
In the above example, not using the continuity correction gives zobs = 2.24 with an associated p-value of 0.025.That is, no c.c. will give a smaller p-value than when c.c. is used the actual Type I error rate will be higher than specified by the significance level.
Confidence interval for pipipipi[Usually a confidence interval is of the form:
statistic z/2* std. error(statistic) ]
Here, it should be:
But pipipipi is unknown.
We dont even have a hypothesised value!
pipipipi
2(1 - )
np z
pi pi
Ideally, to get the CI for pipipipi, we have to solve a quadratic
16
So, we use p as our best estimate of pipipipi an approx confidence interval for pipipipi is:
/2(1- )p p
np z
... we use an approximation for the standard error of P instead of the exact standard error but we still refer to the z-tables, not the t ... theory to be done next year.
Confidence interval CLT check:In the hypothesis test for pi, we used the normal approximation to the binomial, and had to check the validity of this under H0.
We also need to check the validity of using the CLT for the confidence interval, but here we do not have a pi0.
Instead, we check using the sample p:CLT check:
17
CLT check: we need np 15 and n(1-p) 15where np = the sample number of successes and n(1-p) = the sample number of fails
Continuity correction:When doing confidence intervals for pi, we dont worry about continuity correction. It is pointless trying to improve accuracy when the standard error is only approximated.
For the example
CLT check: np = 76 15 n(1-p) = 324 15
We can validly use the normal approximation to the binomial here.
95% confidence interval for pi
0.19(0.81)0.19 1.96400
18
We are 95% confident that the interval (0.1515, 0.2285) includes the true population proportion pi of claimants this year.
( )( )( )
0.19 1.96 0.01963
0.19 0.0385
0.1515,0.2285
Using the CI for pipipipi for testing H0Even though this interval for pi does not contain 0.15 (the hypothesised proportion for this year), we cannot accurately use it to test the hypothesis H0: pi = 0.15 vs H0: pi 0.15
Why is this so?In evaluating the standard error of P: the hypothesis test uses pipipipi0 ; but the C.I. uses the sample p .
19
the C.I. uses the sample p .
Under H0: pi = 0.15 we used:
For the C.I. calculations we used
( ) ( )0 01 0.15 0.85 0.01785400n
se p pi pi = =
( ) ( )1 0.19 0.81 0.019615400
p pn
se p = =
However, in most cases, the difference between the two s.e.s will be very small.
Only if pi0 is close to the CI boundaries is there a problem with using the CI to perform the
20
hypothesis test.
Here, the 95% ci for pi was (0.1515, 0.2285)
and we were testing H0: pi = 0.15, so it a bit too close to call in this case (so we would have to do the hypothesis test).
Limits on c.i.s for pipipipiA two-sided approx confidence interval for pipipipi is:
However, pipipipi must be in the interval (0,1) as it is a proportion.
/2p(1-p)
np z
Ideally, to get the CI for pi, we have to solve a quadratic
21
The confidence interval CLT checknp 15 and n(1-p) 15
should guarantee the ci will not be outside the interval (0,1).
The 3 CLT check will guarantee the ci for pi is in the interval (0,1), as
long as the z/2 < 3.
One sided c.i.s for pipipipi
For a one-sided CI for pi using the normal approximation, we cannothave a boundary of we have boundaries of 0 or 1
for a proportion.
The 100(1-)% ci for pi:
22
The 100(1-)% ci for pi:
For a < alternative:
For a > alternative:
p(1-p)n
,0 p z
+
p(1-p),
n1p z
Under Stat Basic Stats 1 Proportion
In MTB 17, there is a drop-down panel for this.
Using Minitab (16):
Under options, Click on: use test based on normal distribution to carry out the z test.
Otherwise, p-value is calculated using exact binomial probabilities.
23
Large n normal approx quite accurate and quicker than many binomial calculations
Small n normal approx not necessarily accurate and a small number of binomial calculations is quite quick
MTB > POne 400 76;SUBC> Test 0.15;SUBC> UseZ.
Test and CI for One Proportion
Test of p = 0.15 vs p not = 0.15
X N Sample_p 95%CI Z-Val P-Val
76 400 0.190 (0.151555,0.228445) 2.24 0.025
Using the normal approximation.
Resulting Minitab outputMinitab does not
use pipipipi notation in the output, it uses p
24
Using the normal approximation.
MTB > POne 400 76;SUBC> Test 0.15.
Test and CI for One Proportion
Test of p = 0.15 vs p not = 0.15Exact
X N Sample p 95% CI P-Value76 400 0.190 (0.152721,0.231938) 0.036
Two sample test of proportionsUsed if we have two independentsamples, where we measure the proportion of something in each.
Example
Children are randomly selected from two different schools take the same test.
The number who pass at each school is
25
The number who pass at each school is recorded.
At School1, 40 out of 70 pass the test.
At School2, 45 out of 100 pass the test.
We want to know: is there any difference between the two schools in their overall pass rates?
The (hypothetical) populations of interest are all students who may ever be in either of the schools.
Here we have two independent samples:
School1 p1 = 40/70 0.57 n1 = 70
School2 p2 = 45/100 = 0.45 n2 = 100
Based on these samples, we need to decide which scenario we believe:
The proportions estimate the same pipipipi(and the difference between p and p
26
(and the difference between p1 and p2can be explained by random variation)
or
The proportions estimate two differentpopulation proportions pi1 and pi2(and the difference between p1 and p2is due to this systematic difference)
General case for two proportionsSample1: observe X1 successes
from n1 observations P1 = X1 / n1
Sample2: observe X2 successesfrom n2 observations P2 = X2 / n2
27
We want to test: H0: pi1 = pi2 (= pi)H1: pi1 pi2 at sig level
If n1 and n2 are large enough to apply the CLT:
( )2 22 2
2
1~ ,P N
n
pi pipi
( )1 11 1
1
1~ ,P N
n
pi pipi
If the two samples are independent:
If H0 is true, i.e. if pipipipi1 = pipipipi2 = pipipipi
( ) ( )1 1 2 21 2 1 2
1 2
1 1 ,P P N
n n
pi pi pi pipi pi
+
( ) ( )1 2
1 2
1 1 ,
1 1
P P Nn n
pi pi pi pipi pi
+
28
Therefore the test statistic is:
But pipipipi is unknown !!!
( )1 2
1 1 0 , 1N
n npi pi
+
( )1 2
1 2
1 11obs
p pz
n npi pi
=
+
We cannot get the exact standard error of P1 P2 , as we need the (unknown) value of pi to substitute in. use the pooled sample
proportion to estimate pi .
Use= p = weighted average of p1 and p2pi
29
= p = weighted average of p1 and p2
= number in sample1+number in sample2total n
So
pi
1 1 2 2 1 2
1 2 1 2
n p n p x xpn n n n
pi+ +
= = =
+ +
The sample proportions are weighted by the sample sizes
This then gives an estimated standard error of P1 P2, and we get the observed test statistic:
Obtain p-value and then reject or
( )1 2
1 2
1 1 1
obsp p
z
n npi pi
=
+
30
Obtain p-value and then reject or retain H0 like any other z-test.As for any test, this can be one or two tailed.
This IS a z-test (even though we have estimated the standard error of (P1P2) using the pooled p-hat) ... as we have used binomial distributional properties in this estimation.
We need approximate normality for bothsample proportions under H0, but we dont know the value of pi, so use its estimate, the pooled sample proportion p:
Need: n 1 p 15 and n 1(1- p) 15n 2 p 15 and n 2(1- p) 15
These are just the number of successes and
CLT check for two proportions
31
These are just the number of successes and failures in the two samples.
Continuity correction?There is no need for continuity correction in two sample proportions tests, as you need to add and subtract a correction term (one for p1 and one for p2) and they will approximately cancel.
For the school exampleH0: pi1 = pi2H1: pi1 pi2 = 0.05
p1 = 40 / 70 0.57 based on n1 = 70p2 = 45/100 = 0.45 based on n2 = 100
Under H0, the pooled proportion is:
32
Checking for approximate normality:
n1*p = 35 15 and n1(1-p) = 35 15 n2*p = 50 15 and n2(1-p) = 50 15
CLT applies
40 45 85 1 0.5070 100 170 2
p pi + = = =+
= =
p-value P(| Z | 1.55)
( )
40 4570 100
1 10.5 0.570 100
17140
0.07792
1.558
obsz
=
+
33
2*0.061 0.121
p-value > 0.05 retain H0
There is insufficient evidence, at the 5% significance level, to be able to conclude there is a difference in the pass rate between the two schools.
Confidence interval for pipipipi1 - pipipipi2Here, we have no null hypothesis, so we are not assuming that pi1 = pi2 .
When doing the hypothesis test, we averaged p1 and p2 to get a pooled p, and used that in our estimate of s.e.(p1 - p2). However, to evaluate the confidence interval, we still need an estimate of
( ) ( )
34
use p1 as an estimate of pi1 and p2 as an estimate of pi2So, an approx 100(1-)% C.I. for pi1 - pi2 is:
( ) 1 1 2 21 2 21 2
(1 ) (1 )p p p pp p zn n
+
( ) ( ) ( )1 1 2 21 21 2
1 1se p p
n n
pi pi pi pi = +
Warning: We cant use the confidence interval to accurately carry out the hypothesis test H0: pi1 = pi2 , as the standard error of the difference in the two sample proportions is evaluated in different ways:
Under H0, there was only one value of pi to estimate, and we then used the pooled p to estimate the relevant standard error
( ) ( ) 1 1
35
In the CI calculations, we are not assuming pi1 = pi2 , so the relevant standard error is estimated by:
( ) 1 1 2 21 21 2
(1 ) (1 )p p p pse p p
n n
= +
( ) ( )1 21 2
1 1 1se p p
n npi pi
= +
For the example95% C.I. for pi1-pi2 is:
( )( )( )
40 30 45 5540 45 70 70 100 100
1.96*70 100 70 100
0.5714 0.45 1.96*0.077289
0.1214 0.1514
+
We are 95% confident that the above interval includes the true difference between the population proportions.
( )( )0.030 , 0.273
36
Note: the standard error used in the hypothesis test calculations was 0.07792;
the standard error used in the ci calculations was 0.07729
Using MinitabUnder Stat Basic Stats 2 Proportions
37
Note the only option is to use the normal approximation. There is no exact binomial test.
Use pooled estimate of p for test must be ticked or the CI (unpooled p) estimate of the standard error is used in the hypothesis test.
MTB > PTwo 70 40 100 45;SUBC> Pooled.
Test and CI for Two Proportions
Sample X N Sample p1 40 70 0.5714292 45 100 0.450000
Difference = p(1) - p(2)
Output (pooled option)
38
Difference = p(1) - p(2)
Estimate for difference: 0.121429
95% CI for diff:(-0.0300545, 0.272912)
Test for difference = 0 (vs not = 0): Z = 1.56 P-Value = 0.119
Fisher's exact test: P-Value = 0.161Ignore this until next year
MTB > PTwo 70 40 100 45.
Test and CI for Two Proportions
Sample X N Sample p1 40 70 0.5714292 45 100 0.450000
Difference = p(1) - p(2)
Output (unpooled option)
39
Estimate for difference: 0.121429
95% CI for diff:(-0.0300545, 0.272912)
Test for difference = 0 (vs not = 0): Z = 1.57 P-Value = 0.116
Results: same CI only a small difference
in z (1.56 vs 1.57)and p values (0.119 vs 0.116)
Topic 9. Appendix A
Insurance claims example: In past years, 15% of the policy holders
have made an insurance claim per year.
This year, of a random sample of 400 policies, 76 have made a claim.
Is there any evidence that the proportion has increased by a factor of more than 1.2 times?
40
times?
Sample estimate of pi is p = 76/400 = 0.19
Ratio of proportions is
(bigger than 1.2)
sample proportion 0.19 1.267past proportion 0.15
=
The results of previous inference were:
H0: pi = 0.15 vs H0: pi 0.15 was rejected with p-val = 0.030
The 95% CI for pi is (0.1515, 0.2285 )
We found that there is evidence that
41
We found that there is evidence that the true proportion this year is higher than 15%
BUT we have not yet answered the question about whether the proportion has increased by a factor of more than 1.2 times!!!
We can approach this a couple of ways:
(1) Hypothesis testH0: pi = 0.15 * 1.2 = 0.18H1: pi > 0.18 = 0.05
01 76 1 0.18
2 400 800p
nzpi
= =
CLT check:npi = 400 * 0.18 = 72 15
n(1-pi) = 400 * 0.82 = 328 15
42
p-value = P(Z 0.46) 0.3228
Insufficient evidence (at = 5%) to conclude the proportion has increased by a factor of more than 1.2.
( ) ( )0 02 400 8001 0.18 0.82
400
0.00875 0.460.0192094
obsnz
n
pi pi= =
(2) Confidence interval (one sided)(not strictly equivalent)95% CI lower bound for pi:
( )0.05
1
76 32476 400 4001.645400 400
0.19 0.032267
0.1577
p pp z
n
=
=
=
CLT check (sample numbers):np = 76 15 n(1-p) = 324 15
Max value for a proportion is 1, so upper limit cannot be
43
We are 95% certain the interval (0.1577, 1) contains the true proportion. Because 0.18 is in the interval (and is not close to the boundary), there is insufficient evidence to be able to claim that the proportion has increased by a factor of more than 1.2.
To claim the proportion has increased by a factor of more than 1.2, the CI for the new pi would have to be completely ABOVE 0.18 .
0.1577=
(3) Confidence interval for the ratioThe claim is that the ratio of the proportions is more than 1.2
i.e. that pi / 0.15 > 1.2
The appropriate 95% one-sided CI for the new pi was found to be (0.1577, 1).So, the appropriate 95% one-sided CI for
the ratio pi / 0.15 is ( )0.1577 1
44
the ratio pi / 0.15 is
Because 1.2 is in the interval (and is not close to the boundary), there is insufficient evidence to be able to claim that the ratio of the proportions is more than 1.2.
To claim the ratio of proportions has is more than 1.2, the CI would have to be completely ABOVE 1.2
( )0.1577 1, 1.051,6.6670.15 0.15
Summary: One Sample Proportions Test & C.I.Hypothesis test:H0: pi = pi0 versus H1: pi pi0Sample: P = X/n where X ~ Bino (n , pi)CLT check: npi0 15 and n(1-pi0) 15
01
2(1- )obs
pnz
pi
pi pi
=
45
Confidence Interval:CLT check: np 15 and n(1-p) 15 An approximate 100(1-)% CI for pi is
0 0(1- )n
pi pi
2p(1-p)
np z
One-sided CIs for pi must have as their limit 0 or 1 (not )
Summary: Two Sample Proportions Test
Hypothesis test:H0: pi1 = pi2 versus H1: pi1 pi2Sample: P1 = X1/n1 and P2 = X2/n2
Pooled estimate of pi is 1 21 2
X XPn n
+=
+
46
CLT check:n1*p 15 and n2p 15
n1*(1-p) 15 and n2(1-p) 15
1 2
1 2
1 1(1 )obs
p pz
p pn n
=
+
Summary: Two Sample Proportions C.I.
Confidence Interval:Sample: P1 = X1/n1 and P2 = X2/n2
CLT check (simply uses observed counts):
n1*p1 15 and n2p2 15
47
n1*p1 15 and n2p2 15 n1*(1-p1) 15 and n2(1-p2) 15
An approximate 100(1-)% CI for (pi1-pi2) is
( ) 1 1 2 21 2 21 2
(1- ) (1- )p p p pp p zn n
+
One-sided CIs for (pi1 - pi2) must have as their limit -1 or +1 (not )