NIPRL Chapter 10. Discrete Data Analysis 10.1 Inferences on a Population Proportion 10.2 Comparing...

NIPRL

Chapter 10. Discrete Data Analysis

10.1 Inferences on a Population Proportion10.2 Comparing Two Population Proportions10.3 Goodness of Fit Tests for One-Way Contingency Tables10.4 Testing for Independence in Two-Way Contingency Tables10.5 Supplementary Problems

NIPRL 2

10.1 Inferences on a Population Proportion

µ

µ

µ µ

µ

µ

Sample proportion

~ ( , )

(1 )( ) and ( )

For large ,

(1 )~ ,

~ (0,1)(1 )

p

X B n p

xp

np p

E p p Var pn

n

p pp N p

n

p pN

p pn

·

=

-= =

·

æ ö- ÷ç ÷ç ÷÷çè ø

-

-

Population

Proportion p with characteristic

Random sample of size n

With characteristic Without characteristic

Cell probability p

Cell frequency x

Cell probability 1-p

Cell frequency n-x

NIPRL 3

10.1.1 Confidence Intervals for Population Proportions

µ

µ µ

µ µ

µµ µ

/ 2 / 2

Since ~ (0,1) for large , two-sided 1- confidence interval is(1 )

(1 ) (1 ),

(1 )where , . .( ) .

(1 ) 1But it is customary that . .( )

p pN n

p pn

p p p pp p z p z

n n

x p pp s e p

n n

p ps e p

n n

a a

a-

·-

æ ö- - ÷ç ÷Î - +ç ÷ç ÷÷çè ø

-= =

-= =

µ µ/ 2 / 2

( ).

Two-Sided Confidence Intervals for a Population Proportion

~ ( , )

Approximate two-sided 1- confidence interval for is

( ) ( ) , .

The approximation

x n x

n

X B n p

p

z zx n x x n xp p p

n n n na a

a

-

·

æ ö- - ÷ç ÷Î - +ç ÷ç ÷÷çè ø

is reasonable as long as both and - are larger than 5.x n x

NIPRL 4


Example 55 : Building Tile Cracks

Random sample n = 1250 of tiles in a certain group of downtown building for cracking. x = 98 are found to be cracked.

µ

µ µ

0.005

/ 2 / 2

980.0784, 2.576

125099% two-sided confidence interval

( ) ( ),

2.575 98 (1250 98) 2.575 98 (1250 98)0.0784 ,0.0784

1250 1250 1250 1250

(0.0588,0.0980)

xp z

n

z zx n x x n xp p p

n n n na a

= = = =

æ ö- - ÷ç ÷Î - +ç ÷ç ÷÷çè ø

æ ö´ - ´ - ÷ç ÷= - +ç ÷ç ÷÷çè ø

= (0.06,0.10)

Total number of cracked tiles is between

0.06 5,000,000 300,000 and 0.10 5,000,000 500,000

»

´ = ´ =

NIPRL 5


µ

µ

Since ~ (0,1) for large , one-sided 1- confidence interval (1 )

(1 )for a lower bound is ,1 .

One-Sided Confidence Intervals for a Population Proportion

~ ( , )

Ap

p pN n

p pn

p pp p z

n

X B n p

a

a-

·-

æ ö- ÷ç ÷Î -ç ÷ç ÷÷çè ø

·

µ µ

proximate two-sided 1- confidence interval for is

( ) ( ) ,1 and 0, .

The approximation is reasonable as long as both and - are larger than 5.

p

z zx n x x n xp p p p

n n n n

x n x

a a

a

æ ö æ ö- -÷ ÷ç ç÷ ÷Î - Î -ç ç÷ ÷ç ç÷ ÷÷ ÷ç çè ø è ø

NIPRL 6

10.1.2 Hypothesis Tests on a Population Proportion

µ

µ

0 0 0

0

0

0

0 0

Two-Sided Hypothesis Tests

: versus :

- 2 ( ) if /

- 2 ( ) if /

where ~ ( , ).

When and (1 ) are both larger than 5, a normal a

AH p p H p p

p value P X x p x n p

p value P X x p x n p

X B n p

np n p

·

= ¹

= ´ ³ = >

= ´ £ = <

· -

µ0 0

0 0 0 0

pproximation

can be used to give the p-value of

- 2 ( | |) where ( ) is the standard normal c.d.f. and

.(1 ) (1 )

p value z

p p x npz

p p np p

n

= ´ F - F ×

- -= =

- -

NIPRL 7


0 0

0

Continuity Correction

In order to improve the normal approximation

the value in the numerator of the -statistic

0.5 may be used when 0.5

0.5 may

z

x np x np

x np

·

- - - >

- +

0

0

/ 2

be used when 0.5.

A size hypothesis test when

| | or

and

x np

accepts H

z z p value

rea

a

a

- <

·

£ - ³

0

/ 2

when

| | or .

jects H

z z p valuea a> - <

NIPRL 8


0 0 0

0

0

0 0

One-Sided Hypothesis Tests for a Population Proportion

: versus :

- ( ) where ~ ( , ).

- The normal approximation to this is

0.5 - ( ) and .

(1 )

AH p p H p p

p value P X x X B n p

x npp value z z

np p

·

³ <

= £

- +=F =

-

0 0

0 0 0

0

- if and if .

For : versus : ,

- ( ) where ~ ( , ).

- The normal approximation to this is

- 1 ( ) and

A

Accept H z z reject H z z

H p p H p p

p value P X x X B n p

p value z

a a³ - <-

· < ³

= ³

= - F

0

0 0

0 0

0.5.

(1 )

- if and if .

x npz

np p

Accept H z z reject H z za a

- -=

-

£ - >-

NIPRL 9



10% or more of the building tiles are cracked ?

0

0

0

0 0

: 0.1 versus : 0.1

1250, 98, 0.1.

0.52.50.

(1 )

- ( 2.50) 0.0062.

AH p H p

n x p

x npz

np p

p value

³ <

= = =

- += =-

-

=F - =

z = -2.50 0

NIPRL 10

10.1.3 Sample Size Calculations

µ µ µ µ2/ 2

/ 2 2

The most convenient way to assess the amount of precision

afforded by a sample size is to consider of two-sided confidence

interval for .

4 (1 )(1 )2

n L

p

z p pp pL z n

n La

a

·

--= Þ =

µ µ µ µ

µ

µ µ

2/ 22

2/ 2

.

Since (1 ) is unknown, if we take (1 ) the largest,

1 and .

2

However, if is far from 0.5, we can take * from prior information for

4 *(1 *and

p p p p

zn p

L

p p p p

z p pn

a

a

- -

= =

=

-;

2

).(either * 0.5 or * 0.5)p p

L< >

NIPRL 11


Example 59 : Political Polling

To determine the proportion p of people who agree with the statement “The city mayor is doing a good job.” within 3% accuracy.

(-3% ~ +3%), how many people do they need to poll?

µ µ

2 2/ 22 2

Construct a 99% confidence interval for with a length

no large than 6% ( 3% ~ 3%)

2.576 1843.3 ( 0.01)

0.06representative sample : at least 1844

If "1000 respondents, 3% sampling erro

p

L p p

zn

La a

= - +

Þ = = = =

Þ

±

/ 2

r",

1000 0.06 1.90, ( 1.90) 0.0287.

0.057 95% confidence interval

z nLa

a

Þ = = ´ = F - =

Þ Þ;

NIPRL 12



2/ 2

2

Recall a 99% confidence interval for is (0.0588, 0.0980).

We need to know within 1% with 99% confidence.( 2%)

Based on the above interval, let * 0.1 0.5.

4 *(1 *)5972.2 or about 6000.

r

p

p L

p

z p pn

La

=

=

-Þ =

Þ

=

;

µ

µ µ/ 2 / 2

epresentative sample : 4750 (initial sample 1250)

After the secondary sampling, 6000 and 98 308 406.

4060.0677 and

6000

( ) ( ),

(0.0593, 0.0761)

n x

p

z zx n x x n xp p p

n n n na a

=

= = + =

= =

æ ö- - ÷ç ÷Î - +ç ÷ç ÷÷çè ø

=

NIPRL 13

10.2 Comparing Two Population Proportions

µ µ µ µ

µ µ µ µ

µ µ

Comparing two population proportions

Assume ~ ( , ), ~ ( , ) and .

and .

(1 ) (1 )( ) ( ) ( ) .

( ) (

A B

A B A B

A A B BA B A B

AA B

X B n p Y B m p X Y

x y x yp p p p

n m n mp p p p

Var p p Var p Var pn m

p p p pz

·

^

= = Þ - = -

- -- = + = +

- - -=

µ µ µ µ

µ µ

µµ µ

µ µ

3 3

0

) ( ) ( ).

( ) ( )(1 ) (1 )

If : , the pooled estimate is

and .1 1

(1 )

B A BA B

A A B B

A B

A B

p p p p

x n x y m yp p p pn mn m

H p p

p px yp z

n mp p

n m

- - -=

- -- - ++

· =

-+= =

+ æ ö÷ç- + ÷ç ÷çè ø

NIPRL 14

10.2.1 Confidence Intervals for the Difference Between Two Population Proportions

µ µµ

/ 2

Confidence Intervals for the Difference Between Two Population Proportions

Assume ~ ( , ), ~ ( , ) and .

Approximate two-sided 1- confidence level confidence interval isA B

A B A B

X B n p Y B m p X Y

pp p p p za

a

·

^

- Î - -

µ µ µµ µ

µ µ µ µ

µ µ µ µ

/ 2

/ 2 / 23 3 3 3

(1 ) (1 ) (1 ) (1 ),

( ) ( ) ( ) ( ),

Approximate one-sided 1- confidence level con

A A B B A A B BA B

A B A B

p p p p p p pp p z

n m n m

x n x y m y x n x y m yp p z p p z

n m n m

a

a a

a

æ ö÷ç - - - - ÷ç ÷+ - - +ç ÷ç ÷ç ÷è ø

æ ö- - - - ÷ç ÷= - - + - - +ç ÷ç ÷÷çè ø

·

µ µ

µ µ

3 3

3 3

fidence interval is

( ) ( ), 1 and

( ) ( )1, .

The approximations are reasonable as long as , - , and - ar

A B A B

A B A B

x n x y m yp p p p z

n m

x n x y m yp p p p z

n m

x n x y m y

a

a

æ ö- - ÷ç ÷- Î - - +ç ÷ç ÷÷çè ø

æ ö- - ÷ç ÷- Î - - - +ç ÷ç ÷÷çè ø

·

e

all larger than 5.

NIPRL 15

10.2.1 Confidence Intervals for the Difference Between Two Population Proportions


Building A : 406 cracked tiles out of n = 6000.

Building B : 83 cracked tiles out of m = 2000.

µ µ

µ µ µ µ/ 2 / 23 3 3 3

406 830.0677 and 0.0415.

6000 2000

( ) ( ) ( ) ( ),

(0.0120, 0.0404) where 0.01.

The fact this confidence interval contain

A B

A B A B A B

x yp p

n m

x n x y m y x n x y m yp p p p z p p z

n m n ma a

a

= = = = = =

æ ö- - - - ÷ç ÷- Î - - + - - +ç ÷ç ÷÷çè ø

= =

s only positive values indicates that .A Bp p>

NIPRL 16

10.2.2 Hypothesis Tests on the Difference Between Two Population Proportions

¶ ¶

µ µ

0

Hypothesis Tests of the Equality of Two Population Proportions

Assume ~ ( , ), ~ ( , ) and .

: versus : has - 2 ( | |)

where1 1

(1 )

A B

A B A A

A

B

B

X B n p Y B m p X Y

H p p H p p p value z

z

p p

p

m

p

n

·

^

· = ¹ = ´ F -

=æ

- +

-

µ

0 / 2 0 / 2

0

0 0

0

and

- Accept if | | and Reject if | | .

: versus : has - ( )

Accept if and Reject if .

: versus : ha

A B A A B

A B A A B

x yp

n m

H z z H z z

H p p H p p p value z

H z z H z z

H p p H p p

a a

a a

+=

+ö÷ç ÷ç ÷çè ø

£ >

· ³ < =F

³ - <-

· £ >

0 0

s - 1 ( )

Accept if and Reject if .

p value z

H z z H z za a

= - F

£ - >-

NIPRL 17

10.2.2 Hypothesis Tests on the Difference Between Two Population Proportions

Example 59 : Political Polling Population

age 18-39 age >= 40

A B

“The city mayor is doing a good job.”

Random sample n=952 Random sample m=1043

Agree : x = 627

Disagree : n-x = 325

Agree : y = 421

Disagree : m-y = 622

µ µ

µ

0

627 4210.652 and 0.404

952 1043

: versus :

627 421Pooled estimate 0.525

952 104311.39, - 2 ( 11.39) 0

- Two-sided 99% confidence interval is

(0.199, 0.311)

A B

A B A A B

A B

p p

H p p H p p

p

z p value

p p

= = = =

= ¹

+= =

+= = ´ F -

- Î

;

NIPRL 18

Summary problems

(1) Why do we assume large sample sizes for statistical inferences concerning proportions?

So that the Normal approximation is a reasonable approach.

(2) Can you find an exact size test concerning proportions?

No, in general.

NIPRL 19

10.3 Goodness of Fit Tests for One-Way Contingency Tables10.3.1 One-Way Classifications

µ

1 2 1 2

1 2 1 2

Each of observations is classified into one of categories or cells.

cell frequencies : , , , with

cell probabilities : , , , with 1

~ ( , )

Hypothesis T

k k

k k

ii i i

n k

x x x x x x n

p p p p p p

xx B n p p

n

·

¼ + + + =

¼ + + + =

Þ =

·

L

L

*

0 0

*

est

: 1 versus : is false

The null hypothesis of homogeneity is that

1for 1

i

i

i AH p p i k H H

p i kk

= £ £

= £ £

NIPRL 20

10.3.1 One-Way Classifications

Example 1 : Machine Breakdowns

n = 46 machine breakdowns.

x1 = 9 : electrical problems

x2 = 24 : mechanical problems

x3 = 13 : operator misuse

It is suggested that the probabilities of these three kinds are

p*1 = 0.2, p*2 = 0.5, p*3 = 0.3.

NIPRL 21


1 2

1 2

Goodness of Fit Tests for One-Way Contingency Tables

Consider a multinomial distribution with k-cells and

a set of unknown cell probabilities , , , and

a set of observed cell frequencies , ,kp p p

x x

·

¼

¼

1 2

*0

2 2 2 21 1

22 2

1 1

*

, with

: 1

- ( ) or - ( )

( )where and 2 ln

with , 1

The - are appropriate as long as

i

i

k k

i

k k

k ki i i

ii ii i

i

x x x x n

H p p i k

p value P X p value P G

x e xX G x

e e

e np i k

p value e

c c- -

= =

+ + + =

· = £ £

= ³ = ³

æ ö- ÷ç ÷= = ç ÷ç ÷çè ø

= £ £

å å

L

2 2 2 20 , 1 , 1

2 2 2 20 , 1 , 1

5, 1 .

At size , accept if (or if ) and

reject if (or if ).

i

k k

k k

i k

H X G

H X G

a a

a a

a c c

c c

- -

- -

³ £ £

£ £

> >

NIPRL 22


Example 1 : Machine Breakdowns

H0 : p1 = 0.2, p2 = 0.5, p3 = 0.3

Electrical MechanicalOperator misuse

Observed cell freq.

x1 = 9 x2 = 24 x3 = 13 n = 46

Expected cell freq.

e1 = 46*0.2

= 9.2

e2 = 46*0.5 =23.0

e3 = 46*0.3 =13.8

n = 46

2 2 22

2

22

(9.0 9.2) (24.0 23.0) (13.0 13.8)0.0942

9.2 23.0 13.8

9.0 24.0 13.02 9.0 ln 24.0 ln 13.0 ln 0.0945

9.2 23.0 13.8

- ( 0.094) 0.095 ( 1 3 1 2)

X

G

p value P kc

- - -= + + =

æ öæ ö æ ö æ ö÷ç ÷ ÷ ÷ç ç ç= ´ ´ + ´ + ´ =÷÷ ÷ ÷ç ç ç ç ÷÷ ÷ ÷÷ ÷ ÷ç ç çç ÷ç è ø è ø è øè ø

= ³ - = - =;

NIPRL 23


( )

20 2

2

2

is plausible from the - ( 0.094) 0.095.

Consider 95% confidence interval for .

24 1.96 24 (46 24) 24 1.96 24 (46 24),

46 46 46 46 46 46

0.378, 0.666

Check for the homo

H p value P

p

p

c· = ³

æ ö´ - ´ - ÷ç ÷Î - +ç ÷ç ÷÷çè ø

=

·

;

( )

11 2 3 3

461 2 3 3

2 2 22

22

13

geneity.

Let .

Then 15.33.

(9.0 15.33) (24.0 15.33) (13.0 15.33)7.87

15.33 15.33 15.33

- ( 7.87) 0.02

Not plausible. Also notice 0.378, 0.666 .

p p p

e e e

X

p value P c

= = =

= = = =

- - -= + + =

= ³

Þ Ï

;

NIPRL 24

10.3.2 Testing Distributional Assumptions

Example 3 : Software Errors

For some of expected values are smaller than 5, it is appropriate to group the cells.

Test if the data are from a Poisson distribution with mean=3.

NIPRL 25

10.3.2 Testing Distributional Assumptions

2 2 22

2 2 2

25

(17.00 16.93) (20.0 19.04) (25.00 19.04)

16.93 19.04 19.04

(14.00 14.28) (6.00 8.57) (3.00 7.14)

14.28 8.57 7.145.12

- ( 5.12) 0.40

Plausible that the software errors have a Po

X

p value P c

- - -= + +

- - -+ + +

=

= ³ =

Þ

0

isson distribution

with mean 3.

If : the software errors have a Poisson distribution,

would be calculated using 2.76 and - would

be calculated from a chi-square distribution withi

H

e x p value

l

l

=

·

= =

1 1 4 degrees of freedom.k - - =

NIPRL 26

10.4 Testing for Independence in Two-Way Contingency Tables10.4.1 Two-Way Classifications

Level 1 Level 2 Level j Level c

Level 1 x11 x12 x1c x1.

Level 2 x21 x22 x2c x2.

Level i xij xi.

Level r xr1 xr2 xrc xr.

x.1 x.2 x.j x.c n = x..

A two-way (r x c) contingency table.

Second Categorization

First C

ategorization

Row

marginal frequencies

Column marginal frequencies

NIPRL 27

10.4.1 Two-Way Classifications


Notice that the column marginal frequencies are fixed.

( x.1 = 6000, x.2 = 2000)

Location

Tile Condition

Building A Building B

Undamaged x11 = 5594 x12 = 1917 x1. = 7511

Cracked x21 = 406 x22 = 83 x2. = 489

x.1 = 6000 x.2 = 2000 n = x.. = 8000

NIPRL 28

10.4.2 Testing for Independence

0

22 2

1 1 1 1

. .

Testing for Independence in a Two-Way Contingency Table

: Two categorizations are independent.

Test statistics

( )or 2 ln

where

-

r c r cij ij ij

iji j i jij ij

i jij

H

x e xX G x

e e

x xe

n

p

= = = =

·

æ ö- ÷ç ÷ç= = ÷ç ÷÷çè ø

=

å å å å

2 2 2 2

2 2 20 ,

2 2 20 ,

( ) or - ( )

where ( 1) ( 1)

A size hypothesis test if (or )

if (or )

value P X p value P G

r c

accepts H X G

rejects H X G

n n

a n

a n

c c

n

a c

c

= ³ = ³

= - ´ -

>

<

NIPRL 29

10.4.2 Testing for Independence Example 55 : Building Tile Cracks

Building A Building B

Undamagedx11 = 5594

e11 = 5633.25

x12 = 1917

e12 = 1877.75x1. = 7511

Crackedx21 = 406

e21 = 366.75

x22 = 83

e22 = 122.25x2. = 489

x.1 = 6000 x.2 = 2000 n = x.. = 8000

11 12

21 22

7511 6000 7511 20005633.25, 1877.75

8000 8000489 6000 489 2000

366.75, 122.258000 8000

e e

e e

´ ´= = = =

´ ´= = = =

NIPRL 30

10.4.2 Testing for Independence2 2

2

2 2

21

(5594.00 5633.25) (1917.00 1877.75)

5633.25 1877.75

(406.00 366.75) (83.00 122.25)

366.75 122.2517.896

- ( 17.896) 0

Consequently, it is not plausible that (the probability tA

X

p value P

p

c

- -= +

- -+ +

=

= ³ ;

hat a tile on Building A

becomes cracked) and (the probability that a tile on Building B becomes cracked)

are equal. This is consistent that the confidence interval of and

does not c

B

A B

p

p p

( )

22 11 22 12 21

1 1 2 2

ontain zero.

0.0120, 0.0404

A formula for the Pearson chi-square statistic for 2 2 table is

( )

A Bp p

n x x x xX

x x x x

- Î

´

-=

g g g g

NIPRL 31

Summary problems

1. Construct a goodness-of-fit test for testing a distributional assumption of a normal distribution by applying the one-way classification method.

2.

*0 1 0In testing : : , what is the df of the Pearson

chi-square statistic, when 1,..., , 1,..., ?

ij ijH p p vs H not H

i r j c

NIPRL Chapter 10. Discrete Data Analysis 10.1 Inferences on a Population Proportion 10.2 Comparing...

Documents

Transcript of NIPRL Chapter 10. Discrete Data Analysis 10.1 Inferences on a Population Proportion 10.2 Comparing...