Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit...

31
Chapter 14 Goodness-of- Fit Tests and Categorical Data Analysis

Transcript of Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit...

Page 1: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Chapter 14

Goodness-of-Fit Tests and

Categorical Data Analysis

Page 2: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

14.1

Goodness-of-fit Tests When Category

Probabilities Are Completely Specified

Page 3: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Terminology

A binomial experiment consists of a sequence of independent trials in which each trial can result in one of two possible outcomes. A multinomial experiment generalizes a binomial experiment by allowing each trial to result in one of k outcomes, where k is an integer greater than 2.

Page 4: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Multinomial Experiment

The expected number of trials resulting in category i is E(Ni) = npi. When H0:p1 = p10,…,pk = pk0 is true, these expected values become E(N1) = np10,…, E(Nk) = npk0.

Page 5: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Recall: Chi-squared Critical Value

Let , called a chi-squared critical value, denote the number of the measurement axis such that of the area under the chi-squared curve with v df lies to the right of 2

, .v

2,v

Page 6: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Notation Illustrated2,v

2,v

shaded area =

2 pdfv

Page 7: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Multinomial Experiment

Provided that npi 5 for every i, the random variable

2 22

1 all cells

( ) (observed expected)

e xpected

ki i

ii

N np

np

has approximately a chi-squared distribution with k – 1 df.

Page 8: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Test With Significance Level

Null hypothesis H0: p1 = p10,…,pk = pk0

Alternative hypoth Ha: at least one 0.i ip p2

2 0

01

( )ki i

ii

n np

np

Test statistic value:

Rejection region: 2 2, 1k

Page 9: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

P-Values for Chi-Squared Tests

The P-value for an upper-tailed chi-squared test is the area under the curve to the right of the calculated

2.

2v

Page 10: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

When the pi’s Are Functions of Other Parameters

2

Frequently the pi’s are hypothesized to depend on a smaller number of parameters (m < k). Then a specific hypothesis involving the yields specific pi0’s, which are then used in the test.

2

'si1,..., m

Page 11: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

When the Underlying Distribution is Continuous

2

1

0 1 0( ) ( )i

i

a

i i ia

p P a X a f x dx

Let X denote the variable being sampled. The hypothesized pdf is f0(x). Subdivide the measurement scale of X into k intervals [a0, a1),…, [ak, ak-1). The cell properties specified by H0 are

Page 12: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

14.2

Goodness of Fit for

Composite Hypotheses

Page 13: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

When Parameters Are Estimated

2

The null hypothesis states that each pi is a function of a small number of parameters with the otherwise unspecified.

'si1,..., m

0 1 1: ( ),... ( )θ θi iH p p 1( ,..., )θ= m Ha: the hypothesis is not true

Page 14: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

11 1 1( ,..., ) ... knn

k k kP N n N n p p

For general k, the joint distribution of N1,…,Nk is the multinomial distribution with

Joint Distribution

11 1 1( ,..., ) ( ) ... ( )θ θ kn n

k k kP N n N n

when H0 is true this becomes

Page 15: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Method of Estimation

Let n1,…,nk denote the observed values of N1,…,Nk. Then are those values of the that maximize

1̂ˆ,..., m

'si 1

1 1 1( ,..., ) ( ) ... ( )θ θ kn nk k kP N n N n

1̂ˆ,..., m are the maximum likelihood

estimators of 1,..., .m

Page 16: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

TheoremUnder general “regularity” conditions on and the ’s if are estimated by the method of maximum likelihood as described previously and n is large,

1,..., m 1,..., m ( )θi

22

1

ˆ[ ( )]ˆ( )

θ

θ

ki i

i i

N n

n

has approximately a chi-squared distribution with k – 1 – m df when H0 is true.

Page 17: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Level Test

An approximate level test of H0 versus Ha is then to reject H0 if . In practice, the test can be used if

2 2, 1k m

ˆ( ) 5θin for every i.

Page 18: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Degrees of Freedom

2 number of freely number of independent df

determined cell counts parameters estimated

A general rule of thumb for degrees of freedom in a chi-squared test is

Page 19: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Test Procedure

2 2 2, 1 , 1If withhold judgement.k m k

2 2, 1 0If , reject .k H

2 2, 1 0If , do not reject .k m H

Page 20: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Let be the maximum likelihood estimators of based on the full sample X1,…,Xn and let denote the statistic based on these estimators. Then the critical value that specifies a level upper-tailed test satisfies

1,..., m

Goodness of Fit for Discrete Distributions

1̂ˆ,..., m

2

c

2 2, 1 , 1k m kc

Page 21: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Goodness of Fit for Continuous Distributions

The chi-squared test can be used to test whether the sample comes from a specified family of continuous distributions. Once the cells are chosen (independent of the observations) it is usually difficult to estimate unspecified parameters from the observed cell counts, so mle’s based on the full sample are computed.

Page 22: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Special Test for Normality

H0: the population distribution is normal

Ha: the pop. distribution is not normalversus

consists of rejecting H0 when .r c

1Let [( 3.75) /( .25)]iy i n andCompute r for the pairs (x(1),y1),…,(x(n),yn). The Ryan-Joiner test of

Page 23: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

14.3

Two-Way Contingency Tables

Page 24: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Data With Counts or Frequencies

1. There are I populations of interest, each corresponding to a different row of the table, and each population is divided into the same J categories. A sample is taken from the ith population, and the counts are entered in the cells in the ith row of the table.

Page 25: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Data With Counts or Frequencies

2. There is a single population of interest, with each individual in the population categorized with respect to two different factors. There are I categories associated with the first factor and J categories associated with the second factor.

Page 26: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

n11 n12 n1j n1J

n21

ni1 nij

nI1 nIJ

Two-Way Contingency Table

Page 27: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Estimated Expected Counts Under H0

(Homogeneity)

.

ˆ estimated expected count in cell ( , )

th row total th column total

ij

ji

e i j

nn

ni j

n

Page 28: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Null hypothesis: H0: p1j = p2j =…= pIj

Alternative hypoth.: Ha: H0 is not true2

2

1 1

ˆ( )

ˆ

I Jij ij

iji j

n e

e

Test statistic value:

Rejection region: 2 2,( 1)( 1)I J

Test for Homogeneity

Apply as long as ˆ 5 for all cells.ije

Page 29: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Multinomial Experiment

Provided that npi 5 for every i, the random variable

2 22

1 all cells

( ) (observed expected)

e xpected

ki i

ii

N np

np

has approximately a chi-squared distribution with k – 1 df.

Page 30: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Estimated Expected Counts (Independence)

ˆ estimated expected count in cell ( , )

ˆ ˆ

th row total th column total

ij

j i jii j

e i j

n n nnn p p n

n n ni j

n

Page 31: Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.

Null hypothesis: H0:

Alternative hypoth.: Ha: H0 is not true2

2

1 1

ˆ( )

ˆ

I Jij ij

iji j

n e

e

Test statistic value:

Rejection region: 2 2,( 1)( 1)I J

Test for Independence

Apply as long as ˆ 5 for all cells.ije

ij i jp p p