Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

28
Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information Donsig Jang, Xiaojing Lin, Amang Sukasih Mathematica Policy Research, Inc. Steve Cohen, Kelly Kang National Science Foundation ITSEW 2008 Research Triangle Park, NC, June 2, 2008

description

Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information. Donsig Jang, Xiaojing Lin, Amang Sukasih Mathematica Policy Research, Inc. Steve Cohen, Kelly Kang National Science Foundation ITSEW 2008 Research Triangle Park, NC, June 2, 2008. Disclaimer. - PowerPoint PPT Presentation

Transcript of Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Page 1: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Assessment of Misclassification Error in Stratification Due to

Incomplete Frame Information

Assessment of Misclassification Error in Stratification Due to

Incomplete Frame Information

Donsig Jang, Xiaojing Lin, Amang SukasihMathematica Policy Research, Inc.

Steve Cohen, Kelly KangNational Science Foundation

ITSEW 2008Research Triangle Park, NC, June 2, 2008

Donsig Jang, Xiaojing Lin, Amang SukasihMathematica Policy Research, Inc.

Steve Cohen, Kelly KangNational Science Foundation

ITSEW 2008Research Triangle Park, NC, June 2, 2008

Page 2: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

DisclaimerDisclaimer

The opinions and assertions are those of the The opinions and assertions are those of the authors and do not reflect the views or policies of authors and do not reflect the views or policies of the National Science Foundationthe National Science Foundation

The opinions and assertions are those of the The opinions and assertions are those of the authors and do not reflect the views or policies of authors and do not reflect the views or policies of the National Science Foundationthe National Science Foundation

Page 3: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Survey Data CollectionSurvey Data Collection

Involves many complex processes includingInvolves many complex processes including– Sampling frame constructionSampling frame construction– Sample selectionSample selection– Data collectionData collection– Data processingData processing– EstimationEstimation

Each process subjects to errorEach process subjects to error

Attempt to decompose the total survey errors into Attempt to decompose the total survey errors into separate stages of processes separate stages of processes

Involves many complex processes includingInvolves many complex processes including– Sampling frame constructionSampling frame construction– Sample selectionSample selection– Data collectionData collection– Data processingData processing– EstimationEstimation

Each process subjects to errorEach process subjects to error

Attempt to decompose the total survey errors into Attempt to decompose the total survey errors into separate stages of processes separate stages of processes

Page 4: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Total Survey ErrorsTotal Survey Errors

Sampling Frame

Parameter

Estimator

Sample

Respondent

Data

Misclassification errorMisclassification errorMisclassification errorMisclassification error

Coverage errorCoverage errorCoverage errorCoverage error

Sampling errorSampling errorSampling errorSampling error

Nonresponse errorNonresponse errorNonresponse errorNonresponse error

Measurement errorMeasurement errorMeasurement errorMeasurement error

Estimation errorEstimation errorEstimation errorEstimation error

Page 5: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Misclassification Error in StratificationMisclassification Error in Stratification

Focus of this talkFocus of this talk

A part of non-sampling errorA part of non-sampling error

Important but often overlooked componentImportant but often overlooked component

Focus of this talkFocus of this talk

A part of non-sampling errorA part of non-sampling error

Important but often overlooked componentImportant but often overlooked component

Page 6: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Stratification in SamplingStratification in Sampling

Enhance precision of survey estimatesEnhance precision of survey estimates

Precision requirements for analytic domainsPrecision requirements for analytic domains

Often imperfect information on stratification Often imperfect information on stratification variablesvariables

– Misclassification in stratificationMisclassification in stratification

Enhance precision of survey estimatesEnhance precision of survey estimates

Precision requirements for analytic domainsPrecision requirements for analytic domains

Often imperfect information on stratification Often imperfect information on stratification variablesvariables

– Misclassification in stratificationMisclassification in stratification

– Trade-off: cost to gather stratification information at the Trade-off: cost to gather stratification information at the frame construction vs. optimal sample allocationframe construction vs. optimal sample allocation

– Trade-off: cost to gather stratification information at the Trade-off: cost to gather stratification information at the frame construction vs. optimal sample allocationframe construction vs. optimal sample allocation

– Loss of effective sample sizes for some analytic Loss of effective sample sizes for some analytic domains domains

– Loss of effective sample sizes for some analytic Loss of effective sample sizes for some analytic domains domains

– Trade-off: cost to gather stratification information at the Trade-off: cost to gather stratification information at the frame construction vs. optimal sample allocationframe construction vs. optimal sample allocation

– Trade-off: cost to gather stratification information at the Trade-off: cost to gather stratification information at the frame construction vs. optimal sample allocationframe construction vs. optimal sample allocation

– Loss of effective sample sizes for some analytic Loss of effective sample sizes for some analytic domainsdomains

– Loss of effective sample sizes for some analytic Loss of effective sample sizes for some analytic domainsdomains

Page 7: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Misclassification MatrixMisclassification Matrix

the proportion of units classified as the proportion of units classified as category category jj in true category in true category kk and and

jk

11

m

jkj

11 12 1

21 22 2

1 2

...

...

... ... ... ...

...

m

m

m m mm

True classification True classification AA

StratificationStratificationclassification classification A*A*

Page 8: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Measures for Misclassification Effects Measures for Misclassification Effects

BiasBias

Effective sample size change Effective sample size change

BiasBias

Effective sample size change Effective sample size change

Page 9: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Bias Due to MisclassificationBias Due to Misclassification

( (1),..., ( ))A A AP P m Pwherewhere

*( ) ( )A ABias p Θ Ι P

= true population props.= true population props.

** ( ) ( )A i i

i s

p j w I A j

ss denotes sample, denotes sample, wwii the sampling weight for unit the sampling weight for unit ii, and , and

II(.)(.) the indicator function the indicator function

* * *( (1),..., ( ))A A Ap p m p = sample proportions = sample proportions

I = Identity matrixIdentity matrix

Kuha and Skinner 1997Kuha and Skinner 1997

Page 10: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Bias EstimationBias Estimation

1*

ˆ( ) ( ) ( )A ARebias Ap

p D Θ Ι p

jk( , ); ( ) if ; d 0 o.w.jk jk Ad d p j j k Ap

Dwherewhere

( ) ( ),A i ii s

p j w I A j

ˆ ˆ( ),jkΘ

( * , )ˆ

( )

i i ii s

jki i

i s

w I A j A k

w I A k

If the true classification is available from the sample:If the true classification is available from the sample:

Page 11: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Effective Sample Sizes and Variance Inflation Factors

Effective Sample Sizes and Variance Inflation Factors

Measures the inflation of variance due to weight variationMeasures the inflation of variance due to weight variation Measures the inflation of variance due to weight variationMeasures the inflation of variance due to weight variation

2

2, , 2

,

where, 1id i d

eff d w d ww d d

wnn deff CV

deff nw

,

,

( )

( *)w d

w d

deff A

deff A

,,

,

( ),

( *)w d

w dw d

deff AVIF

deff A

for domain for domain dd constructed based on true value constructed based on true value

for domain for domain dd constructed based on misclassified value constructed based on misclassified value

Page 12: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Example: National Survey of Recent College Graduates (NSRCG)

Example: National Survey of Recent College Graduates (NSRCG)

Sponsored by National Science FoundationSponsored by National Science Foundation

Collecting education, employment, and demographic Collecting education, employment, and demographic information from recent graduates with Bachelor’s or information from recent graduates with Bachelor’s or Master’s in science, engineering, or health fieldsMaster’s in science, engineering, or health fields

For details, For details,

– http://www.nsf.gov/statistics/srvyrecentgrads

Sponsored by National Science FoundationSponsored by National Science Foundation

Collecting education, employment, and demographic Collecting education, employment, and demographic information from recent graduates with Bachelor’s or information from recent graduates with Bachelor’s or Master’s in science, engineering, or health fieldsMaster’s in science, engineering, or health fields

For details, For details,

– http://www.nsf.gov/statistics/srvyrecentgrads

Page 13: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

NSRCG (Continued)NSRCG (Continued)

Two stage sample design: school sample at the first stage and Two stage sample design: school sample at the first stage and graduate sample at the second stage graduate sample at the second stage

Crucial to collect key sampling variables (degree date, degree Crucial to collect key sampling variables (degree date, degree level, field of major, race/ethnicity, and gender) from schools level, field of major, race/ethnicity, and gender) from schools for eligibility determination and stratification (frame variables)for eligibility determination and stratification (frame variables)

Sample was designed to have moderate weight variation within Sample was designed to have moderate weight variation within domains while meeting certain sample size thresholdsdomains while meeting certain sample size thresholds

Quality of sampling variables compromised due to schools’ Quality of sampling variables compromised due to schools’ reluctance to release the student’s information, non-standard reluctance to release the student’s information, non-standard formats used by schools, and inaccurate/incomplete formats used by schools, and inaccurate/incomplete administrative data administrative data

Two stage sample design: school sample at the first stage and Two stage sample design: school sample at the first stage and graduate sample at the second stage graduate sample at the second stage

Crucial to collect key sampling variables (degree date, degree Crucial to collect key sampling variables (degree date, degree level, field of major, race/ethnicity, and gender) from schools level, field of major, race/ethnicity, and gender) from schools for eligibility determination and stratification (frame variables)for eligibility determination and stratification (frame variables)

Sample was designed to have moderate weight variation within Sample was designed to have moderate weight variation within domains while meeting certain sample size thresholdsdomains while meeting certain sample size thresholds

Quality of sampling variables compromised due to schools’ Quality of sampling variables compromised due to schools’ reluctance to release the student’s information, non-standard reluctance to release the student’s information, non-standard formats used by schools, and inaccurate/incomplete formats used by schools, and inaccurate/incomplete administrative data administrative data

Jang and Lin (2007 JSM)Jang and Lin (2007 JSM)

Page 14: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

NSRCG (Continued)NSRCG (Continued)

Same information (degree date, degree level, field of Same information (degree date, degree level, field of major, race/ethnicity, and gender) were also major, race/ethnicity, and gender) were also collected from sampled graduatescollected from sampled graduates

Able to measure the quality of school provided Able to measure the quality of school provided information for stratification by assessing information for stratification by assessing discrepancies between school provided information discrepancies between school provided information and reported valuesand reported values

Looking at two survey data (2003 and 2006 NSRCG)Looking at two survey data (2003 and 2006 NSRCG)

Same information (degree date, degree level, field of Same information (degree date, degree level, field of major, race/ethnicity, and gender) were also major, race/ethnicity, and gender) were also collected from sampled graduatescollected from sampled graduates

Able to measure the quality of school provided Able to measure the quality of school provided information for stratification by assessing information for stratification by assessing discrepancies between school provided information discrepancies between school provided information and reported valuesand reported values

Looking at two survey data (2003 and 2006 NSRCG)Looking at two survey data (2003 and 2006 NSRCG)

Page 15: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Misclassification for GenderMisclassification for Gender

NSRCG2003NSRCG2003 NSRCG2006NSRCG2006

ReBias for PReBias for PMaleMale= -0.01%= -0.01% ReBias for PReBias for PMaleMale = 0.50% = 0.50%

Male FemaleMale 472,866 (98.5%) 7,042 (1.0%) 479,908Female 7,081 (1.5%) 667,266 (99.0%) 674,347Total 479,946 674,309 1,154,255

StratificationResponse

Total Male FemaleMale 824,984 (99.4%) 8,750 (0.8%) 833,734Female 4,611 (0.6%) 1,095,674 (99.2%) 1,100,284Total 829,594 1,104,424 1,934,018

StratificationResponse

Total

Prop_F Prop_RMale 41.58 41.58Female 58.42 58.42Total 100 100

Prop_F Prop_RMale 43.11 42.89Female 56.89 57.11Total 100 100

Page 16: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Misclassification for Race/EthnicityMisclassification for Race/Ethnicity

NSRCG2003NSRCG2003 NSRCG2006NSRCG2006

White Asian MinorityWhite 678,516 (82.4%) 4,891 (3.4%) 12,586 (6.7%) 695,992Asian 136,099 (16.5%) 134,386 (94.7%) 26834(14.2%) 297,320Minority 8546( 1.0%) 2659(1.9%) 149,739 (79.2%) 160,943Total 823,161 141,936 189,158 1,154,255

StratificationResponse

Total White Asian MinorityWhite 1,196,301 (90.6%) 9,636 (3.5%) 28,473 (8.4%) 1,234,409Asian 113,823 ( 8.6%) 262,197 (95.0%) 39,869 ( 11.8%) 415,889Minority 9,841 (0.7%) 4,130 (1.5%) 269,749 (79.8%) 283,720Total 1,319,964 275,963 338,091 1,934,018

StratificationResponse

Total

Prop_F Prop_RWhite 60.30 71.32Asian 25.76 12.30Minority 13.94 16.39Total 100 100

Prop_F Prop_RWhite 63.83 68.25Asian 21.50 14.27Minority 14.67 17.48Total 100 100

NSRCG2003 NSRCG2006Relative Bias of PWhite (White vs. Others) -15.4% -6.5%

Relative Bias of PAsian (Asian vs. Others) 109.5% 50.7%Relative Bias of PMinority (Minority vs. Others) -14.9% -16.1%

Relative Bias

Page 17: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Effective Sample Sizes and Variance Inflation Factors

Effective Sample Sizes and Variance Inflation Factors

What if taking reported values for discrepant cases?What if taking reported values for discrepant cases?

Result in more weight variation within domains Result in more weight variation within domains based on reported values due to unequal selection based on reported values due to unequal selection probabilities across classesprobabilities across classes

Check domain specific sample sizes and variance Check domain specific sample sizes and variance inflation factors inflation factors

What if taking reported values for discrepant cases?What if taking reported values for discrepant cases?

Result in more weight variation within domains Result in more weight variation within domains based on reported values due to unequal selection based on reported values due to unequal selection probabilities across classesprobabilities across classes

Check domain specific sample sizes and variance Check domain specific sample sizes and variance inflation factors inflation factors

Page 18: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Variance Inflation FactorsVariance Inflation Factors

NSRCG2003NSRCG2003 NSRCG2006NSRCG2006

Domain: race/ethnicity by degree level by major field by genderDomain: race/ethnicity by degree level by major field by gender

= White, = Asian, = = White, = Asian, = MinorityMinority

Page 19: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Ratio of Sample Size, n_R / n_F Ratio of Sample Size, n_R / n_F

NSRCG2003NSRCG2003 NSRCG2006NSRCG2006

Domain: race/ethnicity by degree level by major field by genderDomain: race/ethnicity by degree level by major field by gender

= White, = Asian, = = White, = Asian, = MinorityMinority

Page 20: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Ratio of Effective Sample Size, n_R / n_FRatio of Effective Sample Size, n_R / n_F

NSRCG2003NSRCG2003 NSRCG2006NSRCG2006

Domain: race/ethnicity by degree level by major field by genderDomain: race/ethnicity by degree level by major field by gender

= White, = Asian, = = White, = Asian, = MinorityMinority

Page 21: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Variance Inflation FactorsVariance Inflation Factors

NSRCG2003NSRCG2003 NSRCG2006NSRCG2006

Domain: race/ethnicity by degree level by major fieldDomain: race/ethnicity by degree level by major field

= White, = Asian, = = White, = Asian, = MinorityMinority

Page 22: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Ratio of Sample Size, n_R / n_FRatio of Sample Size, n_R / n_F

NSRCG2003NSRCG2003 NSRCG2006NSRCG2006

Domain: race/ethnicity by degree level by major fieldDomain: race/ethnicity by degree level by major field

= White, = Asian, = = White, = Asian, = MinorityMinority

Page 23: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Ratio of Effective Sample Size, n_R / n_FRatio of Effective Sample Size, n_R / n_F

NSRCG2003NSRCG2003 NSRCG2006NSRCG2006

Domain: race/ethnicity by degree level by major fieldDomain: race/ethnicity by degree level by major field

= White, = Asian, = = White, = Asian, = MinorityMinority

Page 24: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Variance Inflation FactorsVariance Inflation Factors

NSRCG2003NSRCG2003 NSRCG2006NSRCG2006

Domain: race/ethnicity by genderDomain: race/ethnicity by gender

= White, = Asian, = = White, = Asian, = MinorityMinority

Page 25: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Ratio of Sample Size, n_R / n_FRatio of Sample Size, n_R / n_F

NSRCG2003NSRCG2003 NSRCG2006NSRCG2006

Domain: race/ethnicity by genderDomain: race/ethnicity by gender

= White, = Asian, = = White, = Asian, = MinorityMinority

Page 26: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Ratio of Effective Sample Size, n_R / n_FRatio of Effective Sample Size, n_R / n_F

NSRCG2003NSRCG2003 NSRCG2006NSRCG2006

Domain: race/ethnicity by genderDomain: race/ethnicity by gender

= White, = Asian, = = White, = Asian, = MinorityMinority

Page 27: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

SummarySummary

Misclassification in stratification may reduce the Misclassification in stratification may reduce the effective sample sizes for domains that were effective sample sizes for domains that were sampled with high sampling ratessampled with high sampling rates

Crucial to have good classification in stratification, Crucial to have good classification in stratification, especially with substantially unequal probability especially with substantially unequal probability selections implementedselections implemented

Misclassification in stratification may reduce the Misclassification in stratification may reduce the effective sample sizes for domains that were effective sample sizes for domains that were sampled with high sampling ratessampled with high sampling rates

Crucial to have good classification in stratification, Crucial to have good classification in stratification, especially with substantially unequal probability especially with substantially unequal probability selections implementedselections implemented

Page 28: Assessment of Misclassification Error in Stratification Due to Incomplete Frame Information

Next StepsNext Steps

Population counts for key domains available but based on Population counts for key domains available but based on misclassificationmisclassification

Estimation of population counts:Estimation of population counts:

– Weighted sums of correct classification from the sampleWeighted sums of correct classification from the sample

– Use of misclassification parameter estimates,Use of misclassification parameter estimates,

where is the vector with population counts of domains where is the vector with population counts of domains defined by defined by A*A*

Raking adjustments of the weights usingRaking adjustments of the weights using

Comparison of key estimatesComparison of key estimates

Population counts for key domains available but based on Population counts for key domains available but based on misclassificationmisclassification

Estimation of population counts:Estimation of population counts:

– Weighted sums of correct classification from the sampleWeighted sums of correct classification from the sample

– Use of misclassification parameter estimates,Use of misclassification parameter estimates,

where is the vector with population counts of domains where is the vector with population counts of domains defined by defined by A*A*

Raking adjustments of the weights usingRaking adjustments of the weights using

Comparison of key estimatesComparison of key estimates

*ˆˆ ,A A -1T Θ T

*AT

ˆAT