Variance Estimation When Donor Imputation is Used to Fill in Missing Values

Variance Estimation When Variance Estimation When Donor Imputation is Used to Donor Imputation is Used to

Fill in Fill in Missing ValuesMissing Values

Jean-François Beaumont and Cynthia BocciJean-François Beaumont and Cynthia BocciStatistics CanadaStatistics Canada

Third International Conference on Establishment Third International Conference on Establishment SurveysSurveys

Montréal, June 18-21, 2007Montréal, June 18-21, 2007

2

OverviewOverview ContextContext

Donor imputationDonor imputation

Variance estimationVariance estimation

Simulation studySimulation study

ConclusionConclusion

3

Context Context Population parameter to be estimated Population parameter to be estimated

::

Domain total: Domain total:

Estimator in the case of full response:Estimator in the case of full response:

Calibration estimator Calibration estimator

Horvitz-Thompson estimatorHorvitz-Thompson estimator

Uk kkdy ydt

sk kkkdy ydwt̂

4

Donor ImputationDonor Imputation Imputed estimator :Imputed estimator :

With donor imputation, the imputed With donor imputation, the imputed value isvalue is

A variety of methods can be considered A variety of methods can be considered in order to find a donor in order to find a donor ll((kk) for the ) for the recipient recipient kk

sk kkk

Idy ydwt̂

otherwise,,

*k

rkk y

skyywith

rklk sklyy )(donorafor,)(*

5

Donor ImputationDonor Imputation Two simple examples:Two simple examples:

Random Hot-Deck Imputation Within Random Hot-Deck Imputation Within ClassesClasses

Nearest-neighbour imputationNearest-neighbour imputation

Practical considerations that add some Practical considerations that add some complexity to the imputation process:complexity to the imputation process: Post-imputation edit rulesPost-imputation edit rules hierarchical imputation classes hierarchical imputation classes

6

Imputation ModelImputation Model Most imputation methods can be Most imputation methods can be

justified by an imputation model:justified by an imputation model:

The donor imputed estimator is The donor imputed estimator is assumed to be approximately assumed to be approximately unbiased under the model:unbiased under the model:

0),|,(),|(),|(

2

rlkm

krkm

krkm

ssyyCssyVssyE

0,|ˆˆE )( msk

klkkkrdyIdym dwsstt

7

CurrentVariance CurrentVariance Estimation MethodsEstimation Methods

Assuming negligible sampling fractionsAssuming negligible sampling fractions Chen and Shao (2000, JOS) for NN Chen and Shao (2000, JOS) for NN

imputation imputation Resampling methodsResampling methods

Our method is closely related to:Our method is closely related to: Rancourt, Särndal and Lee (1994, proc. Rancourt, Särndal and Lee (1994, proc.

SRMS): Assumes a ratio model holdsSRMS): Assumes a ratio model holds Brick, Kalton and Kim (2004, SM): Brick, Kalton and Kim (2004, SM):

Condition on the selected donorsCondition on the selected donors

8

Imputation Model Imputation Model ApproachApproach

Variance decomposition of Särndal Variance decomposition of Särndal (1992, SM):(1992, SM):

For any donor imputation method, For any donor imputation method, we have: we have:

MIXNRSAM

MIX22

VVVVˆˆEˆVEˆE

dy

Idympqdypmdy

Idympq ttttt

m

r

siiikkdk

sk kdkIdy

dwkilIdwW

yWt

))((

where,ˆ

9

Estimation of the Estimation of the nonresponse variancenonresponse variance

The estimation of the nonresponse The estimation of the nonresponse variance is achieved by estimatingvariance is achieved by estimating

Noting that the nonresponse error Noting that the nonresponse error is:is:

Then, the nonresponse variance Then, the nonresponse variance estimator is:estimator is:

rdyIdymrdy

Idym ssttsstt ,|ˆˆV,|)ˆˆ(E 2

mr sk kkksk kkkdkdyIdy ydwydwWtt -)(ˆˆ

2222 ˆˆ)(,|ˆˆV̂ ksk kksk kkkdkrdyIdym

mrdwdwWsstt

10

Estimation of the mixed Estimation of the mixed componentcomponent

Similarly, the estimation of the mixed Similarly, the estimation of the mixed component is achieved by estimatingcomponent is achieved by estimating

The mixed component estimator is:The mixed component estimator is:

This component can be either positive This component can be either positive or negative and may not always be or negative and may not always be negligiblenegligible

rdydydyIdym sstttt ,|ˆ,ˆˆCov2

2 2ˆ ˆ ˆV 2 ( 1)( ) 2 ( 1)r m

MIX k dk k k k k k k k kk s k s

w W w d d w w d

11

Estimation of the sampling Estimation of the sampling variancevariance

Let be the full response Let be the full response variance est.variance est.

The strategy consists of The strategy consists of EstimatingEstimating Replace by their estimates the unknownReplace by their estimates the unknown

This leads to the sampling variance This leads to the sampling variance estimator: estimator:

E ( ) | , ,m r rv y s s Y

)ˆ(V̂)( θyv p

2and kk

msk

kkkk dwyv 22ˆ ˆ)1()(

12

Estimation of the sampling Estimation of the sampling variancevariance

This strategy is essentially equivalent This strategy is essentially equivalent toto Randomly imputing the missing values Randomly imputing the missing values

using the imputation modelusing the imputation model Computing the full response sampling Computing the full response sampling

variance estimator by treating these variance estimator by treating these imputed values as true valuesimputed values as true values

Repeating this process a large number of Repeating this process a large number of times and taking the average of the times and taking the average of the sampling variance estimatessampling variance estimates

Similar to multiple imputation sampling Similar to multiple imputation sampling variance estimatorvariance estimator

13

Simulation studySimulation study Generated a population of size 1000Generated a population of size 1000 Two y-variables: Two y-variables:

LIN: Linear relationship between y and x LIN: Linear relationship between y and x NLIN: Nonlinear relationship between y NLIN: Nonlinear relationship between y

and xand x Two different sample sizes: Two different sample sizes:

Small sampling fraction: n=50Small sampling fraction: n=50 Large sampling fraction: n=500Large sampling fraction: n=500

Response probability depends on x Response probability depends on x with an average of 0.5with an average of 0.5

14

Simulation studySimulation study Imputation: Nearest-Neighbour Imputation: Nearest-Neighbour

imputation using x as the matching imputation using x as the matching variablevariable

Estimation ofEstimation of LIN: Linear model in perfect agreement LIN: Linear model in perfect agreement

with the LIN y-variablewith the LIN y-variable NPAR: Nonparametric estimation using NPAR: Nonparametric estimation using

the procedure TPSPLINE of SAS the procedure TPSPLINE of SAS

2and kk

15

Simulation studySimulation study Two objectives:Two objectives:

Compare the two ways of estimatingCompare the two ways of estimating LIN and NPARLIN and NPAR

Compare three nonparametric methods:Compare three nonparametric methods: NPARNPAR NPAR_Naïve: NPAR with the sampling NPAR_Naïve: NPAR with the sampling

variance being estimated by the naïve variance being estimated by the naïve sampling variance (Brick, Kalton and sampling variance (Brick, Kalton and Kim, 2004)Kim, 2004)

CS : method of Chen and Shao (2000) CS : method of Chen and Shao (2000)

2and kk

)( yv

16

Results: Large sampling Results: Large sampling fractionfraction

MethoMethodd

Relative Bias Relative Bias in %in % RRMSE in %RRMSE in %

y-LINy-LIN y-y-NLINNLIN y-LINy-LIN y-y-

NLINNLIN

LINLIN -2.4-2.4 358.4358.4 15.715.7 514.1514.1

NPARNPAR -0.3-0.3 -18.8-18.8 21.521.5 54.654.6

17

Results: Small sampling Results: Small sampling fractionfraction

MethoMethodd



NLINNLIN

NPARNPAR -4.9-4.9 -13.3-13.3 41.841.8 245.4245.4

NPAR_NPAR_NaïveNaïve -5.9-5.9 -10.4-10.4 42.142.1 265.8265.8

CSCS -9.1-9.1 -9.4-9.4 52.852.8 257.8257.8

18

Results: Large sampling Results: Large sampling fractionfraction

MethoMethodd



NLINNLIN

NPARNPAR -0.3-0.3 -18.8-18.8 21.521.5 54.654.6

NPAR_NPAR_NaïveNaïve -0.3-0.3 -12.0-12.0 21.821.8 69.169.1

CSCS 33.933.9 59.659.6 53.753.7 118.7118.7

19

ConclusionConclusion Nonparametric estimation of Nonparametric estimation of

seems beneficial (robust) with seems beneficial (robust) with Nearest-Neighbour imputationNearest-Neighbour imputation

Our proposed method is valid even Our proposed method is valid even for large sampling fractionsfor large sampling fractions

It seems to be slightly better to use It seems to be slightly better to use our sampling variance estimator our sampling variance estimator instead of the naïve sampling instead of the naïve sampling variance estimatorvariance estimator

2and kk

20

ConclusionConclusion Work done in the context of Work done in the context of

developing a variance estimation developing a variance estimation system (SEVANI)system (SEVANI)

Methodology implemented in the Methodology implemented in the next version 2.0 of SEVANInext version 2.0 of SEVANI

Estimation of :Estimation of : Linear modelLinear model Nonparametric estimation Nonparametric estimation

2and kk

21

Thanks - MerciThanks - Merci

For more For more information information please please contactcontact

Pour plus Pour plus d’informationd’information, veuillez , veuillez contactercontacter

Jean-François BeaumontJean-François [email protected]@statcan.ca

Cynthia BocciCynthia BocciCynthia.BocciCynthia.Bocci@@statcan.castatcan.ca

mailto:[email protected]



Variance Estimation When Donor Imputation is Used to Fill in Missing Values

Documents

Transcript of Variance Estimation When Donor Imputation is Used to Fill in Missing Values