Variance Estimation When Donor Imputation is Used to Fill in Missing Values
description
Transcript of Variance Estimation When Donor Imputation is Used to Fill in Missing Values
Variance Estimation When Variance Estimation When Donor Imputation is Used to Donor Imputation is Used to
Fill in Fill in Missing ValuesMissing Values
Jean-François Beaumont and Cynthia BocciJean-François Beaumont and Cynthia BocciStatistics CanadaStatistics Canada
Third International Conference on Establishment Third International Conference on Establishment SurveysSurveys
Montréal, June 18-21, 2007Montréal, June 18-21, 2007
2
OverviewOverview ContextContext
Donor imputationDonor imputation
Variance estimationVariance estimation
Simulation studySimulation study
ConclusionConclusion
3
Context Context Population parameter to be estimated Population parameter to be estimated
::
Domain total: Domain total:
Estimator in the case of full response:Estimator in the case of full response:
Calibration estimator Calibration estimator
Horvitz-Thompson estimatorHorvitz-Thompson estimator
Uk kkdy ydt
sk kkkdy ydwt̂
4
Donor ImputationDonor Imputation Imputed estimator :Imputed estimator :
With donor imputation, the imputed With donor imputation, the imputed value isvalue is
A variety of methods can be considered A variety of methods can be considered in order to find a donor in order to find a donor ll((kk) for the ) for the recipient recipient kk
sk kkk
Idy ydwt̂
otherwise,,
*k
rkk y
skyywith
rklk sklyy )(donorafor,)(*
5
Donor ImputationDonor Imputation Two simple examples:Two simple examples:
Random Hot-Deck Imputation Within Random Hot-Deck Imputation Within ClassesClasses
Nearest-neighbour imputationNearest-neighbour imputation
Practical considerations that add some Practical considerations that add some complexity to the imputation process:complexity to the imputation process: Post-imputation edit rulesPost-imputation edit rules hierarchical imputation classes hierarchical imputation classes
6
Imputation ModelImputation Model Most imputation methods can be Most imputation methods can be
justified by an imputation model:justified by an imputation model:
The donor imputed estimator is The donor imputed estimator is assumed to be approximately assumed to be approximately unbiased under the model:unbiased under the model:
0),|,(),|(),|(
2
rlkm
krkm
krkm
ssyyCssyVssyE
0,|ˆˆE )( msk
klkkkrdyIdym dwsstt
7
CurrentVariance CurrentVariance Estimation MethodsEstimation Methods
Assuming negligible sampling fractionsAssuming negligible sampling fractions Chen and Shao (2000, JOS) for NN Chen and Shao (2000, JOS) for NN
imputation imputation Resampling methodsResampling methods
Our method is closely related to:Our method is closely related to: Rancourt, Särndal and Lee (1994, proc. Rancourt, Särndal and Lee (1994, proc.
SRMS): Assumes a ratio model holdsSRMS): Assumes a ratio model holds Brick, Kalton and Kim (2004, SM): Brick, Kalton and Kim (2004, SM):
Condition on the selected donorsCondition on the selected donors
8
Imputation Model Imputation Model ApproachApproach
Variance decomposition of Särndal Variance decomposition of Särndal (1992, SM):(1992, SM):
For any donor imputation method, For any donor imputation method, we have: we have:
MIXNRSAM
MIX22
VVVVˆˆEˆVEˆE
dy
Idympqdypmdy
Idympq ttttt
m
r
siiikkdk
sk kdkIdy
dwkilIdwW
yWt
))((
where,ˆ
9
Estimation of the Estimation of the nonresponse variancenonresponse variance
The estimation of the nonresponse The estimation of the nonresponse variance is achieved by estimatingvariance is achieved by estimating
Noting that the nonresponse error Noting that the nonresponse error is:is:
Then, the nonresponse variance Then, the nonresponse variance estimator is:estimator is:
rdyIdymrdy
Idym ssttsstt ,|ˆˆV,|)ˆˆ(E 2
mr sk kkksk kkkdkdyIdy ydwydwWtt -)(ˆˆ
2222 ˆˆ)(,|ˆˆV̂ ksk kksk kkkdkrdyIdym
mrdwdwWsstt
10
Estimation of the mixed Estimation of the mixed componentcomponent
Similarly, the estimation of the mixed Similarly, the estimation of the mixed component is achieved by estimatingcomponent is achieved by estimating
The mixed component estimator is:The mixed component estimator is:
This component can be either positive This component can be either positive or negative and may not always be or negative and may not always be negligiblenegligible
rdydydyIdym sstttt ,|ˆ,ˆˆCov2
2 2ˆ ˆ ˆV 2 ( 1)( ) 2 ( 1)r m
MIX k dk k k k k k k k kk s k s
w W w d d w w d
11
Estimation of the sampling Estimation of the sampling variancevariance
Let be the full response Let be the full response variance est.variance est.
The strategy consists of The strategy consists of EstimatingEstimating Replace by their estimates the unknownReplace by their estimates the unknown
This leads to the sampling variance This leads to the sampling variance estimator: estimator:
E ( ) | , ,m r rv y s s Y
)ˆ(V̂)( θyv p
2and kk
msk
kkkk dwyv 22ˆ ˆ)1()(
12
Estimation of the sampling Estimation of the sampling variancevariance
This strategy is essentially equivalent This strategy is essentially equivalent toto Randomly imputing the missing values Randomly imputing the missing values
using the imputation modelusing the imputation model Computing the full response sampling Computing the full response sampling
variance estimator by treating these variance estimator by treating these imputed values as true valuesimputed values as true values
Repeating this process a large number of Repeating this process a large number of times and taking the average of the times and taking the average of the sampling variance estimatessampling variance estimates
Similar to multiple imputation sampling Similar to multiple imputation sampling variance estimatorvariance estimator
13
Simulation studySimulation study Generated a population of size 1000Generated a population of size 1000 Two y-variables: Two y-variables:
LIN: Linear relationship between y and x LIN: Linear relationship between y and x NLIN: Nonlinear relationship between y NLIN: Nonlinear relationship between y
and xand x Two different sample sizes: Two different sample sizes:
Small sampling fraction: n=50Small sampling fraction: n=50 Large sampling fraction: n=500Large sampling fraction: n=500
Response probability depends on x Response probability depends on x with an average of 0.5with an average of 0.5
14
Simulation studySimulation study Imputation: Nearest-Neighbour Imputation: Nearest-Neighbour
imputation using x as the matching imputation using x as the matching variablevariable
Estimation ofEstimation of LIN: Linear model in perfect agreement LIN: Linear model in perfect agreement
with the LIN y-variablewith the LIN y-variable NPAR: Nonparametric estimation using NPAR: Nonparametric estimation using
the procedure TPSPLINE of SAS the procedure TPSPLINE of SAS
2and kk
15
Simulation studySimulation study Two objectives:Two objectives:
Compare the two ways of estimatingCompare the two ways of estimating LIN and NPARLIN and NPAR
Compare three nonparametric methods:Compare three nonparametric methods: NPARNPAR NPAR_Naïve: NPAR with the sampling NPAR_Naïve: NPAR with the sampling
variance being estimated by the naïve variance being estimated by the naïve sampling variance (Brick, Kalton and sampling variance (Brick, Kalton and Kim, 2004)Kim, 2004)
CS : method of Chen and Shao (2000) CS : method of Chen and Shao (2000)
2and kk
)( yv
16
Results: Large sampling Results: Large sampling fractionfraction
MethoMethodd
Relative Bias Relative Bias in %in % RRMSE in %RRMSE in %
y-LINy-LIN y-y-NLINNLIN y-LINy-LIN y-y-
NLINNLIN
LINLIN -2.4-2.4 358.4358.4 15.715.7 514.1514.1
NPARNPAR -0.3-0.3 -18.8-18.8 21.521.5 54.654.6
17
Results: Small sampling Results: Small sampling fractionfraction
MethoMethodd
Relative Bias Relative Bias in %in % RRMSE in %RRMSE in %
y-LINy-LIN y-y-NLINNLIN y-LINy-LIN y-y-
NLINNLIN
NPARNPAR -4.9-4.9 -13.3-13.3 41.841.8 245.4245.4
NPAR_NPAR_NaïveNaïve -5.9-5.9 -10.4-10.4 42.142.1 265.8265.8
CSCS -9.1-9.1 -9.4-9.4 52.852.8 257.8257.8
18
Results: Large sampling Results: Large sampling fractionfraction
MethoMethodd
Relative Bias Relative Bias in %in % RRMSE in %RRMSE in %
y-LINy-LIN y-y-NLINNLIN y-LINy-LIN y-y-
NLINNLIN
NPARNPAR -0.3-0.3 -18.8-18.8 21.521.5 54.654.6
NPAR_NPAR_NaïveNaïve -0.3-0.3 -12.0-12.0 21.821.8 69.169.1
CSCS 33.933.9 59.659.6 53.753.7 118.7118.7
19
ConclusionConclusion Nonparametric estimation of Nonparametric estimation of
seems beneficial (robust) with seems beneficial (robust) with Nearest-Neighbour imputationNearest-Neighbour imputation
Our proposed method is valid even Our proposed method is valid even for large sampling fractionsfor large sampling fractions
It seems to be slightly better to use It seems to be slightly better to use our sampling variance estimator our sampling variance estimator instead of the naïve sampling instead of the naïve sampling variance estimatorvariance estimator
2and kk
20
ConclusionConclusion Work done in the context of Work done in the context of
developing a variance estimation developing a variance estimation system (SEVANI)system (SEVANI)
Methodology implemented in the Methodology implemented in the next version 2.0 of SEVANInext version 2.0 of SEVANI
Estimation of :Estimation of : Linear modelLinear model Nonparametric estimation Nonparametric estimation
2and kk
21
Thanks - MerciThanks - Merci
For more For more information information please please contactcontact
Pour plus Pour plus d’informationd’information, veuillez , veuillez contactercontacter
Jean-François BeaumontJean-François [email protected]@statcan.ca
Cynthia BocciCynthia BocciCynthia.BocciCynthia.Bocci@@statcan.castatcan.ca