Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
Multiple Imputation in Stata —-mi- and -ice- commands
Henrik Støvring([email protected])
Department of Biostatistics
November 19, 2010
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
Outline
Background and terminology
Generating imputed datasets
Brief list of introductory references
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
Background and setting
◮ Assume we “know” that◮ Data belongs to MAR category◮ More than one variable has missings◮ Missing data pattern is not monotone
◮ What is needed in order to proceed?◮ Prediction model for missing values◮ Tool to impute the missing values◮ Tool to combine estimates from analysis of each imputed
dataset into an overall estimate
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
TerminologyDataset zero: The original dataset with missing values
Imputed dataset: A copy of the original dataset with all missingvalues replaced with imputed values (j = 1, . . . , m)
Imputed values: (Randomly generated) values substituted forunobserved values
Multiple imputation analysis: The ordinary analysis on eachimputed dataset AND combination of estimatesinto a single estimate
Iteration: Do a procedure (compute some numbers), updatestarting values with the result (the computednumbers), and repeat the procedure
Passive variable: A variable that depends on an imputedvariable
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations
Obtaining and installing
◮ Obtained in Stata with the commands. search ice. net sj 9-3 st0067_4
and then click “(click here to install)”◮ Background information in
◮ help-file and references therein◮ Royston (2009)
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations
Rationale
M ultiple (Imputation)
I terated: Repeat to achieve stability. . .
C hained: In a specific order, one by one. . .
E quations: Based on a set of regression equations
◮ Consists of two “steps”1. Estimate relationships between each variable to be imputed
and predictive variables (covariates)2. Impute values from fitted model
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations
Initial observations regarding -ice-
◮ Variables take turn in being predictor and predicted(“outcome”/ to be imputed)
◮ Variables to be imputed can be predicted from variableswithout missing values
◮ Allows regression types for categorical data◮ Logistic◮ Multinomial/Ordered logistic
◮ Allows imputation of interval censored data
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations
Major options
m() Number of imputed datasets
eq() Equations used to predict from
cmd() Regression type used to model “dependent”variable
by() Impute separately for subsets of dataset
cycles() Number of cycles used to obtain estimatedregression coefficients used in subsequentprediction/imputation
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations
Useful options
dryrun Checks syntax and equations for consistency, butdoesn’t do any actual imputations
saving() Save imputed dataset to named file
seed() Fix random origin, i.e. make process reproducible
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations
Other options
◮ Countless:clear
match()passive()
boot(). . . . . .
◮ WARNING: Package is often updated AND optionssometimes changes definition!
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations
Imputing values: running -ice-
◮ Two types of approaches (at least!):1. “Black box” — minimal specification, maximal complexity:
. ice total_noncompl health incmean edulvl sex, m(10) clear
where sex is a binary indicator for gender2. “Dedicated” — detailed specification, transparent modeling
. ice total_noncompl health logincmean edulvl sex, ///m(10) clear ///eq(logincmean: total_noncomp edulvl sex, ///
total_nomcomp: incmean edulvl sex) ///passive(incmean:exp(logincmean))
[etc....]
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations
Advice and cautions
◮ Always include outcome in predictions of covariates◮ Always check the reported equations used by -ice-◮ Always check that imputed values are sensible
(Barnard and Meng (1999), Meng (1994))◮ When you estimate many parameters, m should be large◮ Build from simple to complex:
Try it out in a simple setting which you understand!
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations
What you get from -ice-
◮ Output report on how it all went◮ A dataset consisting of
◮ m + 1 sub-datasets: The original plus m imputed datasets◮ Two new variables:
_mi: Record identifier across all imputed datasets,i = 1, . . . , n
_mj: Identifier of imputed dataset, j = 1, . . . , m
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations
Importing data into -mi- command family
◮ Stata has its own suite of commands for multipleimputation analysis: -mi-
◮ Requires◮ Specific organization of datasets◮ Specific naming of variables← different from those of-ice-
◮ Registration of relation between variables
◮ Obtained with. mi import ice, clear auto
◮ Creates variables: _mi_m, _mi_id, _mi_miss
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations
Formats for multiply imputed datasets
◮ Stata has four different types of formats:wide Adds new variables with imputed values
mlong Adds a new record for each missing value perimputed dataset
flong Adds a full dataset per imputed datasetflongsep Stores each of the imputed dataset in
separate files◮ Two first are efficient, computationally and storagewise◮ Latter two are transparent
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations
-mi estimate-
◮ General syntax is similar to -bysort variable : -:. mi estimate, post: regress outcome covar1 covar2
. mi estimate, or post: logit binoutcome covar1 covar2
◮ Runs regression on each imputed datasetand combines results using Rubin’s rule
◮ Leaves behind results that allows use of -mi test-
◮ This part is the easiest!
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations
Interactions
◮ Consider ESS data◮ Suppose we are interested in
Outcome: Health (binary)Covariates: Age (3 categories), Income (4 categories) and
their interaction◮ Should interaction term be included in imputations?◮ Yes!
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations
Not including interactions in imputations?
◮ Consider all subjects with missing health◮ If we impute health based on Age and Income alone:
1. In imputed subjects, the interaction is non-existing in anysubsequent analysis
2. For the entire dataset, the interaction effect (if truly present)becomes diluted
◮ Same reasoning basically if for example Income categoryis missing:If interaction effect is truly present, then this should beallowed for in imputations of Income
◮ Remember to use passive option of ice command!
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
Barnard, J. and X. L. Meng (1999, Mar).Applications of multiple imputation in medical studies: from
AIDS to NHANES.Stat Methods Med Res 8(1), 17–36.
Donders, A. R. T., G. J. M. G. van der Heijden, T. Stijnen, andK. G. M. Moons (2006, October).
Review: a gentle introduction to imputation of missing values.
Journal of clinical epidemiology 59(10), 1087–91.PMID: 16980149.
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
Greenland, S. and W. D. Finkle (1995, December).A critical look at methods for handling missing covariates in
epidemiologic regression analyses.American journal of epidemiology 142(12), 1255–64.PMID: 7503045.
Koopman, L., G. J. M. G. van der Heijden, D. E. Grobbee,and M. M. Rovers (2008, March).
Comparison of methods of handling missing data inindividual patient data meta-analyses: An empiricalexample on antibiotics in children with acute otitis media.
Am. J. Epidemiol. 167 (5), 540–545.
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
Larsen, J., H. Stovring, J. Kragstrup, and D. G. Hansen(2009).
Can differences in medical drug compliance betweenEuropean countries be explained by social factors:analyses based on data from the European SocialSurvey, round 2.
BMC Public Health 9, 145.
Little, R. J. A. and D. B. Rubin (1987).Statistical Analysis with Missing Data.New York: Wiley.
Meng, X.-l. (1994).Multiple-Imputation inferences with uncongenial sources of
input.Statistical Science 9(4), 538–558.
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
Meng, X.-L. (2000).Missing data: dial M for ???J. Amer. Statist. Assoc. 95(452), 1325–1330.
Potthoff, R. F., G. E. Tudor, K. S. Pieper, and V. Hasselblad(2006, June).
Can one assess whether missing data are missing at randomin medical studies?
Statistical Methods in Medical Research 15(3), 213–234.PMID: 16768297.
Royston, P. (2009).Multiple imputation of missing values: Further update of ice,
with an emphasis on categorical variables.Stata Journal 9(3), 466–477(12).
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
Rubin, D. B. (1976).Inference and missing data.Biometrika 63, 581–92.
Rubin, D. B. (1987).Multiple imputation for nonresponse in surveys.Wiley Series in Probability and Mathematical Statistics:
Applied Probability and Statistics. New York: John Wiley& Sons Inc.
Sterne, J. A. C., I. R. White, J. B. Carlin, M. Spratt,P. Royston, M. G. Kenward, A. M. Wood, and J. R.Carpenter (2009).
Multiple imputation for missing data in epidemiological andclinical research: potential and pitfalls.
BMJ 338, b2393.
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
van der Heijden, G. J. M. G., A. R. T. Donders, T. Stijnen, andK. G. M. Moons (2006, October).
Imputation of missing values is superior to complete caseanalysis and the missing-indicator method in multivariablediagnostic research: a clinical example.
Journal of clinical epidemiology 59(10), 1102–9.PMID: 16980151.
H Støvring Stata, MI, and ICE
Background and terminologyGenerating imputed datasets
Brief list of introductory referencesReferences
Thank you for your attention!
Slides prepared with LATEXand Beamer
H Støvring Stata, MI, and ICE
Top Related