Multiple Imputation in Stata --- -mi- and -ice- commands

25
Background and terminology Generating imputed datasets Brief list of introductory references References Multiple Imputation in Stata — -mi- and -ice- commands Henrik Støvring ([email protected]) Department of Biostatistics November 19, 2010 H Støvring Stata, MI, and ICE

Transcript of Multiple Imputation in Stata --- -mi- and -ice- commands

Page 1: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

Multiple Imputation in Stata —-mi- and -ice- commands

Henrik Støvring([email protected])

Department of Biostatistics

November 19, 2010

H Støvring Stata, MI, and ICE

Page 2: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

Outline

Background and terminology

Generating imputed datasets

Brief list of introductory references

H Støvring Stata, MI, and ICE

Page 3: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

Background and setting

◮ Assume we “know” that◮ Data belongs to MAR category◮ More than one variable has missings◮ Missing data pattern is not monotone

◮ What is needed in order to proceed?◮ Prediction model for missing values◮ Tool to impute the missing values◮ Tool to combine estimates from analysis of each imputed

dataset into an overall estimate

H Støvring Stata, MI, and ICE

Page 4: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

TerminologyDataset zero: The original dataset with missing values

Imputed dataset: A copy of the original dataset with all missingvalues replaced with imputed values (j = 1, . . . , m)

Imputed values: (Randomly generated) values substituted forunobserved values

Multiple imputation analysis: The ordinary analysis on eachimputed dataset AND combination of estimatesinto a single estimate

Iteration: Do a procedure (compute some numbers), updatestarting values with the result (the computednumbers), and repeat the procedure

Passive variable: A variable that depends on an imputedvariable

H Støvring Stata, MI, and ICE

Page 5: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations

Obtaining and installing

◮ Obtained in Stata with the commands. search ice. net sj 9-3 st0067_4

and then click “(click here to install)”◮ Background information in

◮ help-file and references therein◮ Royston (2009)

H Støvring Stata, MI, and ICE

Page 6: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations

Rationale

M ultiple (Imputation)

I terated: Repeat to achieve stability. . .

C hained: In a specific order, one by one. . .

E quations: Based on a set of regression equations

◮ Consists of two “steps”1. Estimate relationships between each variable to be imputed

and predictive variables (covariates)2. Impute values from fitted model

H Støvring Stata, MI, and ICE

Page 7: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations

Initial observations regarding -ice-

◮ Variables take turn in being predictor and predicted(“outcome”/ to be imputed)

◮ Variables to be imputed can be predicted from variableswithout missing values

◮ Allows regression types for categorical data◮ Logistic◮ Multinomial/Ordered logistic

◮ Allows imputation of interval censored data

H Støvring Stata, MI, and ICE

Page 8: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations

Major options

m() Number of imputed datasets

eq() Equations used to predict from

cmd() Regression type used to model “dependent”variable

by() Impute separately for subsets of dataset

cycles() Number of cycles used to obtain estimatedregression coefficients used in subsequentprediction/imputation

H Støvring Stata, MI, and ICE

Page 9: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations

Useful options

dryrun Checks syntax and equations for consistency, butdoesn’t do any actual imputations

saving() Save imputed dataset to named file

seed() Fix random origin, i.e. make process reproducible

H Støvring Stata, MI, and ICE

Page 10: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations

Other options

◮ Countless:clear

match()passive()

boot(). . . . . .

◮ WARNING: Package is often updated AND optionssometimes changes definition!

H Støvring Stata, MI, and ICE

Page 11: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations

Imputing values: running -ice-

◮ Two types of approaches (at least!):1. “Black box” — minimal specification, maximal complexity:

. ice total_noncompl health incmean edulvl sex, m(10) clear

where sex is a binary indicator for gender2. “Dedicated” — detailed specification, transparent modeling

. ice total_noncompl health logincmean edulvl sex, ///m(10) clear ///eq(logincmean: total_noncomp edulvl sex, ///

total_nomcomp: incmean edulvl sex) ///passive(incmean:exp(logincmean))

[etc....]

H Støvring Stata, MI, and ICE

Page 12: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations

Advice and cautions

◮ Always include outcome in predictions of covariates◮ Always check the reported equations used by -ice-◮ Always check that imputed values are sensible

(Barnard and Meng (1999), Meng (1994))◮ When you estimate many parameters, m should be large◮ Build from simple to complex:

Try it out in a simple setting which you understand!

H Støvring Stata, MI, and ICE

Page 13: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations

What you get from -ice-

◮ Output report on how it all went◮ A dataset consisting of

◮ m + 1 sub-datasets: The original plus m imputed datasets◮ Two new variables:

_mi: Record identifier across all imputed datasets,i = 1, . . . , n

_mj: Identifier of imputed dataset, j = 1, . . . , m

H Støvring Stata, MI, and ICE

Page 14: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations

Importing data into -mi- command family

◮ Stata has its own suite of commands for multipleimputation analysis: -mi-

◮ Requires◮ Specific organization of datasets◮ Specific naming of variables← different from those of-ice-

◮ Registration of relation between variables

◮ Obtained with. mi import ice, clear auto

◮ Creates variables: _mi_m, _mi_id, _mi_miss

H Støvring Stata, MI, and ICE

Page 15: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations

Formats for multiply imputed datasets

◮ Stata has four different types of formats:wide Adds new variables with imputed values

mlong Adds a new record for each missing value perimputed dataset

flong Adds a full dataset per imputed datasetflongsep Stores each of the imputed dataset in

separate files◮ Two first are efficient, computationally and storagewise◮ Latter two are transparent

H Støvring Stata, MI, and ICE

Page 16: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations

-mi estimate-

◮ General syntax is similar to -bysort variable : -:. mi estimate, post: regress outcome covar1 covar2

. mi estimate, or post: logit binoutcome covar1 covar2

◮ Runs regression on each imputed datasetand combines results using Rubin’s rule

◮ Leaves behind results that allows use of -mi test-

◮ This part is the easiest!

H Støvring Stata, MI, and ICE

Page 17: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations

Interactions

◮ Consider ESS data◮ Suppose we are interested in

Outcome: Health (binary)Covariates: Age (3 categories), Income (4 categories) and

their interaction◮ Should interaction term be included in imputations?◮ Yes!

H Støvring Stata, MI, and ICE

Page 18: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

ICEBuilding the prediction model using -ice-Imputed datasets and multiple imputation analysisInteraction terms and imputations

Not including interactions in imputations?

◮ Consider all subjects with missing health◮ If we impute health based on Age and Income alone:

1. In imputed subjects, the interaction is non-existing in anysubsequent analysis

2. For the entire dataset, the interaction effect (if truly present)becomes diluted

◮ Same reasoning basically if for example Income categoryis missing:If interaction effect is truly present, then this should beallowed for in imputations of Income

◮ Remember to use passive option of ice command!

H Støvring Stata, MI, and ICE

Page 19: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

Barnard, J. and X. L. Meng (1999, Mar).Applications of multiple imputation in medical studies: from

AIDS to NHANES.Stat Methods Med Res 8(1), 17–36.

Donders, A. R. T., G. J. M. G. van der Heijden, T. Stijnen, andK. G. M. Moons (2006, October).

Review: a gentle introduction to imputation of missing values.

Journal of clinical epidemiology 59(10), 1087–91.PMID: 16980149.

H Støvring Stata, MI, and ICE

Page 20: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

Greenland, S. and W. D. Finkle (1995, December).A critical look at methods for handling missing covariates in

epidemiologic regression analyses.American journal of epidemiology 142(12), 1255–64.PMID: 7503045.

Koopman, L., G. J. M. G. van der Heijden, D. E. Grobbee,and M. M. Rovers (2008, March).

Comparison of methods of handling missing data inindividual patient data meta-analyses: An empiricalexample on antibiotics in children with acute otitis media.

Am. J. Epidemiol. 167 (5), 540–545.

H Støvring Stata, MI, and ICE

Page 21: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

Larsen, J., H. Stovring, J. Kragstrup, and D. G. Hansen(2009).

Can differences in medical drug compliance betweenEuropean countries be explained by social factors:analyses based on data from the European SocialSurvey, round 2.

BMC Public Health 9, 145.

Little, R. J. A. and D. B. Rubin (1987).Statistical Analysis with Missing Data.New York: Wiley.

Meng, X.-l. (1994).Multiple-Imputation inferences with uncongenial sources of

input.Statistical Science 9(4), 538–558.

H Støvring Stata, MI, and ICE

Page 22: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

Meng, X.-L. (2000).Missing data: dial M for ???J. Amer. Statist. Assoc. 95(452), 1325–1330.

Potthoff, R. F., G. E. Tudor, K. S. Pieper, and V. Hasselblad(2006, June).

Can one assess whether missing data are missing at randomin medical studies?

Statistical Methods in Medical Research 15(3), 213–234.PMID: 16768297.

Royston, P. (2009).Multiple imputation of missing values: Further update of ice,

with an emphasis on categorical variables.Stata Journal 9(3), 466–477(12).

H Støvring Stata, MI, and ICE

Page 23: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

Rubin, D. B. (1976).Inference and missing data.Biometrika 63, 581–92.

Rubin, D. B. (1987).Multiple imputation for nonresponse in surveys.Wiley Series in Probability and Mathematical Statistics:

Applied Probability and Statistics. New York: John Wiley& Sons Inc.

Sterne, J. A. C., I. R. White, J. B. Carlin, M. Spratt,P. Royston, M. G. Kenward, A. M. Wood, and J. R.Carpenter (2009).

Multiple imputation for missing data in epidemiological andclinical research: potential and pitfalls.

BMJ 338, b2393.

H Støvring Stata, MI, and ICE

Page 24: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

van der Heijden, G. J. M. G., A. R. T. Donders, T. Stijnen, andK. G. M. Moons (2006, October).

Imputation of missing values is superior to complete caseanalysis and the missing-indicator method in multivariablediagnostic research: a clinical example.

Journal of clinical epidemiology 59(10), 1102–9.PMID: 16980151.

H Støvring Stata, MI, and ICE

Page 25: Multiple Imputation in Stata --- -mi- and -ice- commands

Background and terminologyGenerating imputed datasets

Brief list of introductory referencesReferences

Thank you for your attention!

Slides prepared with LATEXand Beamer

H Støvring Stata, MI, and ICE