Wish You Were Here! Strategies for Handling Missing Data
description
Transcript of Wish You Were Here! Strategies for Handling Missing Data
Wish You Were Here!Strategies for Handling
Missing Data
Overview
Types of Missing Data
Strategies for Handling Missing Data
Software Applications and Examples
Agenda
Sources of Missing Data
◦ Item non-response Missing value for any given item
◦ Scale non-response Missing value for any given scale Often a result of item non-response
◦ Attrition Missing value (item and/or scale) for any given time point
◦ Data entry error Observed value not included
Overview
So I have missing data…what’s the big deal?
◦ Missing data, no matter how minimal, can (and probably do) result in biased results
◦ Statistical power
◦ Validity
Overview
How much missing data is “problematic”? Depends on who you ask… Answer #1
ANY
Answer #2 Its never “too much” Optimal methods can easily accommodate 50% missing data
Answer #3 >5% (Schafer, 1999) >10% (Bennett, 2001) >20% (Peng, et al., 2006)
Answer #4 (Widaman, 2006) 1%-2% (Negligible) 5%-10% (Minor) 10%-25% (Moderate) 25%-50% (High) >50% (Excessive)
Overview
Missing Completely at Random (MCAR)
Missing at Random (MAR)
Not Missing at Random (NMAR)
Types of Missing Data
Missing Completely at Random (MCAR)
◦ Missing values on Y are unrelated to any other variable in the analysis
◦ Cases with missing data can be treated as a random subset of the entire sample
◦ Best case scenario; difficult to ascertain
Types of Missing Data
Missing at Random (MAR)
◦ Missing values on Y are related to X but not to Y
◦ Missing values on Y arerandom (random effect)after
controlling for X (systematic effect
◦ Can test systematic effect but not random effect
Types of Missing Data
Not Missing at Random (NMAR)
◦Missing values on Y are related to Y itself
◦Missing data are “non-ignorable”
◦Difficult to ascertain; difficult to manage
Types of Missing Data
Testing for MCAR
◦ Little’s Test of MCAR Omnibus χ2 test of all specified variables
If significant, data are not MCAR May be MAR or MNAR
If not significant, can assume MCAR
Available in SPSS under “Missing Value Analysis” and as a SAS Macro
Determining Type of Missing Data
Testing for MAR
◦ Create a “dummy” variable for not missing/missing on the variable of interest
◦ Conduct statistical tests to see if other relevant variables are associated with values of the new variable Binomial logistic regression χ2 test of independence t-tests
◦ If significant relationships are found, then have MAR; these variables need to be included in any analyses
◦ If no significant relationships found, then you have more work to do
Determining Type of Missing Data
If not MCAR or MAR, does that mean it is MNAR?
◦ Not necessarily…
Might still be MAR but you haven’t found the right indicator variable
◦ Consider other potentially relevant variables and test against the missing data “dummy” variable
Determining Type of Missing Data
Patterns of missing data◦ Monotone pattern
Variables v1-vj can be ordered so that if data are missing on v1, they are missing on all successive variables
VERY common with longitudinal data
Determining Type of Missing Data
Patterns of missing data◦ Non-monotone pattern
Patterns of missing data are arbitrary
Determining Type of Missing Data
Deletion Methods◦Remove cases with missing values
Non-Stochastic Methods◦Replace missing values with “known” values
Stochastic Methods◦Replace missing values with estimated values
Methods for Handling Missing Data
List-Wise Deletion◦ Mechanism
Deletes cases from analysis with missing data on any variable (even if that variable isn’t part of the analysis)
Only uses “complete cases”
◦ Pros Easy to implement Works for any kind of statistical analysis If data are MCAR, does not introduce any bias in parameter estimates Standard error estimates are appropriate
◦ Cons May delete a large proportion of cases, resulting in loss of statistical power May introduce bias if MAR but not MCAR
Deletion Methods
Pair-Wise Deletion◦ Mechanism
Deletes cases when missing data on a specific variable involved in parameter estimation
Uses all available information for each estimation, independent of information available for other estimations
◦ Pros Approximately unbiased if MCAR Uses all available information
◦ Cons Standard errors are incorrect
Deletion Methods
Deletion Methods
Mean Imputation◦ Mechanism
All missing values on a given variable are replaced by the sample mean for that variable
◦ Pros Leaves sample mean of non-missing values unchanged
◦ Cons Often leads to biased parameter estimates (e.g., variances) Usually leads to standard error estimates that are biased downward
Treats imputed data as real data, ignores inherent uncertainty in imputed values.
Non-Stochastic Methods
Individual Mean Imputation◦ Mechanism
Scale scores are computed by taking the mean of non-missing values Ex: Respondent answered 8 of 10 questions on Miller Anxiety Scale –
Compute Scale score by taking mean of available cases
◦ Pros All available information for a given individual is used in the estimation of
missing values
◦ Cons Assumes the items with missing values are similar in difficulty or extremity to
items with non-missing data May lead to biased scores
Non-Stochastic Methods
Regression◦ Mechanism
Missing values are replaced by “predicted” values derived from MR using all relevant variables
◦ Pros Predicted values maintain relationships among variables
◦ Cons Predicted values are “perfect” and lead to positively
biased estimates
Non-Stochastic Methods
Non-Stochastic Methods
Stochastic Regression (aka “Simple Imputation”)◦ Mechanism
Similar to non-stochastic regression in the available data are used to predict missing values
Adds a random value to the predicted value by sampling from a normal distribution with a mean of zero and variance equal to the residual variance of the regression equation
◦ Pros Improvement over Non-Stochastic methods Provides unbiased variance estimates
◦ Cons Only uses a single estimation step and may produce inaccurate or
unusual values
Stochastic Methods
Stochastic Methods (Regression)
Expectation Maximization (EM)◦ Mechanism
2-step iterative process Step 1: Expectation
Use parameter values (initially based on complete-case data) to estimate values for missing data
Step 2: Maximization Use complete-case data and estimated values for missing data to estimate new
model parameters Repeat until results converge (Successive iterations will not yield different parameters)
◦ Pros Minimizes bias in parameter estimates (larger samples yield less bias) Ideal for exploratory and reliability analyses
◦ Cons Initial estimates based on list-wise deletion (doesn’t use all available data) Biased standard errors (minimized with larger samples) Less efficient than FIML for hypothesis testing
Stochastic Methods
Stochastic Methods (EM)
Full Information Maximum Likelihood (FIML)◦ Mechanism
Directly estimates parameters using all observed data for every case
◦ Pros Only requires a single step for imputation and analysis Uses all available data even if some cases are missing data Unbiased standard errors Can be used with smaller samples (N<100)
◦ Cons All variables related to missing data need to be included in the analysis
Stochastic Methods
Stochastic Methods (FIML)
Multiple Imputation (MI)◦ Mechanism
Creates multiple data set using stochastic regression Minimum of 3-5 recommended, but no limit on maximum (Schafer, 1997)
Each data set will be slightly different because of the random component Parameters are estimated for each data set and then averaged
◦ Pros Produces unbiased parameter estimates Produces unbiased standard errors Easy to include auxiliary variables
◦ Cons Labor intensive Can be difficult to integrate multiple data sets
Stochastic Methods
Stochastic Methods (MI)
Comparison of Stochastic Methods
Stochastic Methods
Good Better Best• Stochastic Regression •Expectation-Maximization • Multiple Imputation
• Full Information Maximum Likelihood
Software ApplicationsSPSS/PASW SAS AMOS/MPLUS/
LISRELDeletion
Non-Stochastic Replacement
Simple Imputation
EM
FIML
MI
Modeling problematic child behavior outcomes
Predictors◦ Positive Parenting◦ Social Skills◦ Interpartner Violence◦ Child Sex
N=181
Original data set missing 4 observations (<.5%)
New data set created for purpose of demonstration
Example
◦Little’s Test of MCAR can be obtained as part of PASW “Missing Values Analysis”
Little's MCAR test: Chi-Square = 36.014, DF = 18, Sig. = .007
Conclude that data are not MCAR (not surprising given that I did not delete values in a random manner)
Testing for Type of Missing Data
Test of MAR can be conducted by creating new dichotomous variable for “Not Missing/Missing” and using it as the outcome variable in a logistic regression model Most interested in missing data on outcome variable in this example, but method is
not limited to that Conclude that pattern of missing data is related to Gender
Little's MCAR test for Boys: Chi-Square = 8.338, DF = 14, Sig. = .871* Little's MCAR test for Girls: Chi-Square = 13.026, DF = 18, Sig. = .790*
*We can conclude that data are MCAR within each group. Gender must be included in any missing data analysis to minimize bias.
Testing for Type of Missing Data
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 1a Gender 3.091 1.046 8.726 1 .003 22.003
Parenting .074 .087 .718 1 .397 1.076
Skills .010 .023 .195 1 .658 1.010
Aggression -.003 .022 .024 1 .877 .997
Constant -9.058 2.936 9.516 1 .002 .000
a. Variable(s) entered on step 1: Gender, Parenting, Skills, Aggression.
Patterns of missing data can be obtained using “Analyze Patterns” option available under “Multiple Imputation”
Results of pattern analysis
Variable Summarya,b
Missing
Valid N Mean Std. Deviation N Percent
Behavior 59 32.6% 122 55.75 10.333
Positive Parenting 44 24.3% 137 18.4293 3.04990
Interpartner Violence 36 19.9% 145 12.77 12.229
Social Skills 27 14.9% 154 51.75 11.501
a. Maximum number of variables shown: 25
b. Minimum percentage of missing values for variable to be included: 10.0%
Results of pattern analysis
Although the pattern is not monotone, these cases only make up a very small %
PASW provides several options for handling missing data The add-on module for “Missing Values Analysis” allows
you to implement several different strategies simultaneously◦ In addition to saving time, comparison output is provided for
means, SDs, and correlation/covariance matrices Available options:
◦ List-wise deletion◦ Pair-wise deletion◦ Stochastic regression◦ EM
Missing Values Analysis in PASW
Missing Values Analysis in PASW
Enter continuous and categorical variables
Choose strategies
Additional options
The “Multiple Imputation” option is part of the basic PASW package◦ Provides numerous options
Choose # of iterations Choose estimation method
(monotone vs. non-monotone patterns) Create new data sets
Multiple Imputation in PASW
Multiple Imputation in PASW
Enter all variables to use in imputation (model + auxiliary)
Choose # of iterations
Create a new data set with imputed data
Note: PASW allows you to run analysis on all imputed sets simultaneously
Multiple Imputation in PASW
“Automatic” is the default
Can manually select method based on pattern of missing data
If your data include interactions, so should your imputation model
Missing Data and LISRELMultiple Imputation available in PreLIS under “Statistics”
I have included both model and auxiliary variables
Select estimation methodEM -> monotoneMCMC -> non-monotone
Decide how to handle cases when all data are missing
Output is a “complete” data set for analysis
Missing Data and LISREL
An alternative to MI is to use FIML estimation with the original data set containing missing values
LISREL will default to this option if there is missing data
Comparing Results
Complete List-Wise Pair-Wise
BStd.
Error Sig. BStd.
Error Sig. BStd.
Error Sig.(Constant) 83.71 5.29 .000 91.47 6.57 .000 91.34 7.01 .000
Child's Sex -.75 1.38 .586 -.64 1.72 .709 -.58 1.79 .748
Positive Parenting -1.03 .22 .000 -1.27 .27 .000 -1.34 .28 .000
Social Skills -.20 .06 .001 -.26 .08 .001 -.21 .08 .000
Interpartner Violence .14 .06 .024 .10 .07 .136 .07 .07 .006
Comparing Results
Complete Mean Substitution Simple Imputation
BStd.
Error Sig. BStd.
Error Sig. BStd.
Error Sig.(Constant) 83.71 5.29 .000 85.37 5.21 .000 80.87 6.01 .000
Child's Sex -.75 1.38 .586 -.42 1.19 .709 -.18 1.48 .904
Positive Parenting -1.03 .22 .000 -1.17 .22 .000 -1.06 .24 .000
Social Skills -.20 .06 .001 -.16 .05 .001 -.12 .06 .049
Interpartner Violence .14 .06 .024 .07 .05 .136 .05 .06 .390
Comparing Results
Complete EM-PASW MCMC-LISREL FIML-LISREL
BStd.
Error Sig. BStd.
Error Sig. BStd.
Error Sig. BStd.
Error Sig.(Constant) 83.71 5.29 .000 91.52 4.99 .000 92.96 5.32 .000 88.83 5.86 .000
Child's Sex -.75 1.38 .586 -.35 1.16 .761 -.18 1.59 .359 -.23 .79 .799
Positive Parenting
-1.03 .22 .000 -1.36 .21 .000 -1.24 .26 .000 -1.19 .26 .000
Social Skills -.20 .06 .001 -.22 .05 .000 -.23 .06 .000 -.25 .07 .000
Interpartner Violence
.14 .06 .024 .09 .05 .073 .11 .06 .051 .11 .06 .076
The goal of handling missing data is to find values close to the “real” (but absent) values. (T or F)◦ FALSE – the goal is to estimate unbiased standard errors and parameter
estimates
Which is more important – amount of missing data or type of missing data?◦ Both are important, but type is more important than amount
List-wise deletion is a good strategy for handling missing data? (T or F)◦ TRUE – if data are MCAR; if not MCAR, then there are better alternatives
There are no “good” strategies for handling data that are NMAR. (T or F)◦ TRUE – but FIML is considered to yield the least biased results
Test
Deletion is the only strategy for handling missing categorical data.(T or F)◦ FALSE – can use both non-stochastic and stochastic methods
If using multiple imputation, it is best to include all available variables. (T or F)◦ FALSE – only include variables related to those with missing data
Values such as “not applicable”, “not sure”, “I don’t know”, etc. should be treated as missing data. (T or F)◦ FALSE – if you included these as possible response categories, then they constitute valid
responses (i.e., they are not missing)
List-wise deletion is better than non-stochastic imputation. (T or F)◦ TRUE – if data are MCAR and/or unless using a small sample with minimal power
Test
Missing data should only be imputed for predictor variables and never for outcome variables. (T or F)◦ DEPENDS – if you have good auxiliary variables for the outcome variable, then you
should impute on the outcome variable; otherwise you should not impute.
Values such as “not applicable”, “not sure”, “I don’t know”, etc. can be treated as missing data. (T or F)◦ TRUE – IF you have a strong theoretical argument that a different response would
have been obtained under different circumstances
The most important factor in choosing a strategy is the type of missing data. (T or F) TRUE
Analyses should always be conducted and reported using data with and without missing values. (T or F)◦ TRUE
Test
Causes (actual and/or hypothesized) of missing data should be discussed
The amount of missing data and the strategy used to handle it should be reported
Results of analyses with and without missing data should be discussed
The most appropriate strategy should be used
Summary
Strategy Type of Missing DataMCAR MAR NMAR
List-wise DeletionPair-wise DeletionNon-stochastic ReplacementSimple ImputationEMFIMLMultiple Imputation
Summary
Allison, P. D. (2001). Missing data. Thousand Oaks, CA: Sage Publications.Bennett, D.A. (2001). How can I deal with missing data in my study? Australian and New
Zealand Journal of Public Health, 25, 464-469.Little, R.J.A. (1988). A test of missing completely at random for multivariate data with missing
values. Journal of the American Statistical Association , 83, 1198-1202. Little, R. J. A., & Rubin, D.B. (1987). Statistical analysis with missing data. John Wiley & Sons,
New York.Peng, C.Y., Harwell, M., Liou, S.M., & Ehman, L.H. (2006). Advances in missing data methods
and implications for educational research. In S Sawilowsky (Ed.), Real data analysis (pp.31-78), Greenwich, CT: Information Age.
Schafer, J.L. (1997). Analysis of incomplete multivariate data. Thousand Oaks, CA: Sage.Schafer, J.L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research. 8:
3-15. Schlomer, G.L., Bauman, S., & Card, N.A. (2010). Best practices for missing data management
in counseling psychology. Journal of Counseling Psychology, 57(1), 1-10.Widaman, K.F. (2006). Missing data: What to do with or without them. Monographs of the
Society for Research in Child Development, 71(3), 42-64.
References
Questions?