1 Practical Approaches for Dealing with Missing Data in Longitudinal Analyses of Adolescent...
-
Upload
norah-watkins -
Category
Documents
-
view
220 -
download
4
Transcript of 1 Practical Approaches for Dealing with Missing Data in Longitudinal Analyses of Adolescent...
1
Practical Approaches for Dealing with Missing Data in
Longitudinal Analyses of Adolescent Addiction Programs
Michael Dennis, Ph.D., Chestnut Health Systems, Bloomington, IL
Presentation at the Advisory Committee Meeting for the “Economic Evaluation Methods: Development and Applications (R01 DA018645)”. Cocunut Grove, FL, November 10-11, 2006. Preparation of this
manuscript was supported by funding from the Center for Substance Abuse Treatment (CSAT Contract no. 270-2003-00006). The content of this presentation are the opinions of the author and do not reflect the
views or policies of the government. Available on line at www.chestnut.org/LI/Posters or by contacting Joan Unsicker at 720 West Chestnut, Bloomington, IL 61701, phone: (309) 827-6026, fax:(309) 829-4661, e-
Mail: [email protected]
2
This presentation provides..
• A quick review of the problems of missingness and methods of imputation based on Schafer 2002
• A summary of the practical approach chestnut uses to deal with missing data
• Focus here is on the conceptual issues and actual effectiveness – not the math or computation formula per se
3
Types of Missingness
• By design
• Logical skipouts
• Item missing
• Wave missing
• Unobserved latent constructs
4
Key Terms (From Rubin)
• Missing Completely at Random (MCAR): No relationship to predictors or dependent variables
• Missing at Random (MAR): No relationship with dependent variable (can be predicted)
• Missing Not at Random (MNAR): Related to predictors and or dependent variables
5
The Problem With Listwise Deletion (default)
Source: Schafer (2002)
Each Estimate are Increasingly biased as we move away from
MCAR
Smaller SD inflates significance tests
Unstable
Changes correlations & Relationships
Loss of sample is also problematic for multivariate analyses
6
Pair-wise
• Pair-wise is particularly efficient and unbiased under the assumption of MCAR
• Becomes rapidly unstable even under MAR
• Often narrows covariance or variance estimates and distorts relationship in regression or structural equation model (SEM)
7
Problems with other common methods of replacement
Source: Schafer & Graham (2002)
Mean Subst.
Narrows Variance
Reg. Est. Still
Narrows Variance
Only models
using real variance
are relatively unbiased
Hot Deck better but
still biased
8
Examples of Predictive
• Weighted hot-deck: sort people based on related variables, then randomly replace
• Maximum Likelihood (ML): predict from all other available data.
• Restricted Maximum Likelihood (RML): predict from all other available data within the same condition (site, time, etc) to preserve differences
• Multiple imputations: Average over several imputations – a form of boot strapping that does not assume a normal distribution
9
Problem with these methods…
• Complicated on many variables and/or for multiple analyses
• All methods have unknown biases under MNAR unless there is a know a-priori basis for modeling missingness (e.g.. A common factor)
• In longitudinal analysis, this includes knowing the expected trajectory over time.
10
Chestnut Strategy 1: Minimize it
• Train, monitoring and do quality assurance to get staff to minimize data
• Use simple logical skips to minimize not applicable questions and burden
• Differentiate between refusals (rare), don’t knows (more common) and skip outs (common) – track and do problem solving if refusals start occurring on specific items (which is MNAR)
• Put more effort into follow-up
11
Follow-up Rates are PRIMARILY related to effort
Source: Scott (2004)
12
Accepting a lower follow-up rate “biases” results
Source: Scott (2004)
• The easiest to find people are different on the outcome – which is MNAR
• The differences are as or larger as the treatment effects we are looking for
13
Strategy 2: Make Logical Edits
1. Design questionnaire so that there are clear simple logical edits with implied value
2. Test logic of edits (all do not work, e.g., M1)
3. Replace logical skip outs with implied value
4. Test logic of complex edits to create summary measures (all do not work, e.g.., NHSDA)
5. Make complex edits
14
Strategy 3: Replace missing data within known factors
• Recall that this was one of the few ways to deal with MNAR
• Know common factors should have a Cronbach’s alpha of at least .7
• Evaluate amount of missing – ‾ by design (e.g., adding an item in a new version) is
MCAR,
‾ systematic refusal is MNAR.
• Calculate scale as mean of valid items x expected number of items. (Require at least 3 valid)
• Generally do above within subscale, then sum up to higher order scales
15
PERSONS MAP OF ITEMS <more>|<rare> 2 TRUNCATED.### | ## | .## | . | HlthProbs .## |T 1 .## + .## S| .### | .### |S Withdrawal/ill .#### | ProbW/Law .###### | Unsafe GiveUpActs DespiteMedPsyProbs .#### | DepressedNervous NeededMoreAOD UnableCutDown 0 .###### +M .###### | ResponNotMet LargerAmnt/more .####### | .############ M| HideWhenUseAOD Fights/trouble .###### |S SpentTimeGetting .####### | .###### | ParentComplained -1 .###### + .##### |T WeeklyAOD . | .###### | .#### | . S| .###### | -2 . + .#### |
.##### | -3 TRUNCATED + -4 .############ + EACH '#' is 24
Example: GAIN Substance Problems Scale (SPS)
Rasch Model Demonstrating Severity of Items are NOT Equal
Source: Riley et al (in press)
16
Use of Rasch Measurement Model / Computer Adaptive Tests (CAT) models
GAIN Substance Problem Scale (SPS)
MeasureW
ithdrawal Sym
ptoms
Frequency of Use
Em
otional Problem
s
Recovery E
nvironment
Health P
roblems
Symptom Count (16) 0.53 0.38 0.36 0.37 0.19
Full Rasch (16) 0.54 0.43 0.41 0.39 0.22
CAT (5-11 items) 0.57 0.45 0.44 0.40 0.23
CAT can closely
approximate with a fraction
of items
Weighting items with
Rasch Does a Little Better
Construct validation: Comparing alternative
measures to “expected” correlates
Source: Riley et al (in press)
17
Strategy 4: Replace structural missing data (e.g.., by site)
• Where data is missing structurally by design (i.e., MCAR), use regression to impute value based on correlated factors in other sites (seeking formula with 70% or more of variance explained).
• Simple regression if small percent of data (under 5%)
• As the amount of missing data goes up to 15%, it is worth considering the use of ML or MI
• Above 15% missing, all methods are questionable
• At this point we usually have less than 1% missing within wave, but 5-20% or more by wave
18
Strategy 5: Replacement within wave
• Identify remaining items with more than 1-2% missing and the feasibility of replacing via regression (or ML/MI)
• For the rest, sort data on key dimensions of variation and do modified weighted hot deck on the 2-3 people above or below
- we typically sort on a total symptom count and the baseline dependent variable within count, condition & site
- Can replace with mean, median or random choice – we have found that the median was more stable because of the skewed nature of several distributions and use it by default
19
Understanding Multidimensional Nature can be used to Create Additional Strata for Replacement
Female Sex Risk
Needle Risk
Crack Risk % Blue Male Sex Risk Dimension
High Risk Needle Sharers
Male Sex Buyers
Female Sex Traders
Source: Dennis et al (2001)
20
Important to block on Condition in Experiments or Quasi-Experiments
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Control Experimental
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Unrestricted replacement would average out real variance effect of experimental condition
21
Strategy 6: Replacement Across Waves
• Create a summary measure based on the average across waves times the expected number of waves to get a total (e.g.., total days of abstinence)
- Works best when most people only have 1-2 waves of several (e.g.., 4-8) missing
- Above can become biased is missing data by wave is high or systematic
• Can regress from first/last or all available to fill in
• Need to know the expected trajectory
22
Special Case of A Curvilinear Trajectory
0
5
10
15
20
25
30
35
Intake 3 6 9 12
Actual
Source: Godley et al (2004)
23
Special Case of A Curvilinear Trajectory
0
5
10
15
20
25
30
35
Intake 3 6 9 12
Mean Replacement
Actual
Very Biased
Source: Godley et al (2004)
24
Special Case of A Curvilinear Trajectory
0
5
10
15
20
25
30
35
Intake 3 6 9 12
Mean ReplacementAvg of NeighborsActual
Much less biased
Source: Godley et al (2004)
25
Strategy 7: Use of Maximum Likelihood (ML)
• Where possible, use ML or Restricted ML (RML) as part of software applications like AMOS, Stata etc.
• Need to evaluate how much data it is replacing
• Need to be confident that it is not MAR (vs. MNAR) by virtual of small n missing, knowledge of reason, or other analyses
• Restricted ML (RML) preferred to control for site, condition, and/or subject differences.
Alternative: We have not used, but have been thinking about exploring some of the new methods of multiple imputation
26
References
• Dennis, M. L., Wechsberg, W. M., McDermeit (Ives), M., Campbell, R. S., & Rasch, R.R. (2001). The correlates and predictive validity of HIV risk groups among drug users in a community-based sample: Methodological findings from a multi-site cluster analysis. Evaluation and Program Planning, 24, 187-206.
• Godley, S. H., Dennis, M. L., Godley, M. D., & Funk, R. R. (2004). Thirty-month relapse trajectory cluster groups among adolescents discharged from outpatient treatment. Addiction, 99, 129-139.
• Riley, B. B., Conrad, K. J., Bezruczko, N., & Dennis, M. (in press). Relative precision, efficiency and construct validity of different starting and stopping rules for a Computerized Adaptive Test: The GAIN Substance Problem Scale. Journal of Applied Measurement.
• Schafer, J. L., & Graham, J. W. (2002). Missing data Our view of the state of the art. Psychological Methods, 7, 147-177.
• Scott, C. K. (2004). A replicable model for achieving over 90% follow-up rates in longitudinal studies of substance abusers. Drug and Alcohol Dependence, 74, 21-36.