Handling Missing Data in the Analysis of CTN Trials: Pitfalls and Possible Solutions

Handling Missing Data in the

Analysis of CTN Trials:

Pitfalls and Possible SolutionsNeal Oden, PhD, DSC2-EMMES

Gaurav Sharma, PhD, DSC2-EMMES

Paul Van Veldhuisen, PhD, DSC2-EMMES

Paul Wakim, PhD, CCTN, NIDA

CTN Design & Analysis Workshop

15 March 2011

Today’s WorkshopToday’s Workshop

The problemThe problem PreventionPrevention Types of missing dataTypes of missing data Analysis methodsAnalysis methods Case studyCase study Open discussion Open discussion

Missing DataMissing Data Information within a trial that is Information within a trial that is

meaningful for analysis but not meaningful for analysis but not collectedcollected Focus here mostly on primary outcome Focus here mostly on primary outcome

data, but relevant to missing secondary data, but relevant to missing secondary outcomes and covariates toooutcomes and covariates too

Missing DataMissing Data RandomizationRandomization

Balances treatment groups for known and Balances treatment groups for known and unknown factorsunknown factors

Lose benefits if there is drop-out, as Lose benefits if there is drop-out, as groups at outcome may not have been groups at outcome may not have been similar at baselinesimilar at baseline

Intention-to-treat principleIntention-to-treat principle Violates principle if not all participants Violates principle if not all participants

contribute to the primary analysiscontribute to the primary analysis

Missing DataMissing Data

If missing unrelated to assigned If missing unrelated to assigned treatmenttreatmentReduces statistical powerReduces statistical power

If missing related to assigned If missing related to assigned treatment or to outcome treatment or to outcome Biases the estimate of the treatment Biases the estimate of the treatment

effecteffect

Causes of Missing DataCauses of Missing Data

Due to discontinuation of study Due to discontinuation of study treatmenttreatment

Outcomes undefined for some Outcomes undefined for some participantsparticipantsQOL measures after deathQOL measures after deathQuantitative drug use hair analysis Quantitative drug use hair analysis in individuals without hairin individuals without hair

Test fails/specimen lostTest fails/specimen lost AttritionAttrition

Related to health status/Related to health status/drug usedrug use Unrelated to health status/Unrelated to health status/drug usedrug use (e.g., moved)(e.g., moved)

Continuing Data Continuing Data Collection for “Drop-Outs”Collection for “Drop-Outs”

Distinction between Distinction between

Premature end of treatmentPremature end of treatment

AND AND

End of studyEnd of study Does collecting data after premature Does collecting data after premature

end of treatment make sense?end of treatment make sense?

RationaleRationale Preserves intention-to-treat approachPreserves intention-to-treat approach Many CTN trials are pragmatic trialsMany CTN trials are pragmatic trials

NOT “Does treatment work if NOT “Does treatment work if perfectly delivered”?perfectly delivered”?

but RATHERbut RATHER““Is this a good treatment strategy or Is this a good treatment strategy or policy”? policy”?

OR OR ““What happens once treatment What happens once treatment starts or is recommended?”starts or is recommended?”

RationaleRationale Delivery of medicine deals with people Delivery of medicine deals with people

in the real worldin the real world A 100% efficacious cure for stimulant use is A 100% efficacious cure for stimulant use is

useless for public health if nobody can useless for public health if nobody can stand it.stand it.

Strive to collect complete data for Strive to collect complete data for primary outcome on ALL participants, primary outcome on ALL participants, even in those who do not complete even in those who do not complete interventionintervention Too much missing data - > no way result Too much missing data - > no way result

will be believable no matter how will be believable no matter how sophisticated the statistical methodsophisticated the statistical method

Why Do We Like It?Why Do We Like It?

Weight loss dietWeight loss diet People on the effective arm lose People on the effective arm lose

weight and stay in the studyweight and stay in the study Some on the ineffective arm get Some on the ineffective arm get

discouraged and quitdiscouraged and quit If we analyzed only the people who If we analyzed only the people who

stayed in the trial, stayed in the trial, the ineffective the ineffective arm would look too goodarm would look too good

Approaches to Missing DataApproaches to Missing Data

Design and conduct of clinical trial Design and conduct of clinical trial that minimizes missing datathat minimizes missing data May require trade-offs with May require trade-offs with

generalizabilitygeneralizability Apply analysis methods that use Apply analysis methods that use

information in observed data to help information in observed data to help analyze primary outcome data in the analyze primary outcome data in the presence of missing datapresence of missing data

B. Franklin

An ounce of prevention is

worth a pound of cure

http://www.ushistory.org/franklin/quotable/quote05.htm

Minimize Missing Data in….. Minimize Missing Data in….. Trial DesignTrial Design

Flexible doseFlexible dose Target populationTarget population Allow rescue therapy for poor Allow rescue therapy for poor

respondersresponders Define primary outcomes that are Define primary outcomes that are

highly ascertainablehighly ascertainable Minimize participant burden/reduce Minimize participant burden/reduce

follow-upfollow-up Number of visits/assessmentsNumber of visits/assessments

Explain importance of trial participation Explain importance of trial participation during consent processduring consent process

Emphasize to staff importance of Emphasize to staff importance of maintaining follow-up even when maintaining follow-up even when treatment is refusedtreatment is refused

IncentivesIncentivesFor participants, need to ensure level For participants, need to ensure level is not viewed as coerciveis not viewed as coercive

Minimize Missing Data Minimize Missing Data in…...in…...

Trial ConductTrial Conduct

Expression of thanksExpression of thanksWritten/verbalWritten/verbal

Assistance with travelAssistance with travel Reminders before visitsReminders before visits Welcoming staff/friendly environmentWelcoming staff/friendly environment Keep locator information currentKeep locator information current Monitor and report to investigators Monitor and report to investigators

extent of missing dataextent of missing data

Minimize Missing Data Minimize Missing Data in…...in…...

Trial ConductTrial Conduct

Availability of Primary Outcome: Percent of Measures with Values

(N=29 trials)

0

10

20

30

40

50

60

70

80

90

100

CTN Study

Perc

ent

What’s the big deal?

We need N = 400 (based on power analysis)

But we expect 20% missing

So we set the initial N = 500

So that the final (analyzed) N = 400

National Institute on Drug Abuse ─ National Institutes of Health ─ U.S. Department of Health and Human Services

Technical terms that we can’t escape…

Missing at random (MAR)

Missing completely at random (MCAR)

Missing not at random (MNAR)

Ignorable

Non-ignorable

… but what do they mean?National Institute on Drug Abuse ─ National Institutes of Health ─ U.S. Department of Health and Human Services

Missing Completely at Random (MCAR)

(Non-technical) Definition:The fact that Y is missing has nothing to do with the unobserved value of Y, or with other variables

Therefore:The set of participants with complete data can be regarded as a simple random (or representative) sample of all participants

What to do?Ignore the missing data and analyze the available data


Missing at Random (MAR)(Non-technical) Definition:The fact that Y is missing can be explained by other observed values of Y, or by other measured variables

Therefore:The observed data can be used to account for the missing data

What to do?Use Maximum Likelihood or Multiple Imputation approach, and include in the model the other measured variables that explain missingness National Institute on Drug Abuse ─ National Institutes of Health ─ U.S. Department of Health and Human Services

Missing Not at Random (MNAR)

(Non-technical) Definition:The fact that Y is missing cannot be explained by other observed values of Y, or by other measured variables

Therefore:The observed data cannot be used to account for the missing data; and outside information is needed

In simple English:We have a problemNational Institute on Drug Abuse ─ National Institutes of Health ─ U.S. Department of Health and Human Services

In Summary…


Missingness(i.e. whether the data are missing or not)

is related to is not related to

MCAR observed or unobserved data

MAR observed data unobserved data

MNAR unobserved data

Based on Graham 2009

Bottom Line

MCAR: No big deal

MAR: Use available collected data to “explain” missing mechanism, and use existing statistical methods

MNAR: Need outside information to “explain” missing mechanism


Ignorable & Non-Ignorable(roughly speaking)

Ignorable (available data are sufficient):•Missing Completely At Random (MCAR)•Missing At Random (MAR)

Non-Ignorable (need outside information):•Missing Not At Random (MNAR)


Missing Data Missing Data Analysis MethodsAnalysis Methods

Complete Case and Pairwise Complete Case and Pairwise DeletionDeletion

CC CC PDPDYY11 YY2 2 YY33 YY11 YY2 2 YY33

X X X X X X X X XX X X

X X X X X X X X XX X X

X X -X X - X X X X --

X X -X X - X X X X --

(Correlation Illustration)(Correlation Illustration)

Simple, Default in Statistical SoftwareSimple, Default in Statistical Software Potential loss of info and precisionPotential loss of info and precision Biased when observation is not MCARBiased when observation is not MCAR

Single ImputationSingle Imputation

Impute a single value, i.e. mean, BOCF, Impute a single value, i.e. mean, BOCF, LOCF, imputing missing as positive…LOCF, imputing missing as positive… Simple, artificially increases sample sizeSimple, artificially increases sample size Underestimate SE and incorrect p-valuesUnderestimate SE and incorrect p-values Most SI methods require MCAR Most SI methods require MCAR

assumptions to hold, while some, such as assumptions to hold, while some, such as LOCF, even require very strong and often LOCF, even require very strong and often unrealistic assumptionsunrealistic assumptions

Multiple Imputation (MI)Multiple Imputation (MI)

Observed DataObserved Data Imputations Imputations

1 2 … m1 2 … m … …

… …

… …

… …

A simulation based approach to missing A simulation based approach to missing datadata

The General IdeaThe General Idea

IMPUTATIONIMPUTATION ANALYSISANALYSIS POOLINGPOOLING

(1)(1) (2)(2) (3)(3)

Incomplete DataIncomplete Data Imputed DataImputed Data Analysis ResultsAnalysis Results Final Final ResultsResults

(1) IMPUTATION Models(1) IMPUTATION Models

The imputation model should include The imputation model should include primary predictive variables and primary predictive variables and other variables associated with other variables associated with missingnessmissingness

Multiple Imputation method is Multiple Imputation method is robust even with approximate robust even with approximate imputation modelsimputation models

(2) ANALYSIS Models(2) ANALYSIS Models

Regression ModelRegression Model

General Linear ModelGeneral Linear Model

Generalized Linear Model (Logistic Generalized Linear Model (Logistic Regression, Poisson Regression)Regression, Poisson Regression)

(3) Rules for POOLING(3) Rules for POOLING

… …

Confidence Interval for Parameter of Interest is Confidence Interval for Parameter of Interest is given bygiven by Mean of Estimate + tMean of Estimate + tdf df √√(Total Variance)(Total Variance)

Estimate 1Variance 1



Estimate ‘m’Variance ‘m’

Mean of Estimate

Within Variance + Between Variance =

Total Variance

Desirable FeaturesDesirable Features

MI gives approximately unbiased MI gives approximately unbiased estimates of all parametersestimates of all parameters

MI provides good estimates of the MI provides good estimates of the standard errorsstandard errors

MI can be used with many kinds of MI can be used with many kinds of data and analyses without data and analyses without specialized softwarespecialized software

Requires MAR assumptionRequires MAR assumption

Maximum likelihoodMaximum likelihood

Basic ideaBasic idea Given some data,Given some data, Try to guess the parameter(s) of the Try to guess the parameter(s) of the

probability distribution that generated probability distribution that generated the datathe data

MLE of a parameter is the value that MLE of a parameter is the value that maximizes the probability of the data maximizes the probability of the data you already haveyou already have

Example:Example:

Flip a coin, get 45 heads, 36 tailsFlip a coin, get 45 heads, 36 tails We don’t know p, but whatever it is:We don’t know p, but whatever it is:

Pr(45 H in 81 tosses) = K pPr(45 H in 81 tosses) = K p4545(1-p)(1-p)3636

How to guess p?How to guess p? Pick the value of p that maximizes the Pick the value of p that maximizes the

probability of what already happenedprobability of what already happened Pick p to maximize L = pPick p to maximize L = p4545(1-p)(1-p)3636

Best guess turns out to be 45/81Best guess turns out to be 45/81

Maximum likelihood Maximum likelihood estimates estimates

have nice propertieshave nice properties ConsistentConsistent AsymptoticallyAsymptotically

NormalNormal UnbiasedUnbiased minimum varianceminimum variance

etc.etc.

New problemNew problem

H = 45H = 45 T = 36T = 36 ? = 19? = 19 Now how to guess p?Now how to guess p?

If we knew how many missing were H If we knew how many missing were H and how many T, we would know what to and how many T, we would know what to do.do.

But we don’t.But we don’t. What to do?What to do?

A solutionA solution

If data are MAR, If data are MAR, you can get MLE’s by you can get MLE’s by

maximizing the (conditional) maximizing the (conditional) likelihood for the nonmissing likelihood for the nonmissing datadata

ignoring the missing data ignoring the missing data mechanism.mechanism.

Important ApplicationImportant Application

Longitudinal analysisLongitudinal analysis Participant 1, visit 1, 2, 3, …Participant 1, visit 1, 2, 3, … Participant 2, visit 1, 2, 3, …Participant 2, visit 1, 2, 3, …

For each visit, y = a + bFor each visit, y = a + b11 x x11 + b + b22 x x22 + + ……

First approach: First approach: Treat all visits as independentTreat all visits as independent Do the regression on all visits togetherDo the regression on all visits together Wrong, because visits from a single Wrong, because visits from a single

participant are related, not independentparticipant are related, not independent

Important Application Important Application (cont’d)(cont’d)

Second approachSecond approach The visits from a single participant have covarianceThe visits from a single participant have covariance Use a mixed modelUse a mixed model

It used to be that you had to have all visits It used to be that you had to have all visits nonmissing for this analysisnonmissing for this analysis

But modern software (SAS MIXED, GLIMMIX) But modern software (SAS MIXED, GLIMMIX) ignores the missing-data mechanism and gets ignores the missing-data mechanism and gets MLE’s from only the nonmissing data, even if MLE’s from only the nonmissing data, even if some visits are missing.some visits are missing.

If data are MAR, this is fine!If data are MAR, this is fine!

Modern longitudinal ML Modern longitudinal ML software software

uses more datauses more data

5544332211

44332211VisitVisit

Par

ticip

ant

Older CC analysis would use only

these cases

Neither old nor new method can use this visit

Complete visit

Incomplete visit

Another applicationAnother application

Survival analysisSurvival analysis Example: time to relapseExample: time to relapse For some people, you have the timeFor some people, you have the time For others, you don’t becauseFor others, you don’t because

Study endedStudy ended People diedPeople died People dropped outPeople dropped out etc.etc. People without relapse times are said to be People without relapse times are said to be

CENSOREDCENSORED

Another application Another application (cont’d)(cont’d)

For censored people, you don’t know the relapse For censored people, you don’t know the relapse time, but you know it is after the censor timetime, but you know it is after the censor time

Survival analysis handles censored data, butSurvival analysis handles censored data, but You have to make the assumption that censoring is You have to make the assumption that censoring is

noninformative.noninformative. If people drop out because they know they are If people drop out because they know they are

going to relapse the next day, the censoring is going to relapse the next day, the censoring is informative.informative.

Informative censoring gives biased survival time Informative censoring gives biased survival time estimatesestimates

The “noninformative censoring” assumption is The “noninformative censoring” assumption is basically an MAR assumption.basically an MAR assumption.

What if data are not What if data are not MAR?MAR?

When the missing data are When the missing data are nonignorable (nonignorable (i.ei.e., MNAR), standard ., MNAR), standard statistical models can yield badly statistical models can yield badly biased resultsbiased results

Cannot test MAR versus MNARCannot test MAR versus MNAR

Sensitivity AnalysisSensitivity Analysis

The missing data mechanism is not The missing data mechanism is not identifiable from observed dataidentifiable from observed data

We don’t know what we don’t knowWe don’t know what we don’t know One or more analyses can be One or more analyses can be

performed using different performed using different assumptionsassumptions Example: Worst Case AnalysisExample: Worst Case Analysis

(won’t work with a lot of missing data)(won’t work with a lot of missing data)

Goals of Sensitivity Goals of Sensitivity AnalysisAnalysis

Consider a range of potential associations Consider a range of potential associations between missingness and responsebetween missingness and response

Assess the degree to which conclusion Assess the degree to which conclusion can be influenced by the missingness can be influenced by the missingness mechanismmechanism If the conclusion is largely unchanged the If the conclusion is largely unchanged the

result may be considered robustresult may be considered robust Otherwise, the conclusion should be Otherwise, the conclusion should be

interpreted cautiously and may be misleadinginterpreted cautiously and may be misleading

MNAR modelsMNAR models Use of non-ignorable models can be Use of non-ignorable models can be

helpful in conducting a sensitivity helpful in conducting a sensitivity analysisanalysis

Not necessarily a good idea to rely on a Not necessarily a good idea to rely on a single MNAR model, because the single MNAR model, because the assumptions about the missing data are assumptions about the missing data are impossible to assess with the observed impossible to assess with the observed datadata

One should use MNAR models sensibly, One should use MNAR models sensibly, possibly examining several types of such possibly examining several types of such models for a given datasetmodels for a given dataset

Two general classes of Two general classes of MNAR modelsMNAR models

Selection ModelsSelection Models – use model for the – use model for the full data response and a selection full data response and a selection mechanismmechanism

Pattern Mixture ModelsPattern Mixture Models – use – use mixture of missing data pattern mixture of missing data pattern information in the modelinformation in the model

Case Study:CTN0010 - BUP for Adolescents

Two groups: Bup/Nal detoxification over 2 weeks

vs.Bup/Nal maintenance over 12 weeks

N (analyzed) = 152 at 6 community treatment programs

Main outcome measure: Opioid-positive urine test result at weeks 4, 8 & 12

Evaluation: weekly for 12 weeks,comprehensive at 4, 8, 12, 24, 36 & 52 weeks


Woody, JAMA 2008

Missingness in CTN0010(from Paul Allison’s analysis)

20 participants had missing outcome for all 12 weeks(effective sample size = N – 20)

Available Data (after removing the 20 cases)


Week 1 2 3 4 5 6 7 8 9 10 11 12

%present 90 74 60 78 48 45 44 69 40 37 37 67

Paul Allison’s Analysis

• Included in the model each of Weeks 1 to 12

• Used Maximum Likelihood Estimation (MLE) and Multiple Imputation (MI) approaches (MLE is preferred over MI)

• Used random effects (mixed) logit model with SAS PROC GLIMMIX


Take-Home Messages1) Model all the available outcome data at all time

points, including outcome at baseline (t=0), and then test the time points (contrasts) of interest

2) There are good data analytic methods for dealing with missing data in repeated-measures designs (under MAR assumption): use random effects (mixed) models estimated by maximum likelihood

3) Allow for a linear and quadratic time trend (saves degrees of freedom), or spline model (broken line)

4) If no time-related pattern, use time as a class variable, i.e. each time point is a category (not continuous)


Take-Home Messages (cont’d)

5) Imputing missing outcomes as positive is a crude approach – one can often do better

6) Incorporation of covariates and auxiliary variables

7) Sensitivity analysis is absolutely vital


ReferencesAllison, Missing Data, Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-136, Thousand Oaks, CA: Sage, 2001.

Fitzmaurice, Laird & Ware, Applied Longitudinal Analysis, Wiley, 2004.

Graham, Missing Data Analysis: Making It Work in the Real World, Annual Review of Psychology, 2009, 60: 549-576.

Liang & Zeger, Longitudinal Data Analysis of Continuous and Discrete Responses for Pre-Post Designs, Sankhya, 2000, 62(B): 134-148.

Weiss, An Introduction to Modeling Longitudinal Data, presentation at UCLA CALDAR Summer Institute on Longitudinal Research, August 2010.

Woody et al., Extended vs Short-term Buprenorphine-Naloxone for Treatment of Opioid-Addicted Youth: A Randomized Trial, JAMA, 2008, 300(17): 2003-2011.


Contact Information

Neal Oden: [email protected]

Gaurav Sharma: [email protected]

Paul Van Veldhuisen: [email protected]

Paul Wakim: [email protected]


Questions & Comments


Handling Missing Data in the Analysis of CTN Trials: Pitfalls and Possible Solutions

Documents

Transcript of Handling Missing Data in the Analysis of CTN Trials: Pitfalls and Possible Solutions