Handling XHandling X--side Missing side Missing DihMData ... · Contents X-side and Y-side Missing...

44
Handling X Handling X-side Missing side Missing D ihM D ihM l Data with M Data with Mplus plus Tor Neilands Tor Neilands Center for AIDS Prevention Studies Center for AIDS Prevention Studies November 19, 2010 November 19, 2010

Transcript of Handling XHandling X--side Missing side Missing DihMData ... · Contents X-side and Y-side Missing...

  • Handling XHandling X--side Missing side Missing D i h MD i h M llData with MData with Mplusplus

    Tor NeilandsTor NeilandsCenter for AIDS Prevention StudiesCenter for AIDS Prevention StudiesC o S v o S d sC o S v o S d s

    November 19, 2010November 19, 2010

  • Contents

    X-side and Y-side Missing DataMi i D M h i Missing Data Mechanisms

    Approaches for Addressing Missing Data in Statistical A lAnalyses

    ExamplesE l 1 Li R i Example 1: Linear Regression

    Example 2: Logistic Regression with Clustered DataLogistic Regression with Clustered Data Example 3: Multilevel AnalysisExample 3: Multilevel Analysis Example 3: Multilevel AnalysisExample 3: Multilevel Analysis Example 4: MI for Clustered Categorical DataExample 4: MI for Clustered Categorical Data

    MMplusplus Availability & ResourcesAvailability & Resources and Acknowledgements MMplusplus Availability & ResourcesAvailability & Resources and Acknowledgements

  • Missing Data OverviewMissing Data OverviewMissing Data OverviewMissing Data Overview Missing data are ubiquitous in applied quantitative Missing data are ubiquitous in applied quantitative

    didistudiesstudies Don’t know/don’t remember/refused responses on crossDon’t know/don’t remember/refused responses on cross--

    i l d lfi l d lf d i i dd i i dsectional surveys and selfsectional surveys and self--administered paper surveysadministered paper surveys Skip patternsSkip patterns Interviewer error/AInterviewer error/A--CASI programming errors or CASI programming errors or

    omissions.omissions.i di l l f lli di l l f ll Longitudinal loss to followLongitudinal loss to follow--upup

    3

  • Preventing Missing DataPreventing Missing DataPreventing Missing DataPreventing Missing Data

    Prevention is the best first stepPrevention is the best first step Prevention is the best first stepPrevention is the best first step AA--CASI, CAPI, etc. CASI, CAPI, etc. Rigorous retention protocols for participant trackingRigorous retention protocols for participant tracking Rigorous retention protocols for participant tracking, Rigorous retention protocols for participant tracking,

    etc.etc. Diane Bill and Lance’s work with flexibleDiane Bill and Lance’s work with flexible Diane, Bill, and Lance s work with flexible Diane, Bill, and Lance s work with flexible

    interviewing methods. interviewing methods. Asking longitudinal study participants if theyAsking longitudinal study participants if they Asking longitudinal study participants if they Asking longitudinal study participants if they

    anticipate barriers to returning for followanticipate barriers to returning for follow--up visits, up visits, then problem solving those issues. then problem solving those issues. p gp g

    4

  • Missing Data MechanismsMissing Data MechanismsMissing Data MechanismsMissing Data Mechanisms What mechanisms lead to missing data?What mechanisms lead to missing data? Rubin’s taxonomy of missing data mechanismsRubin’s taxonomy of missing data mechanisms

    Rubin (1976), Rubin (1976), BiometrikaBiometrika MCAR: Missing Completely at RandomMCAR: Missing Completely at Random MAR: Missing at RandomMAR: Missing at Randomgg NMAR: Not Missing at RandomNMAR: Not Missing at Random

    Also known as MNAR (Missing Not at Random). Also known as MNAR (Missing Not at Random). ( g )( g )

    Good articles that spell this out:Good articles that spell this out: Schafer & Graham, 2002, Schafer & Graham, 2002, Psychological MethodsPsychological Methods

    5

    , ,, , y g M dy g M dGraham, 2009, Graham, 2009, Annual Review of PsychologyAnnual Review of Psychology

  • Missing at Random (MAR)Missing at Random (MAR)Missing at Random (MAR)Missing at Random (MAR)

    DefineDefine RR as an indicator of (non)as an indicator of (non)missingnessmissingness forforDefine Define RR as an indicator of (non)as an indicator of (non)missingnessmissingness for for variable variable YY. . RR = 1 if = 1 if YY is observed; is observed; RR = 0 if = 0 if YY is missing. is missing.

    Denote Denote YYcompletecomplete as the complete data. Partition as the complete data. Partition pp ppYYcompletecomplete as as YYcompletecomplete = (= (YYobservedobserved, , YYmissingmissing))pp (( gg))

    MAR occurs when the distribution of MAR occurs when the distribution of missingnessmissingness does does not depend on the values of Y that would have been not depend on the values of Y that would have been observed had Y not been missing:observed had Y not been missing: P(P(RR|Y|Ycompletecomplete) = P() = P(R|YR|Yobservedobserved))

    6

  • Missing Completely at Random Missing Completely at Random (MCAR)(MCAR)

    Put another way, MAR allows the probabilities of Put another way, MAR allows the probabilities of y, py, pmissingnessmissingness to depend on observed data, but not on to depend on observed data, but not on missing data. missing data.

    MAR is a much less restrictive assumption than MCAR. MAR is a much less restrictive assumption than MCAR. MCAR is a special case of MAR where the distribution of MCAR is a special case of MAR where the distribution of

    missing data does not depend on missing data does not depend on YYobservedobserved, also:, also: P(P(RR|Y|Ycompletecomplete) = P() = P(RR) )

    If incomplete data are MCAR, the cases with complete If incomplete data are MCAR, the cases with complete data are then a random subset of the original sample. data are then a random subset of the original sample.

    7

  • Not Missing at Random (NMAR)Not Missing at Random (NMAR)Not Missing at Random (NMAR)Not Missing at Random (NMAR)

    The probability that Y is missing is a function of The probability that Y is missing is a function of Y itself. Y itself.

    Missing data mechanism must be modeled to Missing data mechanism must be modeled to ggobtain good parameter estimates. Examples:obtain good parameter estimates. Examples: Heckman’s selection modelHeckman’s selection model Pattern mixture modelsPattern mixture models

    Disadvantages of NMAR modeling: RequiresDisadvantages of NMAR modeling: Requires Disadvantages of NMAR modeling: Requires Disadvantages of NMAR modeling: Requires high level of knowledge about missingness high level of knowledge about missingness mechanism; results are often sensitive to themechanism; results are often sensitive to themechanism; results are often sensitive to the mechanism; results are often sensitive to the choice of NMAR model selected.choice of NMAR model selected.

    8

  • MCAR, MAR, NMAR RevisitedMCAR, MAR, NMAR RevisitedMCAR, MAR, NMAR RevisitedMCAR, MAR, NMAR Revisited

    From Schafer & Graham, 2002, p. 151: Another way toFrom Schafer & Graham, 2002, p. 151: Another way toFrom Schafer & Graham, 2002, p. 151: Another way to From Schafer & Graham, 2002, p. 151: Another way to think about MAR, MCAR, and NMAR: If you have think about MAR, MCAR, and NMAR: If you have observed data X and incomplete data Y, and assuming observed data X and incomplete data Y, and assuming independence of observations:independence of observations: MCAR indicates that the probability of Y being missing MCAR indicates that the probability of Y being missing

    f d d df d d dfor a participant does not depend her values on X or Y.for a participant does not depend her values on X or Y. MAR indicates that the probability of Y being missing MAR indicates that the probability of Y being missing

    for the participant may depend on her X values but notfor the participant may depend on her X values but notfor the participant may depend on her X values but not for the participant may depend on her X values but not her Y values.her Y values.

    NMAR indicates that the probability of Y being missing NMAR indicates that the probability of Y being missing NM R indicates that the probability of Y being missingNM R indicates that the probability of Y being missingdepends on the participant’s actual Y values. depends on the participant’s actual Y values.

    9

  • ListwiseListwise Deletion of Missing DataDeletion of Missing DataListwiseListwise Deletion of Missing DataDeletion of Missing Data

    Standard statistical programs typically delete the whole case Standard statistical programs typically delete the whole case f l i if i bl ’ l i if l i if i bl ’ l i ifrom an analysis if one or more variables’ values are missing from an analysis if one or more variables’ values are missing ((listwiselistwise deletion). Consequences:deletion). Consequences:If i i d d MCARIf i i d d MCAR If missing data are due to MCAR: If missing data are due to MCAR: Parameter estimates unbiased but standard errors are enlarged and Parameter estimates unbiased but standard errors are enlarged and

    power for hypothesis testing is reducedpower for hypothesis testing is reducedpower for hypothesis testing is reducedpower for hypothesis testing is reduced

    If missing data are due to MAR:If missing data are due to MAR: Parameter estimates may be biased, standard errors enlarged, and Parameter estimates may be biased, standard errors enlarged, and y , g ,y , g ,

    power for hypothesis testing reducedpower for hypothesis testing reduced

    If missing data are due to NMAR: If missing data are due to NMAR:

    10

    Parameter estimates may be biased standard errors enlarged, and Parameter estimates may be biased standard errors enlarged, and power for hypothesis testing reducedpower for hypothesis testing reduced

  • Occurrence ofOccurrence of MissingnessMissingness TypesTypesOccurrence of Occurrence of MissingnessMissingness TypesTypes

    MCAR: Missing Completely at RandomMCAR: Missing Completely at Random A very stringent assumption unlikely to be met in practiceA very stringent assumption unlikely to be met in practice Example: computer failure loses some cases’ data but not othersExample: computer failure loses some cases’ data but not others

    MAR Mi i R dMAR Mi i R d MAR: Missing at RandomMAR: Missing at Random Much more likely to be met in practice, especially in social and Much more likely to be met in practice, especially in social and

    behavioral research (Schafer & Graham, 2002, behavioral research (Schafer & Graham, 2002, Psychological MethodsPsychological Methods))( , ,( , , y gy g )) NMAR: Not Missing at RandomNMAR: Not Missing at Random

    Unknown. MCAR vs. MAR can be formally tested via statistical tests, Unknown. MCAR vs. MAR can be formally tested via statistical tests, b MAR NMAR b db MAR NMAR b dbut MAR vs. NMAR cannot be tested. but MAR vs. NMAR cannot be tested.

    Inclusion of measures during the study design phase that are likely to Inclusion of measures during the study design phase that are likely to be correlated with subsequent data be correlated with subsequent data missingnessmissingness can help to minimize can help to minimize

    11

    qq gg ppNMAR NMAR missingnessmissingness. .

    Some NMAR Some NMAR missingnessmissingness may be inevitable, however. may be inevitable, however.

  • Addressing Missing DataAddressing Missing DataAddressing Missing DataAddressing Missing Data

    Ad hocAd hoc methods such as methods such as listwiselistwise deletion, deletion, pairwisepairwise deletion, deletion, or substitution of the variable’s mean value usually assume or substitution of the variable’s mean value usually assume MCAR and are not recommended. See Paul Allison’s 2002 MCAR and are not recommended. See Paul Allison’s 2002 Sage publication for a readable treatment of the reasonsSage publication for a readable treatment of the reasonsSage publication for a readable treatment of the reasons Sage publication for a readable treatment of the reasons why these methods don’t usually work well. why these methods don’t usually work well.

    ListwiseListwise deletion may yield unbiased results in some deletion may yield unbiased results in some y yy ycircumstances, however: circumstances, however:

    Regression models where the probability of missing data Regression models where the probability of missing data th i d d t i bl d t d d thth i d d t i bl d t d d thon the independent variables does not depend on the on the independent variables does not depend on the

    value of the dependent variable (Allison, 2002, pp. 6value of the dependent variable (Allison, 2002, pp. 6--7). 7).

    12

  • Addressing Missing DataAddressing Missing DataAddressing Missing DataAddressing Missing Data Regression models where the probability of Regression models where the probability of missingnessmissingness on Y on Y

    d d X l ( i td d X l ( i t d d td d t i ii i S LittlS Littldepends on X values (covariatedepends on X values (covariate--dependent dependent missingnessmissingness. See: Little, . See: Little, 1995, 1995, JASAJASA). ).

    In general, estimates of sample statistics such as means are In general, estimates of sample statistics such as means are g , pg , pmore biased due to missing data than are regression more biased due to missing data than are regression parameter estimates (Little & Rubin, 2002: parameter estimates (Little & Rubin, 2002: Statistical Analysis Statistical Analysis

    ith Mi i D tith Mi i D t Wil 2002)Wil 2002)with Missing Datawith Missing Data, Wiley, 2002). , Wiley, 2002). Remember, though, that efficiency in the regression analysis Remember, though, that efficiency in the regression analysis

    context is reduced due to missing data You can lose a lot ofcontext is reduced due to missing data You can lose a lot ofcontext is reduced due to missing data. You can lose a lot of context is reduced due to missing data. You can lose a lot of statistical power, especially if there are many cases and statistical power, especially if there are many cases and missing data patterns, and the number of complete cases is missing data patterns, and the number of complete cases is a small fraction of the original number of cases.a small fraction of the original number of cases.

    13

  • Addressing Missing DataAddressing Missing DataAddressing Missing DataAddressing Missing Data

    NMARNMAR missingnessmissingness can only be addressedcan only be addressedNMAR NMAR missingnessmissingness can only be addressed can only be addressed through explicitly assuming a given through explicitly assuming a given missingnessmissingness mechanism, which can lead to mechanism, which can lead to ggsuboptimal results if the incorrect suboptimal results if the incorrect missingnessmissingnessmechanism is specified (Allison, 2002). mechanism is specified (Allison, 2002).

    There is even some evidence that methods There is even some evidence that methods that assume MAR that assume MAR missingnessmissingness may may

    f h h f NMARf h h f NMARoutperform other approaches for NMAR outperform other approaches for NMAR situations (situations (MuthenMuthen, Kaplan, & Hollis, 1987, , Kaplan, & Hollis, 1987, PsychometrikaPsychometrika))PsychometrikaPsychometrika). ).

    14

  • Methods for MAR MissingnessMethods for MAR MissingnessMethods for MAR MissingnessMethods for MAR Missingness

    Ibrahim Ibrahim ((JASAJASA, 2005), 2005) reviewed four general approaches and reviewed four general approaches and ((JJ )) g ppg ppfound all to perform about equally well:found all to perform about equally well: Inverse censoring weightsInverse censoring weights Fully Bayesian analysisFully Bayesian analysis Multiple imputation (MI)Multiple imputation (MI)

    Di i lik lih d i i (ML)Di i lik lih d i i (ML) Direct maximum likelihood estimation (ML)Direct maximum likelihood estimation (ML)

    A full treatment of each technique is beyond the scope of A full treatment of each technique is beyond the scope of today’s presentation We will concentrate on how to employtoday’s presentation We will concentrate on how to employtoday s presentation. We will concentrate on how to employ today s presentation. We will concentrate on how to employ MMplusplus to address Xto address X--side missingness using direct maximum side missingness using direct maximum likelihood (ML) and multiple imputation (MI) under the likelihood (ML) and multiple imputation (MI) under the

    15

    e ood ( ) a d p e p a o ( ) de ee ood ( ) a d p e p a o ( ) de eassumption of MAR for incomplete data. assumption of MAR for incomplete data.

  • XX--side and Yside and Y--sideside MissingnessMissingnessXX side and Yside and Y side side MissingnessMissingness

    Some software programs implicitly incorporate direct Some software programs implicitly incorporate direct p g p y pp g p y pmaximum likelihood handling of an outcome variable Y. maximum likelihood handling of an outcome variable Y. These are typically mixed models routines that can be These are typically mixed models routines that can be employed to analyze longitudinal data with missing outcomesemployed to analyze longitudinal data with missing outcomes PROCs MIXED, GLIMMIX, and NLMIXED in SASPROCs MIXED, GLIMMIX, and NLMIXED in SAS

    MIXED i SPSSMIXED i SPSS MIXED in SPSSMIXED in SPSS StataStata --xtxt-- commands (there are many) and commands (there are many) and --gllammgllamm--

    However these commands will drop the whole observationHowever these commands will drop the whole observation However, these commands will drop the whole observation However, these commands will drop the whole observation when one or more X values are missing. when one or more X values are missing.

    They cannot conveniently be used to handle crossThey cannot conveniently be used to handle cross--sectionalsectional

    16

    They cannot conveniently be used to handle crossThey cannot conveniently be used to handle cross--sectional sectional missing data. missing data.

  • MMplusplus for Addressing Missing Datafor Addressing Missing DataMMplus plus for Addressing Missing Datafor Addressing Missing Data

    Some of the most important developments in handling nonSome of the most important developments in handling non--p p gp p gnormal and incomplete data arose in the latent variable normal and incomplete data arose in the latent variable (structural equation modeling) field in the 1990s.(structural equation modeling) field in the 1990s.

    For many years, EQS had the best nonFor many years, EQS had the best non--normal data routines normal data routines and AMOS had the most userand AMOS had the most user--friendly implementation of friendly implementation of di ML i bl f i hdi ML i bl f i h i l di l ddirect ML suitable for use with crossdirect ML suitable for use with cross--sectional and sectional and longitudinal Xlongitudinal X--side and Yside and Y--side missing data. side missing data.

    I th l t 1990 B t d Li d M thé d l d MI th l t 1990 B t d Li d M thé d l d Mplpl In the late 1990s, Bengt and Linda Muthén developed MIn the late 1990s, Bengt and Linda Muthén developed Mplusplus, a , a general latent variable modeling program that included and general latent variable modeling program that included and enhanced these approaches to include among other thingsenhanced these approaches to include among other things

    17

    enhanced these approaches to include, among other things, enhanced these approaches to include, among other things, the ability to handle categorical outcome variables. the ability to handle categorical outcome variables.

  • MMplus plus for Addressing Missing Datafor Addressing Missing Datapp g gg g

    MMplusplus is mainly known as a latent variable modeling is mainly known as a latent variable modeling b i fi id l f ib i fi id l f iprogram, but it can fit a wide class of regression program, but it can fit a wide class of regression

    models, including linear, binary and ordinal logistic, models, including linear, binary and ordinal logistic, d d l (P i i bi i l h dld d l (P i i bi i l h dlcount data models (Poisson, negative binomial, hurdle, count data models (Poisson, negative binomial, hurdle,

    zerozero--inflated Poisson, zeroinflated Poisson, zero--inflated negative inflated negative bi i l) d i dbi i l) d i d i (C )i (C )binomial), and parametric and nonbinomial), and parametric and non--parametric (Cox) parametric (Cox) discrete and continuous time survival models. discrete and continuous time survival models.

    For clustered data structures, robust standard errors For clustered data structures, robust standard errors with or without weights and stratification are available with or without weights and stratification are available

    18

    for complex survey data. Multilevel modeling with two for complex survey data. Multilevel modeling with two levels is available for modellevels is available for model--based multilevel analyses. based multilevel analyses.

  • How to transfer data to MHow to transfer data to Mplusplus: : A very brief guideA very brief guideA very brief guideA very brief guide

    MMplusplus only accepts delimited ASCII (raw text data as input)only accepts delimited ASCII (raw text data as input) Delimeter can be spaces commas etcDelimeter can be spaces commas etc Delimeter can be spaces, commas, etc.Delimeter can be spaces, commas, etc.

    My approach: Use Stat/Transfer to save a spaceMy approach: Use Stat/Transfer to save a space--delimeted delimeted text file with variable names in the first row and a specific text file with variable names in the first row and a specific ppvalue (e.g., value (e.g., --999) to denote missing values for all variables. 999) to denote missing values for all variables.

    Change the file extension from Change the file extension from .csv.csv to to .txt.txt.. Open the text data file with Notepad and cut the first row Open the text data file with Notepad and cut the first row

    containing the variable names into Mcontaining the variable names into Mplusplus. Delete the . Delete the resulting blank line in the data file so that the first row ofresulting blank line in the data file so that the first row ofresulting blank line in the data file so that the first row of resulting blank line in the data file so that the first row of the data file starts with the first case (i.e., with actual data). the data file starts with the first case (i.e., with actual data).

    Paste the variable names into MPaste the variable names into Mplusplus and remove the and remove the

    19

    ppquotation marks. Note that Mquotation marks. Note that Mplus plus prefers variable names to prefers variable names to be only 8 characters or less in length. be only 8 characters or less in length.

  • Example 1: Linear RegressionExample 1: Linear Regressionp gp g Example from Paul Allison’s monograph Example from Paul Allison’s monograph Missing DataMissing Data (Sage (Sage

    Publications, 2002, p. 21)Publications, 2002, p. 21), , p ), , p ) NN = 1,302 US colleges and universities= 1,302 US colleges and universities Outcome: Outcome: gradratgradrat (ratio of graduating seniors to number (ratio of graduating seniors to number

    ll d f i l * 100)ll d f i l * 100)enrolled four years previously * 100)enrolled four years previously * 100) Explanatory variables: Explanatory variables:

    CSATCSAT –– Combined mean scores on verbal and math SATCombined mean scores on verbal and math SAT CSAT CSAT Combined mean scores on verbal and math SATCombined mean scores on verbal and math SAT LNEnrollLNEnroll –– Natural log of number of freshmenNatural log of number of freshmen Private Private –– 0 = public school; 1 = private school0 = public school; 1 = private school

    St fSt f R ti f t d t t f lt * 100R ti f t d t t f lt * 100 StufacStufac –– Ratio of students to faculty * 100Ratio of students to faculty * 100 RMBrdRMBrd –– Total cost of room & board in thousands of dollarsTotal cost of room & board in thousands of dollars

    Auxiliary variable: ACT Auxiliary variable: ACT –– Mean ACT scoreMean ACT score

    20

    yy Not in the regression model, but correlated with CSAT. Not in the regression model, but correlated with CSAT.

    Only Private has complete data for all 1,302 colleges. Only Private has complete data for all 1,302 colleges.

  • Example 1: Linear RegressionExample 1: Linear Regressionp gp g NN = 455 cases with complete data. This would be the default = 455 cases with complete data. This would be the default

    availableavailable NN for multiple linear regression analysis infor multiple linear regression analysis in StataStataavailable available NN for multiple linear regression analysis in for multiple linear regression analysis in StataStata, , SAS, SPSS, etc. SAS, SPSS, etc.

    And MAnd Mplusplus, too: by default, M, too: by default, Mplus plus assumes that any xassumes that any x--side side pp yy pp yyvariable is fixed and drops cases with missing data for xvariable is fixed and drops cases with missing data for x--side side variables. If, however, we are willing to convert xvariables. If, however, we are willing to convert x--side side variables with missing data to random variables withvariables with missing data to random variables withvariables with missing data to random variables with variables with missing data to random variables with distributions and assume multivariate normality of the distributions and assume multivariate normality of the underlying joint distribution, those variables can be brought underlying joint distribution, those variables can be brought into the likelihood and the cases with partial information will into the likelihood and the cases with partial information will be included in the analysis. See be included in the analysis. See http://www statmodel com/verhistory shtmlhttp://www statmodel com/verhistory shtml “Analysis“Analysis

    21

    http://www.statmodel.com/verhistory.shtmlhttp://www.statmodel.com/verhistory.shtml Analysis Analysis Conditional on Covariates” for more information on the Conditional on Covariates” for more information on the thinking behind this selection of this default behavior. thinking behind this selection of this default behavior.

  • Example 1: Linear RegressionExample 1: Linear Regressionp gp g See MSee Mplus plus model output files:model output files:

    Allison1.out: Allison1.out: Default listwise deletion of cases with missing X data (Default listwise deletion of cases with missing X data (N N 455)455)= 455). = 455).

    Allison2.out: Allison2.out: Direct ML analysis (Direct ML analysis (N N = 1,302) with ML standard errors.= 1,302) with ML standard errors. Allison3.out: Allison3.out: Direct ML analysis (Direct ML analysis (NN = 1,302) with robust standard = 1,302) with robust standard

    errors.errors. Allison4.out: Allison4.out: Direct ML analysis (Direct ML analysis (NN = 1,302) with bootstrap standard = 1,302) with bootstrap standard

    errors and biaserrors and bias--corrected confidence intervals based on 5,000 corrected confidence intervals based on 5,000 b lb lbootstrap samples. bootstrap samples.

    Results: All regression coefficients are significantly different Results: All regression coefficients are significantly different from zero across all analyses, except for studentfrom zero across all analyses, except for student--faculty ratio:faculty ratio:y , py , p yy Listwise analysis result: Listwise analysis result: BB = = --.123, .123, SESE = .131, = .131, pp = .347= .347 Direct ML result: Direct ML result: BB = = --.194, .194, SESE = .102, = .102, pp = .058= .058 Direct ML robust SEs:Direct ML robust SEs: BB == 194194 SESE = 106= 106 pp = 068= 068

    22

    Direct ML, robust SEs: Direct ML, robust SEs: BB = = --.194, .194, SESE = .106, = .106, pp = .068= .068 Direct ML, bootstrap SEs: Direct ML, bootstrap SEs: BB = = --.194, .194, SESE = .123, = .123, pp = .116= .116

    Bootstrap 95% biasBootstrap 95% bias--corrected CI: corrected CI: --.391, .099.391, .099

  • Example 1: Linear RegressionExample 1: Linear RegressionExample 1: Linear RegressionExample 1: Linear Regression

    The number of MThe number of Mplusplus input and output files caninput and output files can The number of MThe number of Mplusplus input and output files can input and output files can grow rapidly for any given project. For instance, grow rapidly for any given project. For instance, in the previous example we have 4 input + 4in the previous example we have 4 input + 4in the previous example we have 4 input + 4 in the previous example we have 4 input + 4 output files = 8 Moutput files = 8 Mplusplus files.files.

    An alternative forAn alternative for StataStata users is to use theusers is to use the An alternative for An alternative for StataStata users is to use the users is to use the runmplusrunmplus utility that enables calls to Mutility that enables calls to Mplusplus from from withinwithin StataStata Input file syntax and data areInput file syntax and data arewithin within StataStata. Input file syntax and data are . Input file syntax and data are passed to the Mpassed to the Mplusplus engine from engine from StataStata and and results are passed back toresults are passed back to StataStata to display in theto display in theresults are passed back to results are passed back to StataStata to display in the to display in the StataStata Results window. Results window.

    23

  • Example 1: Linear RegressionExample 1: Linear RegressionExample 1: Linear RegressionExample 1: Linear Regression

    StataStata do filedo file usnews Mplus Models dousnews Mplus Models do exampleexample StataStata do file do file usnews_Mplus_Models.do usnews_Mplus_Models.do example example syntax blocksyntax block: :

    ** ** ----> Allison 2 (Direct ML estimation > Allison 2 (Direct ML estimation -- N = N = 1,302)

  • Example 1: Linear RegressionExample 1: Linear Regressionp gp g Features of the analysis:Features of the analysis:

    Direct ML handling of incomplete XDirect ML handling of incomplete X--side and Yside and Y--side data.side data.g pg p Easy incorporation of auxiliary variable(s) into the analysis. Easy incorporation of auxiliary variable(s) into the analysis.

    The automatic auxiliary variable inclusion feature is available only for The automatic auxiliary variable inclusion feature is available only for singlesingle level analyses involving all continuous outcomes or mediatorslevel analyses involving all continuous outcomes or mediatorssinglesingle--level analyses involving all continuous outcomes or mediators. level analyses involving all continuous outcomes or mediators. You can still include auxiliary variables manually in other types of You can still include auxiliary variables manually in other types of analyses, however. analyses, however.

    Th LISTWISE ON ti i il bl i th MTh LISTWISE ON ti i il bl i th Mplpl DATADATA The LISTWISE = ON option is available in the MThe LISTWISE = ON option is available in the Mplusplus DATA DATA command to explicitly request a listwise analysis. Useful for: command to explicitly request a listwise analysis. Useful for:

    Verifying that your Mplus model specification is correct by Verifying that your Mplus model specification is correct by y g y p p yy g y p p ycomparing Mcomparing Mplusplus results to those from the same model fitted in results to those from the same model fitted in SAS, Stata, etc. SAS, Stata, etc.

    Use with care: LISTWISE = ON will delete cases with both YUse with care: LISTWISE = ON will delete cases with both Y--

    25

    Use with care: LISTWISE = ON will delete cases with both YUse with care: LISTWISE = ON will delete cases with both Y--side and Xside and X--side missing data.side missing data.

  • Example 1: Linear RegressionExample 1: Linear RegressionExample 1: Linear RegressionExample 1: Linear Regression

    Robust standard errors (MRobust standard errors (Mplusplus estimator MLR) andestimator MLR) and Robust standard errors (MRobust standard errors (Mplusplus estimator MLR) and estimator MLR) and the bootstrap are available to address assumption the bootstrap are available to address assumption violations due to nonviolations due to non--normal or normal or heteroskedasticheteroskedasticoutcome data.outcome data.

    Mplus MLR estimator assumes MCAR Mplus MLR estimator assumes MCAR missingnessmissingnessMplus MLR estimator assumes MC RMplus MLR estimator assumes MC R missingnessmissingnessand finite fourthand finite fourth--order moments (i.e., kurtosis is order moments (i.e., kurtosis is nonnon--zero); initial simulation studies show zero); initial simulation studies show low SE low SE ))bias bias for this estimator with MAR data. See for this estimator with MAR data. See http://www.statmodel.com/download/webnotes/http://www.statmodel.com/download/webnotes/mc2.pdfmc2.pdf for more information about this estimator. for more information about this estimator.

    26

  • Example 2: Logistic Regression with Example 2: Logistic Regression with Clustered DataClustered DataClustered DataClustered Data

    Gay Couples Study (Colleen Hoff, PI)Gay Couples Study (Colleen Hoff, PI) St d of 566 ga male co ples from San FranciscoSt d of 566 ga male co ples from San Francisco Study of 566 gay male couples from San FranciscoStudy of 566 gay male couples from San Francisco Outcome: Any unprotected anal intercourse in the past three months Outcome: Any unprotected anal intercourse in the past three months

    with an outside partner of discordant or unknown with an outside partner of discordant or unknown serostatusserostatus(UAIOUTDU 1 0 )(UAIOUTDU 1 0 )(UAIOUTDU; 1 = yes, 0 = no). (UAIOUTDU; 1 = yes, 0 = no).

    Explanatory variables:Explanatory variables: Relationship length in years (coupleRelationship length in years (couple--level)level) NonNon--monogamous (couplemonogamous (couple--level; 1 = yes, 0 = no)level; 1 = yes, 0 = no) Age in years (individualAge in years (individual--level)level) CESCES--D depression (individualD depression (individual--level)level)

    S l i (i di id lS l i (i di id l l l)l l) Sexual agreement investment (individualSexual agreement investment (individual--level)level) Couple Couple serostatusserostatus (couple(couple--level; 1 = level; 1 = --//--; 2 = +/; 2 = +/--; 3 = +/+); 3 = +/+)

    Missing data: Missing data: 131 f 1 132 (11 6%) did h i l d131 f 1 132 (11 6%) did h i l d

    27

    131 out of 1,132 men (11.6%) did not report having a sexual agreement and so 131 out of 1,132 men (11.6%) did not report having a sexual agreement and so were skipped out of the sexual agreement questions. were skipped out of the sexual agreement questions.

  • Example 2: Logistic Regression with Example 2: Logistic Regression with Clustered DataClustered DataClustered DataClustered Data

    Analysis: Logistic regression of UAIOUTDU onto the Analysis: Logistic regression of UAIOUTDU onto the explanatory variables listed on the previous slide.explanatory variables listed on the previous slide.

    Clustered data structure: 1,132 individual men nested Clustered data structure: 1,132 individual men nested within 566 couples. within 566 couples.

    Requires logistic regression approach that allows for Requires logistic regression approach that allows for q g g ppq g g ppclustered data. Some possibilitiesclustered data. Some possibilities: : SubjectSubject--specific coefficients:specific coefficients:jj pp

    Multilevel generalized linear mixed modeling (GLMM)Multilevel generalized linear mixed modeling (GLMM)

    PopulationPopulation--average coefficients:average coefficients:

    28

    Logistic regression analysis with robust standard errorsLogistic regression analysis with robust standard errors GEEGEE

  • Example 2: Logistic Regression with Example 2: Logistic Regression with Clustered DataClustered DataClustered DataClustered Data

    W ill h b d d h ThiW ill h b d d h Thi We will use the robust standard error approach. This We will use the robust standard error approach. This is similar to using PROC SURVEYLOGISTIC in SAS is similar to using PROC SURVEYLOGISTIC in SAS

    l i il i i i h hi h h ll i ii i SS IIor or --logisticlogistic-- with the with the --clustercluster-- option in option in StataStata. In . In MMplusplus, we use the MLR estimator with the CLUSTER , we use the MLR estimator with the CLUSTER

    i d h COMPLEX l ii d h COMPLEX l ioption and the COMPLEX analysis type. option and the COMPLEX analysis type. But… a default But… a default listwiselistwise deletion analysis would use deletion analysis would use NN

    = 1,001, a 12% reduction from the original = 1,001, a 12% reduction from the original NN of of 1,132.1,132.

    29

  • Example 2: Logistic Regression with Example 2: Logistic Regression with Clustered DataClustered DataClustered DataClustered Data

    How to handle the missing data?How to handle the missing data?Multiple imputation? This would require reshaping Multiple imputation? This would require reshaping

    of long data into a wide format; imputing of long data into a wide format; imputing agreement investment scores; backagreement investment scores; back--transposing to transposing to the long format, and analyzing. Tedious. the long format, and analyzing. Tedious.

    Direct ML in MDirect ML in Mplusplus: Run once with the original : Run once with the original long data structure. long data structure. gg

    30

  • Example 2: Logistic Regression Example 2: Logistic Regression with Clustered Datawith Clustered Data

    Compare the default analysis results whereCompare the default analysis results where Compare the default analysis results where Compare the default analysis results where cases with missing xcases with missing x--data are excluded to an data are excluded to an

    l i h h i l d d bl i h h i l d d banalysis where they are included by analysis where they are included by specifying the xspecifying the x--variables with missing data variables with missing data as random variables. as random variables. Default analysis results (Default analysis results (HoffDemo1.outHoffDemo1.out))y (y ( ffff ))Direct ML analysis results (Direct ML analysis results (HoffDemo2.outHoffDemo2.out))

    31

  • Example 2: Logistic Regression Example 2: Logistic Regression with Clustered Datawith Clustered Data

    Eff Li i OR ( ) Di ML OR ( )Effect Listwise OR (p) Direct ML OR (p)

    HIV-Discordant Couples 2.16 (.004) 2.30 (.001)

    HIV-Positive Couples 1.82 (.034) 1.43 (.168)

    Relationship Length 1.03 (.125) 1.02 (.280)p g ( ) ( )

    Non-Monogamous 9.68 (

  • Example 2: Logistic Regression with Example 2: Logistic Regression with Clustered DataClustered DataClustered DataClustered Data

    Features of the analysisFeatures of the analysis Binary outcome variableBinary outcome variable Binary outcome variableBinary outcome variable Clustered data structureClustered data structure Direct ML handling of incomplete XDirect ML handling of incomplete X--side dataside data Direct ML handling of incomplete XDirect ML handling of incomplete X--side dataside data

    33

  • Example 3: Multilevel Analysis Example 3: Multilevel Analysis (Linear Mixed Model)(Linear Mixed Model)(Linear Mixed Model)(Linear Mixed Model)

    Study of 99,966 HIVStudy of 99,966 HIV--positive patients from five geographic positive patients from five geographic subregions (North America; East Africa; West Africa; Southsubregions (North America; East Africa; West Africa; Southsubregions (North America; East Africa; West Africa; South subregions (North America; East Africa; West Africa; South Africa; Asia) measured repeatedly following initiation of ART. Africa; Asia) measured repeatedly following initiation of ART.

    Outcome: CD4 countOutcome: CD4 count Outcome: CD4 countOutcome: CD4 count Explanatory variables: Explanatory variables:

    SexSex Sex Sex Age Age Baseline CD4 (represented by three restricted cubic spline variables)Baseline CD4 (represented by three restricted cubic spline variables)( p y p )( p y p ) Baseline viral load stage (five stages via four dummy variables)Baseline viral load stage (five stages via four dummy variables) On AZT at baselineOn AZT at baseline

    34

    On Efavirenz at baselineOn Efavirenz at baseline

  • Example 3: Multilevel AnalysisExample 3: Multilevel Analysisp yp y Missing data: Missing data:

    Baseline viral load category: N = 24,486 (24.5%) spread across theBaseline viral load category: N = 24,486 (24.5%) spread across the Baseline viral load category: N 24,486 (24.5%) spread across the Baseline viral load category: N 24,486 (24.5%) spread across the five geographic subregionsfive geographic subregions

    Baseline Efavirenz: N = 3,055 (3.06%) in the S. Africa group onlyBaseline Efavirenz: N = 3,055 (3.06%) in the S. Africa group only Baseline AZT: N = 2 852 (2 85%) in the South Africa group onlyBaseline AZT: N = 2 852 (2 85%) in the South Africa group only Baseline AZT: N = 2,852 (2.85%) in the South Africa group onlyBaseline AZT: N = 2,852 (2.85%) in the South Africa group only

    All other explanatory variables and the outcome were fully All other explanatory variables and the outcome were fully observed. observed.

    Goal: Fit a linear mixed model with linear splines representing Goal: Fit a linear mixed model with linear splines representing three time windows (1.5 mosthree time windows (1.5 mos--4 mos; 4 mos4 mos; 4 mos--12 mos; 12 mos12 mos; 12 mos--36 mos) with random slopes and intercepts with a completely 36 mos) with random slopes and intercepts with a completely ) p p p y) p p p yunstructured covariance matrix among the random effects. unstructured covariance matrix among the random effects.

    Main effects only for all covariates but baseline CD4. Main effects only for all covariates but baseline CD4. M ll d i d bM ll d i d b

    35

    Moreover, all mean and covariance parameters need to be Moreover, all mean and covariance parameters need to be groupgroup--specific. Time was centered at 4 months. specific. Time was centered at 4 months.

  • Example 3: Multilevel AnalysisExample 3: Multilevel Analysisp yp y

    If we analyze these data with SAS PROC MIXED or If we analyze these data with SAS PROC MIXED or StataStata xtmixedxtmixed the analysisthe analysis NN is 72 731 a 27% loss ofis 72 731 a 27% loss ofStataStata --xtmixedxtmixed--, the analysis , the analysis NN is 72,731, a 27% loss of is 72,731, a 27% loss of observations. observations.

    Using direct maximum likelihood in MUsing direct maximum likelihood in Mplusplus the fullthe full NN Using direct maximum likelihood in MUsing direct maximum likelihood in Mplusplus, the full , the full NNof 99,966 can be used. of 99,966 can be used.

    Sidebar: Convergence with the full model could notSidebar: Convergence with the full model could not Sidebar: Convergence with the full model could not Sidebar: Convergence with the full model could not be obtained with SAS or be obtained with SAS or StataStata (even with multiply (even with multiply imputed data) but was possible with Mimputed data) but was possible with Mplusplus in thisin thisimputed data), but was possible with Mimputed data), but was possible with Mplusplus in this in this particular scenario. particular scenario. In other scenarios, SAS or In other scenarios, SAS or StataStata may converge, but Mmay converge, but Mplusplus

    36

    ,, y g ,y g , ppmay not. may not.

  • Example 3: Multilevel AnalysisExample 3: Multilevel Analysisp yp y Refer to the handout Refer to the handout model_d_3spline_sqrt.out.model_d_3spline_sqrt.out. Features of the analysis:Features of the analysis:Features of the analysis:Features of the analysis:

    Uses all available dataUses all available data Separate variances and covariances among random effects Separate variances and covariances among random effects p v v gp v v g

    are easily implemented within the multiple groups approach are easily implemented within the multiple groups approach (55 random effects estimated). (55 random effects estimated).

    Various model parameters are easily fixed or set to be Various model parameters are easily fixed or set to be equal. Nonequal. Non--focal covariate effects are set equal across focal covariate effects are set equal across splines and geographic regions via named parameterssplines and geographic regions via named parameterssplines and geographic regions via named parameters. splines and geographic regions via named parameters.

    ModelModel--based standard errors used here, but MCAR robust based standard errors used here, but MCAR robust MLR standard errors are available if desired. MLR standard errors are available if desired.

    37

    PostPost--estimation of spline knots at 4, 12, & 36 months is estimation of spline knots at 4, 12, & 36 months is available via the MODEL CONSTRAINT command. available via the MODEL CONSTRAINT command.

  • Example 4: MI for Clustered Example 4: MI for Clustered Categorical DataCategorical Data

    The range of model types that can be fitted in MThe range of model types that can be fitted in Mplus plus g ypg yp ppusing direct ML or Bayesian estimationusing direct ML or Bayesian estimation is very broad. is very broad.

    However, sometimes you may still want to generateHowever, sometimes you may still want to generateHowever, sometimes you may still want to generate However, sometimes you may still want to generate multiple imputations. Examples:multiple imputations. Examples: To obtain regression diagnostics which are more readilyTo obtain regression diagnostics which are more readily To obtain regression diagnostics, which are more readily To obtain regression diagnostics, which are more readily

    available in programs like SAS and Stataavailable in programs like SAS and Stata To conveniently obtain specific tests or assumptions (e.g., To conveniently obtain specific tests or assumptions (e.g., To co ve e t y obta spec c tests o assu pt o s (e.g.,To co ve e t y obta spec c tests o assu pt o s (e.g.,

    score test for proportional odds for ordinal logistic score test for proportional odds for ordinal logistic regression; Hosmerregression; Hosmer--Lemeshow lackLemeshow lack--ofof--fit test for logistic fit test for logistic

    38

    regression, etc.). regression, etc.).

  • Example 4: MI for Clustered Example 4: MI for Clustered Categorical DataCategorical Data

    SAS and SAS and StataStata have wellhave well--developed suites for generating developed suites for generating l i l i i d h i f j il i l i i d h i f j imultiple imputations under the assumption of joint multiple imputations under the assumption of joint

    multivariate normalitymultivariate normality This approach is robust to nonThis approach is robust to non--normality if the subsequent datanormality if the subsequent data This approach is robust to nonThis approach is robust to non normality if the subsequent data normality if the subsequent data

    analyses are carried out using approaches that address the nonanalyses are carried out using approaches that address the non--normality (Schafer, 1997, p. 147normality (Schafer, 1997, p. 147--148). 148).

    Though it is not large there is still some bias in this approachThough it is not large there is still some bias in this approach Though it is not large, there is still some bias in this approach, Though it is not large, there is still some bias in this approach, however (Horton, however (Horton, LipsitzLipsitz, , ParzenParzen, American Statistician, 2003)., American Statistician, 2003).

    Multiple imputations via chained equations (MICE) provides Multiple imputations via chained equations (MICE) provides p p q ( ) pp p q ( ) pan alternative approach for imputing categorical data. an alternative approach for imputing categorical data.

    StataStata has a MICE implementation via the userhas a MICE implementation via the user--written written ii f i i IID (i d d d id i llf i i IID (i d d d id i ll

    39

    program program ––iceice-- for imputing IID (independent and identically for imputing IID (independent and identically distributed; i.e., distributed; i.e., unclusteredunclustered data)data)

  • Example 4: MI for Clustered Example 4: MI for Clustered Categorical DataCategorical Data

    For repeated measures data with a small number of fixed time For repeated measures data with a small number of fixed time points, it is possible to transpose or reshape a multiple record points, it is possible to transpose or reshape a multiple record long data file into a multiple variable wide data file with one long data file into a multiple variable wide data file with one

    d bj t t th i t ti d thd bj t t th i t ti d threcord per subject, generate the imputations, and then record per subject, generate the imputations, and then retransform the imputed data back into the long file format retransform the imputed data back into the long file format used by mixed models and GEE software routinesused by mixed models and GEE software routinesused by mixed models and GEE software routines.used by mixed models and GEE software routines.

    Handling clustered data with unequally spaced repeated Handling clustered data with unequally spaced repeated measurements or nonmeasurements or non--longitudinal clustered data structures is longitudinal clustered data structures is easu e e ts o oeasu e e ts o o o g tud a c uste ed data st uctu es so g tud a c uste ed data st uctu es smore difficult. For instance, one could impute within each more difficult. For instance, one could impute within each cluster and include dummy variables indicating cluster cluster and include dummy variables indicating cluster

    40

    membership in the imputation model (Graham, 2009, membership in the imputation model (Graham, 2009, Annual Annual Review of PsychologyReview of Psychology). ).

  • Example 4: MI for Clustered Example 4: MI for Clustered Categorical DataCategorical Data

    MMplus plus 6 allows for imputation of clustered continuous and 6 allows for imputation of clustered continuous and pp ppordered categorical data assuming random intercepts for ordered categorical data assuming random intercepts for clusters and an unrestricted covariance structure among clusters and an unrestricted covariance structure among th i blth i blthe variables. the variables.

    It is also possible to impute data under a more complex It is also possible to impute data under a more complex random effects structure (e g random intercepts andrandom effects structure (e g random intercepts andrandom effects structure (e.g.,. random intercepts and random effects structure (e.g.,. random intercepts and random slopes), but the analyst must specify a specific random slopes), but the analyst must specify a specific model under which the data must be imputed. If an model under which the data must be imputed. If an ppordered categorical variable is represented in the model as ordered categorical variable is represented in the model as a series of dummy variables, this can be problematic a series of dummy variables, this can be problematic b Mb M ll i 1 l i l d i bli 1 l i l d i bl

    41

    because Mbecause Mplusplus may assign 1s to multiple dummy variables may assign 1s to multiple dummy variables during the imputation process. during the imputation process.

  • Example 4: MI for Clustered Example 4: MI for Clustered Categorical DataCategorical Data

    Impute missing values for Impute missing values for EfavirenzEfavirenz (binary), AZT (binary), and viral (binary), AZT (binary), and viral load WHO status (ordinal) among South African participants in theload WHO status (ordinal) among South African participants in theload WHO status (ordinal) among South African participants in the load WHO status (ordinal) among South African participants in the previous example (previous example (NN = 65,401). = 65,401). As noted previously, only As noted previously, only EfavirenzEfavirenz, AZT, and baseline VL category had , AZT, and baseline VL category had

    missing valuesmissing values 41,185 (63%) had complete data41,185 (63%) had complete data 21,137 (32%) were missing baseline VL but nothing else21,137 (32%) were missing baseline VL but nothing else 2,547 (3.9%) were missing 2,547 (3.9%) were missing EfavirenzEfavirenz and AZT, but nothing elseand AZT, but nothing else 305 (

  • MMplusplus Availability & ResourcesAvailability & Resourcespp yy For CAPS investigators and staff, two MFor CAPS investigators and staff, two Mplusplus licenses are accessible licenses are accessible

    via the Citrix terminal server system.via the Citrix terminal server system.via the Citrix terminal server system. via the Citrix terminal server system. For all others or remote use without an internet connection, For all others or remote use without an internet connection,

    licenses are available for purchase at licenses are available for purchase at http://www.statmodel.comhttp://www.statmodel.com. . Online user’s guide available at this same URL, along with many Online user’s guide available at this same URL, along with many

    examples and a discussion forum. examples and a discussion forum. Sample programs and Monte Carlo counterparts come with theSample programs and Monte Carlo counterparts come with the Sample programs and Monte Carlo counterparts come with the Sample programs and Monte Carlo counterparts come with the

    program. program. Extra manuals from previous versions are in the CAPS reading Extra manuals from previous versions are in the CAPS reading

    room. room. For For StataStata users, there is a userusers, there is a user--written ado program that enables written ado program that enables

    StataStata do files to pass Mdo files to pass Mplusplus syntax to the Msyntax to the Mplusplus engine fromengine from StataStata

    43

    StataStata do files to pass Mdo files to pass Mplusplus syntax to the Msyntax to the Mplusplus engine from engine from StataStata. . http://sites.google.com/site/lvmworkshop/home/runmplushttp://sites.google.com/site/lvmworkshop/home/runmplus--stuffstuff

  • AcknowledgementsAcknowledgementsgg Slide Reviewers:Slide Reviewers:

    Dee Chakravarty, MSDee Chakravarty, MSyy Steve Gregorich, PhDSteve Gregorich, PhD Estie Hudes, PhD, MPHEstie Hudes, PhD, MPH

    StataStata--toto--MMplusplus utility: utility: Adam Carle, PhD, James M. Anderson Center for Health Systems Adam Carle, PhD, James M. Anderson Center for Health Systems

    E ll Ci i i OHE ll Ci i i OHExcellence, Cincinnati, OHExcellence, Cincinnati, OH Richard Jones, ScD, Institute for Aging Research, Hebrew Richard Jones, ScD, Institute for Aging Research, Hebrew

    SeniorLife, Boston, MASeniorLife, Boston, MASe o e, os o ,Se o e, os o ,

    Data Sharing:Data Sharing: Colleen Hoff, PhD, PI: Gay Couples StudyColleen Hoff, PhD, PI: Gay Couples Study

    44

    , , y p y, , y p y Jeff Martin, MD, MPH & Elvin Geng, MD, MPH: NAJeff Martin, MD, MPH & Elvin Geng, MD, MPH: NA--ACCORD ACCORD

    CD4 data CD4 data