Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at...

32
Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at www.oregonstate.edu/~acock/missing

Transcript of Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at...

Page 1: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Working with Missing Values

Alan C. AcockFebruary, 2007

Supporting material is available at www.oregonstate.edu/~acock/missing

Page 2: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

2

Why are the Values Missing: The reason instructs the solution

By Design—Completely Random– Missing Completely at Random (MCAR)– 50% of items selected randomly for each interview– 50% randomly selected for follow-up– Effective when there are too many items or high costs

Intentionally Missing—Researcher controlled– Boys not asked when first menstruation– Drop from analysis– Sometimes unintentionally imputed– Imputing doesn’t necessarily hurt

Page 3: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

3

Why are the Values Missing

Refusals—We may know mechanism– Adjusted for gender, race, education– May be missing at random– Otherwise, bias is likely w/o Auxiliary

Variables

Missing because of “don’t know” responses– Between agree and disagree?– Can we impute a better value? – Should we?

Page 4: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

4

Why are the Values Missing

Missing by researcher error– May be missing completely at random– May reflect researcher bias – Perceived risk to researcher– Missing observation worse than missing value

Code reason value is missing– NLSY97, uses 5 types of missing values – Treat each differently

Page 5: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

5

Why are the Values Missing

• Understand why each value is missing

• Delete observations or variables where you do not intend to impute a value

– Drop variable

– Drop observation

Page 6: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

6

Four Questions

• Do I want to have a value for this person?

• Is the value missing completely at random, or

• Do I have auxiliary variables that explain why it is missing, and

• Do I have covariates that predict the score?

Page 7: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

7

Patterns of Missing Values MISSING DATA PATTERNS 1 2 3 4 5 6 7 8 9 10 HLTH x x x x CHILDS x x x x x x x x x x HAP_GEN x x x x x INCOME98 x x x x x x AGE x x x x x x x x EDUC x x x x x

– What is problem with • HLTH? • INCOME98? • EDUC?

Page 8: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

8

Patterns of Missing Values

MISSING DATA PATTERN FREQUENCIESPattern Freq Pattern Freq Pattern Freq 1 550 5 27 9 4 2 81 6 2 10 14 3 77 7 12 4 30 8 21

• Throw out 81 people in pattern 2?• We have data on five of the six variables• Income might not be a key predictor

• Why is health missing in patterns 5 to 10—Was this by design?

Page 9: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

9

Amount of Missing ValuesPROPORTION OF DATA PRESENT HLTH CHILDS HAP_GEN INC AGE EDUC HLTH .90CHILDS .90 1.00HAP_GEN .77 .82 .82INCOME98 .76 .83 .70 .83AGE .90 .99 .81 .82 .99EDUC .77 .82 .82 .70 .81 .822

• Income low with educ, hlth, hap_gen• If income is “just” a control variable--Find a substitute or

impute • Over 50% of cases for all the combinations• Could be worse if you did 3-way (hlth, income, educ)

Page 10: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

10

Raw Data Missingness

ID Var1 Var2 Var3

1 9 7 .

2 . 3 5

3 7 4 .

4 9 4 6

5 6 2 7

6 . . 5

ID D1 D2 D3

1 0 0 1

2 1 0 0

3 0 0 1

4 0 0 0

5 0 0 0

6 1 1 0

Page 11: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

11

Missing Completely at Random (MCAR)

• The Missingness data is random. D1, D2, D3 uncorrelated with anything!

• Correlate (or logistic regression) variables with D1, D2, D3

• Consider race, gender, age, education• None of these should be correlated with

D1, D2, or D3• This is not correlating variables with the

raw score!

Page 12: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

12

Missing at Random (MAR)

• The Missingness data is a random pattern after you control for – Variables in your analysis– Auxiliary variables– Probability of missingness NOT dependent on

unobserved variables

• Correlate variables with D1, D2, D3• Consider auxiliary variables--race, gender,

age, education

Page 13: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

13

Missing at Random (MAR)

• Include auxiliary variables as mechanisms for missingness– If they are correlated significantly with the

missingness, D1, D2, D3

• Data is MAR after controlling auxiliary variables

• Auxiliary variables available in many datasets

Page 14: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

14

Problem with Traditional Approaches

Listwise deletion—standard default– It excludes many observations—50%?– May be only missing one variable and that

variable may not be important– In longitudinal program evaluations

• Missing those with low level of implementation

– If MCAR, this reduces power, but is unbiased– W/O MCAR this is biased– Political Science Journal—50% deleted

Page 15: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

15

Problem with Traditional Approaches

Mean Substitution

– Mean often bad estimate

– Attenuates variance

– Reduces effect—variables w/ missing data, or

– Exaggerates effects--variables with little missing data

– Reduces R2

Page 16: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

16

Problem with Traditional Approaches

Pairwise Deletion (rarely used)

– Each correlation on different subsample

– Set of correlations—no single sample

– May not be able to invert matrix

– What is the right sample size?

– If it works, usually better than mean substitution or listwise deletion

Page 17: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

17

Problem with Traditional Approaches

Ordinary regression imputation – Multiple regression used to predict their score– Predicted value will have no new information if

predictors are in your model—colinearity – Does nothing about uncertainty of predictions

• If R2 = .90, the predicted value is good• If R2 = .10, the predicted value has a lot of noise

– Thus, predicted values are “too good”

Page 18: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

18

Problem with Traditional Approaches

Single Imputation (SPSS Module) (MAR)

– American Statistician article--done incorrectly

– Single imputation does not incorporate variability between multiple imputations

– Reviewers for many journals not aware of limitations of single imputation so . . .

– Easy to implement using SPSS

Page 19: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

19

Modern Approaches

Multiple Imputation--Assumes MAR

– Imputation is done 5-20 times

– Model is estimated 5-20 times

– Estimates (R’s, B’s, Betas) are averaged

– Standard errors--variances between solutions incorporated

– Reflects uncertainty of the process

– Always better than single imputation

Page 20: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

20

Modern Approaches

Multiple Imputation– Available with best Statistical packages

• Stata• SAS

– Available with freeware programs that work in conjunction with statistical packages

• Norm• Amelia• IVEware• Mice

Page 21: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

21

Modern Approaches

Full Information Maximum Likelihood (FIML)– Assumes MAR– Uses all available information– Assumes patterns same if no missing– Results similar to multiple imputation– Available with SEM programs

• Mplus• LISREL• AMOS• EQS

Page 22: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

22

Modern Approaches

Full Information Maximum Likelihood – Easy changes in SEM programs will do this– Researchers rarely include auxiliary variables– Researchers rarely include covariates unless

in model– Possible to add auxiliary/predictor variables– Mplus allows for both FIML estimation and

multiple imputation--nice to compare results

Page 23: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

23

How Multiple Imputation Works: Non-technical Explanation

• All variables may have some missing values, including DV

• Eliminate observations will missing values on all variables – Missing wave of panel is just missing values

• Estimate covariance matrix (listwise)

• Regress xi on remaining variables

Page 24: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

24

How Multiple Imputation Works

• Add residual based on strength of prediction– R2 = .90—add small error – R2 = .10—add big error

• You now have an actual or imputed value for all observations on all variables

• Estimate a covariance• This covariance matrix should be “better”

because it utilizes more information

Page 25: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

25

How Multiple Imputation Works

• If covariance matrices are different– Repeat process until successive covariance

matrices are virtually identical

• This provides first imputed dataset

• Repeat this process m times – Results—m imputed datasets with no missing

values

Page 26: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

26

How Multiple Imputation Works

• Estimate your model with each of your m imputed datasets

• Combine the results using Rubin’s rules – Parameter estimates—mean of their m values– Standard errors inflate mean of standard

errors based on how much solutions vary– Standard errors (hence t-tests) will be

unbiased if the data is MAR

Page 27: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

27

How FIML is Implemented: MplusTitle: Missing values including mechanismsData: File is miss_systematic-999.dat ;Variables: Names are childs satfin male hap_gen ident income98 educ hlth age; Missing are all (-999) ; Usevariables are hlth childs hap_gen income98 age educ satfin male ;Analysis: Type = missing ; *without this get listwise

Page 28: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

28

FIML: Mplus ExampleModel: hlth on childs hap_gen income98 age educ ;

satfin on childs hap_gen income98 age educ ;

male on childs hap_gen income98 age educ ;

Output: standardized ;

1.The “hlth” and “satfin” lines are the model2.The “male” line is a nonsense equation that

includes any covariates or auxiliary variables

Page 29: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

29

Freeware Dedicated Packages

Package Single Imputation

Multiple Imputation

FIML

Amelia X X

IVEware X X

Norm X X

MICE X X

Mx X

Page 30: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

30

Commercial Statistical Packages

Package Single Imputation

Multiple Imputation

FIML

SAS (MI) X

SPSS (EM) X

Stata (ice) X X

Page 31: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

31

Commercial FIML Packages

Package Single Imputation

Multiple Imputation

FIML

AMOS X

EQS X

HLM X

LISREL X

Mplus X X

Page 32: Working with Missing Values Alan C. Acock February, 2007 Supporting material is available at acock/missing.

Alan C. Acock, Working with Missing Values

32

Web Pages for Selected Software

• Ameilia gking.harvard.edu/amelia/• Iveware http://www.isr.umich.edu/src/smp/ive/• Norm http://www.stat.psu.edu/~jls/misoftwa.html#aut

• MX www.vcu.edu/mx/ • SPSS www.spss.comwww.mvsoft.com/• LISREL http://www.ssicentral.com/hlm/index.html • Mplus www.statmodel.com • SAS www.sas.com • Stata www.stata.com