Handling Missing Data with SAS › content › dam › SAS › en_ca › User... · • Mean...

25
Timothy B. Gravelle Principal Scientist & Director, Insights Lab September 13, 2013 What’s Missing? Handling Missing Data with SAS © 2000-2013 PriceMetrix Inc. Patents granted and pending.

Transcript of Handling Missing Data with SAS › content › dam › SAS › en_ca › User... · • Mean...

Timothy B. Gravelle

Principal Scientist & Director, Insights Lab

September 13, 2013

What’s Missing?

Handling Missing Data with SAS

© 2000-2013 PriceMetrix Inc. Patents granted and pending.

2

Missing data

• Survey data frequently contain missing observations due

to respondent refusal, errors in fieldwork, etc.

• Business data can also contain missing observations.

• Large amounts of missing data can bias survey

estimates.

• Many statistical techniques assume (or require)

complete data, so missing data can reduce effective

sample size (and statistical power).

3

Types of missing data

• Patterns of data loss are typically described as either

ignorable or non-ignorable.

• Types of ignorable missing data:

- Missing completely at random (MCAR): the missing

observations on a given variable differ from the

observed scores on that variable only by chance and

the missing observations are further not related to any

other variable.

- Missing at random (MAR): the missing observations

on a given variable differ from the observed scores on

that variable only by chance.

4

Types of missing data

• Non-ignorable missing data, or data that are missing not

at random (MNAR): cases with missing data differ from

cases with complete data for some reason, rather than

randomly.

5

Dealing with missing data

• Listwise deletion (or complete-case analysis): removes

all cases with any missing data from the analysis.

• Pairwise deletion (or available-case analysis): different

parts of the analysis are conducted with different subsets

of the data.

• Imputation: missing data points in a dataset are replaced

with plausible values.

6

Types of imputation

• Mean imputation: missing data points are simply

replaced with the mean.

• Random imputation: missing data points are imputed

randomly from a random uniform distribution.

• Regression-based imputation: missing values are

replaced by a predicted score generated by a regression

model based on the non-missing data.

7

Single vs. multiple imputation

• A problem with imputing only a single value for every

missing value is that this does not reflect our uncertainty

about the predictions. Standard errors may therefore be

biased (too small).

• An alternative is to replace each missing value with

multiple plausible values. This represents the uncertainty

about the right value to impute.

• Data analyses from multiply-imputed datasets can be

combined to produce estimates and confidence intervals

that incorporate missing-data uncertainty.

8

Steps for multiple imputation

• Impute the missing values m times (m is usually 3 to 10)

• Analyze each of the m completed data sets. This results

in m analyses.

• Pool the results from m analyses into a final result.

β1

β2

β3

β

Incomplete

data

Analysis

results

Complete

data

Final

results

9

Multiple imputation vs. listwise deletion: an example Predicting concern about illegal immigration: United States (OLS regression)

Multiple Imputation

Listwise Deletion

Coeff. SE p Sig. Coeff. SE p Sig.

Intercept 2.08 0.41 0.000 *** 1.88 0.45 0.000 ***

Male -0.02 0.06 0.693 -0.02 0.07 0.725

ln Age (Years) 0.14 0.08 0.093 0.15 0.10 0.127

Education: College -0.24 0.07 0.000 *** -0.25 0.08 0.002 **

Education: Some College 0.03 0.07 0.696 0.01 0.08 0.864

Monthly HH Income: 2K-4K -0.01 0.10 0.958 0.06 0.10 0.552

Monthly HH Income: 4K-7.5K 0.07 0.09 0.438 0.12 0.09 0.208

Monthly HH Income: 7.5K+ -0.03 0.10 0.747 -0.02 0.11 0.865

Race: Black -0.03 0.11 0.812 0.03 0.12 0.833

Race: Other 0.10 0.11 0.379 0.07 0.12 0.544

Hispanic -0.28 0.13 0.031 * -0.28 0.14 0.044 *

Party: Democrat -0.11 0.11 0.323 -0.14 0.12 0.233

Party: Republican 0.15 0.11 0.154 0.17 0.12 0.149

Ideology (Conservative) 0.10 0.04 0.009 ** 0.10 0.04 0.017 *

ln Distance to US-Mex Border (km) 0.07 0.03 0.049 * 0.08 0.03 0.016 *

n 1,037 763

R2 0.142 0.165

Adjusted R2 0.130 0.149

10

Multiple imputation vs. listwise deletion: a second example Predicting positive impressions of NAFTA: Canada (logistic regression)

Multiple Imputation

Listwise Deletion

Coeff. S.E. O.R. p Sig. Coeff. S.E. O.R. p Sig.

Intercept 2.05 0.98 7.76 0.037 * 1.66 1.88 5.27 0.378

Male 0.60 0.18 1.83 0.001 *** 0.55 0.32 1.74 0.084

ln Age (Years) -0.57 0.21 0.56 0.007 ** -0.49 0.38 0.61 0.198

Education: University 0.16 0.22 1.17 0.481 0.53 0.36 1.70 0.141

Education: Community College -0.14 0.23 0.87 0.525 0.06 0.40 1.06 0.888

Province: NL 0.56 1.31 1.75 0.671 -0.26 1.73 0.77 0.880

Province: NS/PEI 1.37 0.54 3.92 0.011 ** -0.02 0.78 0.98 0.983

Province: NB 0.26 0.59 1.30 0.663 1.70 0.91 5.48 0.061

Province: QC -0.48 0.24 0.62 0.047 * 0.91 0.44 2.49 0.037 *

Province: MB 0.12 0.48 1.13 0.796 0.86 0.68 2.37 0.202

Province: SK 0.21 0.50 1.24 0.670 0.17 0.75 1.18 0.821

Province: AB 0.06 0.35 1.06 0.869 0.93 0.78 2.52 0.233

Province: BC -0.23 0.29 0.79 0.422 -0.42 0.48 0.66 0.382

HH Income: Comfortable 0.18 0.19 1.20 0.347 0.38 0.37 1.46 0.306

HH Income: Finding it Difficult -0.73 0.31 0.48 0.018 * -0.14 0.41 0.87 0.725

City Economy Getting Better 0.46 0.19 1.58 0.017 * 0.77 0.36 2.16 0.030 *

Local Job Market Good 0.19 0.20 1.21 0.358 -0.41 0.36 0.67 0.259

Confident in National Government 0.56 0.21 1.75 0.009 ** 1.73 0.40 5.63 0.000 ***

Approve of Canadian Leadership 0.65 0.22 1.91 0.004 ** -0.37 0.40 0.69 0.356

Approve of American Leadership 0.36 0.23 1.44 0.118 0.43 0.37 1.54 0.239

ln Distance to Can-US Border (km) -0.23 0.12 0.80 0.051 * -0.35 0.23 0.71 0.131

n 885 379

Model Chi Square 177.37 91.69

Cox & Snell R2 0.182 0.216

Nagelkerke R2 0.244 0.295

11

Multiple imputation vs. listwise deletion: a second example Predicting positive impressions of NAFTA: Canada (logistic regression)

Multiple Imputation

Listwise Deletion

Coeff. S.E. O.R. p Sig. Coeff. S.E. O.R. p Sig.

Intercept 2.05 0.98 7.76 0.037 * 1.66 1.88 5.27 0.378

Male 0.60 0.18 1.83 0.001 *** 0.55 0.32 1.74 0.084

ln Age (Years) -0.57 0.21 0.56 0.007 ** -0.49 0.38 0.61 0.198

Education: University 0.16 0.22 1.17 0.481 0.53 0.36 1.70 0.141

Education: Community College -0.14 0.23 0.87 0.525 0.06 0.40 1.06 0.888

Province: NL 0.56 1.31 1.75 0.671 -0.26 1.73 0.77 0.880

Province: NS/PEI 1.37 0.54 3.92 0.011 ** -0.02 0.78 0.98 0.983

Province: NB 0.26 0.59 1.30 0.663 1.70 0.91 5.48 0.061

Province: QC -0.48 0.24 0.62 0.047 * 0.91 0.44 2.49 0.037 *

Province: MB 0.12 0.48 1.13 0.796 0.86 0.68 2.37 0.202

Province: SK 0.21 0.50 1.24 0.670 0.17 0.75 1.18 0.821

Province: AB 0.06 0.35 1.06 0.869 0.93 0.78 2.52 0.233

Province: BC -0.23 0.29 0.79 0.422 -0.42 0.48 0.66 0.382

HH Income: Comfortable 0.18 0.19 1.20 0.347 0.38 0.37 1.46 0.306

HH Income: Finding it Difficult -0.73 0.31 0.48 0.018 * -0.14 0.41 0.87 0.725

City Economy Getting Better 0.46 0.19 1.58 0.017 * 0.77 0.36 2.16 0.030 *

Local Job Market Good 0.19 0.20 1.21 0.358 -0.41 0.36 0.67 0.259

Confident in National Government 0.56 0.21 1.75 0.009 ** 1.73 0.40 5.63 0.000 ***

Approve of Canadian Leadership 0.65 0.22 1.91 0.004 ** -0.37 0.40 0.69 0.356

Approve of American Leadership 0.36 0.23 1.44 0.118 0.43 0.37 1.54 0.239

ln Distance to Can-US Border (km) -0.23 0.12 0.80 0.050 * -0.35 0.23 0.71 0.131

n 885 379

Model Chi Square 177.37 91.69

Cox & Snell R2 0.182 0.216

Nagelkerke R2 0.244 0.295

Examples: IVEware, PROC MI

and PROC MIANALYZE

13

PROC MI

• Provides analyses of missing data patterns.

• Creates imputed values (mainly for interval-level

variables using linear regression; handling of categorical

data is new in SAS/STAT 12.1).

14

PROC MIANALYZE

• Combines the analyses of multiply imputed data

performed in other SAS procedures – e.g., PROC REG,

PROC LOGISTIC, PROC SURVEYREG, PROC

SURVEYLOGISTIC, PROC CALIS.

15

IVEware (SAS-callable)

• Developed at the University of Michigan and distributed

free of charge.

• Can accommodate interval, ordinal , nominal, count and

“mixed” data (using linear, binary logistic, generalized

logistic and Poisson regression).

• Can accommodate bounds on the imputed values.

• Can restrict the imputation to a subset of cases (useful

for imputing data for contingent/skip-based questions).

16

IVEware (SAS-callable)

DATA _null_;

INFILE datalines;

FILENAME setup "impute.set";

FILE setup;

INPUT;

PUT _infile_;

REPLACE;

DATALINES4;

DATAIN work.data_3;

DATAOUT work.data_MI;

DEFAULT CATEGORICAL;

CONTINUOUS LN_AGE WP1220 DISTANCE_CAN_US_BORDER

LN_DISTANCE_USA;

TRANSFER CASE_ID YEAR WP8018 WP12596 WP5 WP1220 WEIGHT LAT LON

DISTANCE_CAN_US_BORDER GEO_MATCH;

17

IVEware (SAS-callable)

BOUNDS WP30(>=1,<=2) WP31(>=1,<=3) WP87(>=1,<=2) WP88(>=1,<=3)

WP89(>=1,<=2) WP137(>=1,<=2) WP138(>=1,<=2) WP139(>=1,<=2)

WP141(>=1,<=2) WP142(>=1,<=2) WP143(>=1,<=2) WP144(>=1,<=2)

WP145(>=1,<=2) WP146(>=1,<=2) WP148(>=1,<=2) WP150(>=1,<=2)

WP151(>=1,<=2) WP6879(>=1,<=2) WP1219(>=1,<=2)

LN_AGE(>=2.7080502,<=4.5951199) EDUCATION(>=1,<=3)

INCOME(>=1,<=7) INCOME_GET_BY(>=1,<=7) WP2319(>=1,<=4)

WP4657(>=1,<=2) REGION_CAN(>=1,<=8);

MINRSQD .01;

ITERATIONS 10;

MULTIPLES 10;

PERTURB=COEF;

SEED 20110718;

RUN;

;;;;

18

IVEware (SAS-callable)

%IMPUTE(name=impute, dir=.);

%PUTDATA(name=impute, dir=., mult=1, dataout=data_MI1);

%PUTDATA(name=impute, dir=., mult=2, dataout=data_MI2);

%PUTDATA(name=impute, dir=., mult=3, dataout=data_MI3);

%PUTDATA(name=impute, dir=., mult=4, dataout=data_MI4);

%PUTDATA(name=impute, dir=., mult=5, dataout=data_MI5);

%PUTDATA(name=impute, dir=., mult=6, dataout=data_MI6);

%PUTDATA(name=impute, dir=., mult=7, dataout=data_MI7);

%PUTDATA(name=impute, dir=., mult=8, dataout=data_MI8);

%PUTDATA(name=impute, dir=., mult=9, dataout=data_MI9);

%PUTDATA(name=impute, dir=., mult=10, dataout=data_MI10);

19

IVEware (SAS-callable)

PROC SQL;

CREATE TABLE data_MI AS

SELECT * FROM data_MI1

UNION ALL SELECT * FROM data_MI2

UNION ALL SELECT * FROM data_MI3

UNION ALL SELECT * FROM data_MI4

UNION ALL SELECT * FROM data_MI5

UNION ALL SELECT * FROM data_MI6

UNION ALL SELECT * FROM data_MI7

UNION ALL SELECT * FROM data_MI8

UNION ALL SELECT * FROM data_MI9

UNION ALL SELECT * FROM data_MI10

ORDER BY _mult_, CASEID

;

QUIT;

20

PROC SURVEYLOGISTIC

(using multiply-imputed data)

PROC SURVEYLOGISTIC DATA=data_MI;

BY _mult_ ;

MODEL NAFTA_POS(EVENT='1')=YEAR_2009 YEAR_2010 YEAR_2011

MALE LN_AGE EDU_UNIV EDU_COLLEGE

INCOME_0_1999_MTH INCOME_2000_2999_MTH INCOME_3000_3999_MTH

INCOME_5000_7499_MTH INCOME_7500_9999_MTH INCOME_10000_PL_MTH

PROVINCE_NL PROVINCE_NS_PE PROVINCE_NB PROVINCE_QC PROVINCE_MB

PROVINCE_SK PROVINCE_AB PROVINCE_BC

GOOD_TIME_FIND_JOB CITY_ECON_BETTER CITY_ECON_WORSE

NATL_ECON_BETTER NATL_ECON_WORSE CAN_LEADERSHIP USA_LEADERSHIP

LN_DISTANCE_USA

/RSQ;

ODS OUTPUT ParameterEstimates=ParmEst;

STRATA PROVINCE;

WEIGHT WT;

RUN;

21

PROC MIANANALYZE

(combining parameter estimates)

PROC MIANALYZE PARMS=ParmEst;

MODELEFFECTS

Intercept

YEAR_2009 YEAR_2010 YEAR_2011

MALE LN_AGE EDU_UNIV EDU_COLLEGE

INCOME_0_1999_MTH INCOME_2000_2999_MTH INCOME_3000_3999_MTH

INCOME_5000_7499_MTH INCOME_7500_9999_MTH INCOME_10000_PL_MTH

PROVINCE_NL PROVINCE_NS_PE PROVINCE_NB PROVINCE_QC PROVINCE_MB

PROVINCE_SK PROVINCE_AB PROVINCE_BC

GOOD_TIME_FIND_JOB CITY_ECON_BETTER CITY_ECON_WORSE

NATL_ECON_BETTER NATL_ECON_WORSE CAN_LEADERSHIP USA_LEADERSHIP

LN_DISTANCE_USA

RUN;

22

Wrap-up

• How one chooses to deal with missing data has

implications for one’s analyses and the substantive

conclusions that one can reach.

• The default of listwise deletion is a choice, though often

an implicit one. It may not be the best choice.

• State-of-the-art multiple imputation techniques are now

(relatively) easy to implement in SAS. They allow us to

use all available data while still accounting for the

uncertainty inherent in the imputation process.

23

Wrap-up

• The creation of multiply-imputed datasets, analysis of

multiply-imputed data and pooling of estimates are

distinct steps.

• Consequently, one can conduct the different steps using

different software according to the software’s capabilities

and the analyst’s preference.

• When we encounter missing data, we should give

greater thought to why they are missing.

24

References

Allison, Paul D. (2002) Missing Data, Sage.

Horton, Nicholas J. and Ken P. Kleinman (2007) “Much Ado About

Nothing: A Comparison of Missing Data Methods and Software to Fit

Incomplete Data Regression Models.” American Statistician 61(1):

79–90.

Little, Roderick J.A. and Donald B. Rubin (2002) Statistical Analysis

with Missing Data, Wiley.

Raghunathan, Trivellore E., Peter W. Solenberger and John Van

Hoewyk (2002) IVEware: Imputation and Variance Estimation

Software User Guide. Ann Arbor, MI: Survey Research Center,

Institute for Social Research, University of Michigan.

Rubin, Donald B. (1987) Multiple Imputation for Nonresponse in

Surveys, Wiley.

Thank you!

Timothy B. Gravelle

Principal Scientist & Director, Insights Lab

[email protected]