An assessment of methods to impute risk exposure into model actor’s risk profile for...

An assessment of methods to impute risk exposure into model

actor’s risk profile for microsimulation.

Deirdre Hennessy and Claude Nadeau

STAR webinar

September 30th 2011

Missing data in general

……a common and very frustrating problem in survey research! Variables such as income and self-reported body mass

index(BMI) might be regarded as sensitive and are prone to non-response.

Types of missing data and why they occur.• Missing completely at random (MCAR)– Admin/collection errors

• Missing at random (MAR)– Missingness related to known respondent characteristics but not the value of the predictor

• Missing not at random (MNAR) or informative missing– Missingness related to a value of the predictor

2

Missing data in general Why impute?

• Reduce non-response bias

• Maintain sample size and statistical efficiency The aim: To produce an approximately unbiased and efficient

estimator by choosing the appropriate imputation method which should be…..

• Robust under misspecification

• Where the type of analysis that needs to be conducted is considered/what is the purpose of imputed data

• Practically appropriate…..computing time, availability of variance estimation formulae etc.

3

Missing data in general Potential solutions:

• V. simple solutions- complete case (respondent) analysis, easy to implement and understand, only valid under limited conditions, can result in estimates that are biased/imprecise, limited applicability.

• Less simple solutions- regression methods still easy to implement and understand. Regression, linear or logistic, can be used to model numeric or categorical, can make use of many auxiliary variables. However, it may distort the distribution of the predictor variable and inflate the association between the predictor and other variables. In addition, the imputed values are predicted not actually observed in another data source. This is a parametric approach and may be sensitive to misspecification of the regression model.

4

Missing data in general• Less simple solutions- “hot-deck” methods still easy to implement

and understand. These methods assign the value from a record with observed data (“donor data”), to a record with missing data. This approach is suitable for dealing with categorical data. This method is non/semi parametric, making distributional assumptions. However, to work well a reasonably large sample size is required.

• “Modern” imputation methods- multiple imputation this is a newer method which involves imputing missing data using an appropriate imputation model that incorporates random imputation, repeat many times (3-10), carry out the analysis of interest in each of resulting datasets and combine the estimates using proscribed rules.

5

6

Imputation: Probability theory insight Let X and Y be vectors of random variables Z=(X,Y) ~ f(z)=f(x,y) Loose/misplace Y Generate Y* ~ f(y|x) = f(x,y)/(∫f(x,y)dy) Z*=(X,Y*) ~ f(z) (Z and Z* have same dist)

Let Y**=E[Y|X] Z**=(X,Y**) ~ f(z) (Z and Z** have different dist)

7

Imputation: Probability theory insight Let Y be a random variable and X a vector of r.v. Z=(X,Y) ~ f(z)=f(x,y) Hide Y Electric choc = (guess – Y)2

Two guessesGenerate Y* ~ f(y|x) = f(x,y)/(∫f(x,y)dy)Y**=E[Y|X]

E[choc*]=2E[choc**] Better off with Y**=E[Y|X]

Missing data in microsimulation

…….a slightly different proposition! Data for microsimulation is assembled from multiple

sources and requires imputation of both missing items and missing variables (i.e. important variables that can be gleaned from only from “donor” data).

In reality assembling a database for microsimulation modelling is one big imputation!

Microsimulation modellers are very practical people!

8

Patchwork quilt of microsimulation data sources

9

Disclaimer: Not my work!

Missing data in microsimulation Microsimulation modellers are very practical people! ……so they have used a variety of approaches to impute missing

variables.• Approach depends on data sources available

• How the resulting imputed variables will be used in the microsimulation model/purpose of the imputed data

• How important the resulting imputed variable is, is it a main outcome or exposure of interest?

Imputation for mircosimulation is NOT standardized, it is unclear which approach produces the best results, however standard imputation approaches, described above, can be used and assessed for best results!

10

Population health model (POHEM)

11

Developed at Statistics Canada POHEM is an example of a microsimulation tool that has been used to inform health policy issues such as chronic disease screening and treatment.

Has been applied to cancer, osteoarthritis (OA) and cardiovascular disease (CVD/Acute myocardial infraction (AMI))…..with plans to develop models for diabetes and stroke

POHEM integrates data distributions and equations derived from a wide range of sources, including nationally representative cross-sectional and longitudinal surveys, cancer registries, hospitalization databases, vital statistics, Census, as well as parameters in the published literature---a very complicated patchwork quilt!

POHEM……..more technically Starts with a cross-sectional sample of Canadian adult population

(CCHS 1.1) and generates individual life histories by simulating various types of events (e.g., births, deaths, migration, changes in risk factors, disease onset and progression, treatments, changes in quality of life).

It is a case-by-case, longitudinal, continuous time, stochastic, Monte Carlo microsimulation.

It directly encompasses competing risks and comorbidity. Has disease specific sub-modules (OA and AMI) and incorporates

models of risk factor and disease development It generates plausible health biographies over the life course of

synthetic individuals from empirical observations.

12

Why do we impute BP and cholesterol? Not available by self-report! These are core risk factors for CVD development

and progression. Calculations to determine AMI incidence in POHEM

use the Framingham risk function– derived from the famous Framingham Heart Study, a long-term follow-up study of CVD risk factors (including physical and laboratory measures) started in 1948.

14

SmokingDiabBPHDLcholeF

Current imputation of blood pressure and cholesterol Uses an old data source, Canadian Heart Health Study (CHHS),

collected 1986-1992 on a provincial basis. Data not exactly comparable to CCHS 2.2 in terms of other data

elements collected, geographic coverage etc. Specifically, using variables common to the CCHS and CHHS,

individual’s BP and cholesterol categories were imputed using “hot-deck” methods. In other words individuals, in the CCHS were matched to those in the CHHS based on 5-year age-group, sex, BMI category and diabetes status and were assigned the corresponding categories BP and total cholesterol/blood pressure available in the CHHS.

15

New model--POHEM nutrition and health outcomes

16

In preparation for constructing a model of nutrition and health outcomes we revisited the imputation of BP and cholesterol.

Why?.......to incorporate nutrition and food intake into POHEM (CCHS 2.2), to improve and update imputation.

Males Optimal Systolic Blood Pressure

0

0.04

0.08

0.12

0.16

0.2

[20,25) [25,30) [30,35) [35,40) [40,45) [45,50) [50,55) [55,60) [60,65) [65,70) [70,75) [75,80)

AGE

PER

CEN

T

2007 POHEM 2008 POHEM 2009 POHEM 2007 - 2009 CHMS

Data/graph courtesy of Meltem Tuna

New model-- Population model of nutrition and health outcomes Times have changed!

Awareness and treatment and of CVD risk factors have changed drastically

17

Graph taken from :F.A. McAlister, K. Wilkins et al.Changes in awareness, treatment and control of hypertension in Canada overthe past 2 decades. CMAJ June14, 2011.

New model-- Population model of nutrition and health outcomes

18

We also have new data, Canadian Health Measures Survey (CHMS) 2007-2009, collects BP and cholesterol using validated methods.

In addition, collects many common variables, i.e. variables that are available in both the CHMS and CCHS 2.2, therefore makes imputation a bit easier.

Simplified POHEM nutrition to outcome model

19

initial values

Obesity

Smoking

Nutrition

Physical activity

Alcohol

Income

Education

Region

Sex

CCHS 2.2

initial values & transition models

Diabetes

Total cholesterol

& HDL

blood pressur

e

CHMS 2007-2009

competing risk of death from other causes

Vital statistics(and other POHEM disease modules)

Progression

Death

Registered Persons database for Ontario (ICES)

(CCORT I)

survival data for each transition

HealthOutcome*

Health Person-Oriented Information (HPOI)

(HIRD)

incidence rates by province, age and sex

transition models

NPHS 1994-2004

* Outcomes associated with high cholesterol and high blood pressure include hypertension, heart disease, AMI, stroke, heart failure and gastric cancer.

Objective of the study

To investigate various techniques to create imputed variables for BP and cholesterol, using the CHMS 2007-2009 as the donor data and CCHS 2.2 as the recipient data.

20

Methods: Donor Data Source

CHMS 2007-2009 Sample size: 5,604 of those 3,719 were >18 years. Collects data on self-reported health, chronic disease status,

physical activity etc. in the same or very similar manner to CCHS 2.2.

Collects physical measures of BMI, CVD risk factors, physical activity and fitness– very innovative survey.

Uses validated measures to collect BP and cholesterol, even using a fasting sub-sample to collect cholesterol (~2,600).

21

Methods: Donor Data Source Disadvantages of CHMS 2007-2009

• Small sample size, limited age range.

• Limited geographic coverage compared to CCHS. Because of cost and logistics considerations, 15 collection sites were chosen (primary sampling units), from 5 regional strata.

• Analysts are advised to perform analysis at national level only.

• Analytic options are somewhat limited, due to small number of degrees of freedom (11). This needs to be considered in the analysis to obtain the proper results in statistical tests or confidence intervals.

22

Methods: Donor Data Source

CCHS 2.2 2004 Sample size: 35,107 of those 21,160 were >18 years. Nationally representative. Collects data on self-reported health, chronic disease

status, physical activity etc. but also detailed food intake data including a 24 hour dietary recall (the gold standard of food intake data available in Canada).

Collects measured BMI, only on a subsample (~12,500), but calculates a special weight to account for missing BMI data.

23

Methods: Study Sample (donor and recipient data)

2424

CHMS-n=5,604 CCHS-n=35,107

3,719 adults 21,160 adults

SBP/DBPTotal

chol/HDL/LDL

Complete microdata file including food

intake and CVD risk factors

Imputation

Results: Regression imputation v1.0 Case of Y**=E[Y|X] At first attempted a complete case analysis….not much

missing data in CHMS…so sample size was even smaller.

Construct a linear regression model of variables common to CHMS and CCHS, including income, education, home ownership, marital status, immigrant status, racial/ethnic origin, chronic disease status, smoking status and BMI.

Modelled the data by sex.

25

Results: Female model

26

Female model:Survey: Linear regression Number of obs = 1919 Population size = 11900450 Replications = 500 Design df = 499 F( 4, 496) = 167.36 Prob > F = 0.0000 R-squared = 0.4140------------------------------------------------------------------------------ | BRR *adjusted_SBP | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- dhh_age | .4405467 .0186056 23.68 0.000 .4039918 .4771016 bmi | .3878245 .0883162 4.39 0.000 .2143071 .5613418 hbp_aware | 8.74016 1.249431 7.00 0.000 6.285367 11.19495 edu | 2.645542 .8434096 3.14 0.002 .9884707 4.302614 _cons | 81.89744 2.5815 31.72 0.000 76.82549 86.96939

Results: Male model

27

Male model:Survey: Linear regression Number of obs = 1712 Population size = 11822052 Replications = 500 Design df = 499 F( 4, 496) = 87.87 Prob > F = 0.0000 R-squared = 0.2038------------------------------------------------------------------------------ | BRR *adjusted_SBP | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- dhh_age | .4569414 .0980163 4.66 0.000 .2643658 .649517 bmi | .5470305 .1854258 2.95 0.003 .1827189 .911342 hbp_aware | 5.371427 1.280766 4.19 0.000 2.855068 7.887786 age_bmi_int | -.0074843 .0037551 -1.99 0.047 -.0148619 -.0001066 _cons | 91.21298 5.3256 17.13 0.000 80.74962 101.6763------------------------------------------------------------------------------

Results: Comparing measured and imputed data relationship with age

28

50

100

150

200

20 40 60 80DHH_AGE

adjusted_SBP Fitted values

100

120

140

160

20 40 60 80 100DHHD_AGE

imputed_SBP Fitted values

CHMS: Measured SBP CCHS: Imputed SBP

Results: Comparing measured and imputed data relationship with age

29

50

100

150

200

20 30 40 50 60bmi

adjusted_SBP Fitted values1

001

201

401

60

0 20 40 60 80MHWDDBMI

imputed_SBP Fitted values


Results: Comparing the distributions

30


05

.0e+

051

.0e+

061

.5e+

06F

req

uenc

y

50 100 150 200adjusted_SBP

05

.0e+

051

.0e+

061

.5e+

06F

req

uenc

y

50 100 150 200imputed_SBP

Results: Relationship of imputed BP with salt intake

31

xi: regress imputed_SBP i.na_quint dhhd_agei.na_quint _Ina_quint_1-5 (naturally coded; _Ina_quint_1 omitted) Source | SS df MS Number of obs = 12310-------------+------------------------------ F( 5, 12304) = 9455.27 Model | 1037614.8 5 207522.96 Prob > F = 0.0000 Residual | 270046.623 12304 21.9478725 R-squared = 0.7935-------------+------------------------------ Adj R-squared = 0.7934 Total | 1307661.42 12309 106.236203 Root MSE = 4.6849------------------------------------------------------------------------------ imputed_SBP | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+----------------------------------------------------------------_Ina_quint_2 | -.0927673 .1375922 -0.67 0.500 -.3624695 .176935_Ina_quint_3 | .01377 .1364071 0.10 0.920 -.2536093 .2811493_Ina_quint_4 | .1428075 .1356282 1.05 0.292 -.1230451 .40866_Ina_quint_5 | 1.061184 .1363676 7.78 0.000 .7938822 1.328486 dhhd_age | .4537406 .0021169 214.34 0.000 .4495912 .4578901 _cons | 95.81601 .1510044 634.52 0.000 95.52001 96.112------------------------------------------------------------------------------

Next steps in imputation of BP and cholesterol from CHMS Repeat for DBP and total cholesterol/HDL– all

important variables in the Framingham risk equation Repeat modelling categories of BP/cholesterol Try random regression imputation– impute from a

conditional distribution, case of Y* ~ f(y|x) = f(x,y)/(∫f(x,y)dy)

Try hot-deck to impute categorical BP and cholesterol and compare to regression results.

32

Case study: POHEM BMI model

Over to Claude….

33

Overall conclusions: Imputation technique must be fit for purpose

• Purpose of data/eventual role in mocrosimulation.

Model used must be assessed and its performance reported in a standard way.

May not be possible to fully standardize and approach to imputation for microsimulation, because it is heavily dependent on the data source and purpose but at least we can make the process and techniques more transparent

3434

Can multiple imputation be used in microsimulation? This techniques may not be fit for purpose because it is difficult

computationally intensive. Proper multiple imputation that involves running analysis in

multiple datasets and then combining the estimates according to proscribed rules may not be appropriate for microsimulation….how would we combine multiple runs of POHEM??

Improper multiple imputation, that runs a regression or hot-deck model multiple times and then incorporates imputed values using a random process may be more appropriate……I shall investigate.

35

Acknowledgements and contact: Carol Bennett Tracey Bushnik Bill Flanagan Doug Manuel Claude Nadeau Meltem Tuna [email protected] 613-951-3725

36

mailto:[email protected]

An assessment of methods to impute risk exposure into model actor’s risk profile for...

Documents

Transcript of An assessment of methods to impute risk exposure into model actor’s risk profile for...