Thierry Chassaulet: Predictive risk modelling: Does technique or time matter?

Predictive risk modelling Does technique or time matter?

Professor Thierry Chaussalet

Department of Business Information Systems, ECS

University of Westminster, London

www.healthcareinformatics.org.uk

Nuffield Trust, 13 June 2012

http://www.healthcareinformatics.org.uk/

Acknowledgements

- Ian Winkworth, who conducted the analysis - The Nuffield Trust for advice throughout

2

Motivation

3

• If patients at risk of (re-)admission could be identified and offered early interventions then their lives and long term health may be improved by reducing the chances of readmission, and hopefully their cost of care reduced

• This has led to the development of a flurry of predictive risk modelling tools: o Most are based on logistic regression such as the PARR+

tool (J. Billings et al. 2006); however there exist many other algorithms such as neural networks or decision trees

o Most are concerned with predicting the risk of (re-) admission within the following year; however readmission within different time intervals is also of interest

Our objectives

1. To develop and compare alternative statistical/data mining algorithms (Logistic Regression, Classification Tree and Neural Network) in order to predict the likelihood of a readmission within 12 months, based on England hospital inpatient admissions data

2. To develop and compare predictive risk models based on the three methodologies (logistic regression, classification trees, neural network) within shorter timeframes, i.e. 1, 3, 6, and 9 months.

3. In addition to explore the benefit of adding a measure of condition severity in a “PARR-like” model

4

Standard PARR Model Timeframe

5

01/04/1999 31/03/2004

31/03/2003 01/04/2002

Triggering year

Prediction time period

Prior hospital

utilisation period

Data Extraction and Manipulation

• Data source: Hospital Episode Statistics (HES) which holds all inpatient episodes of care.

• Software used to extract the data: MySQL were used to extract a sample of just over 100,000 emergency inpatient admissions that started and ended between 01/04/2002 and 31/03/2003. The data were then split into training (70%) and validation (30%) data sets

• Software used to fit models to the data: SAS Enterprise Miner was used to fit models to the extracted data [but SPSS and open source software could be used e.g. R, Rapid Miner, etc.].

6

Independent variables

The following independent variables were used in the models –Age group at triggering admission, gender and ethnic origin –Presence of certain diseases/conditions in the triggering admission or in the previous three years. –The summed total of disease severity calculated by the Charlson Comorbidity Severity Index. Determined by looking at all diseases/conditions that the patient had over the previous three years. The list of diseases used in this measure are on the next slide –Variables like the number of emergency inpatient admissions in the previous three years.

7

8

Condition Charlson

Comorbidity Severity

Index

ICD 10 codes

Ischaemic heart disease 1 I21-I25 Congestive heart failure (CHF) 1 I50, I110, I130

Peripheral vascular disease (PVD)

1 I700-I702, I71-I72, I731-I739, I709, I792, I771, R2

Cerebrovascular disease (CVD) 1 I60-I67, I69, G45, H340, R298, R470

Mental illness 1 F00-F09, F17-F69, F90-F99 Chronic obstructive pulmonary

disease (COPD) 1 J43-J44

Connective tissue disease/rheumatoid arthritis

(CTDRA)

1 M32-M36, M05, M06, M08, I39, I528, I418, I328, J990, G737

Peptic Ulcer 1 K25-K28 Mild Liver Disease 1 K703, K743-K746, K760, K769

Diabetes without complications 1 E100, E10l, E106, E108, E109, E110, E111, E116, E118, E119, E120, E121, El26, E128, El29,

E130, E131, E136, E138, E139, E140, E141, E146, E148, E149

Hemiplegia 2 G041, G114, G801, G802, G81, G82, G830-G834, G839

Renal Failure 2 N18-N20, Z940 Diabetes with complications 2 E102-E105, E107, E112, E115,

E117, E122-E125, E127, E132-E135, E137, E142-E145, E147

Cancer 2 All codes beginning with C, D00-D48

Moderate to severe Liver Disease

3 I850, I859, I864, I982, K704, K711, K721, K729, K765, K766,

K767 Metastatic Cancer 6 C77-C80

HIV/AIDS 6 B20-B24

Effect of severity index

Patients are more likely to have a readmission if they have a high severity index total score

9

The methods used • Logistic Regression

o Somewhat like regression but with binary dependent variable; will lead to:

• Decision Trees o Partitions the independent variables into a set of

homogeneous regions o Popular algorithms are CART, CHAID, C4.5 o C4.5 uses the idea of information gain (entropy)

• Neural Network o Aims at mimicking the brain with many neurons in

hidden layers that connect through “synapses” o Mathematically is a generalisation of logistic regression

𝑃𝑃(𝑅𝑅) =1

1 + 𝑒𝑒−�𝛽𝛽0+∑ 𝛽𝛽𝑛𝑛𝑋𝑋𝑛𝑛1𝑛𝑛 �

10

Logistic Regression - Results

• Most significant variables o Number of emergency admissions within the

previous 3 years o Age 75 plus at admission o Number of emergency admissions within the

previous 6 months o Average number of episodes per emergency

admission spell o Reference condition in the previous 3 years

o The severity index is also significant

11

Decision tree – Results

These factors were also found significant with logistic regression, however factors such as age, ethnic origin and some conditions were significant in the regression model but are not significant in the tree model

Factor Factor name in tree Relative importance

in model The number of emergency admissions within the previous 3 years

NumberOfEMAD_within_3years 1.000

The severity index total score for conditions in the current admission and in the previous 3 years

Severity_Index 0.246

The number of emergency admissions within the previous 6 months

NumberOfEMAD_within_6months 0.068

Whether the patient had an emergency admission due to COPD in the previous 3 years

COPD 0.062

Whether the patient had a reference condition in the current admission or in the previous 3 years

Ref_condition_prev_3_yrs 0.060

12

If a patient had 2 emergency admissions within the previous 3 years and a severity index of 4 or more in the previous 3 years then s/he is predicted to have a emergency readmission within 12 months. 62.3% of the 780 patients in this group who were predicted to have a readmission actually had a readmission.

Decision Trees –Results

13

Neural Network Number of hidden layers 1 Number of hidden neurons 9 Network architecture Multilayer Perceptron

Due to their complex structure neural network results are a lot more difficult to interpret

9 nodes

Neural Network vs Logistic Regression Results

15

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

40 45 50 55 60 65 70 75 80 85 90 95

Perc

enta

ge o

f Fla

gged

Pat

ient

s who

wer

e Re

adm

itted

Risk Score Threshold

Percentage of patients flagged by the neural network and logistic regression models to have a emergency readmission within 12 months that did have a

readmission

This project - Training data (Neural network model) This project - Validation data (Neural network model)

This project - Training data (Logistic regression model) This project - Validation data (Logistic regression model)

2006 PARR paper

Logistic Regression

Neural Network

Algorithms comparison for different timeframes

16

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

1 month

3 months

6 months

9 months

12 months

Percentage accuracy in classification (%)

Rea

dmis

sion

with

in

Percentage accuracy in classification of the three modelling techniques at predicting readmission within 1, 3, 6, 9 and 12 months

Neural network model Logistic regression model Classification tree model

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

1 month

3 months

6 months

9 months

12 months

Positive predictive value (%)

Read

miss

ion

with

in

Positive predictive values of the three modelling techniques for predicting readmission within 1, 3, 6, 9 and 12 months

Neural network model Logistic regression model Classification tree model17

Algorithms comparison for different timeframes

18

Conclusions (1)

• The accuracy (and PPV) in classification of the three models predicting readmission within 12 months is almost identical

• Neural networks were the best models for accurately identifying

the highest number of actual readmissions with a sensitivity of 45.4% , possibly due to their nonlinear nature

Logistic Regression Classification Tree Neural Network Accuracy 71.5% 71.6% 72.1% PPV 67.4% 66.8% 66.2% Sensitivity 40.1% 41.7% 45.4%

19

Conclusions (2)

• Number of emergency admissions in the three years prior to the triggering emergency admission is the strongest factor in predicting readmission within 12 months in ALL models. So is the number of emergency admissions in the previous 6 months.

• Severity and number of conditions that a patient has also plays a role in accurately predicting readmission in all the models, with those patients who have a reference condition or COPD being more likely to have a readmission.

20

Conclusions (3)

• Although the neural network model gives good results at higher risk scores, the results of the technique are much more difficult to explain to a non technical audience.

• Classification trees have a strong advantage as they allow us to visualise the important factors immediately.

• However, classification trees are not designed to allocate

probabilities of readmission for individuals as patients are sorted into groups and then the groups are allocated with a probability.

• For these reasons, Logistic Regression often remains the method

which gives the most easily understandable results to a non technical audience.

21

Conclusions (4)

• As the prediction interval to readmission decreases the performance of the logistic regression model in terms of PPV decreases, while the other two models retain relatively stable values irrespective of the timeframe to readmission. This is particularly true of decision trees.

• This study suggests that alternative algorithms have great potential in terms of performance, ease of use, and robustness over timeframe

• This also opens the door for exploring the benefits of newer more sophisticated machine learning type of techniques: support vector machines, fuzzy approaches, etc.

• However greater prediction improvement would probably be achieved with better and more comprehensive data (e.g. GP, social care, etc.)

Thierry Chassaulet: Predictive risk modelling: Does technique or time matter?

Health & Medicine

Transcript of Thierry Chassaulet: Predictive risk modelling: Does technique or time matter?