Thierry Chassaulet: Predictive risk modelling: Does technique or time matter?
-
Upload
nuffield-trust -
Category
Health & Medicine
-
view
649 -
download
0
Transcript of Thierry Chassaulet: Predictive risk modelling: Does technique or time matter?
Predictive risk modelling Does technique or time matter?
Professor Thierry Chaussalet
Department of Business Information Systems, ECS
University of Westminster, London
www.healthcareinformatics.org.uk
Nuffield Trust, 13 June 2012
Acknowledgements
- Ian Winkworth, who conducted the analysis - The Nuffield Trust for advice throughout
2
Motivation
3
• If patients at risk of (re-)admission could be identified and offered early interventions then their lives and long term health may be improved by reducing the chances of readmission, and hopefully their cost of care reduced
• This has led to the development of a flurry of predictive risk modelling tools: o Most are based on logistic regression such as the PARR+
tool (J. Billings et al. 2006); however there exist many other algorithms such as neural networks or decision trees
o Most are concerned with predicting the risk of (re-) admission within the following year; however readmission within different time intervals is also of interest
Our objectives
1. To develop and compare alternative statistical/data mining algorithms (Logistic Regression, Classification Tree and Neural Network) in order to predict the likelihood of a readmission within 12 months, based on England hospital inpatient admissions data
2. To develop and compare predictive risk models based on the three methodologies (logistic regression, classification trees, neural network) within shorter timeframes, i.e. 1, 3, 6, and 9 months.
3. In addition to explore the benefit of adding a measure of condition severity in a “PARR-like” model
4
Standard PARR Model Timeframe
5
01/04/1999 31/03/2004
31/03/2003 01/04/2002
Triggering year
Prediction time period
Prior hospital
utilisation period
Data Extraction and Manipulation
• Data source: Hospital Episode Statistics (HES) which holds all inpatient episodes of care.
• Software used to extract the data: MySQL were used to extract a sample of just over 100,000 emergency inpatient admissions that started and ended between 01/04/2002 and 31/03/2003. The data were then split into training (70%) and validation (30%) data sets
• Software used to fit models to the data: SAS Enterprise Miner was used to fit models to the extracted data [but SPSS and open source software could be used e.g. R, Rapid Miner, etc.].
6
Independent variables
The following independent variables were used in the models –Age group at triggering admission, gender and ethnic origin –Presence of certain diseases/conditions in the triggering admission or in the previous three years. –The summed total of disease severity calculated by the Charlson Comorbidity Severity Index. Determined by looking at all diseases/conditions that the patient had over the previous three years. The list of diseases used in this measure are on the next slide –Variables like the number of emergency inpatient admissions in the previous three years.
7
8
Condition Charlson
Comorbidity Severity
Index
ICD 10 codes
Ischaemic heart disease 1 I21-I25 Congestive heart failure (CHF) 1 I50, I110, I130
Peripheral vascular disease (PVD)
1 I700-I702, I71-I72, I731-I739, I709, I792, I771, R2
Cerebrovascular disease (CVD) 1 I60-I67, I69, G45, H340, R298, R470
Mental illness 1 F00-F09, F17-F69, F90-F99 Chronic obstructive pulmonary
disease (COPD) 1 J43-J44
Connective tissue disease/rheumatoid arthritis
(CTDRA)
1 M32-M36, M05, M06, M08, I39, I528, I418, I328, J990, G737
Peptic Ulcer 1 K25-K28 Mild Liver Disease 1 K703, K743-K746, K760, K769
Diabetes without complications 1 E100, E10l, E106, E108, E109, E110, E111, E116, E118, E119, E120, E121, El26, E128, El29,
E130, E131, E136, E138, E139, E140, E141, E146, E148, E149
Hemiplegia 2 G041, G114, G801, G802, G81, G82, G830-G834, G839
Renal Failure 2 N18-N20, Z940 Diabetes with complications 2 E102-E105, E107, E112, E115,
E117, E122-E125, E127, E132-E135, E137, E142-E145, E147
Cancer 2 All codes beginning with C, D00-D48
Moderate to severe Liver Disease
3 I850, I859, I864, I982, K704, K711, K721, K729, K765, K766,
K767 Metastatic Cancer 6 C77-C80
HIV/AIDS 6 B20-B24
Effect of severity index
Patients are more likely to have a readmission if they have a high severity index total score
9
The methods used • Logistic Regression
o Somewhat like regression but with binary dependent variable; will lead to:
• Decision Trees o Partitions the independent variables into a set of
homogeneous regions o Popular algorithms are CART, CHAID, C4.5 o C4.5 uses the idea of information gain (entropy)
• Neural Network o Aims at mimicking the brain with many neurons in
hidden layers that connect through “synapses” o Mathematically is a generalisation of logistic regression
𝑃𝑃(𝑅𝑅) =1
1 + 𝑒𝑒−�𝛽𝛽0+∑ 𝛽𝛽𝑛𝑛𝑋𝑋𝑛𝑛1𝑛𝑛 �
10
Logistic Regression - Results
• Most significant variables o Number of emergency admissions within the
previous 3 years o Age 75 plus at admission o Number of emergency admissions within the
previous 6 months o Average number of episodes per emergency
admission spell o Reference condition in the previous 3 years
o The severity index is also significant
11
Decision tree – Results
These factors were also found significant with logistic regression, however factors such as age, ethnic origin and some conditions were significant in the regression model but are not significant in the tree model
Factor Factor name in tree Relative importance
in model The number of emergency admissions within the previous 3 years
NumberOfEMAD_within_3years 1.000
The severity index total score for conditions in the current admission and in the previous 3 years
Severity_Index 0.246
The number of emergency admissions within the previous 6 months
NumberOfEMAD_within_6months 0.068
Whether the patient had an emergency admission due to COPD in the previous 3 years
COPD 0.062
Whether the patient had a reference condition in the current admission or in the previous 3 years
Ref_condition_prev_3_yrs 0.060
12
If a patient had 2 emergency admissions within the previous 3 years and a severity index of 4 or more in the previous 3 years then s/he is predicted to have a emergency readmission within 12 months. 62.3% of the 780 patients in this group who were predicted to have a readmission actually had a readmission.
Decision Trees –Results
13
Neural Network Number of hidden layers 1 Number of hidden neurons 9 Network architecture Multilayer Perceptron
Due to their complex structure neural network results are a lot more difficult to interpret
9 nodes
Neural Network vs Logistic Regression Results
15
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
40 45 50 55 60 65 70 75 80 85 90 95
Perc
enta
ge o
f Fla
gged
Pat
ient
s who
wer
e Re
adm
itted
Risk Score Threshold
Percentage of patients flagged by the neural network and logistic regression models to have a emergency readmission within 12 months that did have a
readmission
This project - Training data (Neural network model) This project - Validation data (Neural network model)
This project - Training data (Logistic regression model) This project - Validation data (Logistic regression model)
2006 PARR paper
Logistic Regression
Neural Network
Algorithms comparison for different timeframes
16
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
1 month
3 months
6 months
9 months
12 months
Percentage accuracy in classification (%)
Rea
dmis
sion
with
in
Percentage accuracy in classification of the three modelling techniques at predicting readmission within 1, 3, 6, 9 and 12 months
Neural network model Logistic regression model Classification tree model
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
1 month
3 months
6 months
9 months
12 months
Positive predictive value (%)
Read
miss
ion
with
in
Positive predictive values of the three modelling techniques for predicting readmission within 1, 3, 6, 9 and 12 months
Neural network model Logistic regression model Classification tree model17
Algorithms comparison for different timeframes
18
Conclusions (1)
• The accuracy (and PPV) in classification of the three models predicting readmission within 12 months is almost identical
• Neural networks were the best models for accurately identifying
the highest number of actual readmissions with a sensitivity of 45.4% , possibly due to their nonlinear nature
Logistic Regression Classification Tree Neural Network Accuracy 71.5% 71.6% 72.1% PPV 67.4% 66.8% 66.2% Sensitivity 40.1% 41.7% 45.4%
19
Conclusions (2)
• Number of emergency admissions in the three years prior to the triggering emergency admission is the strongest factor in predicting readmission within 12 months in ALL models. So is the number of emergency admissions in the previous 6 months.
• Severity and number of conditions that a patient has also plays a role in accurately predicting readmission in all the models, with those patients who have a reference condition or COPD being more likely to have a readmission.
20
Conclusions (3)
• Although the neural network model gives good results at higher risk scores, the results of the technique are much more difficult to explain to a non technical audience.
• Classification trees have a strong advantage as they allow us to visualise the important factors immediately.
• However, classification trees are not designed to allocate
probabilities of readmission for individuals as patients are sorted into groups and then the groups are allocated with a probability.
• For these reasons, Logistic Regression often remains the method
which gives the most easily understandable results to a non technical audience.
21
Conclusions (4)
• As the prediction interval to readmission decreases the performance of the logistic regression model in terms of PPV decreases, while the other two models retain relatively stable values irrespective of the timeframe to readmission. This is particularly true of decision trees.
• This study suggests that alternative algorithms have great potential in terms of performance, ease of use, and robustness over timeframe
• This also opens the door for exploring the benefits of newer more sophisticated machine learning type of techniques: support vector machines, fuzzy approaches, etc.
• However greater prediction improvement would probably be achieved with better and more comprehensive data (e.g. GP, social care, etc.)