Rating organ failure via adverse events using data mining in the intensive care unit

Rating organ failure via adverse events usingdata mining in the intensive care unit

Alvaro Silva a, Paulo Cortez b,*, Manuel Filipe Santos b,Lopes Gomes c, Jose Neves d

a Servico de Cuidados Intensivos, Hospital Geral de Santo Antonio, Porto, PortugalbDepartamento de Sistemas de Informacao, Universidade do Minho, Campus de Azurem,4800-058 Guimaraes, PortugalcClınica Medica I, Inst. de Ciencias Biomedicas Abel Salazar, Porto, PortugaldDepartamento de Informatica, Universidade do Minho, Braga, Portugal

Received 5 February 2007; received in revised form 28 March 2008; accepted 31 March 2008

Artificial Intelligence in Medicine (2008) 43, 179—193

http://www.intl.elsevierhealth.com/journals/aiim

KEYWORDSAdverse event;Artificial neuralnetworks;Critical care;Data mining;Multinomial logisticregression;Organ failureassessment

Summary

Objective: The main intensive care unit (ICU) goal is to avoid or reverse the organfailure process by adopting a timely intervention. Within this context, early identi-fication of organ impairment is a key issue. The sequential organ failure assessment(SOFA) is an expert-driven score that is widely used in European ICUs to quantify organdisorder. This work proposes a complementary data-driven approach based on adverseevents, defined from commonly monitored biometrics. The aim is to study the impactof these events when predicting the risk of ICU organ failure.Materials and methods: A large database was considered, with a total of 25,215 dailyrecords taken from 4425 patients and 42 European ICUs. The input variables includethe case mix (i.e. age, diagnosis, admission type and admission from) and adverseevents defined from four bedside physiologic variables (i.e. systolic blood pressure,heart rate, pulse oximeter oxygen saturation and urine output). The output target isthe organ status (i.e. normal, dysfunction or failure) of six organ systems (respiratory,coagulation, hepatic, cardiovascular, neurological and renal), as measured by theSOFA score. Two data mining (DM) methods were compared: multinomial logisticregression (MLR) and artificial neural networks (ANNs). These methods were tested inthe R statistical environment, using 20 runs of a 5-fold cross-validation scheme. Thearea under the receiver operator characteristic (ROC) curve and Brier score were usedas the discrimination and calibration measures.Results: The best performance was obtained by the ANNs, outperforming the MLR inboth discrimination and calibration criteria. The ANNs obtained an average (over all

* Corresponding author. Tel: +351 253 510313; fax: +351 253 510300.E-mail address: [email protected] (P. Cortez).

0933-3657/$ — see front matter # 2008 Elsevier B.V. All rights reserved.doi:10.1016/j.artmed.2008.03.010

mailto:[email protected]

http://dx.doi.org/10.1016/j.artmed.2008.03.010

180 A. Silva et al.

1. Introduction

Since the early 1980s clinical scores have beendeveloped to access severity of illness and organdysfunction in the intensive care unit (ICU) setting[1]. Indeed, in the context of intensive medicine,severity scores are instruments that aim primarily atstratifying patients based on risk adjustment of theclinical condition. Furthermore, these tools havebeen used to improve the quality of intensive careand guide local planning of resources.

The majority of these scores use are static, sincethey use data collected only on the first ICU day,such as the acute physiology and chronic healthevaluation system (APACHE) [2], the simplifiedacute physiology score (SAPS) [3] or mortality prob-ability model (MPM) [4]. Yet, these static scores failto recognize several factors that can influence thepatient outcome after the first 24 h (e.g. the ther-apeutics strategy and the patients’ response).

More recently, dynamic (or repetitive) scores havebeen designed, where the data and scores are up-dated on a daily basis. The most used scores include[5]: the sequential organ failure assessment (SOFA),multiple organs dysfunction score (MODS) and logisticorgan dysfunction (LOD). Our focus is on the SOFAscore which was first proposed to evaluate morbidity(degree of organ failure) [6] and latter it has beenshown to be related with mortality risk [7,8].

The SOFA scores six organ systems (respiratory,coagulation, hepatic, cardiovascular, neurologicaland renal) on a scale ranging from 0 to 4, accordingto the degree of failure. This is an expert-drivenscore, in the sense that it was developed by a panelof experts who choose a set of variables and rulesbased on their personal opinions [5]. The SOFA iswidely used in European ICUs, nevertheless thereare some issues not yet solved. Firstly, for some ofthe variables (e.g. platelets and bilirubin), the SOFAuses the worst value obtained in the last 24 h and itis not clear how many daily times they should bemeasured. Also, the SOFA is a classification systemthat does not provide a risk (i.e. probability) of theoutcome of interest (i.e. organ failure).

On the other hand, bedside monitoring of phy-siologic variables is universal and routinely regis-tered during patient ICU stay. Indeed, ICU physicianstend to analyze these monitoring data in an empiri-cal fashion in order to trigger an action given aspecific condition. The relationships within thesedata are complex, nonlinear and not fully under-stood. For instance, if a severe arterial hypotension(i.e. low blood pressure) arises then renal or cardi-ovascular failure may succeed. Yet, it is not clearwhat should be the duration and/or severity of thehypotension to trigger the latter outcomes. Thus,monitoring analysis is not standardized and mainlyrelies on the physicians knowledge and experienceto interpret them. The SOFA score uses both phy-siological parameters (e.g. hypotension) and labora-tory data (e.g. platelets). However, the latter onesusually depend on previous physiological impair-ments. For example, a severe and long hypotensionassociated with hypoxemia can lead to hepatic fail-ure (i.e. bilirubin increase). Therefore, using onlybiometric data should potentially allow a moreadequate evaluation and early therapeutic inter-vention.

Yet, as more and more biometrics are continu-ously monitored (e.g. mechanical ventilator, cardi-ovascular device), the amount of data availableincreases exponentially, generating alarms thatneed to be interpreted. In previous work [9], wehave shown that out of range measurements (oradverse events) of four biometrics (i.e. systolicblood pressure, heart rate, pulse oximeter oxygensaturation and urine output) have an impact on themortality outcome of ICU patients. Since multipleorgan failure is amajor cause for ICUmortality [8], itis rational to access the impact of the adverseevents on organ system function at an early stage.

One of the most promising recent developmentsin intensive care consists in the use of artificialintelligence/data mining (DM) techniques [1,10].The fast growing amount of data collected hadled to vast and complex databases that exceededthe human capability for comprehension withoutusing computational resources. The goal of data

organs) area under the ROC curve of 64, 69 and 74% and Brier scores of 0.18, 0.16 and0.09 for the dysfunction, normal and failure organ conditions, respectively. Inparticular, very good results were achieved when predicting renal failure (ROC curvearea of 76% and Brier score of 0.06).Conclusion: Adverse events, taken from bedside monitored data, are importantintermediate outcomes, contributing to a timely recognition of organ dysfunctionand failure during ICU length of stay. The obtained results show that it is possible touse DM methods to get knowledge from easy obtainable data, thus making room forthe development of intelligent clinical alarm monitoring.# 2008 Elsevier B.V. All rights reserved.

Rating organ failure 181

mining is to discover interesting knowledge from theraw data by using automatic discovery tools [11].

There are several DM techniques, each one withits own purposes and advantages. The majority ofthe severity scores use statistical methods such asthe logistic regression (LR), which is easy to inter-pret. Yet, such classical statistics may not be sui-table for the complex nonlinear relationships oftenfound in biomedical data [1]. Artificial neural net-works (ANNs) are connectionist models inspired bythe behavior of the human brain [12]. In ICUs, ANNsare gaining an increase of acceptance due to advan-tages of nonlinear learning and high flexibility.Indeed, ANNs have been applied to predict mortalityand length of stay [1,10].

Motivated by the results obtained in [13], a novelapproach is presented in this work, where the maingoal is to explore the impact of the adverse events,during the last 24 h, on the current day organ riskcondition (i.e. normal, dysfunction or failure). As asecondary goal, two DM techniques (i.e. LR andANNs) are evaluated and compared. The proposedapproach will be tested on a large database, whichincludes daily records of 4425 patients taken from42 European ICUs.

The paper is organized as follows. Section 2 pre-sents the ICU clinical data, DM models, featureselection approach and computational environ-ment. Next, the results are analyzed (Section 3)and discussed (Section 4). Finally, closing conclu-sions are drawn (Section 5).

2. Materials and methods

2.1. Intensive care data

The database used in the present study was con-structed by the authors from the EURICUS II study.TheEURICUS IIprojectwasconductedfromNovember1998 to August 1999 and encompassed 42 ICUs from 9European Union countries (see [14] formore details).

Table 1 The protocol for the out of range physiologic mea

BP SpO2

Normal range 90—180 mmHg � 90%Event a � 10 min � 10 minEvent b � 10 min in 30 min � 10 min inCritical event a � 1 h � 1 hCritical event b � 1 h in 2 h � 1 h in 2 hCritical event c < 60 mmHg < 80%

BP, blood pressure; HR, heart rate; SpO2, pulse oximeter oxygen saa Defined when continuously out of range.b Defined when intermittently out of range.c Defined anytime.

In each participating ICU, monitoring data wascollected and registered manually. According to theuniversal monitoring practice, in every hour, all ICUpatient biometrics were recorded in a standardizedsheet form by the nursing staff. Also, the adverseevents were assigned in a specific sheet at a hourlybasis. The registered data was submitted to a doublecheck, using both local (i.e. ICU) and central levels(i.e. Health Services Research Unit of the GroningenUniversity Hospital, the Netherlands). The latterunit was used to gather the full database.

Two main criteria were used for the event defini-tion. First, its occurrence and duration should beregistered by physiological changes (e.g. shock andnot pneumonia). Second, the related physiologicalvariables should be routinely registered at regularintervals. Four biometrics filled these require-ments: the systolic blood pressure (BP), the heartrate (HR), the pulse oximeter oxygen saturation(SpO2) and the hourly urine output (UR). The normalranges for these parameters (see Table 1) were setby a panel of seven experts. An alarm is triggered ifthere is an out of range value during a given time,defining an event. It should be noted that theminimum time period was set to 10 min to minimizethe number of false alarms triggered by technicalproblems (e.g. disconnected sensor). For each bio-metric, the daily number of events were stored.When a longer event occurs or a more extremephysiologic measurement is found, it is called acritical event. For this last case, the databaseincludes daily entries with the number of criticalevents and its duration. Table 2 shows a synopsis ofthe ICU variables considered. The first four attri-butes (the case mix) are static, being collectedduring the patient’s admission. The next 12 vari-ables are related to the adverse events.

At a daily basis, the SOFA score was computed forsix organ systems (respiratory, coagulation, hepatic,cardiovascular, neurological and renal) by collectingthe raw data presented in Table 3 during the last24 h. The SOFA values range from 0 to 4, with the

surements

HR UR

60—120 bpm � 30 ml/h� 10 min � 1 h

30 min � 10 min in 30 min —� 1 h � 2 h� 1 h in 2 h —< 30 bpm _ > 180 bpm � 10 ml/h

turation; UR, urine output.

182 A. Silva et al.

Table 2 The intensive care variables

Attribute Description Min. Max. Mean a

admtype Admission type Categorical b

admfrom Admission origin Categorical c

SAPS II SAPS II score 0 118 40:9� 16:4age Age of the patient 18 100 62:5� 18:2

NBP Daily number of blood pressure events 0 24 0:8� 1:9NHR Daily number of heart rate events 0 24 0:6� 2:3NSpO2 Daily number of oxygen events 0 24 0:4� 1:8NUR Daily number of urine events 0 24 1:0� 3:0NCRBP Daily number of critical blood pressure events 0 10 0:3� 0:7NCRHR Daily number of critical heart rate events 0 10 0:2� 0:6NCRSpO2 Daily number of critical oxygen events 0 6 0:1� 0:4NCRUR Daily number of critical urine events 0 7 0:4� 0:8TCRBP Time of critical blood pressure events (% of 24 h) 0 24.7 0:8� 2:7TCRHR Time of critical heart rate events (% of 24 h) 0 24.7 1:0� 3:4TCRSpO2 Time of critical oxygen events (% of 24 h) 0 24.7 0:4� 2:1TCRUR Time of critical urine events (% of 24 h) 0 24.7 1:6� 4:5a Mean and sample standard deviation.b 1, unscheduled surgery; 2, scheduled surgery; 3, medical.c 1, operating theatre; 2, recovery room; 3, emergency room; 4, general ward; 5, other ICU; 6, other hospital; 7, other sources.

following interpretation: 0, normal; 1 or 2, dysfunc-tion; 3 or 4, failure.

The exclusion criteria fulfilled the SAPSII defini-tions [3], i.e. with age lower than 18 years old,burned or with recent coronary bypass surgery.Also, the last day of stay data entries were dis-carded, since the SOFA score is only defined for a24 h time frame and several of these patients weredischarged earlier. The final database contains a

Table 3 The SOFA variables and scoring rules (adapted fro

Organ/variable SOFA score

0 1

RespiratoryPaO2/FIO2 (mmHg) > 400 � 400

Coagulation

Platelets (�103 mm�3) > 150 � 150

HepaticBilirubin (mmol/l) > 20 < 32

CardiovascularHypotension b None MAP

< 70mmHg

NeurologicalGlasgow coma score 15 13—14

RenalCreatinine (mmol/l) < 110 � 110Or urine output

PaO2, arterial oxygen tension; FIO2, fractional inspired oxygen; MAnorepi., norepinephrine.a With respiratory support.b Agents administered for at least 1 h (doses in mg/kg per min).

total of 25,215 daily records taken from 4425 cri-tically ill patients.

Fig. 1 plots the histograms of the SOFA values foreach organ (computed over the whole database).The figure shows the prevalence of each condition,denoting skewed distributions, i.e. the number ofnormal conditions is higher than the failure ones.During the preprocessing stage, each SOFA variablewas transformed into a three-class output, one for

m [7])

2 3 4

� 300 � 200 a � 100 a

� 100 � 50 � 20

< 101 < 204 > 204

dop: � 5 ordobutamine(any dose)

dop: < 5 orepi: � 0:1 ornorepi. � 0:1

dop: > 15or epi: > 0:1 ornorepi. > 0:1

10—12 6—9 < 6

� 171 � 300 � 440< 500 ml/day < 200 ml/day

P, mean arterial pressure; dop., dopamine; epi., epinephrine;


Figure 1 The organ condition prevalence during the ICU length of stay (x-axis denotes the daily SOFA value and the y-axis the frequency of the x value within the whole dataset).

each organ condition: normal, dysfunction andfailure.

For demonstrative purposes, Fig. 2 presents theboxplots of the time of critical events associatedto each renal status. In the boxplots, it is difficult tofind a clear pattern that relates adverse events tothe organ condition, suggesting that this is a non-trivial task.

2.2. Data mining methods

Data mining is an emerging area that lies at theintersection of statistics, artificial intelligence anddata management. DM tasks can be classified intotwo categories [11]: descriptive, where the inten-tion is to characterize the properties of the data;and predictive, to forecast the unknown value of anoutput target given known values of other variables(the inputs). Predictive tasks can be further dividedinto classification, when the output domain is dis-crete, and regression, when the dependent variableis continuous.

The multinomial logistic regression (MLR) is theextension of the common logistic method to multi-class tasks. Let c j 2C be the condition j and C the setof all possible classes, then the respective esti-mated probability (bpj) is given by [15]:

bpj ¼exp ðh jxÞXC

k¼1exp ðhkxÞ

h jðxÞ ¼XIi¼1

b j;ixi

(1)

where b j;0; . . . ;b j;I denotes the parameters of themodel, and x1; . . . ; xI the dependent variables. Thismodel requires that hkðxÞ� 0 for one ck 2C (thebaseline group) and this assures that

PCj¼1 bpj ¼ 1.

It should be noted that the selection of the baselineclass (ck) does not affect the MLR performance.

The multilayer perceptron is a popular ANN,where processing neurons are grouped into layersand connected by weighted links [12]. The ANN isactivated by feeding the input layer with the inputvariables and then propagating the activations in afeedforward fashion, via the weighted connections,through the entire network.

A fully connected network, with one hidden layerof H nodes, will be adopted in this work. For multi-class data, the ANN outputs can be interpreted asprobabilities if the logistic function is applied to thehidden neurons and the linear function is used at theC output nodes. Then, the final ANN probabilityestimate for the class j is given by [15]:

184 A. Silva et al.

Figure 2 Boxplots of the time of critical events for each renal condition. Each box is delimited by first (bottom) andthird (top) quartiles. Mean values are represented by black diamonds and outliers by open circles. The latter were definedif outside 1.5� the interquartile range of the box.

bpj ¼exp ðy jÞXC

k¼1exp ðykÞ

ðsoftmax functionÞ

yi ¼ wi;0 þXIþH

m¼Iþ1fXIn¼1

xnwm;n þ wm;0

!wi;n

(2)

where yi is the output of the network for the node i;f ¼ 1=ð1þ exp ð�xÞÞ is the logistic function; Irepresents the number of input neurons; wd;s theweight of the connection between nodes s and d;andwd;0 is a constant called bias. The first equation,known as the softmax function, warranties thatbpj 2 ½0; 1� and

PCj¼1 bpj ¼ 1. The simplest ANN (with

H ¼ 0) is equivalent to the MLR model and more

Figure 3 Example of a multinomial logistic regression (left) a

complex discrimination functions can be learnedwith a higher number of hidden neurons (Fig. 3).Yet, a high value of H will induce generalization loss(i.e. overfitting).

The logistic model is easier to interpretthan ANNs. Nevertheless, it is possible to gatherknowledge about what the ANN has learned bymeasuring the relative importance of theinputs (Section 2.3) and extracting rules. Thelatter issue is still an active research domain[16]. In this work, the pedagogical technique pre-sented in [9] will be adopted, where the directrelationships between the inputs and outputs ofthe ANN are extracted by using a decision tree[17].

nd artificial neural network with two hidden nodes (right).


2.3. Sensitivity analysis and featureselection

The sensitivity analysis [18] is a simple procedurethat analyses the model responses when the inputsare changed. Although originally proposed for ANNs,this sensitivity method can also be applied to otherDM models, such as logistic regression or supportvector machines [19]. Let bpic j denote the probabilityof condition c j when all input variables are hold attheir average values. The exception is the attributexa, which varies through its range with i2f1; . . . ; Lglevels. In this work, we will adopt the averagegradient (Ga) as the sensitivity measure. For amulti-class domain, it is given by

Ga ¼

XCj¼1

XL�1i¼1jbpiþ1c j

� bpic j jCðL� 1Þ

Ra ¼VaXA

k¼1Gk

(3)

where A denotes the number of input attributes andRa the relative importance of attribute a (in %). Inthe experiments, L will be set to the number ofdiscrete values for the nominal attributes and 6 forthe continuous inputs (xa 2f�1:0;�0:6; . . . ; 1:0g).

Feature selection methods [20] are useful todiscard irrelevant inputs, leading to simplermodels that are easier to interpret and oftenpresenting higher predictive accuracies. A covar-iance analysis was applied to the attributes ofTable 2, revealing weak relationships except forthe variables related to the same biometric (e.g.the correlation between NCRBP and TCRBP is 0.7).This suggests that the number of irrelevantfeatures is low, although the covariance procedureis only capable of measuring linear dependences.Therefore, a backward variable selectionmethod will be applied to both the MLR andANN models.

The backward search will be guided by the sen-sitivity measure [18], allowing a reduction of thecomputational effort by a factor of A when com-pared to the standard backward selection algorithm[20]. All inputs are used at the beginning and thedata is randomly split into training (66.6%) andvalidation (33.3%) sets. In each iteration, the formerset is used to fit the model and get the importancevalues (Ra), while the validation data is used toaccess the generalization error. Then, the leastrelevant feature (i.e. with the lowest Ra) is dis-carded. The process is repeated until there is noerror improvement during E iterations (in this workset to E ¼ 3) or after A cycles. Finally, the lowest

validation error is the criterion for selecting the bestset of variables.

2.4. Evaluation

The receiver operating characteristic (ROC) curveshows the performance of a two class classifieracross the range of possible threshold (D) values,plotting one minus the specificity (x-axis) versusthe sensitivity (y-axis) [21]. The overall accuracyis given by the area under the curve(AUC ¼

R 10 ROC dD), measuring the degree of discri-

mination that can be obtained from a given model.In intensive care, the AUC is themost popular metricfor prognostic scores [10], where the ideal methodshould present an AUC of 1.0, while an AUC of 0.5denotes a random classifier. In the medical litera-ture, values of AUC above 0.7 are considered accep-table [1,10]. Multi-class problems can be handled byproducing one ROC for each class [21]. The ROCgraph for the class reference ci is generated byconsidering the positive (ci) and negative (Cnci)labels. The global AUC can then be computed bysumming the AUCs weighted by the prevalence of ciin the data, using [22]:

AUCGlobal ¼Xci 2C

AUCðciÞ prevðciÞ

prevðciÞ ¼ ci=N(4)

where AUCðciÞ denotes the AUC for class referenceci, ci the number of patients with condition ci and Nis the total number of patients.

Another important criterion is the calibration,which measures how close the predictions (bp) areto the true probabilities ( p) of an event. In thiswork, calibration will be assessed using the widelyused Brier score (2 ½0; 1�), which is defined for a two-class scenario as [23]:

Brierðc jÞ ¼1

N

XNi¼1ð pij � bpijÞ2 (5)

where pij and bpij denote the actual c j outcome (0 or1) for the patient i and respective probability esti-mation. Inspired in the multi-class AUC metric, theglobal Brier score is defined as

BrierGlobal ¼Xci 2C

BrierðciÞ prevðciÞ (6)

The lower the value, the better is the calibration,with the perfect model presenting a Brier score of 0.

Calibration can also be visualized with the regres-sion error characteristic (REC) curve [24], which isused to compare regression models and it plots theerror tolerance (x-axis), given in terms of the abso-lute deviation, versus the percentage of pointspredicted within the tolerance (y-axis). Similarly

186 A. Silva et al.

to the ROC concept, the ideal regressor shouldpresent a REC area of 1.0.

The K-fold cross-validation [25] is a commonlyused method to estimate generalization perfor-mances. In each run, the data is divided into Kpartitions of equal size. Sequentially, one differentsubset is tested and the remaining data is used forfitting themodel. Under this scheme, all data is usedfor testing, although K different models are fitted.This work will use 20 runs of a 5-fold, in a total of20� 5 ¼ 100 experiments for each tested config-uration. Statistical significance for the AUC andBrier values will be given by using a Mann—Whitneynon-parametric test at the 95% confidence level.According to [26], this test is equivalent to the testproposed by DeLong et al. [27] to compare ROCareas.

2.5. Computational environment

All experiments were conducted using the RMiner[28], an open source library for the R statisticalenvironment [29] that facilitates the use of DMtechniques in classification and regression tasks.In particular, the RMiner uses the multinomial andnnet functions of the nnet package to implementthe MLR and ANN models [15]. Also, the efficientAlgorithms 1 and 2 presented in [21] are used tocompute the ROC curves and AUC values.

In this work, wewill adopt the default suggestionsof the nnet developers [15] to adjust the DM tech-niques. The nominal inputs were encoded into 1-of-(# C� 1) binary variables. As an example, admtypefrom Table 2 is transformed with: 1!ð0 0Þ;2!ð1 0Þ; and 3!ð0 1Þ. For the ANNs, the contin-uous inputs were scaled into a zero mean and onestandard deviation range. Both the MLR and ANNmodels were trained using 100 iterations (known asepochs) of the efficient BFGS algorithm [30], fromthe family of quasi-Newton methods. Within a givenepoch, the whole training dataset is presented tothe ANN, in order to compute an error function thatis used to adjust the neural weights. For multi-classdata, the algorithm is set to maximize the likeli-hood, which is equivalent to minimizing the costerror function (j) given by

j ¼XNi¼1

XCj¼1

pij lnpijbpij þ ð1� pijÞ ln

1� pij1� bpij

" #(7)

In contrast with the MLR, the adopted ANN modelrequires the definition of one hyperparameter, thenumber of hidden nodes (H). To set this value, theRMiner provides a grid search facility, whereH 2fHL;HL þ g;HL þ 2g; . . . ;HUg, HL and HU denotethe lower and upper bounds; and g is a constant

value. To prevent the overfitting phenomenon andalso to reduce the search time, we will adopt a smallrange (i.e. H 2f2; 4; 6; 8; 10g). Also, and due tocomputational limitations, H will be fixed to themedian of the grid range during the feature selec-tion phase [19]. Then, the grid search is applied,using a random ð2=3Þ=ð1=3Þ data split for the train-ing and validation sets. The best H will be the onethat provides the lowest validation error. Afterselecting the best attributes and H value (in caseof ANN), the final model is retrained with all avail-able data.

3. Results

3.1. Predictive performance

A total of 6 (organs) � 2 (methods) ¼ 12 differentconfigurations were tested. The median number ofthe selected hidden nodes was 8 for all organsexcept the neurological, where the median was10. For tested configurations, the feature selectionalgorithm only discarded an average of two attri-butes. In general, the few removed variables arerelated to the adverse events. Nevertheless, all fourbiometrics are used in all models (e.g. NCRUR maybe deleted but TCRUR is not). These results confirmthe covariance analysis performed on Section 2.3.

The discrimination results evaluated over the testsets are summarized in Table 4. The best results areobtained by the ANNs, which outperform the MLRwith an average (last row) margin of 2.2, 1.8 and2.8% points for the normal, dysfunction and failurestatus respectively. The AUC differences (ANN vs.MLR) are significant ( p-value < 0:05) in all cases.When analyzing the organ condition discrimination,the dysfunction condition is more difficult to pre-dict. In effect, none of the presented models hasacceptable values (AUC higher than 70%). The nor-mal status shows a higher discrimination, with oneMLR and three ANN acceptable models. Finally, thefailure condition presents the most accurate pre-dictions. The MLR models are acceptable for thecoagulation, hepatic, neurological and renal sys-tems, while the ANNs obtain good performancesfor all organs except respiratory. In particular, thehepatic, neurological and renal AUCs are above 75%.When weighted by the condition prevalence, theglobal AUC reveals three acceptable models (ANNfor the cardiovascular, neurological and renal sys-tems). All ROC curves are plotted in Fig. 4. In thegraphs, the ANN curves are above the MLR ones,confirming the superiority of the discriminationpower of the ANNs.


Table 4 The discrimination power (mean AUC value of the 20 runs, in %) for each organ, condition andmethod (valuesof AUC > 70% are in bold)

Normal Dysfunction Failure Global

Organ MLR ANN MLR ANN MLR ANN MLR ANN

Respiratory 67.2 69.5 59.2 61.0 65.6 68.9 63.6 66.0Coagulation 63.6 65.5 60.1 62.0 72.6 73.9 63.3 65.1Hepatic 64.7 66.7 62.5 64.2 72.6 76.0 64.6 66.6Cardiovascular 67.9 71.2 63.8 65.6 67.3 71.0 67.1 70.2Neurological 70.0 72.1 58.8 61.2 74.7 76.7 68.8 70.9Renal 69.4 70.7 66.0 66.8 73.5 76.1 69.1 70.4

Average 67.1 69.3 61.7 63.5 71.0 73.8 66.1 68.2

The calibration results are presented in Table 5.The global Brier scores are particularly good for bothDM methods on three organs (coagulation, hepaticand renal). Nonetheless, the ANN outperforms thelogistic model in all cases except the hepatic dys-function and coagulation failure conditions (thedifferences are significant, with p-value < 0:05).Regarding the organ status, the best calibration isobtained for the failure state (average Brier scorefor all organs of 0.093), followed by the dysfunction(0.156) and normal (0.181) conditions. These resultsare complemented by a REC analysis (Fig. 5). Highquality curves (REC close to 1) were achieved for theprediction of the coagulation, hepatic and renalfailures, precisely where lowest Brier scores wereobtained. Although MLR and ANN curves are close,latter ones present a higher area. Also, more patientconditions are correctly predicted for low admittederrors. For instance, if a 0.1 tolerance is accepted(e.g. a 0.9 output is interpreted as positive), then27.7% of the coagulation failure (positive or nega-tive) examples are correctly estimated for the ANNmethod. This value decreases to 18% for the MLR.

3.2. Descriptive knowledge

This section will provide explanatory knowledgethat can be useful for the intensive care domain.The goal is not to infer about the predictive cap-

Table 5 The calibration values (mean Brier score of the 20 rdenote statistical significance when compared with MLR)

Normal Dysfunction

Organ MLR ANN MLR A

Respiratory 0.213 0.204 0.233 0Coagulation 0.173 0.171 0.155 0Hepatic 0.132 0.130 0.116 0Cardiovascular 0.205 0.197 0.132 0Neurological 0.208 0.202 0.153 0Renal 0.182 0.179 0.155 0

Average 0.185 0.181 0.157 0

abilities of each model, as measured in the Section3.1, but to give a simple description that sum-marizes the DM models. Thus, the whole datasetwill be used in the descriptive experiments.

Tables 6 and 7 present the relevance (in %) ofeach input variable for the two DM methods. Forboth MLR and ANN, the four biometrics are impor-tant for all organs, although the relative impact maydiffer. For the logistic model, the adverse eventsoverall influence ranges from 52.5% (cardiovascular)to 69.8% (hepatic), while the interval varies from38.6% (coagulation) to 50.3% (respiration) for theANN. Regarding the MLR model, the most importantbiometrics are on average the oxygen saturation andheart rate. The oxygen alarms are also the mostrelevant for the ANNs, followed by the blood pres-sure.

For demonstrative purposes, more detail will begiven to the renal models, which obtained satisfac-tory discrimination and calibration values. Table 8shows the bi; j MLR coefficients (the model was fittedwith all available data). The R environment auto-matically selected the dysfunction class as the base-line group, thus bpdysfunction ¼ 1� ðbpfailure þ bpnormalÞand no coefficients are used by this condition. Thesecoefficients should not be read separately, sinceorgan function condition results from the impactof complex interactions between all physiologicalmetrics. For instance, regarding the urine output,

uns) for each organ, condition andmethod (values in bold

Failure Global

NN MLR ANN MLR ANN

.230 0.171 0.166 0.211 0.205

.154 0.038 0.038 0.134 0.133

.116 0.026 0.025 0.101 0.100

.130 0.138 0.133 0.160 0.155

.151 0.136 0.132 0.169 0.165

.155 0.065 0.063 0.144 0.142

.156 0.096 0.093 0.153 0.150

188 A. Silva et al.

Figure 4 The receiver operating characteristic curves for each organ and condition (artificial neural network–—solidline; multinomial logistic regression–—dashed; random–—gray line).

while the values suggest that renal failure is nega-tively influenced by the number of events (NUR), itis also positively influenced by long lasting criticalevents (TCRUR).

In this example, the feature selection algorithmdiscarded one variable (NCRUR) for the MLR, whilethe final neural model did not include three attri-butes (NCRSpO2, TCRBP, TCRHR). The latter contains


Figure 5 The regression error curves for each organ and condition (artificial neural network–—solid line; multinomiallogistic regression–—dashed).

19 input, 8 hidden and 3 output neurons, with a totalof 187 weights. Instead of presenting all theseweights, and to simplify the analysis, a decisiontree will be used to describe the ANN behavior

[9]. The tree was fit using the default values ofthe rpart R library [15] and a training set composedby the ANN inputs and outputs. The latter ones werepreprocessed into the condition related to the high-

190 A. Silva et al.

Table 6 The relative importance of the input variables for the multinomial logistic regression (Ra values, in %)

Organ admtype admfrom SAPS II age BP a HR a SpO2a UR a


Average 9.1 10.1 15.7 6.5 13.0 18.5 17.4 9.7a All attributes related to the variable where summed (number of events, critical events and the time).

Table 7 The relative importance of the input variables for the artificial neural networks (Ra values, in %)

Organ admtype admfrom SAPS II age BP a HR a SpO2a UR a


Average 19.7 11.3 16.4 9.7 11.4 5.9 16.0 9.6a All attributes related to the variable where summed (number of events, critical events and the time).

est ANN probability. The obtained model (Fig. 6)managed to mimic the ANN behavior with a lowclassification error (3.4%) and it includes the twomost relevant biometrics from Table 7 (UR and HR).As an example, the next two rules for renal failureprediction can be extracted from the tree:

IF TCRUR� 13:8 AND NUR� 15 THEN failureIF TCRUR< 13:8 AND admfrom =2f5; 6g

AND NCRHR ¼ 0 AND SAPSII� 93 THEN failure

(8)

4. Discussion

The assessment of the degree of organ failure iscrucial in ICUs, since one of the main ICU tasks is to

Table 8 The multinomial logistic coefficients for the rena

Condition bi; j coefficients

�0:32� 0:50admþ0:13admfrom4 þ

Failure þ0:01SAPSII� 0:0�0:17NCRHR� 0þ0:03TCRBP� 0:

3:56� 0:20admtyþ0:15admfrom4 �

Normal �0:03SAPSII� 0:0�0:12NCRHRþ 0þ0:01TCRBP� 0:

Binary variables are denoted by Vi, denoting the ith categorical va

avoid or reverse organ failure process by an earlyidentification of patients at risk and adopting therespective therapy. Indeed, several expert-drivenscores have been developed to quantify organ dis-order, such as the SOFA, which is widely used inEurope.

This study proposes a novel data-driven bedsidemonitoring approach, where the major goal is tostudy the impact of adverse events to daily predictthe organ condition risk of six systems (i.e. respira-tory, coagulation, hepatic, cardiovascular, neurolo-gical and renal). The assumption behind ourapproach is to use only data collected in the last24 h of the ICU length of stay. A large database wasconsidered using bedsidemonitoring data. The inputvariables included the case mix (i.e. admissiontype/origin, SAPSII index and the age) and adverse

l system

type2 þ 0:10admtype3 þ 0:14admfrom2 þ 0:11admfrom3

0:51admfrom5 þ 0:03admfrom6 � 0:04admfrom7

2age� 0:05NBP� 0:05NCRBP� 0:01NHR:03NSpO2 þ 0:09NCRSpO2 � 0:03NUR03TCRHR� 0:06TCRSpO2 þ 0:12TCRUR

pe2 � 0:05admtype3 � 0:11admfrom2 þ 0:15admfrom3

0:05admfrom5 þ 0:18admfrom6 þ 0:55admfrom7

2age� 0:04NBP� 0:13NCRBP� 0:01NHR:04NSpO2 � 0:15NCRSpO2 þ 0:06NUR02TCRHR� 0:01TCRSpO2 � 0:07TCRUR

lue of variable V.


Figure 6 The extracted rules given in terms of a decisiontree for the renal system.

events. The latter were measured as the out ofrange values of four commonly monitored physiolo-gical variables (e.g. heart rate).

The second goal was also to compare two datamining techniques, namely MLR and ANNs. Theexperiments were conducted in the R statistical tool[29] using discrimination and calibration criteria. Asargued in [31], it is difficult to compare DM methodsin a fair way, with data analysts tending to favormodels that they know better. To reduce the biastowards a given model, we adopted the defaultsuggestions of the nnet package [15] for the Renvironment. The only exception is the number ofhidden neurons, which was set using a simple gridsearch procedure. The default settings are morelikely to be used by common (non-expert) users,thus this seems a reasonable assumption for thecomparison.

The results show that the ANNs are the bestlearning models, outperforming the MLR for bothcriteria. The average (over all organs) obtained ANNROC area is 64, 69 and 74% for the dysfunction,normal and failure conditions, while the respectiveBrier scores were 0.18, 0.16 and 0.09. In particular,good ANN discrimination results (ROC area higherthan 75%) were achieved for three systems (hepatic,neurological and renal). Also, high calibrated mod-els (Brier score below 0.1) were attained for thecoagulation, hepatic and renal organs. These resultscan be explained by the fact that the SOFA score ismore reliable and robust when classifying the clin-ical condition of these organs. For instance, therenal function condition is classified using welldefined and objective intervals, rather than respira-tory that can be influenced by an inadequate FIO2

setting.The risk estimates for the normal and dysfunction

conditions provided less accuracies. This may beexplained by several factors. Normality is at onethe extremes, with the dysfunction being an in-between state. Hence, in principle the normal con-dition should be easier to predict. However, asshown in Fig. 2 there are several outliers (e.g. rare

or extreme events) in the data. Since ICU patientsare critically ill, the normal function label describesa clinical condition where the severity is not enoughto define a failure or dysfunction but does notexclude a disease process. Furthermore, organ fail-ure development is a continuous process where theborders for each stage are necessarily fuzzy and notwell known.

Regarding the interpretability issue, the MLR iseasier to understand than the neural model. Yet,under the adopted experimental settings, the latterpresented the best results and it is possible toextract knowledge from trained ANNs, given interms of input variable importance or humanfriendly rules (Section 3.2).

The major outcome of this work is that we showthat adverse events, taken from bedside monitoreddata, have a relevant impact on the degree of organfailure. Although this finding was expected, ourmain contribution is to quantify such impact (i.e.discrimination, calibration and input relevance),allowing to get knowledge from easy obtainabledata. Rather than an empirical subjective analysis(e.g. performed by the individual physician), theobtained results strength the pursuit of a systematicintelligent data-driven approach to monitor ICUpatients.

4.1. Related work

In the past, the majority of studies using data miningmethods in ICU environments were focused in mor-tality assessment [10], while the application of DMto organ failure is rather scarce. Matis et al. [32]used 15 variables (e.g. age, bilirubin, creatinine) totrain an ANN in order to predict liver failure aftertransplantation. The obtained accuracy ranged from70% (using data prior to the operation) to 88% (5 daysafter the transplantation). An ANN was also success-fully used to access the cardiac failure of 58patients, using 20 variables (e.g. heart rate, bloodpressure) [33]. In previous work [13], ANNs haveoutperformed decision trees for organ failure pre-diction, obtaining an overall classification accuracyof 70%. More recently, a kernel logistic regressionwas used by Pearcea et al. [34] in order to predictacute pancreatitis. The model included eight vari-ables (e.g. age, respiratory rate, creatinine) andoutperformed a daily updated APACHE II prognosticmodel.

This work is quite distinct from the previousstudies, since we use adverse events based on dailybedside monitored data. Moreover, we model thedegree of organ failure of six organ systems. Thisstudy largely extends our previous work [13] bypredicting three conditions (i.e. normal, dysfunc-

192 A. Silva et al.

tion and failure), testing also a logistic model in theexperiments and evaluating the results under cali-bration and discrimination analysis.

Regarding the use of daily SOFA scores by artificialintelligence techniques, most of the literature isalso focused on mortality prediction. For instance,Kayaalp et al. [35] adapted Bayesian networks undera time series approach, where 23 variables (e.g.urine output, bilirubin, SOFA scores for five organsystems) were used to predict ICU mortality. Inprevious work [9], we tested the use of ANN andadverse alarms of four biometrics, outperformingthe SAPSII logistic model for mortality assessment.Toma et al. [23] followed a distinct dynamicapproach, where organ failure scores were usedto discover patterns of sequences (called episodes).Several logistic regression models, built for each ofthe first 5 days, were tested for mortality prognosisand the best results were attained by the modelsthat included the episodes.

In contrast with the above studies, this workmodels the degree of organ impairment. Since mul-tiple organ failure is themain cause of ICUmortality,there is a need to identify the degree of ICU patientillness in a continuous form, in order to apply atimely intervention. In fact, this was the rationalebehind the SOFA score development [7]. Our studyfollows a similar and complementary approach,adding a risk estimate (i.e. probability) of the organcondition to bedside alarms. The proposed workcould be applied using precise, low cost and real-time variables, by using a real-time computerizeddata acquisition system from bedside monitors andapplying quality procedures (e.g. data validated bythe ICU staff) [36]. Moreover, such system could givemore updated predictions (e.g. every 6 or 12 h).

4.2. Future work

To our knowledge this is the first attempt to relatedadverse events with organ failure and furtherexploratory research is needed. For instance, out-lier detection techniques [37] could be used todiscard rare or extreme cases. This is expected toimprove the results, especially for the normal anddysfunction conditions. Moreover, while the adverseevents have an impact on organ failure (Section 3.2)there are complex dependencies between the bio-metrics. Therefore, a temporal analysis, such aspresented in [23,35], where the evolution of eachorgan during the patient length of stay is modeled, isa very promising direction. In effect, some of thelimitations of this work, namely the manual collec-tion of the data and the lack of temporal sequenceanalysis, could be answered by testing our approachin a real environment, using real-time data. In

effect, we intend to explore all these possibilitiesin the INTCare pilot project [36], where a friendlydecision support system is currently being devel-oped at the ICU of the Hospital Geral de SantoAntonio, Oporto, Portugal.

5. Conclusion

A data-driven analysis was performed on a large ICUdatabase, with an emphasis on the use of dailyadverse events, taken from four commonly moni-toredbiometrics. Twodataminingmethods, artificialneural networks and multinomial logistic regression,were tested topredict thedegreeof failure regardingsix organ systems. The former method provided bet-ter discrimination and calibration results, with aver-ageROCcurveareasof 74, 64and69%andBrier scoresof 0.09, 0.18 and 0.16 for the failure, dysfunction andnormal conditions, respectively. The obtained resultsshow that adverseevents are important intermediateoutcomes, reflecting the patient condition and ICUway of work. Hence, this work contributes to animprovement of the process of critical ill patientcare, by means of generating more intelligent bed-side intensive care alarms.

Acknowledgments

The authors wish to thank FRICE and the BIOMEDproject BMH4-CT96-0817 for the provision of part ofthe EURICUS II data, which is integrated in a PhDprogram, developed at Instituto de Ciencias Biome-dicas Abel-Salazar from University of Porto and theDepartments of Computer Science/Information Sys-tems from the University of Minho. The work of P.Cortez, M.F. Santos and J. Neves is supported by theFCT project PTDC/EIA/72819/2006. We also wouldlike to thank the anonymous reviewers for theirhelpful comments.

References

[1] Rosenberg A. Recent innovations in intensive care unit risk-prediction models. Curr Opin Crit Care 2002;8:321—30.

[2] Knaus W, Wagner D, Draper E, Zimmerman J, Bergner M,Bastos P, et al. The APACHE III prognostic system. Riskprediction of hospital mortality for critically ill hospitalizedadults. Chest 1991;100:1619—36.

[3] Le Gall J, Lemeshow S, Saulnier F. A new simplified acutephysiology score (SAPS II) based on a European/North Amer-ican multicenter study. JAMA 1993;270:2957—63.

[4] Lemeshow S, Teres D, Klar J, Avrunin J, Gehlbach S, Rapo-port J. Mortality Probability Models (MPM II) based on aninternational cohort of intensive care unit patients. JAMA1993;270:2478—86.


[5] Le Gall J. The use of severity scores in the intensive careunit. Intens Care Med 2005;31:1618—23.

[6] Vincent J, Moreno R, Takala J, Willatss S, Mendonca A,Bruining H, et al. The SOFA (Sepsis-related Organ FailureAssessment) score to describe organ dysfunction/failure.Intens Care Med 1996;22:707—10.

[7] Moreno R, Vincent J, Matos R, Mendonca A, Cantraine F,Thijs L, et al. The use of maximum SOFA score to quantifyorgan dysfunction/failure in intensive care. Results of aprospective, multicentre study. Intens Care Med 1999;25:686—96.

[8] Amaral A, Andrade F, Moreno R, Artigas A, Cantraine F,Vincent J. Use of the sequential organ failure assessmentscore as a severity score. Intens Care Med 2005;31:243—9.

[9] Silva A, Cortez P, Santos MF, Gomes L, Neves J. Mortalityassessment in intensive care units via adverse eventsusing artificial neural networks. Artif Intell Med 2006;36:223—34.

[10] Ohno-Machado L, Resnic F, Matheny M. Prognosis in criticalcare. Annu Rev Biomed Eng 2006;8:567—99.

[11] Hand D, Mannila H, Smyth P. Principles of data mining.Cambridge, MA: MIT Press; 2001.

[12] Haykin S. Neural networks–—a comprehensive foundation,2nd ed., New Jersey: Prentice-Hall; 1999.

[13] Silva A, Cortez P, Santos M, Gomes L, Neves J. Multiple organfailure diagnosis using adverse events and neural networks.In: Seruca I, Cordeiro J, Hammoudi S, Filipe J, editors.Enterprise information systems VI. The Netherlands:Springer; 2006.

[14] Fidler V, Nap R, Miranda R. The effect of a managerial-basedintervention on the occurrence of out-of-range-measure-ments and mortality in intensive care units. J Crit Care2004;19(3):130—4.

[15] Venables W, Ripley B. Modern applied statistics with S, 4thed., Springer; 2003.

[16] Setiono R. Techniques for extracting classification andregression rules from artificial neural networks. In: FogelD, Robinson C, editors. Computational intelligence: theexperts speak. Piscataway, NY, USA: IEEE; 2003. p. 99—114.

[17] Breiman L, Friedman J, Ohlsen R, Stone C. Classification andregression trees. Monterey, CA: Wadsworth; 1984.

[18] Kewley R, Embrechts M, Breneman C. Data strip mining forthe virtual design of pharmaceuticals with neural networks.IEEE Trans Neural Netw 2000;11(3):668—79.

[19] Cortez P, Portelinha M, Rodrigues S, Cadavez V, Teixeira A.Lamb meat quality assessment by support vector machines.Neural Process Lett 2006;24(1):41—51.

[20] Guyon I, Elisseeff A. An introduction to variable and featureselection. J Mach Learn Res 2003;3:1157—82.

[21] Fawcett T. An introduction to ROC analysis. Pattern RecognLett 2006;27:861—74.

[22] Provost F, Domingos P. Tree induction for probability-basedranking. Mach Learn 2003;52(3):199—215.

[23] Toma T, Abu-Hanna A, Bosman R. Discovery and inclusion ofSOFA score episodes inmortality prediction. J Biomed Inform2007;40(6):649—60.

[24] Bi J, Bennett K. Regression error characteristic curves. In:Fawcett T, Mishra N, editors. Proceedings of 20th interna-tional conference on machine learning (ICML). WashingtonDC, USA: AAAI Press; 2003.

[25] Kohavi R. A study of cross-validation and bootstrap foraccuracy estimation and model selection. In: Proceedingsof the international joint conference on artificial intelli-gence (IJCAI), vol. 2. Montreal, Quebec, Canada: MorganKaufmann; August 1995.

[26] Molodianovitch K, Faraggi D, Reiser B. Comparing the areasunder two correlated ROC curves: parametric and non-parametric approaches. Biometrical J 2006;48(5):745—57.

[27] DeLong E, DeLong D, Clarke-Pearson D. Comparing the areasunder two or more correlated receiver operating character-istic curves: a nonparametric approach. Biometrics1988;44(3):837—45.

[28] Cortez, P. RMiner: data mining with neural networks andsupport vector machines using R. In: Rajesh, R, editor.Introduction to advanced scientific softwares and toolboxes.IAENG Publications, in press.

[29] R Development Core Team. R: a language and environmentfor statistical computing. Vienna, Austria: R Foundation forStatistical Computing. ISBN 3-900051-00-3, http://www.R-project.org [accessed 26 March 2008].

[30] Moller M. A scaled conjugate gradient algorithm for fastsupervised learning. Neural Netw 1993;6(4):525—33.

[31] Hand D. Classifier technology and the illusion of progress.Stat Sci 2006;21(1):1—15.

[32] Matis S, Doyle H, Marino I, Mural R, Uberbacher E. Use ofneural networks for prediction of graft failure following livertransplantation. In: Proceedings of the eighth IEEE sympo-sium on computer-based medical systems. Washington, DC,USA: IEEE; 1995. p. 133—40.

[33] Gils M, Jansen H, Nieminen K, Summers R, Weller P. Usingartificial neural networks for classifying ICU patient states.Eng Med Biol Mag 1997;16:41—7.

[34] Pearcea C, Gunn S, Ahmeda A, Johnson C. Machine learningcan improve prediction of severity in acute pancreatitisusing admission values of APACHE II score and C-reactiveprotein. Pancreatology 2006;6:123—31.

[35] Kayaalp M, Cooper G, Clermont G. Predicting ICU mortality:a comparison of stationary and nonstationary temporalmodels. In: Proceedings of AMIA symposium. Los Angeles,CA, USA: AMIA; 2000. p. 418—22.

[36] Gago P, Santos MF, Silva A, Cortez P, Neves J, Gomes L.INTCare: a knowledge discovery based intelligent decisionsupport system for intensive care medicine. J Decision Syst2005;14(3):241—59.

[37] Hodge V, Austin J. A survey of outlier detection methodol-ogies. Artif Intell Rev 2004;22(2):85—126.

http://www.R-project.org

http://www.R-project.org

Rating organ failure via adverse events using data mining in the intensive care unit

Documents

Transcript of Rating organ failure via adverse events using data mining in the intensive care unit