Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features...

17
Application of irregular and unbalanced data to predict diabetic nephropathy using visualization and feature selection methods Baek Hwan Cho a , Hwanjo Yu b , Kwang-Won Kim c , Tae Hyun Kim c , In Young Kim a, * , Sun I. Kim a a Department of Biomedical Engineering, Hanyang University, Seoul, Republic of Korea b Department of Computer Science, University of Iowa, Iowa City, IA, USA c Division of Endocrinology and Metabolism, Department of Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea Received 6 April 2007; received in revised form 21 September 2007; accepted 21 September 2007 Artificial Intelligence in Medicine (2008) 42, 37—53 http://www.intl.elsevierhealth.com/journals/aiim KEYWORDS Decision support systems; Diabetic nephropathy; Support vector machines; Visualization; Risk factor analysis; Feature selection Summary Objective: Diabetic nephropathy is damage to the kidney caused by diabetes mellitus. It is a common complication and a leading cause of death in people with diabetes. However, the decline in kidney function varies considerably between patients and the determinants of diabetic nephropathy have not been clearly identified. Therefore, it is very difficult to predict the onset of diabetic nephropathy accurately with simple statistical approaches such as t-test or x 2 -test. To accurately predict the onset of diabetic nephropathy, we applied various machine learning techniques to irregular and unbalanced diabetes dataset, such as support vector machine (SVM) classification and feature selection methods. Visualization of the risk factors was another important objective to give physicians intuitive information on each patient’s clinical pattern. Methods and materials: We collected medical data from 292 patients with diabetes and performed preprocessing to extract 184 features from the irregular data. To predict the onset of diabetic nephropathy, we compared several classification methods such as logistic regression, SVM, and SVM with a cost sensitive learning method. We also applied several feature selection methods to remove redundant features and improve the classification performance. For risk factor analysis with SVM classifiers, we have developed a new visualization system which uses a nomogram approach. Results: Linear SVM classifiers combined with wrapper or embedded feature selection methods showed the best results. Among the 184 features, the classifiers selected the * Corresponding author at: Sungdong P.O. Box 55, Seoul 133-605, Republic of Korea. Tel.: +82 2 2291 1713; fax: +82 2 2296 5943. E-mail address: [email protected] (I.Y. Kim). 0933-3657/$ — see front matter # 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.artmed.2007.09.005

Transcript of Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features...

Page 1: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

Application of irregular and unbalanced data topredict diabetic nephropathy using visualizationand feature selection methods

Baek Hwan Cho a, Hwanjo Yu b, Kwang-Won Kim c,Tae Hyun Kim c, In Young Kim a,*, Sun I. Kim a

Artificial Intelligence in Medicine (2008) 42, 37—53

http://www.intl.elsevierhealth.com/journals/aiim

aDepartment of Biomedical Engineering, Hanyang University, Seoul, Republic of KoreabDepartment of Computer Science, University of Iowa, Iowa City, IA, USAcDivision of Endocrinology and Metabolism, Department of Medicine,Samsung Medical Center, Sungkyunkwan University School of Medicine,Seoul, Republic of Korea

Received 6 April 2007; received in revised form 21 September 2007; accepted 21 September 2007

KEYWORDSDecision supportsystems;Diabetic nephropathy;Support vectormachines;Visualization;Risk factor analysis;Feature selection

Summary

Objective: Diabetic nephropathy is damage to the kidney caused by diabetes mellitus.It is a common complication and a leading cause of death in people with diabetes.However, the decline in kidney function varies considerably between patients and thedeterminants of diabetic nephropathy have not been clearly identified. Therefore, it isvery difficult to predict the onset of diabetic nephropathy accurately with simplestatistical approaches such as t-test or x2-test. To accurately predict the onset ofdiabetic nephropathy, we applied various machine learning techniques to irregular andunbalanced diabetes dataset, such as support vector machine (SVM) classification andfeature selection methods. Visualization of the risk factors was another importantobjective to give physicians intuitive information on each patient’s clinical pattern.Methods and materials: We collected medical data from 292 patients with diabetesand performed preprocessing to extract 184 features from the irregular data. Topredict the onset of diabetic nephropathy, we compared several classification methodssuch as logistic regression, SVM, and SVMwith a cost sensitive learningmethod.We alsoapplied several feature selection methods to remove redundant features and improvethe classification performance. For risk factor analysis with SVM classifiers, we havedeveloped a new visualization system which uses a nomogram approach.Results: Linear SVM classifiers combined with wrapper or embedded feature selectionmethods showed the best results. Among the 184 features, the classifiers selected the

* Corresponding author at: Sungdong P.O. Box 55, Seoul 133-605, Republic of Korea. Tel.: +82 2 2291 1713; fax: +82 2 2296 5943.E-mail address: [email protected] (I.Y. Kim).

0933-3657/$ — see front matter # 2007 Elsevier B.V. All rights reserved.doi:10.1016/j.artmed.2007.09.005

Page 2: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

38 B.H. Cho et al.

same 39 features and gave 0.969 of the area under the curve by receiver operatingcharacteristics analysis. The visualization tool was able to present the effect of eachfeature on the decision via graphical output.Conclusions: Our proposed method can predict the onset of diabetic nephropathyabout 2—3 months before the actual diagnosis with high prediction performance froman irregular and unbalanced dataset, which statistical methods such as t-test andlogistic regression could not achieve. Additionally, the visualization system providesphysicians with intuitive information for risk factor analysis. Therefore, physicians canbenefit from the automatic early warning of each patient and visualize risk factors,which facilitate planning of effective and proper treatment strategies.# 2007 Elsevier B.V. All rights reserved.

1. Introduction

Diabetes mellitus is a metabolic disorder character-ized by chronic hyperglycemia (high blood sugarlevel) resulting from defects in insulin secretion,insulin action, or both [1]. A medical insurancecohort study that included 1.2 million subjects indi-cated that diabetes mellitus, in Korea, is the firstleading cause of burden of disease, and 8.4% of thepopulation suffers from this disease [2]. Diabetescan cause devastating complications including car-diovascular diseases, kidney failure, leg and footamputations, and blindness, which often result indisability and death. Diabetic nephropathy isdamage to the kidney because of diabetes; it is acommon diabetic complication and a leading causeof death in people with diabetes [3].

Many researchers have been trying to determinerisk factors of mortality in patients with diabetes viastatistical approaches. There are also many studieson predictors of diabetic nephropathy. The long-termoccurrence over the years of high blood glucose leveland high blood pressure is highly indicative of thedevelopment of renal disease, other microvascularlesions, and macrovascular disease [4,5]. Clinicaltrials have demonstrated consistently that suppres-sion of the glycosylated hemoglobin (HbA1C) level isassociated with decreased risk for clinical and struc-tural manifestations of diabetic nephropathy inpatients with types 1 and 2 diabetes [6,7]. Torffvitand Agardh showed that poor metabolic control andhigh blood pressure is associated with developmentof diabetic nephropathy in patients with type 2 dia-betes [8]. A logistic regression analysis was adoptedto generate prediction rules for identifying patientswith diabetes at high risk of complications and foranalyzing risk factors [9]. However, most of thoseprognostic studies compared mean values of inde-pendent predictors between patients with diabeticnephropathy and control groups, using simple statis-tical methods such as Student’s t-test and the non-parametric Mann—Whitney U-test. The declinein kidney function varies considerably betweenpatients, and determinants of the diabetic nephro-

pathy have not been identified clearly. Therefore, itis difficult to predict diabetic nephropathy accu-rately using simple statistical approaches.

A number of studies have taken advantage of datamining techniques in the diabetes domain. A 1998review provided evidence of the use of decisionsupport systems to guide physicians through the clin-ical consultation [10]. The paper used time seriesmethods and a causal probabilistic network, andproposed telemedicineand telecare for patientswithdiabetes.Others introduced various types of artificialneural networks (ANNs) or decision trees to predictthe onset of diabetes and to identify risk factorsamong the data [11—14]. However, most of thosepapers attempted to predict the onset of diabetesitself, even though its complications are much moreimportant for the quality of life and mortality.

Although many have tried to approach healthproblems through data mining techniques, thereare many constraints and difficulties in this [15].The main difficulty arises in data collection.Because hospitals have not used electronic medicalrecord systems for long, it is very hard to collect adataset large enough for such research. Moreover,in the university hospital setting, physicians andmedical practitioners have occasionally trans-ferred to other medical institutions, which mightcause different patient care protocols, includingphysical examinations and interviews formats.Furthermore, because of over confidence in theirhealth condition or for other personal reasons,some outpatients visit the hospital irregularly.These circumstances may lead to very irregular,incomplete, or missing data in the clinical setting,and thus make it difficult to extract meaningfulinformation from the data.

Other difficult problems in medical data mininglie in the artificial intelligence algorithm itself. Todiagnose whether patients have a disease, or topredict if they would have a disease in the future,several studies have attempted to apply learningalgorithms such as decision trees, ANNs, and supportvector machines (SVMs). Although they give highgeneralization performances, the users can barely

Page 3: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

Prediction of diabetic nephropathy 39

understand the results by means of input variables,especially with ANNs and SVMs; that is why they aretermed ‘‘black box’’ algorithms. Consequently,despite its relatively poor generalization perfor-mance, physicians and medical practitioners stillmake use of the logistic regression (LR) as a goldstandard because it gives more information aboutthe results: it provides not only the percentage ofvariance in the output variable (i.e., probabilisticoutput), but also the odds ratio of each input vari-able.

In information retrieval, there are various fea-ture (i.e., variable) selection methods to rank theimportance of input features and thereby enhancegeneralization performance by eliminating the dis-turbing features. Most of those methods are mainlyconcerned with the ranking of input features, ratherthan about the insights of each feature. However, ina practical clinical situation, physicians may wish tounderstand the actual effect of each feature on theresults: an interpretation about how the predictionresult would change if a feature’s value were tochange. Jakulin et al. introduced a nomogramapproach for visualizing SVMs that can graphicallyexpose its internal structure and visualize the effectof each feature by means of the log odds ratio, aswith LR [16].

In this study, we aimed to predict the onset ofdiabetic nephropathy by learning SVM predictivemodel from an irregular and unbalanced diabeticdataset. We also attempted to identify some riskfactors from clinical parameters using feature selec-tion and nomogram visualization.

2. Classification methods

2.1. Logistic regression and the ridgeestimator

LR is a popular method to generate a predictivemodel for dichotomous target variable. The rela-tionship between the input variables and theresponse is not a linear function, but the logisticregression function:

PðxÞ ¼ 1

1þ e�ðaþb1x1þb2x2þ ��� þbdxdÞ(1)

where P is the probability that an event happens, a

is the coefficient of the equation and bi is thecoefficient of the input variable. The log-likelihoodfunction of the data is

LðbÞ ¼Xi

ðyi log PðxiÞ þ ð1� yiÞ logð1� PðxiÞÞÞ (2)

The conventional LR optimizer finds the LR coeffi-cients by maximizing L(b) using the maximum like-

lihood estimate method. However, unstableparameter estimates may arise when the numberof input variables is large or the input variables arehighly correlated. Cessie and Houwelingen intro-duced ridge estimators to resolve this case [17].They included a restriction term in the log-likeli-hood function thus

LlðbÞ ¼ LðbÞ � ljjbjj2 (3)

where L(b) is the unrestricted log-likelihood func-tion and jjbjj is the norm of the parameter vector b.The ridge parameter l shrinks the norm of b, and therestriction term stabilizes the system to provideestimates with smaller variance.

2.2. Cost sensitive learning in SVMs

SVMs, an emerging classification technique, havebeen intensively benchmarked against a variety oftechniques; it is one of the best-known classifica-tion techniques with computational advantagesand good generalization performance. The mainidea of SVMs is to maximize the margin, which isdefined as the distance from the separating hyper-plane to the closest training samples (support vec-tors) [18].

However, for an unbalanced dataset that has farmore positives than negatives or vice versa, thegeneral classifiers may produce poor generalizationperformance as the hyperplane may be moved faraway to the minority training samples. For thisreason, Veropoulos et al. used a cost-sensitive learn-ing approach for SVM; their key point is to givedifferent cost (penalty) to the errors of each class[19]. Accordingly, the optimization strategy of SVMis as follows:

minimize1

2jjwjj2 þ Cþ

Xnfijyi¼þ1g

ji þ C�Xn

fijyi¼�1gji (4)

subject to yiðw � xi þ bÞ� 1� ji; i ¼ 1; . . . ; n

ji� 0; i ¼ 1; . . . ; n

(5)where w is a weight vector of the separatinghyperplane, ji is a slack variable that allow themargin constraints to be violated and C is a userparameter to be tuned. The first term of theobjective function is about the margin maximiza-tion, and the second and third part is for control-ling the penalties for positive and negativetraining samples. Geometrically, when positivedata are in the minority, it gives more weight tothe positive support vectors (C+ is larger than C�),and pushes the hyperplane towards where thenegative (majority) data exist.

Page 4: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

40 B.H. Cho et al.

Table 1 Algorithm of the ReliefF method

3. Feature selection methods

Feature selection is a machine learning process thatselects a feature subset from the whole feature setand removes redundant features that do not con-tribute to the performance. Feature selectionmethods have been introduced to avoid the ‘‘curseof dimensionality’’, which means the required num-ber of calculations becomes huge as the number ofdimensions increases, while retaining or evenenhancing the performance.

There are three main approaches in featureselection: filter, wrapper, and embedded methods[20]. Filter methods select high ranked featuresbased on a statistical score as a preprocessing step;ReliefF is a popular filter algorithm in microarrayclassification problems because of its simplicity.Wrapper performs selection taking into accountthe classifier as a black box and ranking the subsetof features by their predictive power. Because a fullsearch requires 2n different evaluations, forwardselection or backward elimination methods areused. Sensitivity analysis could be adopted to cal-culate the importance of each feature with anyclassifier. Embedded methods, in contrast to wrap-per approaches, select features considering theclassifier design at the same time.

3.1. ReliefF algorithm

A key idea in ReliefF is to evaluate the contributionof each feature to inter-class difference and intra-class similarity [21]. With a randomly selected data,the algorithm looks for the k nearest hits (those with

the same class label) and misses (those with adifferent class label). After that, it updates thequality of the contribution of features with respectto the difference between the feature values of theselected data and nearest ones. The pseudocode isshown in Table 1. Function diff( f, Ii, Ij) calculatesthe difference between the feature values of twoinstances. Thus, the weight vector W[f] increaseswhen the feature value of the selected instance isdifferent from that of the nearest missMj(C). On theother hand, it decreases when there is a differencebetween the feature values of the selected instanceand the nearest hit Hj. Finally, according to theweight values of W[f], one can identify the rankingof each feature and perform feature selection byeliminating features with smallest weight values.

3.2. Sensitivity analysis with SVM

Sensitivity analysis is another method that has beenwidely used to rank input features in terms of theircontribution to the deviation of the outputs [22]. Itinvolves varying every input feature over a reason-able range with the others fixed, and observing therelative changes in the outputs. As a result, featuresthat produce larger deviation in the output are con-sidered more important and one can select featuresby the same method that is adopted in the ReliefFmethod above. Table 2 shows the pseudocode of thesensitivity analysis method. The algorithm calculatesthe difference betweenmaximum andminimum out-put of the predictive model when a feature valuevaries from its possible minimum to maximum withthe other features fixed to their mean values.

Page 5: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

Prediction of diabetic nephropathy 41

Table 2 Algorithm of the sensitivity analysis method

3.3. Recursive feature elimination withSVM (SVM—RFE)

SVM—RFE is an example of the embedded methodand a similar approach that removes less importantfeatures recursively, except for using the weightmagnitude as a ranking criterion [23]. The outlineof the algorithm is presented in Table 3. In theiterative training process in SVM—RFE, one can findthe best subset of features that provides the highestperformance. However, this algorithm is limitedtheoretically to the linear kernel in SVM, becauseit is difficult to calculate the weight vector for anon-linear kernel because of the kernel character-istics of the implicit mapping. The authors of thepaper [23] mentioned the non-linear kernelversion of SVM—RFE that is computationally moreexpensive.

Table 3 Algorithm of the SVM—RFE method

4. Risk factor analysis with nomogramvisualization of SVM

4.1. General concept of the nomogram inpredictive models

The nomogram gives the insight of a model byvisualizing the effect of each feature on the pre-diction. Fig. 1 shows an imaginary example of thenomogram visualizing a predictive model. Supposethat one is trying to predict whether a patient willhave diabetic nephropathy by considering two fea-tures. One is whether the patient smokes or not; theother is systolic blood pressure level. In the figure,log odds ratio (Log OR) scores of the individualfeatures are summed up using the topmost axis ofthe nomogram and used to estimate the probabilityof having diabetic nephropathy (the bottommost

Page 6: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

42 B.H. Cho et al.

Figure 1 An imaginary nomogram example of a supportvector machine (SVM) model that predicts the probabilityof having diabetic nephropathy within 1 year. The nomo-gram gives the insight of a model by presenting the Log ORscore for each feature, which denotes the effect of theindividual feature on the prediction (the higher the Log ORscore, the higher the risk).

axis of the nomogram). In this example, theLog OR score of smoking is 0.46 and that of systolicblood pressure level of 170 is 0.63. Thus, the sumof the Log OR scores becomes 0.46 + 0.63 = 1.09,which corresponds to the probability of 0.86that the patient will have diabetic nephropathywithin a year.

Using the Log OR line, one can easily see howmuch each feature influences the target probability.When the Log OR score of a feature is high, it givesmore positive effect on the probability (the higherthe Log OR score, the higher the risk that the patientwill have diabetic nephropathy). Moreover, longerfeatures on the nomogramwill have a wider range ofLog OR score and thus have stronger effects on thetarget prediction probability. For example, the sys-tolic blood pressure line is longer than the smokingline, which implies that the probability of diabeticnephropathy is more strongly associated with sys-tolic blood pressure.

4.2. How to draw a nomogram with SVM

Jakulin et al. [16] introduced a nomogram methodfor SVM predictive models, in which the distancefrom a data sample (x, y) to the separating hyper-plane of SVM is considered an independent variableand denoted as d(x). Given a kernel function K(x, z),the distance can be replaced by the decision func-tion in SVM as follows:

dðxÞffi bþXNj¼1

y ja jKðx; z jÞ (6)

where b is the bias, a expresses the coefficientsof support vectors z in SVM and N is the numberof support vectors. When the kernel is linearly

decomposable with respect to each feature, thedistance becomes:

dðxÞffi bþXMk¼1½w�k (7)

and

½w�k ¼XNj¼1

y ja jKðxk; z j;kÞ (8)

where M is the number of features, xk is the kthfeature of data vector x and zj,k is the kth feature ofthe jth support vector.

Considering the class label y as a dependentvariable, the probability that the sample belongsto the positive group (in binary classification pro-blem) is denoted as

Pðy ¼ 1jxÞ ¼ 1

1þ e�ðAþB�dðxÞÞ (9)

The parameters A and B can be calculated by opti-mizing the log-likelihood function, as is done in LR. Across-validation is performed internally to preventoverfitting [26]. After optimizing the parameters Aand B, one can revise (9) as

Pðy ¼ 1jxÞ ¼ 1

1þ e� b0þPM

k¼1 ½b�k� � (10)

where b0 = A + B � b and [b]k = B � [w]k. b0 is anintercept, a constant delineating the prior probabil-ity in the absence of any features, and [b]k is theeffect vector that maps the value of the kth featureinto a point score, which finally becomes a line ofthe Log OR for the feature in the nomogram as seenin Fig. 1. Henceforth, when the summation of theLog OR scores of each feature becomes high, theprobability that y is equal to one (the probabilitythat the sample belongs to the positive class)becomes also high. Note that linear kernel is fea-sible for decomposing itself by each feature,whereas neither polynomial kernel nor radial basisfunction (RBF) kernel is decomposable linearly.

4.3. Nomogram-based recursive featureelimination (nomogram-RFE)

Because the prediction output is mainly associatedwith the effect vector (i.e., Log OR), one candeduce that a feature is more important whenthe length of the line in the nomogram is longer,as described above. Consequently, a feature selec-tion method based on the nomogram determinesmore important features according to the lengths ofthe lines.

From an SVM model trained at each iterativeround, the nomogram-RFE method calculates thelengths of lines that correspond to their features in

Page 7: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

Prediction of diabetic nephropathy 43

nomograms. As with SVM—RFE, nomogram-RFE canremove features recursively that have low effectson prediction output (i.e., short length of the line).During the iterative training process, one can find asubset of features that provides the best predictiveperformance.

Figure 2 A virtual example of irregular and incompletedata in the clinical setting. This patient was diagnosedwith diabetic nephropathy on 10 November 2004 by mea-suring microalbumin secretion.

5. Experimental set-up

5.1. Data preparation

Data of all patients with diabetes who attended theoutpatient clinic in Samsung Medical Center, Seoul,Korea, have been collected consecutively for up to10 years (1996—2005). A total of 4321 adult patientswith type 2 diabetes mellitus have taken severalphysical examinations out of the following 20 itemsat a visit: glycosylated hemoglobin (HbA1C, %), cho-lesterol (mg/dL), alkaline phosphatase (ALP, U/L),alanine aminotransferase (ALT, U/L), aspartate ami-notransferase (AST, U/L), creatinine (mg/dL), bloodurea nitrogen (BUN, mg/dL), triglyceride (mg/dL),white blood cell count (WBC, �103 mL�1), hemoglo-bin (g/dL), platelet count (�103 mL�1), high-densitylipoprotein cholesterol (HDL-C,mg/dL), low-densitylipoprotein cholesterol (LDL-C, mg/dL), Na+ (mmol/L), K+ (mmol/L), uric acid (mg/dL), microalbumin(mg/min), systolic blood pressure (sBP, mmHg), dia-stolic blood pressure (dBP, mmHg) and body massindex (BMI, kg/m2). For the differential diagnosis ofdiabetic nephropathy, we selected the positivepatients from the whole dataset when the followingcriteria were fulfilled: (1) 20—200 mg/min in urinaryalbumin (microalbumin), (2) no evidence of micro-albumin or renal failure at the time of diabetesdiagnosis, and (3) prior evidence of diabetic retino-pathy, as the development of renal disease isstrongly associated with the occurrence of retino-pathy [24,25].

As mentioned above, there was some difficulty indata collection: the patients have not always takenall the tests at the visit, and their visits were veryirregular. Some patients visited the hospital lessthan once a year. Fig. 2 illustrates an imaginaryexample for a patient that could exist in the data-set. This patient’s first visit to this hospital was inthe early part of 2000 and only two parameters weremeasured (HbA1C and cholesterol) at that time. Thepatient took five tests 8months later; took four testsanother 7 months later, and so on. As shown, thissequential dataset does not always have the sameitems at every visit and the time gaps are irregular.Thus, one cannot directly use such irregular andincomplete data in SVM classification, which ledus to preprocess this data.

5.2. Feature extraction using thequantitative temporal abstraction

Over the past decade, much work has been done toextract relevant features from the time-stampedlongitudinal data and the temporal abstraction (TA)is one of the most interesting approaches for thatpurpose [26—28]. In the clinical domains, the goal ofTA task is to evaluate and summarize the state of thepatient over a period. Several researches in thediabetes mellitus domain have also incorporatedthe TA framework [29,30]. Those researches aredealing with high-frequency domains where theclinical parameters have been measured at least afew times a day such as monitoring in the intensivecare unit, and focusing mainly on the knowledge-based TA approach which requires the clinician’sdomain knowledge and derives meta features thatform symbolic descriptions of the data. Verduijnet al. compared the qualitative (knowledge-based)TA procedure with the quantitative (data-driven) TAthat extracts meta features from the data usingstatistical summaries (such as mean, variance, theslope coefficient) with minimum use of domainknowledge [31]. They concluded that the qualita-tive TA procedure is preferable to the quantitativeTA in the case study about the prediction fromintensive care monitoring data. Thus, we tried toextract the features from the sparse time-stampeddata (each variable has been measured a few timesper year) by simple data-driven abstraction ratherthan the knowledge-based TA method.

Using the quantitative TA, we derived the follow-ing nine meta features from each laboratory exam-ination dataset to represent the trend of thesequential data over the period between the firstvisit and the latest visit before the diagnosis. These

Page 8: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

44 B.H. Cho et al.

were the minimum, maximum, mean, variance,slope, estimated value on the date of prediction(EST), initial value, the latest value before theprediction, and K value (see below).

We calculated the slope using linear regressionanalysis with consecutive values of the laboratoryexamination data over a given period. We could alsoestimate the EST, applying the date of prediction tothe generated regression model. K value is similarto the stochastic oscillator in the financial domain,which computes the location of the latest featurevalue relative to its range over a given period, asfollows:

K ¼ L� Min

Max� Min(11)

where L is the latest value, Min is the minimum andMax is the maximum.

After preprocessing, we had 292 records (33positives and 259 negatives) comprising 184 featuresfor each instance, including the demographic fea-tures of the patients: age on the date of prediction,age of onset of diabetes, diabetes duration, and sex.

Table 4 The extracted features after the preproces-sing

Feature index Featurea

1—4 Onset age, diabetes duration,age, sex

5—13 White blood cell count (WBC)14—22 Hemoglobin23—31 Platelet count32—40 Serum cholesterol level41—49 Serum aspartate aminotransferase

(AST) level50—58 Serum alanine aminotransferase

(ALT) level59—67 Serum alkaline phosphatase

(ALP) level68—76 Blood urea nitrogen (BUN)77—85 Creatinine86—94 Uric acid95—103 Na+

104—112 K+

113—121 Serum triglycerides level122—130 High density lipoprotein cholesterol

(HDL-C) level131—139 Low density lipoprotein cholesterol

(LDL-C) level140—148 Glycosylated hemoglobin (HbA1C)149—157 Microalbumin158—166 Systolic blood pressure (sBP)167—175 Diastolic blood pressure (dBP)176—184 Body mass index (BMI)a Each feature set except for 1—4 has 11 features: slope,

mean, variance, maximum, minimum, K, EST (estimatedvalue on the date of prediction), initial value and latest value.

Table 4 summarizes the features extracted with thepreprocessing step.

5.3. Performance evaluation

Several classification algorithms were comparedwith each other. Especially for SVMs, the effect ofcost-sensitive learning was examined, comparedwith equal cost learning. More importantly, weevaluated the effects of the various feature selec-tion methods: statistical feature selection, ReliefF,sensitivity analysis, SVM—RFE, and nomogram-RFE.

Throughout the experimental process, we tookadvantage of leave-one-out cross-validation (LOOCV)to evaluate the performance of each predictionmethod. The LOOCV is equivalent to k-fold crossvalidation, where k is the number of data objects.Because the sigmoid parameters should be obtainedto use the nomogram approach, we used the prob-abilisticoutputs of SVMclassification in all theexperi-ments [32]. Assuming equal loss for misclassifiednegative and misclassified positive, the optimalthreshold for the probabilistic output is P(y =1jf) = 0.5. Instead, an alternative threshold wasalso applied in the test phase. We used Weka [33]and LIBSVM [34] for implementations of ReliefFand SVM.

Because the dataset in this study had an unba-lanced distribution of positives and negatives, theclassification accuracy (the rate of correctly classi-fied test samples) was not sufficient as a perfor-mance measurement of the predictive models.Suppose that one has a dataset including 10 positivesand 90 negatives. A simple decision rule that clas-sifies all the instances as negative would represent90% accuracy, whereas it could not correctly predictany positive instance. From a medical point of view,a misclassified negative is the most critical decision,because the patient could not have appropriatemedical care in that case. However, one also needsto reduce the number of misclassified positives,which leads to unnecessary additional physicalexamination or treatment.

Accordingly, sensitivity and specificity analysis iscommon in medicine. Sensitivity stands for the per-centage of patients correctly recognized by the clas-sification whereas specificity means the percentageof healthy subjects recognized by the classification.However, there is a trade-off between sensitivity andspecificity, meaning it is necessary to calculate otherstable evaluation metrics. Therefore, we used thearea under the curve (AUC) of a receiver operatingcharacteristic (ROC) curve as a target performancemetric. The ROC curve typically plots false positiverate (1 � specificity) versus true positive rate (sensi-tivity) while a decision threshold is being varied. The

Page 9: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

Prediction of diabetic nephropathy 45

AUC is a convenient way of comparing classifierswhere a random classifier has an area of 0.5, andan ideal classifier has an area of 1.0.

We also calculated other judging criteria to eval-uate a classifier, such as the balanced error rate(BER) and the harmonic mean of sensitivity andspecificity (HMSS). The definition of BER is

BER ¼ 1

2

false negative

posþ false positive

neg

� �(12)

In an example mentioned above, the BER would be50% because the first term of (12) is 10/10 but thesecond would be 0/90. The HMSS is analogous tothe F measure in information retrieval. The generaldefinition of F measure for a non-negative real valuea is

Fa ¼ð1þ aÞ � precision� recall

a� precisionþ recall(13)

where the precision is the fraction of predictedpositives that are the actual positives while therecall is the fraction of the actual positives thatare predicted by a classifier, which is identical tosensitivity. The F1 measure, where a is one, is thetraditional F measure that is indeed the harmonicmean of precision and recall, i.e.,

F1 ¼2� precision� recall

precisionþ recall(14)

However, the F measure does not take into accountperformance on the negative class because it is hardto estimate the true negative in information retrie-val. To overcome the drawback of the F measure, weused HMSS, defined as

HMSS ¼ 2� sensitivity� specificity

sensitivityþ specificity(15)

Table 5 Comparison of the logistic regression and SVM keselection methoda

Parameter Ridge LRb SVM

# Features 184 184C+c n/ad 1AUC 0.571 0.6Accuracy 0.774 0.8HMSS 0.559 0BER 0.379 0.5Sensitivity 0.424 0Specificity 0.819 1a Note that accuracy, HMSS, BER, sensitivity and specificity are mb Ridge LR: logistic regression with a ridge estimator.c C+: cost weight for positive instances.d n/a: not available.

This gives equal weight to both sensitivity and spe-cificity, as the F1 measure does for precision andrecall.

6. Results

6.1. Baseline performances with all thefeatures

First, we performed conventional LR without anyfeature selection method, and it failed to optimizethe solution. Therefore, we used ridge LR as areplacement for the conventional method. Table 5compares the ridge LR, SVM with linear kernel, andSVM with RBF kernel using all 184 features. Weapplied the equal weights to both positives andnegatives for the baseline SVM experiments. Forthe SVM methods, the prediction accuracies wereas high as 0.887 for both kernels when the decisionthreshold was 0.5. However, the HMSSs gave zerosand the BERs were about 0.5, which were clearlybecause of the imbalanced dataset and the lowsensitivity of the classifiers. Although the SVM withlinear kernel showed the best performance in termsof AUC, all three methods showed poor performance(all were lower than 0.7), suggesting the need for asearch for feature selection method.

6.2. Statistical feature selection method

We applied the x2-test for the feature sex andStudent’s t-test for the remaining 183 features.Among the 184 features, Table 6 summarizes sig-nificantly different features between the twogroups: those who developed (positive) and thosewho did not develop diabetic nephropathy (nega-tive). For the positive patients, the initial value of

rnel methods without cost-sensitive learning or feature

with linear kernel SVM with RBF kernel

1841

75 0.64487 0.887

00.501

easures using a threshold of 0.5 in probabilistic output.

Page 10: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

46 B.H. Cho et al.

Table 6 Results of statistical analysis for each feature between the two groups (only these 17 features showedsignificant differences from the 184 features tested)

Feature index Feature name Positive (n = 33) Negative (n = 259) Significance

Mean � S.D. Mean � S.D.

12 WBC (initial) 7.26 � 1.70 6.63 � 1.61 *

63 ALP (min) 72.76 � 25.83 63.07 � 18.22 *

117 Triglyceride (min) 100.82 � 44.91 82.56 � 34.59 **

123 HDL-C (mean) 44.76 � 10.20 49.50 � 11.31 *

125 HDL-C (max) 53.15 � 12.59 60.08 � 14.33 **

130 HDL-C (latest) 41.54 � 9.93 46.81 � 13.60 *

141 HbA1C (mean) 8.10 � 1.12 7.58 � 1.17 *

143 HbA1C (max) 10.42 � 1.61 9.75 � 1.86 *

144 HbA1C (min) 6.46 � 0.95 6.05 � 0.94 *

148 HbA1C (latest) 8.10 � 1.31 7.53 � 1.57 **

150 Microalbumin (mean) 10.16 � 4.08 6.68 � 3.08 ***

151 Microalbumin (variance) 2.86 � 1.87 1.72 � 1.39 **

152 Microalbumin (max) 13.31 � 4.31 8.65 � 4.03 ***

153 Microalbumin (min) 7.24 � 4.75 4.85 � 2.77 **

156 Microalbumin (initial) 10.23 � 5.30 6.72 � 3.70 ***

157 Microalbumin (latest) 9.92 � 5.50 6.79 � 3.96 **

172 dBP (K) 0.45 � 0.21 0.53 � 0.27 *

* P < 0.05.** P < 0.01.*** P < 0.001.

WBC count (P < 0.05), minimum value of ALP(P < 0.05), and minimum value of triglyceride(P < 0.01) were significantly higher than those ofnegative patients. For HDL-C mean (P < 0.05), max-imum (P < 0.01), and the latest value (P < 0.05) ofthe positive patients were significantly lower thanthose of negatives. The positive patients had sig-nificantly higher values in mean (P < 0.05), max-imum (P < 0.05), minimum (P < 0.05), and thelatest value (P < 0.01) of HbA1C than negatives.Lastly, for the microalbumin of the positivepatients, mean (P < 0.001), variance (P < 0.01),maximum (P < 0.001), minimum (P < 0.01), the

Table 7 Comparison of logistic regression and SVM kernelHMSS, BER, sensitivity, and specificity were measured using

Parameter Ridge LRa SVM linear

# Features 17 17C+c n/ad 1AUC 0.776 0.276Accuracy 0.880 0.887HMSS 0.348 0BER 0.411 0.5Sensitivity 0.212 0Specificity 0.965 1a Ridge LR: logistic regression with a ridge estimator.b (*) SVM with cost sensitive learning including equal cost learninc C+: cost weight for positive instances.d n/a: not available.

initial value (P < 0.01), and the latest value(P < 0.01) were much higher.

Using only these 17 statistically different fea-tures, we performed five different combinationalapproaches. Table 7 shows the results, where an SVMclassifier using the RBF kernel and cost-sensitivelearning was found to give the best AUC (0.807)and SVM classifiers with equal cost learninggave the worst AUC. The HMSSs and BERs of allthe combinational methods were poor because ofthe low sensitivity of the classifiers, which meansthat the classifiers predicted–—most of the test dataas negatives with the threshold of 0.5.

methods with statistically selected features (accuracy,a threshold of 0.5 in the probabilistic output)

SVM RBF SVM linear* b SVM RBF*b

17 17 171 16 130.679 0.793 0.8070.890 0.894 0.8900.059 0.167 0.1670.485 0.456 0.4580.030 0.091 0.0911 0.996 0.992

g.

Page 11: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

Prediction of diabetic nephropathy 47

Figure 3 Performance variation of each classifier usingthe ReliefF method. Ridge LR stands for logistic regressionwith a ridge estimator and the SVM classifiers that includean asterisk shows the best area under the curve (AUC)values of the receiver operating characteristic curvefor cost-sensitive learning and the equal cost learningmethods.

Figure 4 Performance variation of each classifier usingthe sensitivity analysis method. Ridge LR stands for logisticregression with a ridge estimator and the SVM classifierswith asterisks shows thebest AUCsamong the cost-sensitivelearning and the equal cost learning method.

6.3. ReliefF method

To evaluate the ranking of the features in terms oftheir effects on the discriminating ability, weapplied the ReliefF method to the whole datasetas a preprocessing step. Based on this feature rank-ing, we eliminated low-ranked features one by oneand trained the classifiers again with the remainingfeatures. Fig. 3 plots the performance variationbased on the number of features selected by ReliefF.Note that it is easier to understand the graph byexamining it from the right to the left, because weused the backward elimination. Even after the clas-sifiers were trained based on the feature rankingby ReliefF, no classifier showed any significantenhancement of performance compared with theclassifier with statistical feature selection method.In Table 8, which shows the performance of theclassifiers in the best cases, the results are poorerthan those from the statistical feature selection

Table 8 Comparison of logistic regression and SVM kerne(accuracy, HMSS, BER, sensitivity and specificity were measu

Parameter Ridge LRa SVM linear

# Features 2 159C+c n/ad 1AUC 0.758 0.714Accuracy 0.877 0.887HMSS 0 0BER 0.506 0.5Sensitivity 0 0Specificity 0.988 1a Ridge LR: logistic regression with a ridge estimator.b (*) The best SVM classifier that includes the cost sensitive learnc C+: cost weight for positive instances.d n/a: not available.

method. The HMSSs and BERs were 0 and around0.5, respectively.

6.4. Sensitivity analysis method

Unlike with the ReliefF method, we trained a clas-sifier first and then eliminated the lowest-rankedfeature by estimating the feature ranking based onsensitivity analysis. Fig. 4 illustrates the perfor-mance variation versus the number of selectedfeatures by sensitivity analysis. Unlike the resultswith the ReliefF method, the performance variationhad a common tendency in that it increased until thenumber of the features decreased to some point,and then it started to degrade as the numberreached one. In most cases, the linear kernel-basedSVM classifier trained by the cost-sensitive learningmethod (SVM linear*) represented the best perfor-mance, and LR was the worst with respect to AUC.

Table 9 compares the five classifiers in the bestcases with each classificationmethod. Among these,the linear kernel based SVM classifier with 39

l methods with the ReliefF method in the best casesred using a threshold of 0.5 in the probabilistic output)

SVM RBF SVM linear* b SVM RBF*b

44 8 81 7 70.742 0.784 0.7960.886 0.884 0.8860 0 00.5 0.502 0.50 0 01 0.996 1

ing method.

Page 12: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

48 B.H. Cho et al.

Table 9 Comparison of logistic regression and SVM kernel methods with sensitivity analysis method in the best cases(accuracy, HMSS, BER, sensitivity and specificity were measured using a threshold of 0.5 in the probabilistic output)

Parameter Ridge LRa SVM linear SVM RBF SVM linear* b SVM RBF*b

# Features 76 39 75 39 63C+c n/ad 1 1 1 7AUC 0.879 0.969 0.918 0.969 0.943Accuracy 0.911 0.921 0.894 0.921 0.897HMSS 0.867 0.532 0.263 0.532 0.264BER 0.130 0.322 0.430 0.322 0.428Sensitivity 0.818 0.364 0.152 0.364 0.152Specificity 0.848 0.992 0.988 0.992 0.992a Ridge LR: logistic regression with a ridge estimator.b (*) The best SVM classifier that includes the cost sensitive learning method.c C+: cost weight for positive instances.d n/a: not available.

selected features showed the highest AUC (0.969).Although the HMSS and the BER outcomes of LR werebetter than those of SVM classifiers, SVM classifiershad higher AUCs. Notably, the best classifier (SVMlinear) used equal cost learning, rather than cost-sensitive learning.

6.5. SVM—RFE and nomogram-RFE

Fig. 5 shows that the embedded feature selectionmethods (SVM—RFE and nomogram-RFE) had exactlythe same pattern in performance variation as thenumber of selected features decreased to somepoint (approximately 10), meaning that they elimi-nated the least-ranked features with the sameorder. Classifiers with cost-sensitive learningseemed superior to those with equal cost learningfor most cases. However, both had the same peakpoints of performance with the same number offeatures. The results of SVM—RFE and nomogram-RFE (both using linear kernel) had exactly the samepattern as the SVM classifier using linear kernel andsensitivity analysis discussed in Section 6.4.

Figure 5 Performance variation of each classifier usingthe SVM—RFE and nomogram-RFE methods. Ridge LRstands for logistic regression with a ridge estimator, andthe SVM classifier that includes an asterisk shows the bestAUCs among the cost-sensitive learning and the equal costlearning methods.

For the best cases, all four approaches gave theidentical highest AUC (0.969), and they took advan-tages of 39 features. Although the accuracies of theclassifiers were greater than 0.9, the HMSSs and theBERs were still poor, with a threshold of 0.5.

6.6. The best classifier

From the results of the various methods above, SVMclassifier with a linear kernel and the 39 selectedfeatures showed the best performance in terms ofAUC. However, it needed an alternative threshold toenhance the HMSS and the BER. We thus tried to useas an alternative threshold the rate of positives overthe total records in the training dataset (the prob-ability threshold varied between 0.12 and 0.13 withrespect to the training dataset). When the alter-native threshold was applied (Table 10), the HMSSincreased and the BER decreased, although accu-racy became lower. These changes imply that SVMapproaches may need an alternative probabilitydecision threshold instead of 0.5 for imbalanceddatasets.

Table 10 Comparison of the threshold variations inthe best prediction

Parameter SVM Linear with 39 featuresSVM RBF*a

Threshold 0.5 Ratiob

# features 39C+a 1AUC 0.969Accuracy 0.921 0.839HMSS 0.532 0.890BER 0.322 0.104Sensitivity 0.364 0.970Specificity 0.992 0.822a C+: cost weight for positive instances.b Ratio: The rate of positives over the total data records.

Page 13: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

Prediction of diabetic nephropathy 49

Figure 6 Receiver operating characteristic (ROC) curveof the best classifier. The AUC of the classifier is 0.969. Thesolid arrow indicates an optimum threshold point (prob-ability of 0.13) for the classifier, which gives a sensitivityof 97% and a specificity of 85%. The dashed arrow indicatesanother threshold point (probability of 0.25) for the clas-sifier, which gives 94% sensitivity and 95% specificity.

Figure 7 The VRIFA system applied by the linear-kernel-based SVM classifier with 39 selected features, asdescribed in Section 6.6.

Fig. 6 shows the ROC curve of the best classifier.The solid arrow indicates the optimum point of theclassifier, giving a sensitivity of 0.97 and a specificityof 0.85as a threshold is 0.13. The dashed arrowindicates another optimum option that gives a sen-sitivity of 0.94 and a specificity of 0.95 at a thresholdof 0.25.

6.7. Prediction time gap

We applied another statistical test related to thetime gap, that is, the interval between the date ofthe latest examination and the date that we weretrying to predict. As shown in Table 11, there was nosignificant difference of prediction gap betweenpositives and negatives: both groups had about0.2 years (2.5 months) of prediction gap. For themicroalbumin measurements, there was also nosignificant difference. All patients had around 2years to take another examination of microalbumin

Table 11 The results of Student’s t-test applied to the meathe two groups (all tests were non-significant)

Time Mean �Positiv

Prediction gap (year) 0.21 �Microalbumin gap (year) 2.05 �Follow-up period (year) 5.37 �

for the final diagnosis of diabetic nephropathy. Thetotal follow-up period was also not significantlydifferent: approximately 5—6 years for both groups.

6.8. Interpretation of the results usingthe VRIFA system

Based on the best classifier (nomogram-RFE with the39 features) in Section 6.6, we have plotted theeffects of the selected features using the nomogramvisualization tool for risk factor analysis (VRIFA) inFig. 7. The upper left part of the figure shows theranges of effect values of the 39 features. In thatpart, features 124 (HDL-C variance) and 170 (dBPmaximum) seem to have had high effects on theprediction because they show wider ranges thanothers do. The detailed Log OR (effect value) ofeach feature at a specific input value can be seenin the right part of VRIFA. By applying all the weightvalues and the intercept to the expression inEq. (10), we show an estimate of the probabilityof having diabetic nephropathy at the bottom of thefigure.

ns of the prediction gap and time of follow-up between

S.D.

e (n = 33) Negative (n = 259)

0.24 0.19 � 0.120.99 1.87 � 0.911.98 5.86 � 1.99

Page 14: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

50 B.H. Cho et al.

Fig. 8 shows the detailed Log OR for all theselected features. When the Log OR increases tothe upper right end as the feature value increasesas for the initial WBC, it shows that the higher thefeature value, the higher is the probability that thepatients would develop diabetic nephropathy. Bycontrast, when the line goes down as the featurevalue increases (as for minimum Hemoglobin value),the higher the value of the lower is the probabilitythat a patient might develop diabetic nephropathy.

Examining each feature closely, some of the fea-tures selected statistically in Section 6.2 have alsobeen included in this selected feature set (initial

Figure 8 The detailed effect values (log odds ratios) of th

WBC; minimum ALP; mean, maximum and minimummicroalbumin, and variance in microalbumin), andthey show similar tendencies: thus a higher value foreach feature indicates positive (bad) effects ofhaving diabetic nephropathy. However, some otherfeatures have inconsistent effects compared withthe commonly believed facts. For example, theeffect values decrease as the mean, EST and initialvalues of systolic blood pressure increase. In addi-tion, the effect values decrease as the mean andminimum values of BMI increase. However, thesefeatures showed no statistically significant differ-ence between groups.

e selected features. *P < 0.05; **P < 0.01; ***P < 0.001.

Page 15: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

Prediction of diabetic nephropathy 51

7. Discussion

7.1. Data preparation and featureextraction

It was difficult to acquire such a large electronicmedical dataset, with 4321 patient follow-uprecords aged up to 10 years. Moreover, the originaldata were irregular and incomplete sequences,which made this study challenging. Therefore, weextracted 184 meta features for preprocessing,shrinking the number of records to 292; this stepwas essential for the machine learning application.To represent a record for each physical and labora-tory examination (20 items), we employed the qua-litative TA approach that calculates severalfundamental statistical features such as minimum,maximum, mean and variance; we also estimatedthe trend of each item. These extracted featuressufficed to describe the data and to distinguish thetwo groups of patients who did or did not developdiabetic nephropathy, because the classificationperformance with several feature selection meth-ods showed promising results. However, our futurework will compare the simple statistical abstractionwith more complex abstraction with the knowledge-based TA procedure.

Despite this reduction in data, there was unlikelyto be any bias in sampling. In the preprocessing step,we selected all data records under the same con-ditions. Thus, they all needed the full 184 features,which means that the patients should have had atleast two examinations for all 20 items before thefinal diagnosis. Moreover, there were no significantdifferences in age, sex, age at the onset of diabetes,or the duration of diabetes between the two groups.

7.2. Feature selection methods

There were too many features to be trained com-pared with the number of records; this raises theissue of ‘‘curse of dimensionality’’, which refers tothe exponential growth of hypervolume as a func-tion of dimensionality. To deal with this, we adoptedseveral feature selection methods; some of whichshowed better results than the classifier using all184 features. This fulfils the dictum of Occam’srazor in that the fewer assumptions are made toexplain a phenomenon, the better it is.

Using statistical analysis, we clarified that somefeatures differed significantly different betweenthe two groups. However, those features were notsufficient to distinguish the two groups because theclassification performance using those features waspoor compared with those with other feature selec-tion methods. Statistical methods, such as Student’s

t-test, compare the means of groups and indicate ifthere is any probable difference between them.However, such statistical differences could notalways guarantee linear or non-linear separablevalues for each group of patients, so it was hardto predict diabetic nephropathy using a statisticalfeature selection.

Filter methods such ReliefF have advantages incomputation because they do not interact withclassifiers, which is why they have been used forsimple approaches. By contrast, the wrapper andembedded methods consider classifier design andare thus computationally expensive. However, thesetactics — sensitivity analysis, SVM—RFE and nomo-gram-RFE — showed better performances thanReliefF.

7.3. Classification methods

The failure of conventional LR here could havearisen from the large number of features. The pre-processing step might be another source of thisfailure, as we extracted these features from irre-gular sequential datasets and they must have hadhigh correlations with each other (i.e., multicoli-nearity). With every feature selection method usedin this study, SVM classification was superior to ridgeLR in terms of the AUC. However, SVM classificationusing cost-sensitive learning gave almost the sameresults as the equal cost learning method. Cost-sensitive learning moves the separating hyperplaneaway to the minority class data; this is very similarto the bias variation scheme, in which the user canalter the decision threshold using bias. Moreover,threshold variation does not related with the AUC ofthe ROC curve. Therefore, the AUC may dependmostly on the tuning of other parameters, ratherthan the penalty to the error of the minority classdata.

7.4. Considerations of the VRIFA system

LR is used widely in medicine because it is able todescribe feature ranking by an odds ratio and it cangive intuitive information to medical practitioners.However, it is very difficult to interpret the resultsusing odds ratio when independent variables (inputfeatures) have continuous values. Therefore, ridgeLR made little use of this advantage, because thefeatures used in this study all had continuous valuesexcept for one feature: sex.

VRIFA, a visualization system using a nomogram inSVM, has been developed as a prototype. It givesintuitive visualization of the effect of each featureand a probabilistic output. Physicians can benefitfrom it for predicting diabetic nephropathy and can

Page 16: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

52 B.H. Cho et al.

analyze the effects of the features. Therefore, itmay be very useful to help develop effective treat-ment strategies and guide patients in choosing a lifestyle. In addition to its graphical output, we tookadvantage of VRIFA for a feature selection method(nomogram-RFE) using the dynamic ranges of fea-tures. In this application, nomogram-RFE produceda value of 0.969 of the AUC using 39 features, as didthe other wrapper and embedded methods.

7.5. Clinical considerations

The features selected by our feature selectionmethods were not always concordant with statisti-cally significant features. Some of the selectedfeatures in the VRIFA showed unexpected tenden-cies that are contrary to common medical thoughts:for example, that high blood pressure and a high BMImight be associated with diabetic nephropathy.However, these features were not significantly dif-ferent between the two groups of patients. A pos-sible explanation of the phenomenon is that the bestclassifiers used linear functions and thus there existlinear relationships among the features. For exam-ple, feature A could be important only if it is usedwith feature B in a linear SVM classifier. Theserelationships cannot be identified by t-test, as itdoes not take into account of the linear relationship.

On the other hand, the statistically significantfeatures showed the same tendencies as the visua-lization output. The patients with diabetic nephro-pathy already had higher values than the unaffectedpatients in WBC counts and microalbumin values,which is consistent with previous researches thatare described in Section 1. In particular, frequenthigh microalbumin values before the outbreak ofdiabetic nephropathy suggests a possible demandfor lowering of the diagnostic threshold for micro-albumin for such patients.

The start of diabetic nephropathy does notdepend on some particular features, but on a com-plex interrelationship involving many. However,physicians could use this VRIFA system to estimatea patient’s overall probability of having diabeticnephropathy and easily find the most important riskfactors. Early stage of diabetic nephropathy isreversible, meaning that a patient who does notdevelop overt albuminuria (over 200 mg/min) couldrevert to normal kidney function if treated appro-priately. Therefore, early diagnosis of this disease isvery important and this study could be an immediateoption for this purpose. When analyzing the timeinterval of prediction, our proposed method usedaverage 5-year follow-up data and predicted dia-betic nephropathy about 2—3 months before theactual diagnosis.

The limitation of this study is that it did not dealwith medication information, for example the use ofinsulin, oral agents, or antihypertensive drugs. Thereare still difficulties in including such information.Therefore, the concept of prediction is becomingincreasingly difficult to pursue because many patie-nts are treated with anti-hypertensive drugs andother types of interventions when microalbuminuriais diagnosed; such measures often return the patie-nt’s albumin excretion to normal [35,36]. This pro-blem may require another challenging data miningtask, for example ontology and text mining. Futurework will include the medication information andmore data records to support the existing models.

8. Conclusions

The goal of any prognostic study with machinelearning methods is to support rather than replaceclinical judgment. Our model scores are not suffi-ciently accurate or complete to supplant decision-making by physicians. In this aspect, themain objec-tive of this study has suggested a new way to giveinformation to physicians to plan efficient andproper treatment strategies.

In this study, we tried to predict diabetic nephro-pathy and determine its risk factors. We performed apreprocessing step to deal with the incomplete andunbalanced practical dataset, and assessed variousmachine-learning techniques, such as SVMs, cost-sensitive learning and feature selection methods.The proposedmethodpredicted the onset of diabeticnephropathy with promising performance and pro-vided a graphical tool for risk factor analysis, whichmay generate new hypothesis that motivates furtherinvestigation and research. In addition,wewere ableto detect high microalbumin values for the patientsbefore they developed diabetic nephropathy.

The most significant aspect of this study is that,to our knowledge, it is the first trial to apply datamining technology for predicting a diabetic compli-cation. The number of hospitals that adopted anelectronic medical record system has increasedgeometrically during the past decade. This willprobably lead to the more efficient collection ofdata and make it easier to study diseases such asdiabetes and hypertension via machine learning orartificial intelligence technology. Thus we believethat this study will have wide application.

Acknowledgement

This study was supported by a Nano-bio TechnologyDevelopment Project, Ministry of Science and Tech-nology, Republic of Korea (2005-01249).

Page 17: Application of irregular and unbalanced data to predict ... · 38 B.H. Cho et al. same 39 features and gave 0.969 of the area under the curve by receiver operating characteristics

Prediction of diabetic nephropathy 53

References

[1] Alberti KG, Zimmet PZ. Definition, diagnosis and classifica-tion of diabetes mellitus and its complications. Part 1.Diagnosis and classification of diabetes mellitus provisionalreport of a WHO consultation. Diabetic Med 1998;17:539—53.

[2] Jo HS, Sung JH, Choi JS, Hwang MS, Jeong HJ, Bae SC. Qualitycontrol of diagnostic coding in the Korean Burden of DiseaseProject. In: Schellekens W, editor. Proceedings of the inter-national conference of the international society for quality inhealth care. Oxford: Oxford University Press; 2004. p. 181.

[3] Gross JL, De Azevedo MJ, Silveiro SP, Canani LH, CaramoriML, Zelmanovitz T. Diabetic nephropathy: diagnosis, pre-vention, and treatment. Diabetes Care 2005;28:164—76.

[4] Mogensen CE. Microalbuminuria, blood pressure and dia-betic renal disease: origin and development of ideas. Dia-betologia 1999;42:263—85.

[5] Rossing P, Hougaard P, Borch-Johnsen K, Parving H. Predic-tors of mortality in insulin dependent diabetes: 10 yearobservational follow up study. Br Med J 1996;313:779—84.

[6] Shichiri M, KishikawaH, Ohkubo Y,Wake N. Long-term resultsof the Kumamoto study on optimal diabetes control in type 2diabetic patients. Diabetes Care 2000;28(Suppl. 2):B21—9.

[7] UKPDS. Intensive blood—glucose control with sulphonylureasor insulin compared with conventional treatment and risk ofcomplications in patients with type 2 diabetes. Lancet1998;352:837—53.

[8] Agardh C, Agardh E, Torffvit O. The prognostic value ofalbuminuria for the development of cardiovascular diseaseand retinopathy: a 5-year follow-up of 451 patients with type2 diabetes mellitus. Diabetes Res Clin Pract 1996;32:35—44.

[9] Selby JV, Ferrara A, Karter AJ, Liu J, Ackerson LM. Devel-oping a prediction rule from automated clinical databases toidentify high-risk patients in a large population with dia-betes. Diabetes Care 2001;24:1547—55.

[10] Carson ER. Decision support systems in diabetes: a systemsperspective. Comput Methods Progr Biomed 1998;56:77—91.

[11] Breault JL, Goodall CR, Fos PJ. Data mining a diabetic datawarehouse. Artif Intell Med 2002;26:37—54.

[12] Park J, Edington DW. A sequential neural network model fordiabetes prediction. Artif Intell Med 2001;23:277—93.

[13] Polak S, Mendyk A. Artificial intelligence technology as a toolfor initial GDM screening. Expert Syst Appl 2004;26:455—60.

[14] Shanker MS. Using neural networks to predict the onset ofdiabetesmellitus. J Chem Inform Comput Sci 1996;36:35—41.

[15] Cios KJ, Moore GW. Uniqueness of medical data mining. ArtifIntell Med 2002;26:1—24.

[16] Jakulin A, Mozina M, Demsar J, Bratko I, Zupan B. Nomo-grams for visualizing support vector machines. In: GrossmanR, Bayardo R, Bennett K, editors. Proceedings of the 17thinternational conference on knowledge discovery and datamining (KDD ’05). New York: ACM Press; 2005. p. 108—17.

[17] Cessie S, Houwelingen JC. Ridge estimators in logisticregression. J R Stat Soc Ser C: Appl Stat 1992;41:191—201.

[18] Vapnik V. The nature of statistical learning theory. New York:Springer; 1995.

[19] Veropoulos K, Cristianini N, Campbell C. Controlling thesensitivity of support vector machines. In: Aiello LC, Dean

T, Kollerbaur A, editors. Proceedings of the 16th interna-tional joint conference on artificial intelligence (IJCAI ’99),workshop ML3. San Francisco: Morgan Kaufmann; 1999. p.55—60.

[20] Guyon I, Elisseeff A. An introduction to variable and featureselection. J Mach Learn Res 2003;3:1157—82.

[21] Kononenko I. Estimating attributes: analysis and extensionsof relief. In: Bergadano F, Raedt LD, editors. Proceedings of7th European conference on machine learning (ECML’94).Berlin: Springer; 1994. p. 171—82.

[22] StevensenM,Winter R,Widrow B. Sensitivity of feed forwardneural networks to weight errors. IEEE Trans Neural Net-works 1990;1:71—80.

[23] Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection forcancer classification using support vector machines. MachLearn 2002;46:389—422.

[24] Gilbert RE, Tsalamandris C, Allen TJ, Colville D, Jerums G.Early nephropathy predicts vision-threatening retinal dis-ease in patients with type 1 diabetes mellitus. J Am SocNephrol 1998;9:85—9.

[25] Mogensen CE, Vigstrup J, Ehlers N. Microalbuminuria pre-dicts proliferative diabetic retinopathy. Lancet 1985;1:1512—3.

[26] Shahar Y, Tu SW, Musen MA. Knowledge acquisition fortemporal-abstraction mechanisms. Knowl Acquis 1992;4:217—36.

[27] Shahar Y, Musen MA. Knowledge-based temporal abstractionin clinical domains. Artif Intell Med 1996;8:267—98.

[28] Stacey M, McGregor C. Temporal abstraction in intelligentclinical data analysis: a survey. Artif Intell Med 2007;39:1—24.

[29] Bellazzi R, Larizza C, Magni P, Montani S, Stefanelli M.Intelligent analysis of clinical time series: an applicationto diabetic patients monitoring. Artif Intell Med 2000;20:37—57.

[30] Seyfang A, Miksch S, Marcos M. Combining diagnosis andtreatment using Asbru. Int J Med Inform 2002;68:49—57.

[31] Verduijn M, Sacchi L, Peek N, Bellazzi R, de Jonge E, de MolBAJM. Temporal abstraction for feature extraction: a com-parative case study in prediction from intensive care mon-itoring data. Artif Intell Med 2007;41:1—12.

[32] Platt JC. Probabilistic outputs for support vector machinesand comparisons to regularized likelihood methods. In:Smolar AJ, Bartlett P, Schoelkopf B, Schuurmans D, editors.Advances in large margin classifiers. Cambridge: MIT Press;1999. p. 61—74.

[33] Witten IH, Frank E. Data mining: practical machine learningtools and techniques. San Francisco: Morgan Kaufmann;2005.

[34] Chang C-C, Lin C-J. LIBSVM: a library for support vectormachines. Software available at http://www.csie.ntu.e-du.tw/cjlin/libsvm; 2001 [accessed September 4, 2007].

[35] Christensen CK, Krussel LR, Mogensen CE. Increased bloodpressure in diabetes: essential hypertension or diabeticnephropathy? Scand J Clin Lab Invest 1987;47:363—70.

[36] Pedersen EB, Mogensen CE. Effect of antihypertensive treat-ment on urinary albumin excretion, glomerular filtrationrate and renal plasma flow in patients with essential hyper-tension. Scand J Clin Lab Invest 1976;36:231—7.