Thesis: Building better predictive models for health-related outcomes

178
School of Computing and Information Systems The University of Melbourne Building better predictive models for health-related outcomes Yamuna Kankanige Supervisors Prof. James Bailey and Assoc. Prof. Benjamin Rubinstein Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy Produced on archival quality paper April, 2018

Transcript of Thesis: Building better predictive models for health-related outcomes

Page 1: Thesis: Building better predictive models for health-related outcomes

School of Computing and Information SystemsThe University of Melbourne

Building better predictive models forhealth-related outcomes

Yamuna Kankanige

SupervisorsProf. James Bailey and Assoc. Prof. Benjamin Rubinstein

Submitted in total fulfilment of the requirementsof the degree of Doctor of Philosophy

Produced on archival quality paper

April, 2018

Page 2: Thesis: Building better predictive models for health-related outcomes

ii

Page 3: Thesis: Building better predictive models for health-related outcomes

ABSTRACT

Predicting health-related outcomes is important for developing decision support systemsfor assisting clinicians and other healthcare workers regularly faced with critical deci-sions. Such models will save time, help to manage healthcare resources and ultimatelyprovide better quality of care for patients. These outcomes are now made possible thanksto complex medical data routinely generated at hospitals and laboratories, and devel-opments in data mining methods. This thesis focusses on development of such decisionsupport systems as well as techniques for improving the data, such as feature selectionand acquisition, generically useful for building better prognostic models for predictinghealth-related outcomes.

Data mining in healthcare is an interesting and unique domain. The data availableis heterogeneous, including demographic and diagnostic information of the patients,clinical notes, medical imaging results and whole genome sequence data. Since mostdata is not collected for research purposes, there can be issues with data quality suchas missing information, ambiguous and erroneous data. In addition, some data mightnot be available in electronic format, which makes it time consuming to collect. Missingvalues is a big problem in this domain which occurs not only due to data entry orcollection issues. Some information is just not available for some records. For example,different pathology test results available for a patient depend on laboratory tests orderedby the clinician for that patient.

Another aspect of data mining in healthcare is that these models need to be suffi-ciently transparent for users to trust and use them. Therefore, techniques/algorithmsthat can be used for such models is subjective to how much trust users have on thosemethods. In particular, it is imperative that data analysis on healthcare data generalizes.

The topic of this thesis, building better predictive models for health-related data,can be divided roughly to two parts. The first part investigates various data miningtechniques used to improve the performance of prediction models, especially with regards

iii

Page 4: Thesis: Building better predictive models for health-related outcomes

to healthcare data, which helps to build better prognostic models for health-relatedoutcomes. The second part of the thesis concerns applications of data mining modelson clinical and biomedical data, to provide better health-related outcomes.

A common occurrence for classification at test time, is partial missing test casefeatures. Since obtaining all missing features is rarely cost effective or even feasible,identifying and acquiring those features that are most likely to improve prediction ac-curacy is of significant impact. This challenge arises frequently in health data, whereclinicians order only a subset of test panels on a patient, at a time. In this thesis, wepropose a confidence-based solution to this generic scenario using random forests. Wesequentially suggest the features that are most likely to improve the prediction accuracyof each test instance, using a set of existing training instances which may themselvessuffer missing values.

Density based logistic regression is a recently introduced classification technique,which has been successful in real clinical settings, that performs one-to-one non-lineartransformation of the original feature space to another feature space based on densityestimations. This new feature space is particularly well suited for learning a logisticregression model, a popular technique for predicting health-related outcomes. Whilstperformance gains, good interpretability and time efficiency make density based logisticregression attractive, there exist limitations to its formulation. As another technique forimproving features, we tackle these limitations of the feature transformation methodand propose several new extensions in this thesis.

Liver transplants are a common type of organ transplantation, second only to kid-ney transplantations in frequency. The ability to predict organ failure or primary non-function, at liver transplant decision time, facilitates utilization of scarce resource ofdonor livers, while ensuring that patients who are urgently in need of a liver transplantare prioritized. An index that is derived to predict organ failure using donor as well asrecipient characteristics, based on local datasets, is of benefit in the Australian context.In a study using real liver transplant data, we propose that by using donor, transplantand recipient characteristics which are known at decision time of a transplantation, withdata mining techniques, we can achieve high accuracy in matching donors and recipients,potentially providing better organ survival outcomes.

iv

Page 5: Thesis: Building better predictive models for health-related outcomes

Serotyping is a common bacterial typing process where isolated microorganism sam-ples are grouped according to their distinctive surface structures called antigens, whichis important for public health and epidemiological surveillance. In a study using wholegenome sequencing data of four publicly available Streptococcus Pneumoniae datasets,we demonstrate that data mining approaches can be used to predict the serotypes ofisolates faster and accurately when compared with the traditional approaches.

In summary, this thesis focusses on techniques for improving data, such as featureselection, transformation and acquisition, generically useful for building better prognos-tic models for predicting health-related outcomes as well as applications of data miningtechniques on clinical and biomedical data for improving health-related outcomes.

v

Page 6: Thesis: Building better predictive models for health-related outcomes

vi

Page 7: Thesis: Building better predictive models for health-related outcomes

DECLARATION

This is to certify that

1. The thesis comprises only my original work towards the degree of Doctor of Phi-losophy except where indicated in the Preface,

2. Due acknowledgment has been made in the text to all other material used,

3. The thesis is fewer than 80,000 words in length, exclusive of tables, maps, bibli-ographies and appendices.

Yamuna Kankanige

vii

Page 8: Thesis: Building better predictive models for health-related outcomes

viii

Page 9: Thesis: Building better predictive models for health-related outcomes

PREFACE

This thesis has been written at the School of Computing and Information Systems, TheUniversity of Melbourne. Each chapter is based on manuscripts published, under reviewor in preparation for publication. I declare that I am the primary author and havecontributed to more than 50% of each of these papers.

Chapter 3 is based on the manuscript in preparation:

• “TABASCO: Sequential Feature Acquisition for Classifier Learning”, Yamuna Kankanige,Benjamin Rubinstein, and James Bailey.

Chapter 4 is based on the paper:

• “Improved Feature Transformations for Classification Using Density Estimation”,Yamuna Kankanige and James Bailey. Proceedings of the 13th Pacific Rim Inter-national Conference on Artificial Intelligence 2014, pp 117-129.

Chapter 5 is based on the paper:

• “Machine-Learning Algorithms Predict Graft Failure Following Liver Transplanta-tion”, Lawrence Lau1, Yamuna Kankanige1, Benjamin Rubinstein, Robert Jones,Christopher Christophi, Vijayaragavan Muralidharan2 and James Bailey2. Trans-plantation, Apr;101(4):e125-e132, 2017.

Chapter 6 is based on the manuscript in preparation:

• “A Novel Data Mining Approach to Prediction of Streptococcus Pneumoniae Serotype”,Yamuna Kankanige, Benjamin Goudey and Thomas Conway.

1 Joint first authors2 Joint last authors

ix

Page 10: Thesis: Building better predictive models for health-related outcomes

x

Page 11: Thesis: Building better predictive models for health-related outcomes

ACKNOWLEDGMENTS

My PhD journey is about to come to an end. This is the toughest challenge that I havetaken in my life so far, and yet it has been such a vibrant experience. This thesis wouldnot have been possible without the help and guidance of my supervisors, collaborators,friends and family. I take this opportunity to acknowledge everyone who helped methroughout this journey.

First, I sincerely thank my supervisors, Prof James Bailey and Assoc Prof BenjaminRubinstein. I am very lucky to have you as my supervisors. Not only that you are brilliantacademics who mentored me through all the problems and challenges, but two of thenicest and most considerate people that I have met. Thank you James, for acceptingme as one of your students and guiding, encouraging and supporting me through thisPhD journey. I am always inspired by your positive attitude and balanced perspectivetowards a problem. Thank you Ben, for accepting me as one of your students and beenactively guiding and mentoring me even before you became one of my supervisors. Ivalue all the technical and even personal discussions we had and all the constructivefeedback you have given to improve myself. I am very fortunate to have taken this PhDjourney under the guidance of the two of you.

I would also like to express my gratitude to IBM Research Australia and Dr ThomasConway for providing me the opportunity to gain experience as an Intern during myPhD. During my internship, I was fortunate to be mentored by two amazing researchers,Dr Thomas Conway and Dr Benjamin Goudey, with whom I continue to collaborate evenafter my internship. I am grateful that I got to know you, be mentored by you and towork with you. This collaboration opened up the whole new area of genomics to myresearch interests. Both of you have been supportive and patient in guiding me throughour work in an entirely new domain. Furthermore, I would like to thank the BiomedicalData Sciences team lead by Dr Natalie Gunn for all the encouragement, support andthe wonderful experience.

xi

Page 12: Thesis: Building better predictive models for health-related outcomes

During my PhD, I was privileged to be able to work with the Liver TransplantUnit at Austin Health, Heidelberg. This collaboration provided me with the real-lifeexperiences where data mining can be applied in healthcare and the challenges thatneeds to be addressed when doing such projects. I would like to thank Doctor LawrenceLau, Doctor Su Kah Goh, Prof Vijayaragavan Muralitharan and all the others at theLiver Transplant Unit, Austin Health. It was such a privilege to work with you and Iwas amazed at how quickly you all responded to all my queries, amidst your very busyschedules as surgeons.

Next, I am so thankful to the members of my advisory panel; Prof Shanika Karunasek-era and Assoc Prof Udaya Parampalli. Thank you for all the valuable feedback andsupport you provided over the years. Both of you have been available for me whenever Ineeded you and I am very grateful for that. I consider myself very lucky to be a PhD can-didate at the School of Computing and information Systems, University of Melbourne.I take this opportunity to thank the head of the school, Prof Justin Zobel and all theother academic, research and professional staff members of the school. I feel privilegedto be part of this prestigious group.

I acknowledge the financial support through the Australian Government ResearchTraining Program Scholarship during my PhD. I would also like to thank MelbourneBioinformatics for providing me the computing facilities required for one of my projects.

For some of us PhD can be a very lonely chapter in our lives. But thanks to the pastand present colleagues at the data mining group and our school, I was never left alone.It was such a pleasure to get to know you all, have fun, share our stories, problems andchallenges. Yang, Jiazhen, Mohadeseh, Donia, Sobia, Goce, Simone, Sergey, Yun, Florin,Zhou, Alvin, Daniel, Irum, Lakshmi, Kushani, Sameendra, Pasan, Neelofar and all theothers whom I forget to mention here; without you all this PhD experience would nothave been this exciting and I will cherish those memories forever.

There is a saying that “friends are the family that we choose for ourself”. I am luckyto be surrounded by good friends here in Melbourne, in Sri Lanka and all around theworld, to share my joys and sorrows. I sincerely thank all of you and your families.

I cannot forget my sister Indika, brother Maduranga, their families and my in-laws.I don’t have words to say how thankful I am to you for all you have done for me and

xii

Page 13: Thesis: Building better predictive models for health-related outcomes

my family. And the person who supported and encouraged me at every step, my dearhusband Gratian, thank you for believing in me and being you. My loving daughterKisali and son Dylain, your mom is very proud to have wonderful kids like you, thethought of you kept me going at every hurdle I faced.

Last but not least, none of this would have been possible without your unconditionallove, blessings and encouragement my dear mother and father. I am me because of you.

Thank you all,Yamuna

xiii

Page 14: Thesis: Building better predictive models for health-related outcomes

xiv

Page 15: Thesis: Building better predictive models for health-related outcomes

CONTENTS

1 introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 background 92.1 Data Mining for Predicting Health Related Outcomes . . . . . . . . . . . 92.2 Classification Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.2 Artificial Neural Networks and Deep Learning . . . . . . . . . . . 152.2.3 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . 162.2.4 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.5 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.6 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Evaluation and Performance Measurement . . . . . . . . . . . . . . . . . 212.3.1 Evaluation Techniques . . . . . . . . . . . . . . . . . . . . . . . . 212.3.2 Performance Measurement . . . . . . . . . . . . . . . . . . . . . . 22

2.3.2.1 Sensitivity (Recall) . . . . . . . . . . . . . . . . . . . . . 232.3.2.2 Specificity . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.2.3 Classification Accuracy . . . . . . . . . . . . . . . . . . . 242.3.2.4 Positive Predictive Value (Precision) . . . . . . . . . . . 242.3.2.5 Negative Predictive Value . . . . . . . . . . . . . . . . . 242.3.2.6 Area Under the Receiver Operating Characteristic Curve 242.3.2.7 Precision-Recall Curve . . . . . . . . . . . . . . . . . . . 25

2.4 Missing Values and Active Feature Acquisition . . . . . . . . . . . . . . . 262.4.1 Active Learning and Feature Acquisition . . . . . . . . . . . . . . 272.4.2 Handling Missing Values . . . . . . . . . . . . . . . . . . . . . . . 27

2.5 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.5.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

xv

Page 16: Thesis: Building better predictive models for health-related outcomes

2.5.2 Feature Transformation . . . . . . . . . . . . . . . . . . . . . . . 302.6 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 tabasco: sequential feature acquisition for classifierlearning 333.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3 The TABASCO Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.1 Calculating the Set of Possible Values . . . . . . . . . . . . . . . . 393.3.2 Calculating the Confidence Gain per Feature . . . . . . . . . . . . 41

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.1.1 Clinical Dataset: . . . . . . . . . . . . . . . . . . . . . . 423.4.1.2 Public Datasets: . . . . . . . . . . . . . . . . . . . . . . 42

3.4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.5.1 Clinical Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . 463.5.2 Public Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.5.3 Public Datasets with Introduced Missing Values . . . . . . . . . . 49

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 improved feature transformations for classification us-ing density estimation 554.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2 Background and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2.1 DLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.2.2 Kernel Density Estimation (KDE) . . . . . . . . . . . . . . . . . . 60

4.3 Robust Density Estimations . . . . . . . . . . . . . . . . . . . . . . . . . 614.4 Higher Order Transformations . . . . . . . . . . . . . . . . . . . . . . . . 624.5 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.6 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.6.1 Datasets: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.6.2 Experimental Set-up: . . . . . . . . . . . . . . . . . . . . . . . . . 67

xvi

Page 17: Thesis: Building better predictive models for health-related outcomes

4.6.3 Size of the Density Estimation Dataset: . . . . . . . . . . . . . . 674.6.4 Higher Order Transformations: . . . . . . . . . . . . . . . . . . . 674.6.5 Transfer Learning (LandMine Datasets): . . . . . . . . . . . . . . 70

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 data mining algorithms predict graft failure followingliver transplantation 755.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2.1 Study Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.2.2 Dataset Collation . . . . . . . . . . . . . . . . . . . . . . . . . . 785.2.3 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2.3.1 Outcome Parameters . . . . . . . . . . . . . . . . . . . . 805.2.3.2 Donor Risk Index . . . . . . . . . . . . . . . . . . . . . . 805.2.3.3 DRI +/- MELD by Random Forest . . . . . . . . . . . . 815.2.3.4 SOFT Score . . . . . . . . . . . . . . . . . . . . . . . . . 815.2.3.5 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . 81

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.3.1 Dataset Characteristics . . . . . . . . . . . . . . . . . . . . . . . . 855.3.2 Algorithm Performances . . . . . . . . . . . . . . . . . . . . . . . 86

5.3.2.1 DRI, SOFT Score and DRI +/- MELD by Random For-est Performance . . . . . . . . . . . . . . . . . . . . . . . 86

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6 a novel data mining approach to prediction of strepto-coccus pneumoniae serotype 936.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.2.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.2.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

xvii

Page 18: Thesis: Building better predictive models for health-related outcomes

6.2.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 1036.2.5 Leave-one-out Cross Validation . . . . . . . . . . . . . . . . . . . 1046.2.6 PneumoCaT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2.7 Merged Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2.8 Variable Importance . . . . . . . . . . . . . . . . . . . . . . . . . 1056.2.9 Population Adaptation . . . . . . . . . . . . . . . . . . . . . . . . 1056.2.10 Confidence in Predictions . . . . . . . . . . . . . . . . . . . . . . 106

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.3.1 PneumoCaT Results . . . . . . . . . . . . . . . . . . . . . . . . . 1066.3.2 Leave-one-out Cross Validation . . . . . . . . . . . . . . . . . . . 1076.3.3 Merged Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076.3.4 Variable Importance . . . . . . . . . . . . . . . . . . . . . . . . . 1116.3.5 Population Adaptation . . . . . . . . . . . . . . . . . . . . . . . . 1126.3.6 Confidence in Predictions . . . . . . . . . . . . . . . . . . . . . . 114

6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7 conclusions 1217.1 Contributions of the Presented Work . . . . . . . . . . . . . . . . . . . . 1217.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

xviii

Page 19: Thesis: Building better predictive models for health-related outcomes

L I ST OF F IGURES

Figure 1.1 Overview of the different topics discussed in this thesis. . . . . . 4Figure 2.1 Different types of data available for clinical and biomedical data

mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Figure 2.2 Logistic function . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Figure 2.3 A multilayer perceptron with two hidden layers . . . . . . . . . . 16Figure 2.4 Classification in Support Vector Machines . . . . . . . . . . . . . 17Figure 2.5 A Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Figure 2.6 Example of a random forest . . . . . . . . . . . . . . . . . . . . . 20Figure 2.7 Generate datasets for evaluation using bootstrap sampling . . . . 23Figure 2.8 A ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Figure 2.9 A Precision-Recall curve . . . . . . . . . . . . . . . . . . . . . . 26Figure 3.1 The process of feature acquisition during prediction time. . . . . 35Figure 3.2 Example random forest comprising three trees; the ellipses de-

note leaf nodes, the path taken by the test instance is highlightedin red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Figure 3.3 Comparison of AUC-ROC values when requesting 5 features se-quentially, with 2 features per test instance to begin with. . . . . 47

Figure 3.4 Disagreement percentage between two methods for the first fea-ture acquisition. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Figure 3.5 Comparison of sequential feature acquisition of five features withrespect to AUC-ROC (datasets with introduced missing values). 51

Figure 3.5 Comparison of sequential feature acquisition of five features withrespect to AUC-ROC (datasets with introduced missing values). 52

Figure 4.1 Dataset with two classes (light blue and dark blue colors) . . . . 57

xix

Page 20: Thesis: Building better predictive models for health-related outcomes

Figure 4.2 Comparison of original feature 1 values and transformed feature1 values, classes (light blue and dark blue) are better separatedin b) than in a) . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Figure 4.3 1D and 2D transformations, focusing on class (light blue anddark blue) separation . . . . . . . . . . . . . . . . . . . . . . . . 63

Figure 4.4 Full process of higher order transformations . . . . . . . . . . . . 65Figure 4.5 Performance of the classifiers when the proportion of density

estimation dataset is changed . . . . . . . . . . . . . . . . . . . 68Figure 4.6 Class distributions of synthetic datasets according to the feature

values; dark blue points belong to positive class while light bluepoints belong to negative class . . . . . . . . . . . . . . . . . . . 69

Figure 4.7 Box-plots of AUC-ROC values . . . . . . . . . . . . . . . . . . . 72Figure 5.1 ROC curve comparison of different models created during the

study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88Figure 6.1 Distribution of serotypes within the four datasets. . . . . . . . . 100Figure 6.2 Comparison of isolates per serotype. . . . . . . . . . . . . . . . . 101Figure 6.3 Comparison of percentage distribution of the serotypes within

the four datasets, such that the sum of percentages per datasetadds to 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Figure 6.4 Comparison of the prediction performances of all the serotypesin leave-one-out cross validation . . . . . . . . . . . . . . . . . . 108

Figure 6.4 Comparison of the prediction performances of all the serotypesin leave-one-out cross validation (cont.) . . . . . . . . . . . . . . 109

Figure 6.4 Comparison of the prediction performances of all the serotypesin leave-one-out cross validation (cont.) . . . . . . . . . . . . . . 110

Figure 6.5 Accuracy distribution of the 1000 bootstrap samples . . . . . . . 111Figure 6.6 The variation of the percentage of SNPs within the cps gene

cluster according to the number of top ranked features (basedon the 1000 bootstrap samples) . . . . . . . . . . . . . . . . . . . 112

Figure 6.7 Comparison of the accuracies when training on one or moredatasets and evaluating on another . . . . . . . . . . . . . . . . . 114

xx

Page 21: Thesis: Building better predictive models for health-related outcomes

Figure 6.8 Comparison of the confidences of predictions for correct and in-correct predictions (based on the results of training on threedatasets and evaluating on the remaining dataset) . . . . . . . . 115

Figure 6.9 Relationships between the accuracy, the confidence threshold andthe number of isolates not typed (based on the results of trainingon the combination of Mass, UK1 and UK2 and evaluating onThai dataset) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

xxi

Page 22: Thesis: Building better predictive models for health-related outcomes

xxii

Page 23: Thesis: Building better predictive models for health-related outcomes

L I ST OF TABLES

Table 2.1 Confusion matrix of a binary classifier . . . . . . . . . . . . . . . 23Table 3.1 Public datasets used. . . . . . . . . . . . . . . . . . . . . . . . . 43Table 3.2 Comparison of AUC-ROC values when requesting 5 features se-

quentially, while having 2 features per test instance in the testset at the start. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Table 3.3 Distribution of missing values of the features between instancesbelonging to the two classes. . . . . . . . . . . . . . . . . . . . . 48

Table 3.4 Comparison of AUC-ROC values when requesting the first fea-ture, while having 10% of the features per test instance in thetest set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Table 3.5 Public datasets with introduced missing values. . . . . . . . . . . 50Table 3.6 Comparison of the classification performance of the two sets of

datasets during the second feature acquisition. . . . . . . . . . . 53Table 4.1 Comparison of AUC-ROC value of models when using original

features, 1D features and 2D features . . . . . . . . . . . . . . . 68Table 4.2 Comparison of AUC-ROC value of models when using original

features, 1D features and 2D feature . . . . . . . . . . . . . . . . 69Table 4.3 Comparison of AUC-ROC value of models when using original

features only and transformed features with original features . . 71Table 4.4 p-values of Friedman post-hoc analysis when using NN and orig-

inal as the base . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Table 5.1 Summary of some donor and recipient characteristics included in

the study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82Table 5.2 Comparison of AUC-ROC values of different models created dur-

ing the study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Table 6.1 Summary of the datasets before and after preprocessing . . . . . 99

xxiii

Page 24: Thesis: Building better predictive models for health-related outcomes

Table 6.2 Comparison of the accuracies of leave-one-out cross validation(LOO-CV) with PneumoCaT . . . . . . . . . . . . . . . . . . . . 106

Table 6.3 Accuracies of the experiments, when training on two datasetsand testing on another . . . . . . . . . . . . . . . . . . . . . . . 113

Table 6.4 Accuracies of the experiments, when training on three datasetsand testing on another . . . . . . . . . . . . . . . . . . . . . . . 113

Table 6.5 Accuracies of the experiments, when training on one dataset andtesting on another . . . . . . . . . . . . . . . . . . . . . . . . . . 113

xxiv

Page 25: Thesis: Building better predictive models for health-related outcomes

1INTRODUCTION

Data mining is an area in computer science where computers are taught to learn poten-tially useful information (for prediction, explanation or understanding) from existingdata without explicit programing [Witten and Frank, 2005]. Data mining techniqueshave been applied successfully in numerous domains such as gaming, finance, education,power systems, healthcare, marketing and sales.

1.1 motivation

Large amounts of complex and heterogeneous medical data are being generated everyday at hospitals and laboratories, which can be used to inform decision support systemsto assist clinicians, hospital management and other related parties in their day-to-dayactivities. The aim of developing data mining applications in healthcare is to use theknowledge available in these clinical datasets to learn information that is valuable andactionable for clinicians, patients and other healthcare workers, and support them whentaking important decisions [Bellazzi and Zupan, 2008]. Such models will save time, helpto manage healthcare resources and ultimately provide better quality of care for patients.

Health-related data can exist in various forms such as clinical data including demo-graphic and diagnostic information of patients, text such as clinical notes, scans frommedical imaging, DNA sequence data such as read data, microarray data. One of thechallenges in data mining using healthcare data is that these datasets are not collectedfor research as the primary focus, but generated via patient treatment. Some medicalpractices might not have all the clinical and diagnostic information available in elec-

1

Page 26: Thesis: Building better predictive models for health-related outcomes

tronic format because some of these can be paper based, such as clinical notes andimaging results, especially with historical data [Yoo et al., 2012]. Moreover, data in theclinical datasets may be incorrect due to errors that could happen during the processsuch as data entry errors and misleading values due to equipment errors [Bellazzi et al.,2011].

Features are the ingredients of data mining models, by representing the structuredattributes of data. Therefore, it is important that available features are used to theirmaximum potential to create optimum models. As discussed earlier, missing values infeatures is an inherent attribute of clinical data. However, in practice, when training ortest instances contain missing labels or values, it might be possible to acquire them ata nominal cost where the costs can be feature acquisition costs or time delay. Activelearning and active feature acquisition investigate the topics of suggesting labels orfeatures to be acquired.

Feature selection is an important aspect in data mining for health-related outcomes,since decision support systems with fewer number of features are attractive due totheir transparency and ease of use, or because some of the data mining techniques arenot designed to handle datasets with higher dimensionality. Feature transformation isanother important aspect of data preparation used to acquire the maximum potentialfrom the features, where the original feature space is converted to another feature spacelinearly or non-linearly. Non-linear transformations are useful if independent variablesare related to the dependent variable in a non-linear manner, when the data miningtechnique used, learn linear relationships. Whereas linear transformations are used tostandardize or normalize the input feature space.

Data mining techniques can be categorized depending on the learning task as well asthe characteristics of data that is available, such as classification, regression, clusteringand association analysis. Classification is a supervised task where the objective is toclassify data according to a set of known categories, whereas clustering is used to groupthe data according to their similarities. The primary focus of this thesis is supervisedclassification tasks on health-related data.

The data mining techniques used for learning tasks in healthcare should be able tohandle available datasets while providing acceptable levels of performance and trans-

2

Page 27: Thesis: Building better predictive models for health-related outcomes

parency. Usability of the models developed will increase if the models are interpretable,allowing clinicians to validate the results based on their domain knowledge, which is apopular research area [Vellido et al., 2012; Rudin, 2014]. Due to these reasons, only someof the algorithms available in data mining can be used for data mining using healthcaredata [Liu et al., 2006 Sept; Cios and Moore, 2002; Iavindrasana et al., 2009]. Popu-lar techniques used in this domain for classification include logistic regression, decisiontrees, artificial neural networks, support vector machines and the naive Bayes classifier[Bellazzi and Zupan, 2008; Yoo et al., 2012; Kononenko, 2001 Aug]. Harper used severalclassification algorithms such as decision trees, neural networks and regression tech-niques to predict useful outcomes using medical data. He compared the performancesof the models developed and their practical usefulness, and concluded that the mostsuitable classification technique would be different for different applications, dependingon the features available as well as the preferences of the end-users [Harper, 2005 Mar].

With all these issues, qualities and challenges of data mining applications on health-related data, the domain is clearly fruitful, because implementing a decision supportsystem might mean saving a life. As Cios and Moore rightly said, “These potentialrewards more than compensate for the many extraordinary difficulties along the pathwayto success.” [Cios and Moore, 2002].

This thesis contributes to building better predictive models for health-related out-comes by proposing techniques for improving data, such as feature selection, transforma-tion and acquisition. The techniques proposed are for active feature acquisition duringprediction time and improving density based non-linear feature transformations. More-over, we discuss two applications of deriving predictive models. The first application ispredicting graft failure following liver transplantation and the second is predicting theserotypes of bacterial isolates of a specie using DNA sequence data. Figure 1.1 gives anoverview of how these different aspects discussed in this thesis are related.

The following section outlines the chapters presented in this thesis.

3

Page 28: Thesis: Building better predictive models for health-related outcomes

12,?,?,4,?,?,?,?,?,?,?Incomingdata

Activefeatureacquisition

Y

N/Y Missingvalues?Dataminingmodel

Prediction

Y/N

Inputdata

Featureselectionandtransformation

Chapter3

Chapter4

Chapter5

Chapter5

Chapter6

Figure 1.1: Overview of the different topics discussed in this thesis.

1.2 thesis overview

Chapter 2 explores the fundamental concepts involved in data mining for predictinghealth-related outcomes, related to this thesis. It is introducing challenges and currenttrends in this domain as well as the popular techniques for classification, feature han-dling, evaluation and performance measurement.

The next two chapters propose techniques for improving data used when developingdata mining models, by missing feature acquisition and feature transformation.

In Chapter 3, the problem of active feature acquisition during prediction time hasbeen investigated. The presence of missing values in training and test instances is com-

4

Page 29: Thesis: Building better predictive models for health-related outcomes

monplace in data mining applications, while possessing more information about datafacilitates data mining models that provide more accurate predictions. Acquiring com-plete instances can be time consuming, prohibitively costly, or impossible [Melville et al.,2004]. In some settings, it is feasible to acquire additional features at a nominal cost.For example, in medical diagnostics clinicians order laboratory tests for a patient, whichhave not yet been performed. In this scenario, if the clinician can order the laboratorytests which will assist in providing a better, more accurate diagnosis for the patient,rather than ordering all possible laboratory tests, then the diagnosis can be performedfaster and more economically [Saar-Tsechansky et al., 2009].

In this chapter, we propose a novel feature acquisition method —TABASCO— forclassification tasks during prediction time. TABASCO proposes to acquire the featurethat maximizes the confidence of the prediction provided by the classifier. It improvesthe classification accuracy of test instances by requesting the most informative missingfeatures sequentially, for one test instance at a time. We demonstrate that TABASCOsignificantly outperforms a number of baselines, using a real clinical case study. Usingpublicly available benchmark datasets, we exhibit that our proposed method enjoys bet-ter performance when the datasets contain missing values in a class dependent manner,motivated by the clinical setting where sicker patients get more tests.

Chapter 4 explores a non-linear feature transformation technique which has been suc-cessful with clinical datasets and proposes an extension to the method. Density basedlogistic regression (DLR) is a non-linear classification technique where original featurespace is transformed using density estimations, which is used to learn the classificationmodel. When compared with other classification techniques such as support vector ma-chines (SVM), logistic regression (LR) and kernel logistic regression, DLR shows higheraccuracy with less time complexity. However, the full potential of this method has notbeen exploited yet.

In this chapter, we try to answer some questions that arise when using DLR forclassification. As the results of our experiments we propose the following, which are thecontributions of this work. 1) A methodology to perform unbiased density estimations2) A novel extension to DLR for numerical features [Chen et al., 2013a] which trans-forms two or more features into one feature based on multidimensional kernel density

5

Page 30: Thesis: Building better predictive models for health-related outcomes

estimation (KDE), which looks more effective compared to DLR for some datasets. 3)We propose transfer learning scenarios as possible applications of DLR and its exten-sions, which is demonstrated using publicly available transfer learning datasets. We alsosuggest that this transformation technique has the capability of using some informationof the datasets that are discarded during pre-processing due to records with missingvalues, which is prevalent in health-related datasets. This could be achieved by usingthose records for calculating the density estimations.

The next two chapters present two applications of data mining techniques in health-related data. In Chapter 5 we present an application on liver transplant data whileChapter 6 describes an application on whole genomic sequencing data of bacterial iso-lates.

Liver transplantation is an option offered to patients suffering from chronic liverconditions, when their life expectancy is likely to be higher after the transplantation[Merion, 2004]. Outcomes following liver transplantation depend upon a complex inter-action between donor, recipient and transplantation features. Driven by the disparitybetween the increasing number of potential transplant recipients and the limited numberof suitable organ donors, there is increasing use of organs of marginal quality [Busuttiland Tanaka, 2003; Tector et al., 2006]. Therefore, the ability to predict graft survivalat liver transplant decision time allows utilization of the scarce resource of donor livers,while ensuring that patients who are urgently in need of a liver transplant are prioritized.

By comparing data mining approaches with well-known indexes such as donor riskindex, model for end-stage liver disease score and survival outcomes following livertransplantation score, we demonstrate that using donor, transplant and recipient char-acteristics known at the decision time of a transplant, data mining approaches canachieve high accuracy in matching donors and recipients, potentially providing bettergraft survival outcomes.

Chapter 6 focuses on evaluating a data mining approach for predicting the serotypesof a bacterial species, using whole genome sequencing data, without using any priorknowledge about the gene cluster associated with the serotype.

Serotyping is a common bacterial typing process where isolates are grouped accord-ing to their distinctive surface structures called antigens. This grouping is important

6

Page 31: Thesis: Building better predictive models for health-related outcomes

for public health and epidemiological purposes such as detecting outbreaks, identifyingpathogenic variants and identifying the occurrences of antibiotic resistant variants [Ten-over et al., 1995; van Belkum et al.]. Nonetheless, this typing process is labour intensive,time consuming and requires expert knowledge [Ashton et al., 2016]. Whereas, wholegenome sequencing is becoming cheaper and routinely available. Therefore, a data min-ing approach using whole genome sequence data, using which serotypes can be predictedfaster and accurately would be beneficial.

In this work, we demonstrate that random forests can predict the serotypes with ac-curacies above 88% within populations, using four different populations of StreptococcusPneumoniae. Moreover, we demonstrate that using samples from a few different popu-lations when training the data mining models, increase the generalization performanceof unseen populations.

The last chapter (Chapter 7) summarizes the implications of the presented work inthis thesis, their limitations and potential future directions.

In summary, this thesis concentrates on techniques for improving data, such as fea-ture selection, transformation and acquisition, generically useful for building better prog-nostic models for predicting health-related outcomes. Furthermore, we present two ap-plications on clinical and whole genome sequencing data for improving health-relatedoutcomes.

7

Page 32: Thesis: Building better predictive models for health-related outcomes

8

Page 33: Thesis: Building better predictive models for health-related outcomes

2BACKGROUND

This chapter provides an overview of existing data mining techniques and applicationsspecific to clinical and biomedical data. Section 2.1 introduces the background on datamining in clinical and biomedical domains, and discusses the challenges and currenttrends associated with data mining in these domains. Subsequent sections explore someof the popular data mining techniques used for developing predictive models in thesedomains and the standard evaluation and performance measurements used. The last twosections discuss data preparation and feature improvement strategies such as missingvalue handling, actively acquiring missing values, feature selection and feature transfor-mation techniques. Finally, we introduce transfer learning, which is an important subarea in data mining providing useful insights for health-related data mining applications.In addition, each later chapter contains a section discussing the related work specific tothat chapter.

2.1 data mining for predicting health related outcomes

With the advancement of technology, modern hospitals, laboratories and even people areequipped with devices for monitoring, collecting and storing large volumes of complex,unstructured and heterogeneous health-related data, also known as big data [Costa,2014; Kononenko, 2001 Aug; Reddy and Aggarwal, 2015; Mittelstadt and Floridi, 2016;Huddar et al., 2016]. Technological advancements in parallel computing, distributed bigdata infrastructure and cloud computing have enabled the storing and processing ofthis vast amount of data [Luo et al., 2016]. This data can be in various forms such

9

Page 34: Thesis: Building better predictive models for health-related outcomes

as paper based clinical notes, electronic medical records, images from medical imageanalysis, genomic and proteomic data known as omics data and data collected fromvarious devices. Pre-processing, integrating and analyzing these data collections providechallenges and opportunities specific to the types and volumes of data [Luo et al., 2016;Reddy and Aggarwal, 2015; Churches and Christen, 2004; Esfandiari et al., 2014]. Figure2.1 exhibits different types of data available in this domain. In this thesis, we focus onclinical and genomic data as highlighted in yellow in the figure.

Texturalclinical notes

Clinical&hospitaldata

Genomicdata

Medicalimagingdata

Datageneratedbypersonaldevicesandapps

Figure 2.1: Different types of data available for clinical and biomedical data mining

Conventional decision making process in healthcare is based on a set of informationsuch the ground truth knowledge, lessons learnt from past experiences and some scor-ing indexes using few known features. However, data mining techniques are capable ofgenerating useful knowledge using the huge amount of complex, high dimensional datasources available in this domain [Kaur and Wasan, 2006].

Studies focusing on predictive data mining using clinical datasets are popular inthe literature. It has been shown that common laboratory test results can be used topredict ICU admission, Medical Emergency Team activation or death, within 24 hours,for patients in emergency department [Loekito et al., 2013 Apr]. Furthermore, it has

10

Page 35: Thesis: Building better predictive models for health-related outcomes

been shown that they can be used to predict death within 24 hours in ward patients[Loekito et al., 2013 Mar]. Predicting the length of stay of a patient at hospital hasalso been the topic of many studies [Li et al., 2013 Mar; Yang et al., 2010 Dec; Sigakiset al., 2013 Jun 13; Kollef et al., 2017], since such outcomes will assist hospitals inplanning and management of beds, medical staff and elective admissions [Gustafson,1968 Spring]. Data mining models have also been developed for the diagnosis of variousmedical conditions such as diabetes [Temurtas et al., 2009], ischaemic heart disease[Kukar et al., 1999] and cancer [Xu et al., 2016; Polat and Güneş, 2007; Zhang et al.,2017].

Predictive data mining using genomic and proteomic data has received a great dealof attention recently, especially from the research in molecular biology [Bellazzi andZupan, 2008; Bellazzi et al., 2011]. Single nucleotide polymorphisms — the single letterdifferences when aligned with a reference sequence [Szymczak et al., 2009], kmers —fixed length words generated from the read data [Zhang et al., 2003], gene expressionmicroarray data [Brazma and Vilo, 2000] —generated by measuring the gene expres-sions levels of thousands of genes simultaneously, and protein expression data [Clarkeet al., 2008] —generated by analyzing complex protein mixtures, are popular types ofgenomic data used in data mining approaches. Unsupervised learning techniques suchas hierarchical clustering [SÃÿrlie et al., 2001], K-means clustering [Gasch and Eisen,2002], self-organizing maps [Tamayo et al., 1999] and supervised learning techniquessuch as support vector machines [Brown et al., 2000] and random forests [Wang et al.,2009] have been used for data mining using genomic data [Brazma and Vilo, 2000].

An important characteristic of clinical and biomedical data is that most of these dataare being generated through the treatment process of patients. When using sensitivedata collected on humans for research, there are issues that should be dealt with dueto the nature of the data such as ethical collection, usage and publishing of data andresearch outcomes, data ownership issues, reluctance to provide data in the fear oflawsuits, data confidentiality and security issues. Some of these privacy issues are dealtwith by anonymizing or de-identifying the data or encrypting it, which is a popularresearch domain [Bayardo and Agrawal, 2005]. Moreover, most of these studies involving

11

Page 36: Thesis: Building better predictive models for health-related outcomes

humans would require approvals from the related institutional ethics committees [Ciosand Moore, 2002; Costa, 2014; Mittelstadt and Floridi, 2016].

Humans are the most complex species that have ever existed. Every human beingis different from one another [Fichman et al., 2011]. When developing decision supportsystems using medical data, one of the challenges is the non-generalizability of theselearning models due to the differences between future patients and the study population[Lin, 2006]. It is important to personalize the delivery of patient care to the individualsby using information about the particular patient, as well as similar patients from largehistorical datasets available. In current practice, clinicians perform various kinds of testsfor clinical diagnosis. Data mining techniques can help in tailoring this treatment pro-cess by discovering knowledge from millions of similar patients, delivering personalizedpatient care [Chawla and Davis, 2013].

Having domain knowledge about the clinical problem being addressed by data miningtechniques, is an important advantage for successful application development [CAO andZHANG, 2007]. Such knowledge would be useful in various aspects of the data miningprocess such as data interpretation and understanding, feature selection, selecting datamining techniques, developing the learning models, and interpreting the results [Bellazziet al., 2011].

With the advancement of relevant technologies, there is plenty of complex, non-generalizable information available for health professionals during their decision-makingprocess. This drives decision making to become more and more challenging. Data miningand data mining techniques have the potential to assist them in handling this informa-tion when making these critical decisions by delivering fast, accurate, and individualizedpredictions [Lin, 2006]. Despite all the advantages that these data-driven techniques canprovide, the clinical community is still hesitant to accept data mining applications inpractice [Fichman et al., 2011]. There can be a few reasons for this, such as not be-ing able to trust these systems, undesirable effects of complicating the already complexclinical jobs by having another set of results/tools to consider and the fear of somejobs being replaced by data mining systems. It is important to understand that eventhough these systems will provide assurance or assistance when taking complex medical

12

Page 37: Thesis: Building better predictive models for health-related outcomes

decisions, the decision power will and should remain with the person who is making thedecision [Kononenko, 2001 Aug].

However, as Cios and Moore said, “For an appropriately formulated medical question,finding an answer could mean extending a life, or giving comfort to an ill person. Thesepotential rewards more than compensate for the many extraordinary difficulties alongthe pathway to success.” [Cios and Moore, 2002].

2.2 classification techniques

Data mining and data mining techniques have rapidly evolved over the recent years,with several suitable techniques available for a task, producing broadly similar results.The techniques that are available for a particular task will differ from one another basedon 1) how they handle noisy and missing data, and different types of features such asnumerical, ordinal and categorical data, 2) dimensionality of data that can be handledefficiently, 3) presentation and transparency of the resulting models, 4) transparency ofthe decisions and 5) computational costs of generating and using data mining models[Bellazzi and Zupan, 2008].

A data mining method used successfully for development of a clinical decision sup-port system will have the following attributes; good predictive performance, ability tohandle missing and noisy data, comprehensibility of the learning process and trans-parency of the decisions [Kononenko, 2001 Aug; Harper, 2005 Mar; Bellazzi and Zupan,2008].

Some of the popular classification techniques used for data mining in health-relateddata are, logistic regression, Naive Bayes, support vector machines, decision trees, neu-ral networks and random forests [Esfandiari et al., 2014]. The basic concepts of thesetechniques are summarized below.

13

Page 38: Thesis: Building better predictive models for health-related outcomes

2.2.1 Logistic Regression

Logistic regression is a well-established classification technique belonging to the familyof generalized linear models, which mathematically models the relationship betweenthe independent input variables and a binary response variable. The response variableis bound between 0 and 1 by using the logistic function (2.1) in modelling. As seenin Figure 2.2, the logistic function stays between 0 and 1 while taking any real value[Kleinbaum and Klein, 2011].

0.00

0.25

0.50

0.75

1.00

−10 −5 0 5 10

q

Lo

gis

tic(q

)

Figure 2.2: Logistic function

logistic(q) =1

1 + exp(−q)(2.1)

The logistic regression conditional joint likelihood model is formally defined in Equa-tion 2.2, where the probability of the binary response Y being 1, given the input variables

14

Page 39: Thesis: Building better predictive models for health-related outcomes

X1,X2,X3.......Xn, is modelled as the logistic function of the linear combination of theinput variables, where a0, a1, a2....an are constants.

P (Y = 1|X1,X2, .....Xn) =1

1 + exp(−(a0 +∑n

i=1 aiXi))(2.2)

Logistic regression has been widely used in data mining applications for predictinghealth-related outcomes [Delen et al., 2005; Loekito et al., 2013 Apr,M; Austin et al.,2013]. It is attractive for clinical studies, since it estimates the probabilities of classesdirectly and the output model is transparent and easy to understand. Furthermore, thecoefficients of the output model can provide insights into the relative importance of theinput variables [Kleinbaum and Klein, 2011].

2.2.2 Artificial Neural Networks and Deep Learning

Inspired by the human nervous system, artificial neural networks attempt to model therelationship between input and output variables using a collection of neurons calledperceptrons. Each of these neurons receives input from other neurons, performs someprocessing and sends the results to other neurons or terminal nodes. The basic ideaof single layered perceptrons was introduced by Rosenblatt [Rosenblatt, 1961], and theworks of Hopfield [Hopfield, 1982, 1984] and Rumelhart et al. [Rumelhart et al., 1985]introduced the multilayered neural networks.

Multilayer perceptron model complex relationships using multiple hidden layers.Each neuron (node) in a layer is connected to all the neurons in the next layer, whereall the connections have specific weights. The learning process of an artificial neuralnetwork involves learning these weights based on training data. An example with twohidden layers is given in Figure 2.3.

Artificial neural networks have been a popular technique used in data mining ap-plications for health-related outcomes [Baxt, 1991; Burke, 1994; Amato et al., 2013].However, it is well known in the data mining literature that artificial neural networksare prone to overfitting and learning noise in data, resulting in unstable models with

15

Page 40: Thesis: Building better predictive models for health-related outcomes

Inputlayer Hidden layer2 OutputlayerHiddenlayer1

Figure 2.3: A multilayer perceptron with two hidden layers

poor generalization ability [Cheng and Titterington, 1994; Gardner and Dorling, 1998;Adya and Collopy, 1998; Zhang, 2000]. Moreover, they are often considered to be black-box models because of the non-transparent classification model [Kononenko, 2001 Aug].

In recent times, deep learning has become very popular in medical data analysis,especially for medical image analysis and natural language processing [Zheng et al.,2013; He et al., 2016]. Deep learning can be described as neural networks, consisting ofmany layers, where the layers can be trained efficiently with the advancements of therelated technologies. These different layers can be trained successfully to learn a spe-cific abstraction, so that the learning model can be designed hierarchically with theseabstractions. Deep learning has been successful in providing excellent results with com-plex high dimensional data, at the expense of high computational power requirements[LeCun et al., 2015]. These large models also require large datasets (in addition to ap-proaches to regularization) to avoid overfitting; in many healthcare problems data isscarce however.

2.2.3 Support Vector Machines (SVM)

Support vector machines introduced by Vladimir Vapnik and colleagues [Boser et al.,1992; Vapnik, 1995], designed for binary classification, operate by finding a hyperplane

16

Page 41: Thesis: Building better predictive models for health-related outcomes

with the maximum ‘margin’ distance that separates the instances belonging to differentclasses. The set of instances closest to the hyperplane are called support vectors, andthe hyperplane with the maximum distance or margin between these support vectorsis chosen, which provides a linear classifier. Figure 2.4 illustrates these fundamentalideas behind support vector machines. However, support vector machines can be usedwith non-linear kernel functions such as polynomial, radial basis and sigmoid kernels,where the original feature space is transformed to a higher dimensional space with innerproduct corresponding to the above kernels, on which a linear classifier as describedabove is learnt [Hearst et al., 1998].

Figure 2.4: Classification in Support Vector Machines

SVMs are popular for their good predictive accuracy and have been extensively usedin data mining applications for predicting health-related outcomes [Guyon et al., 2002;Akay, 2009; Statnikov et al., 2008; Cheng et al., 2006]. However, they can be very slowto train needing extensive computational power [Yoo et al., 2012]. Moreover, the modelis not transparent when kernels other than the linear kernel is being used [Bellazzi andZupan, 2008; Hearst et al., 1998].

17

Page 42: Thesis: Building better predictive models for health-related outcomes

2.2.4 Naive Bayes

Naive Bayes classifier is a simple probabilistic classification technique defined based onBayes theorem. Given two events A and B, the posterior probability of A given B isdefined by Bayes rule in Equation 2.3, where P (A) and P (B) are the prior probabilitiesof events and P (B|A) is the conditional probability of B given A.

P (A|B) = P (B|A)P (A)P (B)

(2.3)

If we consider a binary classification problem with response Y ∈ (0, 1), and inputvariables X1,X2,X3.......Xn, the probability of Y1 can be defined as the posterior prob-ability of Y1 given X1,X2,X3.......Xn, P (Y1|X1,X2,X3.......Xn). In the Naive Bayesclassifier, by “naively" assuming all the attributes to be conditionally independent fromone another given the class, this probability is calculated by taking the product of theindividual conditional probabilities of each attribute Xi given class Y1 as in Equation2.4.

P (Y1|X1,X2,X3.......Xn) =n∏

i=1

P (Xi|Y1)P (Y1)

P (Xi)(2.4)

This is a simple but powerful technique. Moreover, Naive Bayes models are fast,hyper parameter-free and handle missing values naturally, which makes them a populartechnique for data mining in healthcare domains [Yousef et al., 2006; Leroy et al., 2008;Anbarasi et al., 2010; Pattekari and Parveen, 2012; Palaniappan and Awang, 2008].However, a major drawback of this technique is the assumption that all attributes areconditionally independent from one another, since in most real-world datasets, especiallyin clinical datasets, some features are known to be correlated [Yoo et al., 2012].

18

Page 43: Thesis: Building better predictive models for health-related outcomes

2.2.5 Decision Trees

A decision tree is built by generating recursive splits such that the variance of thedependent variable within the subgroups are reduced. At each node of the tree a featurethat produces the purest split amongst all the features is selected for producing thesplits. When the nodes are pure (belong to one class), or a predefined threshold of theamount of variance has been reached, or the predefined number of levels within the treehas been reached, the tree building stops. These terminal nodes are called leaves. Figure2.5 shows the structure of a decision tree.

Rootnode

Internalnode

Terminalnode(leaf)

Figure 2.5: A Decision Tree

Decision trees are popular for being simple, transparent and non-parametric. Due tothese qualities decision trees have been commonly used in data mining applications forhealth-related outcomes [Delen et al., 2005; Anbarasi et al., 2010; Olanow and Koller,1998]. However, they are known to be sensitive to errors and inconsistencies in data [Es-fandiari et al., 2014]. Most of the current implementations of decision trees are variantsof C4.5 [Quinlan, 1993] and CART [Breiman et al., 1984] decision trees.

19

Page 44: Thesis: Building better predictive models for health-related outcomes

2.2.6 Random Forests

The random forest data mining algorithm, described by Breiman in 2001 [Breiman, 2001]can be explained as an ensemble of unpruned decision trees, each of which is built usinga sample training dataset (a bootstrap sample), with a random subset of input variablesconsidered at each node for splitting. Predictions of the random forest algorithm aredetermined by majority voting of all the trees [Liaw and Wiener, 2002]. An example ofa random forest is given in Figure 2.6.

Data

Resampling

Voting

Sample 3Sample 2Sample 1 Sample 4

Figure 2.6: Example of a random forest

Three of the main parameters that can be tuned when training a random forestare, the number of trees in the forest, the number of randomly selected features tobe considered at each node and the minimum node size of the tree [Liaw and Wiener,2002].The general consensus in parameter tuning is that it is best to have a large numberof trees, the square root of the number of features available is a common rule-of-thumbfor the number of features per node and by default the trees in the forest are allowedto grow unpruned for classification tasks [Wright and Ziegler, 2015].

20

Page 45: Thesis: Building better predictive models for health-related outcomes

Random forests handle features with missing values naturally, and because of theway they work, can handle many features [Hapfelmeier et al., 2014], which makes ran-dom forests a good technique for modelling high dimensional medical data. Due to thesedesirable qualities random forest has been well-received as a desirable data mining tech-nique for predicting health related outcomes [Khalilia et al., 2011; Gray et al., 2012; Fanet al., 2011; Xu et al., 2011]. The feature importance measures available with randomforests indicate the importance of each variable, thereby improving the transparency ofthe algorithm [Cutler et al., 2007]. One popular method for calculating the importanceof each feature is by observing the differences in prediction accuracies when the featuresof the out-of-bag samples are permuted across the random forest [Liaw and Wiener,2002].

Time complexity of an algorithm estimates the time taken to execute an algorithm.The time complexity of a decision tree is O(mnlog(n)), with n instances and m at-tributes [Witten and Frank, 2005]. Therefore, the time complexity of a random forestwith z trees is O(z(mnlog(n)).

2.3 evaluation and performance measurement

When developing a predictive model, it is important to validate the results using datasetsnot seen at the model development stage, which will provide an estimate of the expectedperformance. The standard practice is to use internal and external validation data forthis purpose. Internal validation is testing the models using data from the same popu-lation, whereas external validation is testing the models using data from a populationdifferent to the study population [Lin, 2006].

2.3.1 Evaluation Techniques

When evaluating performance of data mining models, especially in internal validation,the intuitive approach would be to split the dataset to training and test sets. However,

21

Page 46: Thesis: Building better predictive models for health-related outcomes

this approach has the disadvantage of not using the whole dataset for training themodels and the performance being biased by the splitting criteria.

There are more sophisticated methods available for evaluation. A popular techniqueis k-fold cross validation where the training dataset is divided to k number of folds,and k number of tests are performed by leaving one-fold at a time for testing whileall the other (k-1) folds are combined for training the predictive models. 10-fold crossvalidation is widely accepted in literature as a fair evaluation technique. An extremecase of k-fold cross validation is known as leave-one-out cross validation, where eachrecord in the dataset is left at a time for testing while all the other records are used fortraining the model [Bellazzi and Zupan, 2008; Steyerberg et al., 2001].

Another popular evaluation technique is known as bootstrap sampling with replace-ment, where equivalent number of records to the original dataset are drawn from theoriginal dataset, randomly with replacement, to create a sample training set. It hasbeen shown in literature that such a bootstrap sample will contain about 63% uniquecases from the original dataset. The remaining records, not included in the training set,are combined as the corresponding test set. This procedure is repeated few times, forexample x times, to obtain x sets of bootstrap training and test sets [Breiman, 1996] asillustrated in Figure 2.7.

2.3.2 Performance Measurement

Well accepted methods for quantifying the performance of classifiers are important whencomparing different techniques for developing predictive models. Moreover, these mea-sures are used in accepting whether a particular learning model is satisfactory for thepredictive task. There are few standard performance measurements used in literature formeasuring the predictive ability of classifiers in clinical and biomedical domain such assensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV),area under the receiver operating characteristic (AUC-ROC) curve, precision recall curveand classification accuracy [Ray et al., 2010; Steyerberg et al., 2010]. Consider the con-fusion matrix for a binary classifier given in Table 2.1, which is used to explain theperformance measurements mentioned above.

22

Page 47: Thesis: Building better predictive models for health-related outcomes

OriginalData

Bootstrapsamplesfortraining

Remainingsamplesfortesting

Figure 2.7: Generate datasets for evaluation using bootstrap sampling

Table 2.1: Confusion matrix of a binary classifier

ActualPredicted Positive Negative

Positive True Positive (TP) False Negative (FN)Negative False Positive (FP) True Negative (TN)

2.3.2.1 Sensitivity (Recall)

This is the proportion of positive records correctly classified as positive.

Sensitivity =TP

TP + FN(2.5)

2.3.2.2 Specificity

The proportion of negative records correctly classified as negative is known as specificity.

Specificity =TN

TN + FP(2.6)

23

Page 48: Thesis: Building better predictive models for health-related outcomes

2.3.2.3 Classification Accuracy

The proportion of correctly classified records from all the records, is known as theaccuracy of a classifier, which is adequate for measuring the performance of a classifierwhen the classes are evenly represented in the population. However, when the classesare imbalanced, which is the reality in applications, classification accuracy is biasedand don’t represent the true discriminative power of the classifier [Provost and Fawcett,1997].

Accuracy =TP + TN

TP + FP + TN + FN(2.7)

2.3.2.4 Positive Predictive Value (Precision)

This is the proportion of true positives from all the records classified as positives.

Precision =TP

TP + FP(2.8)

2.3.2.5 Negative Predictive Value

The proportion of true negatives from all the records classified as negatives is known asthe negative predictive value.

NegativePredictiveV alue =TN

TN + FN(2.9)

2.3.2.6 Area Under the Receiver Operating Characteristic Curve

Receiver operating characteristic (ROC) curve exhibits how the positive and negativerecords are ranked according to the predictions provided by a classifier across the entirerange of cut-off thresholds. An example of the ROC curve is given in Figure 2.8, whichis used to determine the suitable cut-off threshold when using the classifier.

24

Page 49: Thesis: Building better predictive models for health-related outcomes

1−Specificity

Sensitiv

ity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 2.8: A ROC curve

The area under the ROC curve, AUC-ROC is a measure of the discriminative abilityof the model, which is well suited for evaluating applications with class imbalance [Rayet al., 2010; Bradley, 1997; Steyerberg et al., 2010]. AUC-ROC values vary from 0to 1, where > 0.9 is considered excellent discrimination, > 0.75 is considered gooddiscrimination and 0.5 is equivalent to random guessing [Ray et al., 2010].

2.3.2.7 Precision-Recall Curve

Recent studies [He and Garcia, 2009; Saito and Rehmsmeier, 2015] have shown that aROC curve can be over-optimistic about the classification performance in the presence ofhighly imbalanced data. In such scenarios, Precision-Recall curves are known to providea better estimation about the discriminative ability of the classifier. The precision-recallvalue pairs are generated using a similar approach to the ROC curve. An example of aprecision-recall curve is given in Figure 2.9.

Sensitivity, specificity and classification accuracy have been widely used for mea-suring the performances of classifiers in clinical and biomedical datasets. These simplemeasurements are easy to grasp while providing the level of confidence that can be

25

Page 50: Thesis: Building better predictive models for health-related outcomes

Precision

Recall

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.6

0.7

0.8

0.9

1.0

Figure 2.9: A Precision-Recall curve

placed on the predictions [Wolfe et al., 1990; Cios and Moore, 2002; Lin, 2006]. AUC-ROC is one of the most common performance measurements in this domain, whichprovides the overall discriminative ability of the classifier. Moreover, AUC-ROC valuesvary from 0 to 1, making it easier to comprehend [Ray et al., 2010].

2.4 missing values and active feature acquisition

As discussed before, the presence of missing values in data is a major challenge whendeveloping and applying data mining techniques to clinical and biomedical data. Thesemissing values may exist in data due to reasons such as data entry errors and noise. How-ever, in clinical data, missing values may occur due to the nature of the domain, whichis because the data is not available. For example, certain laboratory test results maynot be available in some patient’s records because they were not performed. Further-more, missing values can occur in blocks, especially in clinical datasets, where certainlaboratory tests are usually ordered in blocks.

26

Page 51: Thesis: Building better predictive models for health-related outcomes

2.4.1 Active Learning and Feature Acquisition

In some situations, it might be possible to acquire more labels or features at a cost.While possessing more information about data facilitates data mining models that pro-vide more accurate predictions, acquiring complete instances can be time consuming,prohibitively costly, or impossible [Melville et al., 2004]. The techniques that are studiedin active learning attempt to acquire training labels for fewer instances with maximumbenefit; i.e. those training labels providing highest expected predictive accuracy for thetest instances [Settles, 2009]. Label acquisition techniques for improving data miningmodels has been applied in clinical and biomedical domains [Hoi et al., 2006; Chen et al.,2013b]. A complementary problem setting is when missing features can be obtained dur-ing prediction time — the focus of Chapter 3.

2.4.2 Handling Missing Values

There are several techniques for handle missing values. One simple approach to dealwith missing values is to use the records without missing values, known as completecase analysis [Little et al., 2012]. However, this approach is suitable only when thenumber of records with missing values are relatively low. Furthermore, considering thecomplete cases would lead the analysis to be biased [Luengo et al., 2012].

Imputing or replacing the missing values in the records with “suitable values” is apopular method for handling missing values. Various techniques are used for decidingthese “suitable values”, such as mean imputation, expectation maximization imputation,k-nearest neighbour imputation and multiple imputation [Graham, 2009; Wood et al.,2004]. Another approach is to treat missing values as special values, which is especiallysuitable when the missingness has a meaning [Allison, 2012].

Some data mining techniques can handle missing values naturally, without the needfor imputing or removing missing values before modelling, such as Random Forests,Decision Trees and Naive Bayes [Graham, 2009].

27

Page 52: Thesis: Building better predictive models for health-related outcomes

2.5 data pre-processing

Data pre-processing is the first step of knowledge discovery process, which refers todata cleaning, preparation, integration and summarizing. This step is known to takethe bulk of the time allocated for a data mining application [Esfandiari et al., 2014].Feature selection and feature transformation are popular techniques employed duringthe pre-processing stage of the data mining process [Liu and Motoda, 1998; Kotsiantiset al., 2006].

2.5.1 Feature Selection

As discussed earlier, clinical and biomedical datasets available for a prediction taskcan be high dimensional. However, when building predictive models, irrelevant featuresmay lead to poorer classifier performance while taking valuable computational power.Some techniques available for developing prediction models are not able to handle highdimensional data efficiently [Saeys et al., 2007 Oct]. Furthermore, simple models builtusing few features are more transparent, faster, and easy to use than models built usinghigh dimensional data [Ma and Huang, 2008].

Selecting the features that are most related to an outcome of interest is a highlyinvestigated problem in clinical and biomedical domain [Haury et al., 2011]. The taskof feature selection is to find the best subset of features from the input feature space,removing noisy and redundant features. In classification problems, this would meanselecting the features based on their discriminative ability [Tang et al., 2014].

Supervised feature selection approaches can be roughly categorized to three groups;filter approach, wrapper approach and embedded approach. In wrapper approach, classi-fier is wrapped with the feature selection process. In filter approach the feature selectionprocess and the classification process are separate. Approaches where feature selectionis embedded in classification are called embedded approaches [Ma and Huang, 2008].

The filter approach works by ranking the features according to some function basedon distance, correlation and information content and choosing the features according

28

Page 53: Thesis: Building better predictive models for health-related outcomes

to their ranking. Information Gain, Relief, Fisher score and correlation are some ofthe scores used in filter based feature selection [Tang et al., 2014; Saeys et al., 2007Oct]. Information gain is a popular method for feature ranking which has been usedin choosing the features to split on in decision trees, and measuring the ability of thefeatures in generating pure splits for decades [Raileanu and Stoffel, 2004]. It is based onthe entropy of the samples, which measures the level of impurity in a group of examples,as defined in Equation 2.10 where pi is the probability of class i. Information gain is thechange in entropy due to the operations performed on a given dataset such as splittingon a variable.

H = −k∑i

pi ∗ log2(pi) (2.10)

Relief [Kira and Rendell, 1992] is another well-known feature ranking method, wherethe importance of the features is determined by each feature’s ability to distinguishnearby instances belonging to different classes. The extended version, ReliefF is knownto work well with noisy and missing data. [Kononenko, 1994].

A problem with filter based feature selection is that the feature selection is completelyseparate from the classification process. The wrapper approach utilizes the predictiveperformance of a classification algorithm for feature selection. In high level this methodselects subsets of features and evaluates them to select the best performing subset. Dif-ferent methods are used for selecting the subset of features, such as best first, sequentialforward selection, sequential backward elimination, randomized hill climbing and geneticalgorithms [Tang et al., 2014; Saeys et al., 2007 Oct]. Even though wrapper methodscan select features accurately, it is computationally expensive to follow this method forhigh dimensional data.

Decision trees and penalized regression are popular embedded feature selection meth-ods which are popular with biomedical data [Saeys et al., 2007 Oct; Ma and Huang,2008]. Lasso (least absolute shrinkage and selector operator) penalty, Bridge penaltyand Elastic net are popular penalty functions used for penalized regression [Ma andHuang, 2008; Haury et al., 2011; Ghosh and Chinnaiyan, 2005; Tang et al., 2014].

29

Page 54: Thesis: Building better predictive models for health-related outcomes

2.5.2 Feature Transformation

Feature transformation is a well-known technique used to create a new set of featuresbased on the original features. Such techniques can either be supervised or unsupervised.Examples of unsupervised transformations include mapping feature x to xk, log x, sin x,1/x, or normalizing x into the range [0, 1]. An example of supervised transformation isdiscretization of a continuous feature to a discrete feature using the class labels.

Feature transformation is advantageous since it can stabilize variance in data, modelnon-linearity, place features on equal scales or construct feature spaces with betterseparation between classes [Kusiak, 2001]. Some popular classification techniques forhealthcare data such as artificial neural networks [Baxt, 1991; Burke, 1994] and supportvector machines [Guyon et al., 2002; Akay, 2009] use feature transformation internally[Hinton and Salakhutdinov, 2006; Hearst et al., 1998]. Furthermore, popular featurereduction techniques used in this domain such as principal component analysis [Allenet al., 2003; Raamsdonk et al., 2001] use feature transformations to create a new lowdimensional feature space [Guyon and Elisseeff, 2006].

2.6 transfer learning

In data mining it is typically assumed that the training dataset and testing dataset areindependent and identically distributed samples from the same population [Pan andYang, 2010]. However, in many scenarios, this assumption may not hold and the testingset may have a different distribution from the data that is used to train the model. Thisleads to degraded accuracy and has stimulated the research area of transfer learning.This topic of transferring knowledge gained on one population to another, is known astransfer learning or domain adaptation and has been widely studied in data mining [Panand Yang, 2010; Weiss et al., 2016].

Transfer learning techniques are useful for data mining applications predicting health-related outcomes, using which an application developed for a particular hospital or apopulation could be utilized in generating predictions for another hospital or a popula-

30

Page 55: Thesis: Building better predictive models for health-related outcomes

tion. There is a number of studies in the literature that adapt these transfer learningtechniques when predicting health-related outcomes [Wiens et al., 2014; Dubois et al.,2017; Chattopadhyay et al., 2012; Song et al., 2017].

In summary, this chapter introduced the background on data mining in clinical andbiomedical domains, and presented an overview of classification techniques, evaluationmethods, performance measurements and feature pre-processing methods specific toclinical and biomedical data. Furthermore, we discussed transfer learning, which is animportant sub area in data mining providing useful insights for health-related datamining applications. In addition, each main chapter contains a section discussing therelated work specific to that chapter.

31

Page 56: Thesis: Building better predictive models for health-related outcomes

32

Page 57: Thesis: Building better predictive models for health-related outcomes

3TABASCO : SEQUENTIAL FEATURE ACQUIS IT ION FORCLASS IF IER LEARNING

A common occurrence for classification at test time, is partially missing test case fea-tures. Since obtaining all missing features is rarely cost effective or even feasible, identi-fying and acquiring those features that are most likely to improve prediction accuracyis of significant potential impact. The missing feature that is most likely to improvethe classification accuracy of a test instance will typically depend on the set of avail-able features as well as their values. Therefore, a personalized approach for each testinstance should be followed when deciding which feature should be acquired next. Inthis chapter1, we propose a confidence-based solution to this generic scenario using ran-dom forests, where we sequentially suggest the features that are most likely to improvethe prediction accuracy of each test instance, using a set of existing training instanceswhich may themselves suffer missing values. We demonstrate that our proposed methodsignificantly outperforms the baselines of selecting features randomly, selecting featuresby information gain or Relief feature ranking and via cost sensitive decision trees [Linget al., 2004]. Motivated by clinical diagnostics where rule-of-thumb feature acquisitionis prevalent, we demonstrate these results using a real clinical case study. Using pub-licly available benchmark datasets, we exhibit that our proposed method enjoys betterperformance when the datasets contain missing values in a class dependent manner,motivated by the clinical setting where sicker patients get more tests.

1 This chapter is based on the following manuscript in preparation::“TABASCO: Sequential Feature Acquisition for Classifier Learning”, Yamuna Kankanige, BenjaminRubinstein, and James Bailey.

33

Page 58: Thesis: Building better predictive models for health-related outcomes

3.1 introduction

The presence of missing values in training and test instances is commonplace in datamining applications, while possessing more information about data facilitates data min-ing models that provide more accurate predictions. Acquiring complete instances canbe time consuming, prohibitively costly, or impossible [Melville et al., 2004].

There are well-known methods available for handling missing values such as completecase analysis, imputation and learners that naturally deal with missing values such asdecision tree-based learners. These methods have their own advantages and disadvan-tages, but are limited in that they must do without access to any of the missing features[Graham, 2009].

In some settings, it is feasible to acquire additional features at a nominal cost. Forexample, in medical diagnostics clinicians order laboratory tests for a patient, whichhave not yet been performed. In this scenario, if the clinician can order the laboratorytests which will assist in providing a better, more accurate diagnosis for the patient,rather than ordering all possible laboratory tests, then the diagnosis can be performedfaster and more economically. Indeed, this is what occurs across the medical profession[Saar-Tsechansky et al., 2009].

Test instances —and medical patients— are likely to be different from one another.The decision to request another laboratory test, and the choice of which test to request,should vary depending on the information already available about that individual. There-fore, the decision of which features to acquire to improve prediction should ideally betest-instance specific, depending on the features present and their values. The processof feature acquisition during prediction time is illustrated in Figure 3.1.

In this chapter, we propose a novel feature acquisition method2 TABASCO for clas-sification tasks during prediction time. TABASCO acquires the feature that maximizesthe confidence of the prediction provided by the classifier. It expects to improve the clas-sification accuracy of test instances by requesting the most informative missing featuressequentially, (i.e. we request the first suggested feature, inspect the gain in classificationaccuracy and re-run the algorithm to select the next feature to acquire) for one test

2 sequenTial feature Acquisition BASed ClassificatiOn

34

Page 59: Thesis: Building better predictive models for health-related outcomes

Dataminingmodel

Dataacquisition12,?,?,4,?,?,?,?,?,?,?Incomingdata

Activefeatureacquisition

Prediction

Figure 3.1: The process of feature acquisition during prediction time.

instance at a time. The operator can continue procuring features until they are satisfiedabout the confidence of the prediction. The intuition behind our approach is that theaccuracy of the prediction could be improved by requesting the feature that is likely toimprove the confidence of the prediction most.

To the best of our knowledge ours is the first study that investigates the genericproblem of sequential acquisition of features for individual test instances, by maximizingthe classification accuracy, not being bound by a cost sensitive framework. We argue thatthis is an important problem, especially in medical diagnosis settings where providingthe best diagnosis for patients is important, and the cost of misclassification can farexceed feature acquisition costs.

As a baseline for comparison, we used a state of the art method, Cost SensitiveDecision Trees introduced by Ling et al. [Ling et al., 2004] where a new splitting cri-teria based on minimal total cost is employed. Follow-up work [Sheng and Ling, 2006;Ling et al., 2006; Sheng et al., 2005] made incremental updates to this approach. Torepresent this line of work, we implemented and compared to the Ling et al.[Ling et al.,2004] method. For additional baselines, we select features for acquisition randomly, andwe obtained features according to feature importance. We adopt information gain andReliefF as the methods of defining feature importance.

35

Page 60: Thesis: Building better predictive models for health-related outcomes

The main contributions of this chapter comprise: 1) We present a confidence-basedsequential feature acquisition method TABASCO for individual test instances based onrandom forests, which can be applied even when the training set contains missing values.2) Using a clinical dataset, containing missing values, we demonstrate that TABASCOoutperforms existing baselines. 3) Using eight publicly available datasets we furtherdemonstrate that when missing values are introduced to features of the datasets in aclass dependent manner, TABASCO outperforms the other baselines.

3.2 related work

Active learning resolves the issue of unlabeled instances in learning by querying for labels.The techniques that are studied in active learning attempt to acquire labels for fewerinstances with maximum benefit; i.e. those labels providing highest expected predictiveaccuracy [Settles, 2009] .Among the previous work in this domain, some studies looked atimproving the predictive accuracy of the generated models [Melville et al., 2004; Thahiret al., 2012; Dhurandhar and Sankaranarayanan, 2015], while the others compared thecost of feature acquisition and expected improvement in model accuracy [Melville et al.,2005; Saar-Tsechansky et al., 2009].

A complementary problem setting is when missing features can be obtained duringprediction time — the focus of this chapter. There are few studies in the literaturethat consider this problem of feature acquisition during test time. Most of the existingwork in this area investigates this problem with a cost sensitive framework, where theyconsidered different costs like feature acquisition cost, misclassification cost and delaycosts.

Ling et al. [Ling et al., 2004] proposed a novel method for building cost sensitivedecision trees, which is used in their proposed sequential and batch feature acquisitionstrategies.

Sheng et al. [Sheng and Ling, 2006] explored feature acquisition during prediction insequential batches where misclassification cost, feature acquisition costs and delay costsare considered. A similar problem was addressed by Ji et al. [Ji and Carin, 2007], usinghidden Markov models, where the aim is to minimize feature acquisition and misclassi-

36

Page 61: Thesis: Building better predictive models for health-related outcomes

fication costs. Weiss et al. [Weiss et al., 2013] proposed a cost sensitive feature selectionmethod using histograms. A feature budgeted random forest method is presented byNan et al. [Nan et al., 2015] who consider feature acquisition costs when building therandom forest. Our TABASCO approach differs from these works since we consider theproblem of acquiring the features that deliver the best classification accuracy.

Bilgic et al. [Bilgic and Getoor, 2007, 2011] considered a cost sensitive setting, propos-ing a data structure for the calculation of the value of information and compare featureacquisition strategies using their proposed data structure.

Kanani et al. [Kanani and Melville, 2008] considered the case where a fixed set offeatures can be acquired for a subset of instances during prediction time. This workdiffers from our study because we do not consider a fixed set of features and we acquirefeatures for each individual test instance.

Kapoor et al. [Kapoor and Horvitz, 2009] investigated updating the predictive modelas well as exploring the test case at hand simultaneously. Given a training dataset and atest instance both of which can be incomplete, and a budget for acquiring information,the goal was to determine the missing features from training data or test instance toachieve the best prediction for the test instance. desJardins et al. [desJardins et al.,2010] also looked at feature acquisition at both training and test time simultaneously.The confidence based feature acquisition method that they proposed seeks to optimizetotal acquisition cost required to attain a given level of expected predictive performanceper-instance.

3.3 the tabasco algorithm

We next introduce the terminology and notations used throughout this chapter.Let X be the input feature space with a binary response Y ∈ {0, 1}. A data set

with N records is denoted by D = (Xi,Yi) for i = 1, ......N . Let rf be a random forestclassifier that defines the mapping from X to Y , rf : X → Y .

The probability of instance i belonging to class m is Pim, while the confidence ofthe prediction for the ith instance is Ci. Possible confidence gain by requesting the jth

feature, Fj , of the ith instance is defined by Gij .

37

Page 62: Thesis: Building better predictive models for health-related outcomes

In each tree treek (k ∈ {1.......Z}) of the random forest, let SPjk denotes the set ofsplit points per tree based on feature Fj while TVjk denotes a minimum set of represen-tative values such that all the branches of the tree are covered.

Our approach is based on the confidence of a prediction provided by the classifier.The intuition behind TABASCO is that the feature that is likely to provide the highestconfidence should be acquired next. For a given test instance, we determine the projectedconfidence of the prediction if we acquire a particular missing feature. This process isperformed for all the missing features of that test instance. The proposed feature tobe acquired next will then be the feature with the highest gain in confidence whencompared with the original prediction of the test instance.

In this section, first the expressions for confidence and confidence gain are derived,followed by a detailed explanation of the algorithm. Given a test instance (Xi,Yi), letthe class probabilities provided by rf(Xi) be Pi1 and Pi0 respectively. We define theconfidence of the prediction for test instance 0 ≤ Ci ≤ 1 in Equation 3.1, which is theabsolute difference of the two probabilities, which we use to quantify how reliable theprediction is.

Ci = |Pi1 − Pi0|. (3.1)

Let the confidence of the prediction for test instance after acquiring feature j be Cij .Then the confidence gain of the feature acquisition −1 ≤ Gij ≤ 1 is given by,

Gij = Cij −Ci. (3.2)

Our solution starts by constructing a random forest classifier rf using the trainingsample Dtrain. The random forest is a well-known bootstrap aggregation ensemble learn-ing method for classification, introduced by Breiman in 2001 [Breiman, 2001], which isknown to handle missing values naturally owing to it being an ensemble of decision trees,and deal with large number of features [Hapfelmeier et al., 2014]. The following processis performed per test instance Xi ∈ Dtest, to acquire features sequentially. Let the orig-inal confidence of the prediction for the test instance Xi calculated as per Equation 3.1

38

Page 63: Thesis: Building better predictive models for health-related outcomes

be Ci. If that confidence is less than the desired confidence Cd of the user, the featureacquisition process can progress, such that the missing feature Fj ∈ {F1,F2, ......,Fq},with the highest possible confidence gain Gij would be requested next. This process canbe repeated until the desired confidence has been reached or all the missing values havebeen obtained as explained in the Algorithm 3.1.

Algorithm 3.1 TABASCO Sequential Feature Acquisition Algorithm1: procedure FeatureAcquisition(Xi,Cd)2: Ci ← confidence(rf(Xi))3: while Ci < Cd do4: GAi ← emptyList5: for all Fn ∈ missingFeatures(Xi) do6: Gin ← confidenceGain(Fn, Xi)7: update(GAi,Gin)8: end for9: X ′i ← acquire Fj with max(GAi)

10: Ci ← confidence(rf(X ′i))11: end while12: end procedure

The estimated confidence gain per missing feature Fn of the test instance Xi, is de-termined by generating a set of typical values for each missing feature using the randomforest classifier. These representative values should cover all the possible branches basedon the feature of interest as well as the test instance of interest to produce good results.Once this set of values is identified, each value is used to replace the missing feature ofthe test instance to generate the candidate predictions.

3.3.1 Calculating the Set of Possible Values

We track the test instance of interest, as it traverses the forest. In each tree treek

(k ∈ {1.......Z}) identify the nodes where the splits are based on the feature of interestFj , and note the split point, from the nodes that the instance Xi passes through. Thisprocess generates the set of split points per tree SPjk based on feature Fj . Using this setof split points per tree we can determine a minimum set of representative values such thatall the branches of the tree are covered, TVjk. By aggregating the representative values

39

Page 64: Thesis: Building better predictive models for health-related outcomes

Hb

Weight Age

Creat Gender

<8>8

>3 <3<7>7

>2 <2 >5 <5

Urea Height

UreaHb

Age

<6>6

>7 <7

>5 <5

<3>3

<9>9

HbWeight

Urea

Urea

>10 <10

>3 <3 <6>6

<4>4

Figure 3.2: Example random forest comprising three trees; the ellipses denote leaf nodes, the path takenby the test instance is highlighted in red.

over the forest, where values are combined without adding duplicate representations ofthe same range, we can determine a set of typical values for the feature of interest, TVj ,representing the whole forest. This corresponds to all the possible paths through theforest for the test instance. The following equations define this process of generating theset of typical values for a feature and a test instance.

SPjk = splitPointsPerTree(treek,Fj ,Xi). (3.3)

TVjk = typicalV alues(SPjk). (3.4)

TVj =Z⋃

k=1uniqueRanges(typicalV alues(SPjk)). (3.5)

For example, consider a random forest based on clinical features: (age, height, urea,gender, weight, creatinine, heamoglobin). Let the test instance be (5.6, 7.1, ?, ?, ?, ?, 10)where ? indicate missing values and the values are given in sequence. Suppose that theforest consists of three trees as given in Figure 3.2, and the path the instance travelsthrough is highlighted.

Assume that we are interested in calculating the set of possible values for ureawhich is missing for the test instance. According to our method, first we have to identifythe split points of urea that the test instance passes through (equation 3.3). In the

40

Page 65: Thesis: Building better predictive models for health-related outcomes

rightmost tree we have 3, in the second tree none and in the leftmost tree we have 10and 4. First, a candidate set of typical values based on the split points are generatedfor any tree (equation 3.4). Considering the rightmost tree first, the values can be 1 and5, representing both paths of the split point 3. For each subsequent tree, typical valuesare added such that duplicate values representing a path will not be added. Therefore,when considering the leftmost tree we need one more value representing greater than 10which can be 11, resulting in a set of typical values (1,5 and11) for the forest coveringall the unique paths (equation 3.5).

3.3.2 Calculating the Confidence Gain per Feature

Using the set of typical values TVj generated according to the above step per featureFj , confidence of the prediction can be determined for each value in TVj as follows. Themissing feature Fj of the test instance Xi is replaced with one value TV from the setTVj at a time and the prediction on the resulting XFj ,T V

i is generated. This results in anarray of possible confidence gains for the feature of interest. From this array of possibleconfidence gains, we calculate the confidence gain of that feature, as the maximum of thepossible confidence gains. Maximum was chosen to be used here as opposed to averagebased on empirical evidence, and a desire to adopt an optimistic policy, akin to theupper-confidence bound (UCB) bandit algorithm [Cesa-Bianchi and Lugosi, 2006]. Thisprocess is repeated for all the features with missing values in the test instance Xi.

Equation 3.6 presents the mathematical expression of the confidence gain, where thepossible confidence gain of a missing feature is defined as the maximum of the confidencegains achieved by substituting the representative values. After calculating the confidencegains for all the missing features, the feature with the maximum gain is selected to beacquired next.

Gij = max

confidence(rf(XFj ,T Vi )

)−Ci : TV ∈ TVj

(3.6)

41

Page 66: Thesis: Building better predictive models for health-related outcomes

3.4 experiments

We now describe the datasets and methodology behind the experimental study ofTABASCO.

3.4.1 Datasets

Our experiments evaluate TABASCO against the aforementioned baselines using twokinds of datasets. One real-world dataset from clinical diagnostics—a domain thatstrongly motivates the use of active feature acquisition—and a number of public bench-mark datasets we used to stress test TABASCO.

3.4.1.1 Clinical Dataset:

We evaluate performance on the dataset studied in Loekito et al. [Loekito et al., 2013Mar], consisting of 22,905 records of routine laboratory test results collected in a hospitalemergency department. The classification outcome of interest is predicting death within24 hours of availability of the test results. The features available in the dataset are: age,albumin, bilirubin, creatinine, urea, total bicarbonate, white cell count, haematocrit,heamoglobin, bicarbonate, platelet count, pH, Na, K and Cl. There are 251 deathswithin 24 hours in our dataset out of 18,130 patients, while 30% of the feature valuesare missing.

3.4.1.2 Public Datasets:

We also evaluated our TABASCO and baseline approaches using 14 datasets availablefrom the publicly-accessible KEEL data repository [Alcalá et al., 2010] and the UCIMachine Learning Repository [Bache and Lichman, 2013]. All the selected datasets arefor binary classification and only numerical features are used (some binary features wereconverted to numerical features, e.g. binary labels were replaced with 0 and 1). Table3.1 provides an overview of these datasets. The WEKA data mining software packageand R statistical computing program were used in our experiments.

42

Page 67: Thesis: Building better predictive models for health-related outcomes

Table 3.1: Public datasets used.

Name Features Instances Class Distribu-tion

MissingValues

Ionosphere 33 351 126/225 7

Breast Cancer 30 569 212/357 7

Pima Indians 8 768 268/500 7

Australian credit 8 690 307/383 7

Indian liver patient 10 583 167/416 7

South African hearth 9 462 160/302 7

German credit 9 1000 300/700 7

QSAR biodegradation 41 1055 356/699 7

Congressional voting 16 435 168/267 3

Horse colic 12 368 136/232 3

Credit approval 10 690 307/393 3

Cervical cancer 35 858 55/803 3

Cylinder Bands 19 539 227/312 3

kidney 24 400 150/250 3

3.4.2 Methodology

We used 10-fold cross validation to determine and compare the classification accuracyof TABASCO and the baselines, in all the experiments. 10-fold cross validation resultsin 10 sets of training and test datasets. We randomly removed features from the testinstances, leaving the same number of features per test instance. In our experiments,we elected to leave 10% of the features non-missing per test instance.

implementation details: The sequential feature acquisition algorithm wasimplemented by making changes to the source code of the WEKA Data Mining package,which is available under the GNU General Public license. For each training set, we trainrandom forest models using the following standard parameters: 1000 trees, and thenumber of randomly selected features per tree node taken as the square root of thenumber of available features [Lunetta et al., 2004].

43

Page 68: Thesis: Building better predictive models for health-related outcomes

The baseline CSDT classifier was implemented using the source code of the WEKA.The numerical features of the datasets were discretized by applying the superviseddiscretize filter in WEKA which implements the minimal entropy method [Fayyad andIrani, 1993], precisely following the methodology used in original papers [Ling et al.,2004; Sheng and Ling, 2006]. Ling et al. highlight the following as important features ofCSDT in their paper: the tree will prefer features with lower costs at the top nodes whenthe features have different acquisition costs, and the relative difference between attributeacquisition costs and misclassification costs affect the depth of the tree. Therefore, tonot bias our comparisons with CSDT, we chose to fix the attribute costs and to fix thefalse positive and false negative costs to 100. The fixed attribute cost per dataset perfold was decided by trying a range of costs (i.e. 0,5,10,15,20,25,30,35,40,45,50) for onefeature acquisition by splitting the training fold into 2/3 and 1/3 splits and choosing thecost which yields the best classification performance (area under the receiver operatingcharacteristic curve) for that fold. In case of ties, the lowest cost was preferred.

feature acquisition process: Feature ranking or defining the importanceof features is a field in data mining which has been studied extensively [Guyon andElisseeff, 2003; Geng et al., 2007; Dash and Liu, 1997]. Information gain is a popularmethod for feature ranking which has been used in choosing the features to split onin decision trees, and measuring the ability of the features in generating pure splits fordecades [Raileanu and Stoffel, 2004]. Relief [Kira and Rendell, 1992] is another well-known feature ranking method, where the importance of the features are determined byeach feature’s ability to distinguish nearby instances belonging to different classes. Theextended version, RliefF is known to work well with noisy and missing data [Kononenko,1994]. We calculated the global feature importance per fold according to these methodsusing the training data of that fold, and used the result for data acquisition of thecorresponding test set.

After acquiring a feature value the corresponding test instance was updated. If thefirst-suggested feature acquisition was infeasible due to a real missing value in the testinstance, the next feature in order of acquisition preference was acquired, and so on.However, note that baseline CSDT recommends only one next feature to be acquired—

44

Page 69: Thesis: Building better predictive models for health-related outcomes

when unavailable this would mean no further feature acquisitions for that test instance.For every subsequent acquisition, the algorithms should be queried first when usingTABASCO and CSDT, to maintain the instance specificness of the suggested features.We did not examine alternate forms of the stopping criterion during the experimentssince it is user specific and the user can choose to stop feature acquisition when therequired level of confidence has reached. In other words, our experiments are designedto yield thorough comparisons irrespective of the specific criterion used—we view thestopping criterion as orthogonal to the method of feature acquisition and relativelystraightforward.

measuring performance: To compare obtained results, we calculated thearea under the receiver operating characteristic curve (AUC-ROC) value on test data.I.e. for example, the AUC-ROC value of test instances without any acquired featuresand AUC-ROC value of test instances with one acquired feature. The AUC-ROC valueis a reliable performance measure that is extensively used to evaluate classificationperformance [Steyerberg et al., 2010]. We adopted this measure as it factors in relativeclassifier prediction confidence in ranking test points: critical in domains such as clinicaldiagnostics where interpretability and calibration demand more than simple yes/noresponses.

We carried out two sets of experiments using the publicly available datasets asdescribed below.

1. 10% of the features were kept in each test instance and features were acquiredsequentially, for all the above datasets

2. 10% of the features were kept in each test instance and features were acquiredsequentially, for all the above datasets without missing values originally, afterrandomly introducing missing values to few randomly selected features, in a classdependent manner.

The objectives of the experiments were to exhibit that the TABASCO approachworks with complete datasets as well as incomplete datasets. Moreover, we hypothesize

45

Page 70: Thesis: Building better predictive models for health-related outcomes

that when datasets contain missing values in a class dependent manner, TABASCOoutperforms the other baselines.

3.5 results and discussion

3.5.1 Clinical Case Study

10-fold cross validation results of five sequential feature acquisitions are compared in Ta-ble 3.2. The results of the four baselines; random acquisition and acquisition according tothe information gain (IG), ReliefF and CSDT are compared with our proposed method(TABASCO). Figure 3.3 displays the same results, where the starting AUC-ROC is theperformance before any feature acquisition. According to the results it is clear that theTABASCO method performs better in selecting informative features, than all the base-lines. Note that CSDT doesn’t suggest to acquire any features for this dataset. The mainreason is, under the cost settings decided per fold according to the described method,the decision tree contains only one node in most of the cases, suggesting that featureacquisition costs are higher than the misclassification costs.

Table 3.2: Comparison of AUC-ROC values when requesting 5 features sequentially, while having 2features per test instance in the test set at the start.

Method 1 2 3 4 5Random 0.749 0.771 0.794 0.805 0.816IG 0.785 0.798 0.812 0.822 0.834ReliefF 0.792 0.807 0.817 0.821 0.829CSDT 0.717 0.717 0.717 0.717 0.717TABASCO 0.798 0.828 0.839 0.844 0.847

One interesting question to investigate is how often TABASCO disagrees with base-lines about the feature that should be acquired next. The disagreement percentagebetween two methods, across all the folds for the first feature acquisition, is shown bythe bar chart Figure 3.4. From the figure it is clear that there is some degree of disagree-ment between all the methods, as expected from the variety of AUC-ROC performanceresults.

46

Page 71: Thesis: Building better predictive models for health-related outcomes

0.75

0.80

0.85

2 3 4 5 6 7

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

Figure 3.3: Comparison of AUC-ROC values when requesting 5 features sequentially, with 2 featuresper test instance to begin with.

0

20

40

60

80

TABASC

O &

IG

TABASC

O &

ReleifF

TABASC

O &

Ran

dom

IG &

ReleifF

dis

ag

ree

me

nt

Figure 3.4: Disagreement percentage between two methods for the first feature acquisition.

The next feature to be acquired for one test instance may not be the best feature tobe acquired for the next test instance even though they have the same features presentto start with. For example, in this clinical dataset across all the folds, when the testinstance had age and creatinine values to start with, our method suggested to acquire

47

Page 72: Thesis: Building better predictive models for health-related outcomes

urea 47% of the time, white cell count 28% of the time and total bicarbonate 22% ofthe time. We believe that this is a key property that should be possessed by any practicalfeature acquisition method, and is successfully exhibited by TABASCO.

When the missing value structure of the real clinical dataset was explored, it wasevident that the missing value percentages of some features were different for positiveand negative instances as shown in Table 3.3. This question of whether TABASCO ismore suitable for datasets with class dependent missing values, is investigated furtherin the next section using publicly available datasets.

Table 3.3: Distribution of missing values of the features between instances belonging to the two classes.

Feature %positive %negativeAge 0 0pH 44 84Bicarbonate 44 84Na 45 18K 46 21Cl 45 18Urea 45 18Creatinine 45 18Total bicarbonate 45 18Bilirubin 62 51Albumin 59 46Heamoglobin 48 16Haematocrit 48 17White cell count 48 17Platelet count 49 17

3.5.2 Public Datasets

The classification performance of the datasets after acquiring the first feature was eval-uated based on AUC-ROC. The results suggest that TABASCO achieves better resultsthan random acquisition and CSDT in most of the datasets, while providing compa-rable results to IG and ReliefF. Table 3.4 shows the comparison of AUC-ROC values

48

Page 73: Thesis: Building better predictive models for health-related outcomes

of the public datasets when requesting the first feature. The table compares the fourbaselines with our proposed method TABASCO. The results exhibit that TABASCOachieves better results than random acquisition and CSDT in most of the datasets, whileproviding comparable results to IG and ReliefF.

Table 3.4: Comparison of AUC-ROC values when requesting the first feature, while having 10% of thefeatures per test instance in the test set.

Name Random IG ReliefF CSDT TABASCOIonosphere 0.857 0.883 0.851 0.859 0.895Breast Cancer 0.938 0.973 0.973 0.969 0.972Pima Indians diabetes 0.684 0.797 0.796 0.782 0.791Australian credit approval 0.699 0.784 0.734 0.791 0.796Indian liver patient 0.671 0.682 0.695 0.651 0.706South African hearth 0.667 0.713 0.719 0.674 0.712German credit 0.538 0.650 0.547 0.555 0.646QSAR biodegradation 0.794 0.843 0.796 0.849 0.833Congressional voting records 0.908 0.983 0.990 0.985 0.987Horse colic 0.736 0.853 0.847 0.843 0.856Credit approval 0.774 0.890 0.886 0.885 0.889Cervical cancer 0.697 0.935 0.935 0.935 0.935Cylinder Bands 0.634 0.646 0.633 0.619 0.672Kidney 0.922 0.982 0.949 0.984 0.981Friedman test base 5.41e-05 1.93e-02 9.05e-02 5.03e-05

base 5.86e-01 2.60e-01 9.99e-01base 9.82e-01 5.46e-01

base 2.31e-01

3.5.3 Public Datasets with Introduced Missing Values

In this section, we experimented by introducing missing values to the training setsof the folds, in a class dependent manner. The features for missing value inductionwere selected randomly. To compare how the results change when the missing valueproportions change, two sets of datasets were generated by introducing missing values

49

Page 74: Thesis: Building better predictive models for health-related outcomes

to these randomly chosen features according to two majority vs minority proportions;6:1 and 12:1. Missing values were introduced to the training dataset of each fold used in10-fold cross validation. Table 3.5 shows the number of features with missing values perdataset, while Table 3.6 shows the classification performance of the two sets of datasetsduring the second feature acquisition, comparing IG and our proposed solution.

Table 3.5: Public datasets with introduced missing values.

Name # of Features missingIonosphere 9/33Breast Cancer 10/30Pima Indians diabetes 5/8Australian credit approval 5/8Indian liver patient 5/10South African hearth 5/9German credit 5/9QSAR biodegradation 12/41

Figure 3.5 compares sequential feature acquisition of 5 features of the above datasetswith missing values, across the two missing value settings, with respect to AUC-ROC.According to the results we can see that the average AUC-ROC gain of TABASCOincreases when the missing value proportion increases, when compared with IG (whichoutperforms other baselines).

The results of these experiments, where we use training datasets with introducedmissing values, confirm that our proposed method is most suitable for feature acquisitionwhen the datasets contain missing values in a class dependent manner. This makes sensesince TABASCO can take into account test case dependent characteristics while thefeature ranking baselines cannot.

3.6 summary

In this chapter, we introduced a confidence-based sequential feature acquisition methodTABASCO for classification tasks using a random forest classifier, aiming to acquirefeatures that maximize classification accuracy. Using a real clinical dataset of about

50

Page 75: Thesis: Building better predictive models for health-related outcomes

0.85

0.90

0.95

3 4 5 6 7 8

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(a) Ionosphere 6:1

0.80

0.85

0.90

0.95

3 4 5 6 7 8

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(b) Ionosphere 12:1

0.94

0.95

0.96

0.97

0.98

3 4 5 6 7 8

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(c) Breast cancer 6:1

0.92

0.94

0.96

0.98

3 4 5 6 7 8

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(d) Breast cancer 12:1

0.70

0.75

0.80

2 4 6

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(e) Pima Indians 6:1

0.70

0.75

0.80

2 4 6

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(f) Pima Indians 12:1

0.65

0.70

0.75

0.80

2 4 6

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(g) Australian credit 6:1

0.65

0.70

0.75

0.80

2 4 6

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(h) Australian credit 12:1

Figure 3.5: Comparison of sequential feature acquisition of five features with respect to AUC-ROC(datasets with introduced missing values).

51

Page 76: Thesis: Building better predictive models for health-related outcomes

0.650

0.675

0.700

0.725

0.750

2 4 6

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(i) Indian liver patient 6:1

0.650

0.675

0.700

0.725

2 4 6

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(j) Indian liver patient 12:1

0.65

0.70

2 4 6

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(k) South African hearth 6:1

0.60

0.65

0.70

2 4 6

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(l) South African hearth 12:1

0.54

0.57

0.60

0.63

2 4 6

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(m) German credit 6:1

0.550

0.575

0.600

0.625

0.650

2 4 6

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(n) German credit 12:1

0.775

0.800

0.825

0.850

0.875

4 5 6 7 8 9

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(o) QSAR biodegradation 6:1

0.80

0.85

4 5 6 7 8 9

number of features acquired

AU

C−

RO

C

CSDT

IG

Random

ReliefF

TABASCO

(p) QSAR biodegradation 12:1

Figure 3.5: Comparison of sequential feature acquisition of five features with respect to AUC-ROC(datasets with introduced missing values).

52

Page 77: Thesis: Building better predictive models for health-related outcomes

Table 3.6: Comparison of the classification performance of the two sets of datasets during the secondfeature acquisition.

Name IG 6:1 TABASCO 6:1 IG 12:1 TABASCO 12:1Ionosphere 0.912 0.937 0.876 0.916Breast Cancer 0.972 0.976 0.972 0.980Pima Indians diabetes 0.813 0.811 0.782 0.783Australian credit approval 0.735 0.730 0.706 0.733Indian liver patient 0.826 0.788 0.785 0.761South African hearth 0.714 0.706 0.695 0.716German credit 0.587 0.638 0.603 0.641QSAR biodegradation 0.849 0.855 0.808 0.868paired t-test p-value = 0.3327 p-value = 0.02714

22,000 records and 15 features, we demonstrated that our proposed method outperformsall baselines for 5 sequential feature acquisitions, where we have only 2 features availableper test instance to begin with. As future work, we could test our algorithm using a largepublicly available medical records dataset such as MIMIC-III [Johnson et al., 2016].

Using 8 publicly available datasets, we showed that when missing values are intro-duced to the training datasets in a class-dependent manner, our method outperformsthe other baselines. This is important because presence of missing values in datasetsis very common in data mining, especially in medical datasets, where the data has notbeen collected for data mining purposes [Cios and Moore, 2002]. Missing values can alsobe class dependent in such domains, for example sicker patients are likely to go throughmore laboratory tests.

There are some limitations to this study. Here we considered datasets with numericalfeatures only. However, this algorithm can easily be extended for datasets with categor-ical features. Our solution works by choosing a set of possible values and trying each ofthem to choose the value with the highest confidence gain in predictions. As such, theefficiency of this solution could be improved when choosing the next features to requestfor a set of incoming instances. However, the solution works efficiently for a single testinstance at a time. In our experiments, we considered the setting where one missingfeature is acquired at a time. Nonetheless, in practice most of the laboratory tests are

53

Page 78: Thesis: Building better predictive models for health-related outcomes

performed in blocks. Therefore, it would be interesting to investigate how our approachcould be extended to incorporate such scenarios. Moreover, as the features may not beindependent in practice, it might be possible to have a good guess of some of the missingvalues. As future work, we could explore how such guesses could be used to improve ouralgorithm.

The next chapter (Chapter 4) presents another technique for improving data, adensity based feature transformation method.

54

Page 79: Thesis: Building better predictive models for health-related outcomes

4IMPROVED FEATURE TRANSFORMATIONS FORCLASS IF ICAT ION US ING DENS ITY EST IMATION

Chapter 3 presents an active feature acquisition for classification tasks, which is partic-ularly applicable when predicting health-related outcomes. In this chapter1, we explorea feature transformation method called Density based logistic regression (DLR). DLRis a recently introduced classification technique, that performs a one-to-one non-lineartransformation of the original feature space to another feature space based on densityestimations, has been successful with real clinical datasets. This new feature space isparticularly well suited for learning a logistic regression model. Whilst performancegains, good interpretability and time efficiency make DLR attractive, there exist somelimitations to its formulation. In this chapter, we tackle these limitations and proposeseveral new extensions: 1) A more robust methodology for performing density estima-tions, 2) A method that can transform two or more features into a single target feature,based on the use of higher order kernel density estimation, 3) Analysis of the utility ofDLR for transfer learning scenarios. We evaluate our extensions using several syntheticand publicly available datasets, demonstrating that higher order transformations havethe potential to boost prediction performance and that DLR is a promising method fortransfer learning.

1 This chapter is based on the following publication:“Improved Feature Transformations for Classification Using Density Estimation”, Yamuna Kankanigeand James Bailey. Proceedings of the 13th Pacific Rim International Conference on Artificial Intelligence2014, pp 117-129.

55

Page 80: Thesis: Building better predictive models for health-related outcomes

4.1 introduction

Logistic regression (LR) is a classification technique belonging to the family of gener-alized linear models. It is attractive because it estimates the probabilities of classesdirectly and the coefficients of the output model can provide insight into the relativeimportance of the input features. Density based logistic regression (DLR) [Chen et al.,2013a], proposed by Chen et al., is a non-linear one-to-one mapping of the original fea-tures into another feature space based on density estimations. This new feature space iswell suited for training a LR model for classification. Chen et al. demonstrate that thistransformation method can achieve good accuracy and area under the receiver operatingcharacteristic curve (AUC-ROC) values.

Feature transformation is a well known technique used to create a new set of fea-tures based on the original set. It may be either supervised or unsupervised. Examplesof unsupervised transformation include mapping feature x to xk, log x, sin x, 1/x, ornormalising x into the range [0, 1]. An example of supervised transformation is dis-cretization of a continuous feature into a discrete feature using knowledge of the classlabel. Transformation is advantageous since it can stabilize variance in the data, modelnon-linearity, place features on equal scales or construct feature spaces with better sepa-ration between classes. DLR can be viewed as performing a supervised non-linear featuretransformation which constructs a new feature space with better separation between theclasses.

To illustrate how the feature transformation in DLR works, consider the dataset inFigure 4.1. Figure 4.2 shows a) projection of this dataset on feature 1 and b) the trans-formed values of feature 1, using the DLR technique. We can see that the transformedfeature values provide better separation between the classes compared to the originalfeature values.

However, we can identify some limitations with DLR that were not addressed in theoriginal work of [Chen et al., 2013a]. For example, it is important to ensure robust densityestimations. If robustness is not ensured, it can cause non-representative transformedfeature values, leading to classifiers which are unduly sensitive and prone to overfitting.In this chapter, we address some open questions with respect to the use of DLR.

56

Page 81: Thesis: Building better predictive models for health-related outcomes

0 20 40 60 80 100

02

04

06

08

01

00

feature 1

fea

ture

2

Figure 4.1: Dataset with two classes (light blue and dark blue colors)

1. What is an appropriate way to provide robust density estimations?

2. Is it beneficial to extend this to higher order transformations?

3. In what circumstances is the use of DLR likely to be effective?

In addressing these questions, we make the following contributions.

1. A methodology for density based transformation, whereby a separate dataset mustbe used for density estimations.

2. A technique for higher order transformations and evidence that it can sometimesbe more effective than the original method.

3. Evidence that density based transformations are useful for improving the perfor-mance of models in transfer learning scenarios.

57

Page 82: Thesis: Building better predictive models for health-related outcomes

0 20 40 60 80 100

class aclass b

(a) Feature 1 values

−0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3

class aclass b

(b) Transformed feature 1 values

Figure 4.2: Comparison of original feature 1 values and transformed feature 1 values, classes (light blueand dark blue) are better separated in b) than in a)

58

Page 83: Thesis: Building better predictive models for health-related outcomes

4.2 background and preliminaries

4.2.1 DLR

Feature transformation in DLR is defined as follows: Given a binary class label y ∈ {0, 1}and dimension d = 1, ...........,D, and xd is the feature which is transformed to φd(x),the transformation function can be defined as,

φd(x) = lnp(y = 1|xd)

p(y = 0|xd)− D− 1

Dlnp(y = 1)p(y = 0) (4.1)

The Nadaraya-Watson estimator is used to estimate p(y = k|xd), k ∈ {0, 1}

p̂(y = k|xd) =

∑i∈Dk

K(xd−xi,d

hd)∑N

i=1K(xd−xi,d

hd)

(4.2)

In the above estimator K is a kernel function and hd is the bandwidth of the kerneldensity estimation for the feature d. The Gaussian kernel is a popular choice and is,

K(x) =1√2πexp(−x

2

2 ) (4.3)

The feature transformation function (for numerical features) is then

φd(x) = ln

∑i∈D1 exp(−

(xd−xi,d)2

2h2d

)∑i∈D0 exp(−

(xd−xi,d)2

2h2d

)− D− 1

Dln|D1||D0|

(4.4)

59

Page 84: Thesis: Building better predictive models for health-related outcomes

When n is the number of records in the dataset and σ is the standard deviation ofthe feature, the rule-of-thumb bandwidth hd is,

hd = 1.06σn−1/5 (4.5)

The above is a simplified description of Chen et al’s DLR and corresponds to theirfirst version using the rule-of-thumb bandwidth for Gaussian kernels by Silverman [Sil-verman, 1986].

4.2.2 Kernel Density Estimation (KDE)

The concept of density estimation,“is the construction of an estimate of the densityfunction from the observed data” [Silverman, 1986]. KDE can be described as the sum-mation of all the kernels, when a kernel is placed at each observed point. The idea ofKDE was first introduced by Rosenblatt [Rosenblatt, 1956]. Suppose that the observedsample of x1,x2, ........,xn random values has a density f . The KDE f̂ at x with kernelK and bandwidth h is given as,

f̂(x) =1nh

n∑i=1

K(x− xi

h) (4.6)

For a multidimensional scenario, with kernel K and dimensionality d, the kerneldensity estimator can be defined as [Wand and Jones, 1995],

f̂H(x) =1n

n∑i=1

KH(x− xi) (4.7)

where KH(x) is

KH(x) = |H|-1/2K(H -1/2x) (4.8)

60

Page 85: Thesis: Building better predictive models for health-related outcomes

H is known as the bandwidth matrix, which is a symmetric positive definite d× d ma-trix. Commonly used kernel function for higher order KDEs is the standard multivariatenormal density function, which is given by,

K(x) = (2π)-d/2exp(−12x

Tx) (4.9)

Recent work in estimating full bandwidth matrix includes estimating a pilot band-width matrix for the two dimensional scenario [Duong and Hazelton, 2003], which isknown as sum of asymptotic mean square error (SAMSE) pilot bandwidth method andestimating cross-validation bandwidth matrices for multi-dimensional scenario [Duongand Hazelton, 2005].

4.3 robust density estimations

Success of learning a logistic regression model using the DLR transformed featuresdepends on how the density estimations have been calculated and inappropriate ornon-robust density estimation can cause problems. One candidate method for densityestimation is a leave-one-out style procedure 4.1: hold out one record t and use theremaining records t′ for computing the density of t. Repeat this for all records. However,suppose there exist two records a and b, which have the same value x on feature d andthese two records belong to different classes. During the transformations, b will be usedfor density estimations of a and vice versa. Note that a and b have the same value butbelong to different classes. Therefore the likelihood ratios will be largely affected bythe inclusion/exclusion of the other record. After transforming feature d to trans(d),then each of these two records will have different transformed values, even though theyought to be indistinguishable2. This issue arises because the transformed value of a heldout record t is unduly sensitive to the set of records being used to compute its density.

2 Note that the class label information is not used for each held out record. Class labels are only availablefor the remaining records used in density estimation

61

Page 86: Thesis: Building better predictive models for health-related outcomes

Changing the set of records used for density estimation even by a single record mayresult in large changes in the density estimate for a held out record.

Algorithm 4.1 Leave one-out method1: for all t) do2: trans(t)← DLRTransform(densityExtimation(t′))3: end for

Consequently, a leave one-out style of density is inappropriate. To avoid this issue,we propose that when transforming data, one should use a completely separate dataset(the density estimations (DE) set), which should be kept separate from the training andtest sets when learning a LR model. In the remainder of the chapter, the training andtest set together will be referred to as the model set. Determining the right balancebetween the size of the DE set and size of the model set then becomes an importantissue.

4.4 higher order transformations

The DLR method employs a one-one transformation of features. We now consider ex-tending this to a higher order case. Our insight here, is that instead of using one singlefeature at a time for transformation, it is worthwhile to consider a many-to-one trans-formation, whereby two or more features are the input and the output is a single trans-formed feature. To illustrate the underlying intuition about the utility of such higherorder transformations, consider again the dataset from Figure 4.1. Figure 4.3 comparesthe one dimensional (1D) transformed features using DLR by Chen et al., versus a twodimensional (2D) transformation3. As we can see there is a clear advantage in using 2Dtransformation here due to the greater class separability achieved.

3 For 2D transformation we used the full bandwidth matrix calculated using SAMSE pilot bandwidthestimation.

62

Page 87: Thesis: Building better predictive models for health-related outcomes

−0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3

class aclass b

(a) 1D transformation of feature 1

−0.6 −0.4 −0.2 0.0 0.2

class aclass b

(b) 1D transformation of feature 2

−6 −4 −2 0 2

class aclass b

(c) 2D transformation of feature 1 and 2

Figure 4.3: 1D and 2D transformations, focusing on class (light blue and dark blue) separation

63

Page 88: Thesis: Building better predictive models for health-related outcomes

The formula for a d dimensional kernel density estimator can be derived as followsusing a diagonal bandwidth matrix and product kernel. It is the product of the univariatekernels for each dimension [Bowman and Azzalini, 1997].

f̂(x1, ...,xd) =1

nh1....hd

n∑i=1

K(x1 − xi,1

h1)....K(

xd − xi,dhd

) (4.10)

Using Equation 4.4, d dimensional feature transformation equation for numeric fea-tures can be derived as,

φd(x1, ..,xd) = ln

∑i∈D1 exp(−

(x1−xi,1)2

2h21

)...exp(− (xd−xi,d)2

2h2d

)∑i∈D0 exp(−

(x1−xi,1)2

2h21

)...exp(− (xd−xi,d)2

2h2d

)− D− 1

Dln|D1||D0|

(4.11)

We note that higher order transformations have increased time complexity. For DLR,the time complexity of transformations is O(DN2) when the number of original featuresis D and the number of records is N, whereas 2D transformations and three dimensional(3D) transformations increase the time complexity to O(D2N2) and O(D3N2) respec-tively. However, if higher order transformations significantly boost the performance ofclassification, this increase in time complexity can be worthwhile4. Even more com-plex higher order transformations can be performed using the full bandwidth matrix byadopting the equations given in Section 4.2 where bandwidth estimation methods suchas the SAMSE pilot bandwidth matrix can be used.

The dimensionality of the dataset will increase when higher order transformationsare used, possibly resulting in some redundant features. It is therefore important toemploy a mechanism to select the most predictive subset of features. Lasso [Tibshirani,1996] is a well known classification method which can simultaneously learn a model andselect features and there exist logistic regression variants. We thus adopt this as theclassifier of choice for use with higher order transformations. The full workflow is givenin Figure 4.4.

4 In our experiments, we focus on 2D and 3D higher order transformations.

64

Page 89: Thesis: Building better predictive models for health-related outcomes

originaldataset

sampling

traindataset

DEdataset

testdataset

transformation transformation

transformedtrain set

transformedtest set

lasso LR Model evaluation

Figure 4.4: Full process of higher order transformations

4.5 transfer learning

In data mining it is typically assumed that the training dataset and testing dataset areindependent and identically distributed samples from the same population [Pan andYang, 2010]. However, in many scenarios, this assumption may not hold and the testingset may have a different distribution from the data that is used to train the model. Thisleads to degraded accuracy and has stimulated the research area of transfer learning.

We argue that density based transformations are a good basis for transfer learningscenarios, due to their foundation on likelihood ratios. To illustrate, suppose we have twodifferent domains that measure the same feature. Let x denote a feature in one domainand let ax+ b denote the corresponding feature in the other domain, where a and b areconstants (i.e. the second domain is a linear shift of the first). DLR transformation of thefirst domain is given by φd(x). In the DLR transformation for the second domain φd(ax+

b), we can see that the effect of b cancels during density estimations, while the effectof a gets proportionately reduced during transformations. As a result, the transformed

65

Page 90: Thesis: Building better predictive models for health-related outcomes

features for the two domains, φd(x) and φd(ax+ b), will have closer distributions thanthe original features.

φd(ax+ b) = ln

∑i∈D1 exp(−

(axd−axi,d)2

2h2d

)∑i∈D0 exp(−

(axd−axi,d)2

2h2d

)− D− 1

Dln|D1||D0|

= ln

∑i∈D1(exp(−

(xd−xi,d)2

2h2d

))a2

∑i∈D0(exp(−

(xd−xi,d)2

2h2d

))a2− D− 1

Dln|D1||D0|

φd(x) = ln

∑i∈D1 exp(−

(xd−xi,d)2

2h2d

)∑i∈D0 exp(−

(xd−xi,d)2

2h2d

)− D− 1

Dln|D1||D0|

DLR thus looks to be promising for inductive transfer learning situations, where thedistributions of data in source and target domains are different but the learning tasksare the same [Ma et al., 2012].

4.6 experiments and analysis

4.6.1 Datasets:

For our experiments, we selected 24 publicly available datasets for classification from theUCI machine learning repository [Bache and Lichman, 2013] and knowledge extractionbased on evolutionary learning (KEEL) dataset repository [Alcalá et al., 2010]. Datasetswith numerical features where the total number of features are less than 40 and have twoclass labels were selected. Records with missing numerical features were removed duringpre-processing. Synthetic datasets were created for some experiments with two featuresand class labels. Experiments on transfer learning were conducted using remote sensinglandmine datasets [Xue et al., 2007], which consist of data collected from 29 landminefields, each with 9 features extracted from radar images and class labels specifyingwhether each record is a true mine or not.

66

Page 91: Thesis: Building better predictive models for health-related outcomes

4.6.2 Experimental Set-up:

We randomly split our original dataset into two stratified datasets (50%-50%), onewhich is the DE set, while the other dataset is further randomly split into two stratified70%-30% splits. Features of the 70% split (training set) and 30%(testing set) split aretransformed using the DE set. The training set is used for model development to predictthe outcome of interest using LR and the testing set is used to evaluate the model. Thisprocess is repeated 10 times and the average AUC-ROC values are used to comparethe models thus avoiding sampling bias. This experiment set-up guarantees that theresulting density estimations are not biased. We use the lasso implementation for twoclass LR in glmnet package for R by Friedman et al. [Friedman et al., 2010]. The penaltyterm which minimizes the error (lambda.min), which is used for evaluation, is selectedvia cross validated glmnet.

4.6.3 Size of the Density Estimation Dataset:

Experiments using two publicly available real and synthetic datasets are performed todecide the right balance between the DE set and the model set. Figure 4.5 compares theaverage AUC-ROC values of the classifiers with two groups of data: original features (O)with 1D transformations, and original features with 1D and 2D transformations. Theratios of records between the density estimation dataset and model dataset used are 0.2,0.4, 0.6, 0.8 and 1 respectively. Results suggest that 50%-50% split which correspondsto ratio 1 is a good choice for our experiments. From the results, it can be seen thatthe size of the dataset used for KDEs will change the performance of the classifierssubstantially.

4.6.4 Higher Order Transformations:

To identify the characteristics of datasets which will benefit from higher order transfor-mations we use some synthetic datasets. Class distributions of the datasets with respect

67

Page 92: Thesis: Building better predictive models for health-related outcomes

0.2 0.4 0.6 0.8 1.0

0.9

10

.93

0.9

5

Ratio

AU

C−

RO

C

O+1DO+1D+2D

(a) Ionosphere dataset

0.2 0.4 0.6 0.8 1.0

0.8

60

.88

0.9

0

Ratio

AU

C−

RO

C

O+1DO+1D+2D

(b) Phoneme dataset

Figure 4.5: Performance of the classifiers when the proportion of density estimation dataset is changed

Table 4.1: Comparison of AUC-ROC value of models when using original features, 1D features and 2Dfeatures

Dataset Original 1D 2D1 0.764 0.845 0.9732 0.51 0.702 0.9683 0.88 0.946 0.995

to the features, are shown in the Figure 4.6. Table 4.1 compares the performance ofthe LR models of the synthetic datasets when using original features, 1D transformedfeatures and 2D transformed features. When transforming features in the 2D scenario,we use SAMSE pilot bandwidth estimation method, while in the 1D scenario we use aslightly modified version of the rule-of-thumb bandwidth hd = 3.0σn−1/5.

When evaluated with some real datasets, 2D transformations didn’t improve theperformance compared to the 1D counterpart. We simulated that situation using asynthetic dataset, by gradually increasing the noise (reducing the number of positiverecords). Table 4.2 compares the AUC-ROC values of the classifiers, when the numberof positive records in the dataset decrease. According to the results of the studies usingsynthetic datasets, we can see that our higher order transformation method can be moreuseful when the features transformed are correlated, with less noise in the datasets.

68

Page 93: Thesis: Building better predictive models for health-related outcomes

0 20 40 60 80 100

020

40

60

80

100

feature 1

featu

re 2

(a) 1

0 20 40 60 80 100

020

40

60

80

100

feature 1

featu

re 2

(b) 2

0 20 40 60 80 100

020

40

60

80

100

feature 1

featu

re 2

(c) 3

Figure 4.6: Class distributions of synthetic datasets according to the feature values; dark blue pointsbelong to positive class while light blue points belong to negative class

Table 4.2: Comparison of AUC-ROC value of models when using original features, 1D features and 2Dfeature

% of positive records Original 1D 2D18.7 0.533 0.740 0.75611.1 0.564 0.706 0.7125.7 0.540 0.686 0.6913.6 0.532 0.634 0.6722.1 0.563 0.631 0.6241.7 0.573 0.680 0.617

69

Page 94: Thesis: Building better predictive models for health-related outcomes

We next discuss the experiments carried-out using 24 publicly available datasets.After sampling the datasets, numerical features are transformed using 1D, 2D and 3Dfeature transformations. Product kernel with diagonal bandwidth matrix, where thebandwidth of a dimension is same as in 1D, has been used to calculate the KDEsin higher dimensions. Bandwidth of each feature is calculated using Silverman’s ruleof thumb bandwidth given in Equation 4.5. Table 4.3 compares the performance ofthe classification models, using four groups of features: original features only, originalfeatures and 1D transformed features, original features, 1D and 2D transformed featuresand original features with all the transformed features. AUC-ROC values when usingonly the original features are obtained using LR as the classifier whereas Lasso LR inglmnet is used in other scenarios.

According to the results, we can see that for some of the datasets, transformedfeatures do boost the performance of the classifiers. The original work [Chen et al.,2013a] contains a comprehensive comparison of 1D transformation with other non-lineartransformation methods such as support vector machines and kernel logistic regression.To compare the results of the four groups over all the datasets, we used Friedman test,which is considered to be a good choice for comparing results of multiple classifiers overmultiple datasets [Demšar, 2006]. Friedman test of the above results gives a p value of0.0243 which means that at least two groups have significantly different distributions at95% confidence level. Post-hoc analysis to identify the significantly different groups, isdone using pairs of groups together. p values of Friedman post-hoc analysis is given inthe last row of Table 4.3, where other groups are compared with the group specified asbase. Post-hoc analysis revealed that the performances of original+1D, original+1D+2Dand original+1D+2D+3D are significantly different from the performances of the groupusing all the original features at 95% confidence interval.

4.6.5 Transfer Learning (LandMine Datasets):

In this section we discuss about the experiments investigating the idea that DLR andits proposed extensions will be useful in transfer learning situations, using landminedatasets. To create an inductive transfer learning scenario we use the dataset of one

70

Page 95: Thesis: Building better predictive models for health-related outcomes

Table 4.3: Comparison of AUC-ROC value of models when using original features only and transformedfeatures with original features

Dataset Original Original+1D

Original+1D+2D

Original+1D+2D+3D

Pima Indians Diabetes 0.818 0.832 0.835 0.83Hepatitis 0.807 0.792 0.812 0.831Indian Liver Patient 0.764 0.722 0.713 0.722Liver Disorders 0.685 0.689 0.68 0.685Vertebral Column 0.916 0.922 0.919 0.92Breast Cancer Wisconsin (Original) 0.992 0.992 0.992 0.992Statlog (German Credit Data) 0.633 0.619 0.617 0.604Ionosphere 0.788 0.949 0.967 0.967Mammographic Mass 0.892 0.897 0.898 0.898Statlog (Heart) 0.87 0.873 0.886 0.874Breast Cancer Wisconsin (Prognostic) 0.686 0.714 0.687 0.684Breast Cancer Wisconsin (Diagnostic) 0.974 0.988 0.985 0.987Haberman’s Survival 0.653 0.676 0.664 0.66Blood Transfusion Service Center 0.778 0.76 0.766 0.759Banknote Authentication 0.999 1 1 1Banana 0.551 0.778 0.971 0.971MONK’s Problems 0.893 0.996 0.998 1Phoneme 0.811 0.87 0.916 0.929Appendicitis 0.819 0.829 0.856 0.823Titanic 0.712 0.748 0.771 0.772South African Heart 0.734 0.744 0.734 0.739Twonorm 0.998 0.998 0.997 0.996Cylinder Bands 0.611 0.654 0.664 0.72Credit Approval 0.795 0.786 0.79 0.789Friedman test base 0.0105 0.033 0.033

base 0.3938

71

Page 96: Thesis: Building better predictive models for health-related outcomes

0.4

0.5

0.6

0.7

0.8

Feature group

AU

C−

RO

C

NNO+1D

O+1D+2D

O+1D+2D+3D

OD

Figure 4.7: Box-plots of AUC-ROC values

domain for model development and another for model evaluation, which represents asituation of using a model developed for one domain for predictions in another. Toensure that the distributions of data in the source and target domains are different, weused Kolmogorov-Smirnov (KS) test to compare the distributions of the same feature insource and target domains. For our experiments we selected domain pairs where KS testgives p values less than 0.05 for all the features, which means that distributions of datais significantly different in all the features of the source and target domains. The first 36pair combinations which met the above criteria have been selected for the experiments.

Nearest neighbour filtering (NN) is a transfer learning methodology proposed byTurhan et al. [Turhan et al., 2009] for cross company defect prediction. They proposeto select a subset from the training set which is closer to the test records, and trainthe classifier using that subset. To compare the performance of the models when usingdensity based transformations, we used NN and training and testing on the original

72

Page 97: Thesis: Building better predictive models for health-related outcomes

Table 4.4: p-values of Friedman post-hoc analysis when using NN and original as the base

OD NNOriginal + 1D 6.334e-05 2.381e-05Original + 1D + 2D 3.061e-06 3.061e-06Original + 1D + 2D + 3D 5.733e-07 3.061e-06

features of the domains (OD) as baselines. For NN we calculated the Euclidean distancebetween each test record to all the training records and selected the closest 10 records.The subset of the training records selected at least by one test record is used for trainingthe classifier. Figure 4.7 compares the box-plots of the AUC-ROC values when usingoriginal features, NN, original+1D transformed features , original+1D+2D transformedfeatures and original+1D+2D+3D transformed features. Friedman test for all the resultsof 36 datasets gives a p value of 1.389e-13 suggesting that performances of some groupshave significantly different distributions. Table 4.4 shows the p values of Friedman post-hoc analysis when using NN and original as the base. According to the results we cansee that density based transformations do help in boosting the performance in transferlearning scenarios where the classification models are trained using data from differentdomains to the target domain.

4.7 summary

In this chapter, we have extended the use of density based logistic regression (DLR) ina number of ways. Firstly, we identified an issue with robustness of density estimationin DLR, in the context of leave one out estimation. To address this we proposed topreserve a completely separate dataset for all density estimation and we found that asize of 50 : 50 works well. Secondly, we extended the use of DLR to higher dimensionaltransformation. Our results showed that 2D transformations can help to boost classifi-cation performance in several situations, but transformations beyond 2D do not appearto be worthwhile. We also observed that higher order transformations are likely to bemore useful when the proportion of positive is higher. Thirdly, we demonstrated thatDLR technique can be used successfully in transfer learning situations.

73

Page 98: Thesis: Building better predictive models for health-related outcomes

As future work, we propose that this transformation technique has the capability ofusing some information of the datasets that are discarded during pre-processing due torecords with missing values. This issue is prominent in clinical and biomedical datasetswhere the datasets are not collected for research purposes. We believe that the densitybased transformation technique could utilize those records for bandwidth calculationsand kernel density estimations.

The next two chapters present two applications of data mining techniques in health-related data. In Chapter 5 we present an application on liver transplant data whileChapter 6 describes an application on whole genomic sequencing data of bacterial iso-lates.

74

Page 99: Thesis: Building better predictive models for health-related outcomes

5DATA MIN ING ALGORITHMS PREDICT GRAFT FAILUREFOLLOWING L IVER TRANSPLANTATION

In the previous chapters, we discuss data mining techniques used during the data pre-processing stage for improving data by feature acquisition and transformation. In thischapter1, we demonstrate an application of data mining techniques on liver transplantdata. The ability to predict graft failure or primary non function, at liver transplantdecision time allows utilization of scarce resource of donor livers, while ensuring thatpatients who are urgently in need of a liver transplant are prioritized. An index thatis derived to predict graft failure using donor as well as recipient features, based onlocal datasets, will be more beneficial in the Australian context. Using liver transplantdata from the Austin Health, Melbourne, Australia, we exhibit that using donor, trans-plant and recipient characteristics which are known at the decision time of a transplant,data mining algorithms can achieve high accuracy in matching donors and recipients,potentially providing better graft survival outcomes.

5.1 introduction

Liver transplantation is an option offered to patients suffering from chronic liver condi-tions, when their life expectancy is likely to be higher after the transplantation [Merion,2004]. Outcomes following liver transplantation depend upon a complex interactionbetween donor, recipient and process features. Driven by the disparity between the in-

1 This chapter is based on the following publication:“Machine-Learning Algorithms Predict Graft Failure Following Liver Transplantation”, Lawrence Lau,Yamuna Kankanige, Benjamin Rubinstein, Robert Jones, Christopher Christophi, Vijayaragavan Mu-ralidharan and James Bailey. Transplantation, Apr;101(4):e125-e132, 2017.

75

Page 100: Thesis: Building better predictive models for health-related outcomes

creasing number of potential transplant recipients and the limited number of suitableorgan donors, there is increasing use of organs of marginal quality [Busuttil and Tanaka,2003; Tector et al., 2006].

This shift brings into focus, the delicate balance with organ allocation, betweenorgan utility and the potential to cause harm to the recipient. Add to this the signifi-cant financial costs and regulatory pressures with each transplant, a quantitative toolwhich can help the transplant surgeon optimize this decision-making process is urgentlyrequired.

Surgeon intuition in the evaluation of donor risk is inconsistent and often inaccurate[Volk et al., 2013]. Scoring indices such as the DRI [Feng et al., 2006] attempts to quantifythe quality of the donor liver based on donor characteristics but include characteristicswhich may not be applicable internationally (e.g. ethnicity and regional location ofdonor), use cold Ischemia time, which is not available until the transplantation operationcommence [Feng et al., 2006; Ioannou, 2006], and does not include features which areknown to be strong predictors of outcome but may not be consistently appraised (e.g.Hepatic steatosis). DRI has not found wide adoption into routine practice [Mataya et al.,2014].

Beyond the assessment of donor organ quality, is the concept of donor-recipientmatching [Briceño et al., 2013], in order to maximize organ utilization while protect-ing patients from post-transplant complications. Risk scores that use both donor andrecipient characteristics such as SOFT [Rana et al., 2008] score have been proposed forthis purpose. Theoretically, the success of a transplant may be altered if a given donororgan were transplanted into different recipients. Unfortunately, aside from blood groupmatching and recipient urgency, currently there is little that guides this decision andthe ideal donor-recipient matching algorithm [Briceño et al., 2013] remains a long-termvision. Attempts to match donors to recipients based on recipient MELD score have hadconflicting results [Halldorson et al., 2009; Croome et al., 2012].

Data mining algorithms can be used to predict the outcome of a new observation,based on a training dataset containing previous observations where the outcome isknown. They can detect complex non-linear relationships between numerous variablesand are used for predictive applications in a wide range of fields including agriculture,

76

Page 101: Thesis: Building better predictive models for health-related outcomes

financial markets, search engines and match-making [Fayyad, 1996; Kaur et al., 2014;Joachims, 2002; Langley, 1997]. They are also finding increasing application in medicine[Kononenko, 2001 Aug]. A data mining algorithm, developed from the experience ofa particular liver transplant unit, may be able to predict the likelihood of transplantsuccess which is unit-specific and potentially allow for evolving practice.

It has been shown in the literature that recipient characteristics also influence theoutcome of a liver transplant [Desai et al., 2004]. The model for end-stage liver disease(MELD) score is derived from three common laboratory test results of a patient; serumbilirubin, serum creatinine and international normalized ratio for prothrombin time[Kamath et al., 2001; Kamath and Kim, 2007]. MELD score predicts the mortality riskof patients suffering from end-stage liver disease, and is used as a disease severity indexof patients, when prioritizing patients on the waiting list [Merion, 2004; Kamath et al.,2001]. A MELD allocation system was implemented to reduce deaths in the waitinglist and make the organ allocation method fair for everyone. However, MELD has notbeen successful in predicting post-transplant survival [Desai et al., 2004]. The reason is,“Mortality in the post transplantation period is related not only to the degree of liverdysfunction prior to transplantation, but to other factors, such as donor characteristics,experience of the transplantation team, and random postoperative complications whichcannot be predicted” [Kamath and Kim, 2007].

The objective of this chapter is to evaluate the utility of data mining algorithmssuch as random forests and artificial neural networks, in order to predict outcome basedon donor and recipient variables which are known prior to organ allocation. The per-formance of these algorithms will be compared against current standards of donor andrecipient risk assessment such as DRI, MELD and SOFT score in predicting transplantoutcome. This risk quantification tool may potentially assist donor-recipient matching,with improved balancing of the considerable risks associated with liver transplantation.

77

Page 102: Thesis: Building better predictive models for health-related outcomes

5.2 materials and methods

5.2.1 Study Cohort

This study included the Liver Transplant Database from Austin Health, Melbourne,Australia, from January 1988 to October 2013. Austin Health is one of five state-basedliver transplant units within Australia and serves the population in the States of Victoriaand Tasmania. Brain-dead and cardiac death organ donors of whole liver and splitliver transplants were included. Transplants involving paediatric recipients (under 18years of age) and transplants from living-related donors were excluded from the study.Although transplant records are available from 1988, due to the significant number ofvalues not available in the records prior to 2010 (particularly with the features used tocalculate DRI), only transplants which occurred after January 1st 2010, were includedfor analysis. Transplants from November 2013 to May 2015 were used for validatingthe results. This research was approved by the Austin Health Human Research EthicsCommittee (Project Number: LNR/14/Austin/368).

5.2.2 Dataset Collation

The prospectively maintained database contains comprehensive information about eachtransplant including donor features, transplant features, recipient features as well asrecipient outcomes. The database was collated into the working dataset, with all fieldsarranged into categorical, ordinal or continuous variables.

5.2.3 Model Development

Well-known data mining techniques such as random forests [Breiman, 2001; Liaw andWiener, 2002], artificial neural networks and logistic regression were employed for modeldevelopment [Bellazzi and Zupan, 2008]. However, logistic regression was not used for

78

Page 103: Thesis: Building better predictive models for health-related outcomes

models with many features due to its comparatively poor performance during initialtesting.

Training and test datasets were created by bootstrap sampling with replacement. Inbrief, an equivalent number of cases from the original dataset were randomly selectedwith duplicates to create a sample training set. It has been shown in literature thatsuch a bootstrap sample will contain about 63% unique cases from the original dataset.The remaining transplants, not included in the training set were allocated as the cor-responding test set. This methodology known as out-of-bag error estimation, ensuresthat there will be no overlaps between the training and test sets [Breiman, 1996], and issimilar to the leave-one-out bootstrap technique for estimating prediction error [Efron,1983]. This process was then repeated 1000 times to yield a set of 1000 training andcorresponding testing datasets. Performances of all the algorithms were evaluated bythe average of AUC-ROC values for the corresponding 1000 testing samples. Randomforest and artificial neural network implementations in Weka data mining software wereused for the experiments.

Random forest can be explained as an ensemble of decision trees, each of which isbuilt using a bootstrap sample of data, with a random subset of features. Predictionsof the forest are determined by majority voting [Liaw and Wiener, 2002]. Randomforests handle features with missing values naturally, and because of the way they work,they can work with large number of features [Hapfelmeier et al., 2014], which makesrandom forests a good match for our study. Inspired by the human nervous system,artificial neural networks try to model the relationship between input variables andoutput variables using a collection of neurons, each of which receives input from otherneurons, performs some processing and sends the results to other neurons or terminalnodes. Multilayer perceptions model complex relationships using multiple hidden layers.Note that each neuron (node) in a layer is connected to all the neurons in the next layer,where all the connections have specific weights. The learning process of an artificialneural network involves learning these weights based on training data [Witten andFrank, 2005].

First, random forest algorithms and artificial neural networks were trained using allavailable features for the 1000 bootstrapped samples.

79

Page 104: Thesis: Building better predictive models for health-related outcomes

Next all the features were ranked per training sample using AUC-ROC based featureranking method, which is suitable for datasets with high number of features, missingvalues and imbalanced class sizes [Janitza et al., 2013; Hapfelmeier et al., 2014]. Theimplementation on ‘party package’ for R statistical software [Hothorn et al., 2010] wasused for this task. By scoring the features according to their importance per each sample,over the 1000 samples, we determined the overall ranks of the features for our trainingdata.

As the next step, the top 15 features for each sample were trained and evaluated usingthe random forests and artificial neural networks. Fifteen was chosen as the number offeatures to be considered based on clinical utility. When training random forests, thefollowing standard parameters were used [Lunetta et al., 2004]: 5000 as the number oftrees, the square root of the number of available features as the number of randomlyselected features considered at each decision point. Two hidden layers were used whentraining artificial neural networks.

By scoring the features according to their ranks per sample, over the 1000 samples,we determined the overall ranks of the features for our training data. Random forestsand artificial neural networks with the overall top 15 ranked features were employed todetermine the performance with the validation data.

5.2.3.1 Outcome Parameters

The primary outcome parameter used, to develop and evaluate the prediction model wasgraft failure or primary non-function, as defined by death or re-transplantation, within30 days of the transplant. As a secondary outcome parameter, the performance of thedeveloped model to predict graft failure at 3 months was evaluated, using a separatevalidation dataset.

5.2.3.2 Donor Risk Index

As a comparative predictor of outcome, the DRI was calculated using the definitionprovided by Feng S et al. [Feng et al., 2006]. In the dataset, some features required tocalculate DRI for a particular donor may not have been recorded. DRI was consideredas missing for that record, if any of the features that are used in DRI were missing; age,

80

Page 105: Thesis: Building better predictive models for health-related outcomes

cause of death (stroke, anoxia, trauma, other), whether the organ offer is after braindeath or cardiac death, height, race (white, African American, other), donor hospitallocation (local, regional, national), cold ischemia time, partial/split liver. Actual cold is-chemia time recorded was used in the calculations. Donor hospital location was assignedas follows: offers from hospitals in Melbourne metropolitan area as local, within Victoriastate as regional, and others as national. Logistic regression was used to evaluate theperformance of the samples with DRI.

5.2.3.3 DRI +/- MELD by Random Forest

The coefficients of the features in DRI were derived in accordance to a Cox linearregression analysis of a large dataset from the United States [Feng et al., 2006]. It ispossible that if the coefficients were recalculated or used to develop a non-linear model,the features considered in DRI may be more specific to the local Australian context.Therefore, a random forest algorithm was developed using the DRI features to assesstheir predictive capability.

A further random forest algorithm was developed using the features required tocalculate the DRI and the MELD score. This was an attempt to consider both donorand recipient features in their contribution to transplant outcome.

5.2.3.4 SOFT Score

We calculated SOFT score as another comparative predictor of the outcome concerned,using the definition provided by Rana A et al. [Rana et al., 2008]. Portal bleed 48 hourspre-transplant was removed from the formula due to its unavailability in the dataset.SOFT score was considered as missing for a record, if any of the 18 features used forSOFT score calculations were missing. Due to the high number of missing values inSOFT score (56%), performance with SOFT score was evaluated using random forests.

5.2.3.5 Statistical Analysis

The predictive performance of all the models was assessed using AUC-ROC analysis,a measurement of the discriminative ability of the model which is especially suited for

81

Page 106: Thesis: Building better predictive models for health-related outcomes

imbalanced class classification [Ray et al., 2010; Bradley, 1997; Steyerberg et al., 2010].AUC-ROC values vary from 0 to 1, where > 0.9 is considered excellent discrimination,> 0.75 is considered good discrimination and 0.5 is equivalent to random guessing [Rayet al., 2010]. AUC-ROC values were computed for each of the 1000 sample training/test-ing datasets and 95% confidence intervals were determined.

Table 5.1: Summary of some donor and recipient characteristics included in the study

CharacteristicsAverage ±standard deviation (range) fornumerical features, % for nominalfeatures

Donor Factors Study dataset Validation dataset

Age 45.8 ±16.8(14-78) 45.4 ±16.2(14-78)

GenderMaleFemaleNot recorded

52.8%46.7%0.5%

53.3%46.7%0%

BMI 26.3 ±4.5(17.6-40.4) 26.9 ±5.6(16.8-54.5)

Number of organs from donor 2.5 ±0.8(1-4) 2.6 ±0.9(1-4)

Donor offerDonation after brain deathDonation after cardiac deathNot recorded

91.1%8.9%0%

91.1%5.6%3.3%

EthnicityCaucasianOtherNot recorded

87.2%8.3%4.5%

76.7%7.8%15.5%

Cause of deathStrokeAnoxiaTraumaOtherNot recorded

65%16.1%10.6%7.8%0.5%

56.7%22.2%10%8.9%2.2%

82

Page 107: Thesis: Building better predictive models for health-related outcomes

Donor pancreas retrievedYesNoNot recorded

36.7%53.9%9.4%

27.8%72.2%0%

Smoking historyYesNoEx-smokerNot recorded

56.1%37.2%5%1.7%

55.6%27.8%14.4%2.2%

Insulin useYesNoNot recorded

41.1%40.6%18.3%

6.7%21.1%72.2%

Alcohol consumptionNoYes (quantity unknown)Mild (<1/d)Mod (2-4/d, up to 14/w)Heavy (>4/d, >14/w)Not recorded

19.4%27.8%33.3%11.1%6.7%1.7%

15.6%25.5%38.9%1.1%8.9%10%

Bilirubin 13.4 ±17.1(2-166) 9.5 ±6.2(2-37)

Plasma sodium 144.3 ±6.5(128-164) 140.4 ±4.2(133-156)

Creatinine 86.8 ±48.4(26-392) 94.1 ±47.4(39-305)

ALT 77.7 ±107.5(5-733) 110.8 ±166.9(10-668)

Hb 116.7 ±26.4(60-183) 128.0 ±23.5(74-175)

Cold ischemia time 6.4 ±2.0(3-18.8) 6.5 ±2.6(0.9-20.3)

Cut downWholeSplit

95.6%4.4%

95.6%4.4%

Recipient Factors

Age at transplant 50.6 ±11.6(19.3-70.9) 53.5 ±11.3(20.8-66.8)

83

Page 108: Thesis: Building better predictive models for health-related outcomes

GenderMaleFemale

66.1%33.9%

72.2%27.8%

MELD score 18.2 ±7.5(6-43) 19.6 ±8.6(6-46)

Re-transplantNoYes

91.1%8.9%

98.9%1.1%

If HCC, number of tumours 1.4 ±0.5(1-3) 2.1 ±1.1(1-6)

Oesophageal varices< 1/4 of lumen, not bandableLargeNot presentNot recorded

31.1%25.6%17.2%26.1%

25.5%16.7%6.7%51.1%

Bilirubin 134.6 ±172.0(5-902) 94.9 ±131.0(4-682)

INR 1.6 ±0.5(1-3.8) 1.5 ±0.4(1-3.2)

Albumin 29.3 ±6.4(13-47) 30.1 ±7.8(16-44)

Portal vein patencyPatentThrombosedPartial ThrombosisPatent transjugular transhepatic

portosystemic shuntNot recorded

78.9%3.3%2.2%1.7%

13.9%

82.2%4.5%6.7%2.2%

4.4%

EthnicityCaucasianAsianOtherNot recorded

55%7.8%3.3%33.9%

37.8%8.9%3.3%50%

84

Page 109: Thesis: Building better predictive models for health-related outcomes

Primary diagnosis/Disease categoryHepatitis CMalignancy / HepatomaPrimary Sclerosing CholangitisAlcoholic CirrhosisOtherChronic Active HepatitisMetabolic DiseasePrimary Biliary cirrhosisAcute Hepatic NecrosisCirrhosis-CryptogenicChronic Active Hepatitis BBiliary AtresiaNot recorded

22.8%14.4%10.6%8.9%8.9%5.6%4.4%4.4%3.9%3.9%2.8%0.5%8.9%

11.2%37.8%14.4%6.7%13.3%1.1%3.3%5.6%1.1%3.3%1.1%0%1.1%

5.3 results

5.3.1 Dataset Characteristics

The final dataset had 180 transplants, including 16 retransplants, with 11 graft failures(6.1%) within 30 days. 276 available donor and recipient characteristics (95 dichoto-mous, 25 non-dichotomous, 156 numerical) were included for feature selection, where32% of the values in the dataset were missing values. One hundred seventy-three (173)donor characteristics, including demographic, clinical and logistical information wereincluded. The recipient characteristics used in the study included 103 demographic andpre-transplant clinical information. A summary of the donor and recipient demographicand clinical characteristics are shown in Table 5.1 and the full list of features are givenin the supplementary of the publication2.

2 Machine-Learning Algorithms Predict Graft Failure Following Liver Transplantation.Lawrence Lau, Yamuna Kankanige, Benjamin Rubinstein, Robert Jones, Christopher Christophi, Vija-yaragavan Muralidharan and James Bailey. Transplantation, Apr;101(4):e125-e132, 2017.

85

Page 110: Thesis: Building better predictive models for health-related outcomes

5.3.2 Algorithm Performances

The ranks of the features were determined from the sample training datasets usingrandom forest feature importance method and the overall top 15 predictive donor andrecipient features were selected.

These donor features were: cause of death (stroke, anoxia, trauma, other), serumalbumin level, donation after brain or cardiac death, the state in which the donor hospi-tal is located, alcohol consumption (no, unknown quantity, <1, 2-4, >4 drinks per day),haemoglobin level, total protein level, insulin usage, age, previous surgery, whether pan-creas was retrieved concurrently, and donor cytomegalovirus status.

The recipient features were: disease category, medical status at activation (home,frequent hospital care, hospital bound, ICU, ventilated) and serum herpes simplex an-tibodies. Table 3 provides the ranking of overall top 15 features with their percentagesof missingness in the study and validation datasets. It is noteworthy that most of thesetop predictors have less missing percentages when compared with the average of 32%.

Without feature selection, neural networks had an average AUC-ROC of 0.734 (95%CI 0.729-0.739) while random forests achieved 0.787 (95% CI 0.782-0.793). By com-parison, when using the top 15 features of each sample for 30 day graft failure, thepredictive ability had an average AUC-ROC value of 0.818 (95% CI 0.812-0.824) withrandom forests and 0.835 (95% CI 0.831-0.840) with neural networks.

The validation dataset contained 90 transplants with 3 graft failures within 3 months,which was selected as the outcome for validation due to the lack of graft failures within30 days. When the performance of the final model with the overall top 15 features,trained for graft failure at 30 days, was assessed in its prediction ability for graft failureat 3 months, random forests achieved an average AUC-ROC value of 0.715 (95% CI0.705-0.724), whereas neural networks yielded 0.559 (95% CI 0.548-0.569).

5.3.2.1 DRI, SOFT Score and DRI +/- MELD by Random Forest Performance

To compare, the DRI for each donor in our dataset was calculated with a mean valueof 1.56 ( ±0.37). DRI predicted graft failure within 30 days with an average AUC-ROCvalue of 0.680 (95% CI 0.669-0.690). Using DRI trained for graft failure at 30 days, to

86

Page 111: Thesis: Building better predictive models for health-related outcomes

Table 5.2: Comparison of AUC-ROC values of different models created during the study

Characteristics used AUC-ROC(95%CI)

Donor risk index 0.680 (0.669-0.690)SOFT score 0.638 (0.632-0.645)Neural network with all the features 0.734 (0.729-0.739)Random forest with all the features 0.787 (0.782-0.793)DRI features in random forest 0.697 (0.688-0.705)DRI features and MELD score in random forest 0.764 (0.756-0.771)Random forest with feature selection (Top 15) 0.818 (0.812-0.824)Neural network with random forest feature selection (Top 15) 0.835 (0.831-0.840)

predict graft failure at 3 months for the validation dataset, the average AUC-ROC valuewas 0.595 (95% CI 0.587-0.602).

Using the same features that are used in DRI, we developed a model using RandomForests. This model achieved an average AUC-ROC of 0.697(95% CI 0.688-0.705). WhenMELD score were added to the DRI features for Random Forest modelling, a predictiveaverage AUC-ROC of 0.764 (95% CI 0.756-0.771) was observed.

The SOFT score was also assessed and had a mean value of 5.5 ( ±4.3). As a predictorfor 30 day graft failure, it had average AUC-ROC of 0.638 (95% CI 0.632-0.645).

The performances of SOFT score and DRI features with MELD score for the vali-dation dataset are not available, since some features required for the calculations werenot available in the validation dataset.

A comparison of all the results with the study dataset is given in Table 5.2 andFigure 5.1.

5.4 discussion

This study is a proof-of-concept that data mining algorithms can be an invaluable tool,supporting the decision-making process for liver transplant organ allocation. This isparticularly relevant in the current high-stakes environment where suboptimal organ

87

Page 112: Thesis: Building better predictive models for health-related outcomes

Figure5.1:R

OC

curvecom

parisonofdifferent

models

createdduring

thestudy

88

Page 113: Thesis: Building better predictive models for health-related outcomes

utility leads to either increased waiting list mortality or patient mortality followingtransplantation.

The results of this study revealed that using 15 of the top-ranking donor and recipi-ent variables available prior to transplantation were the best predictors of outcome withan average AUC-ROC of 0.818 with the random forest algorithm and 0.835 with arti-ficial neural networks. Both data mining techniques showed significant improvementsin AUC-ROC with feature selection. This was followed by training the random forestclassifier with the variables used to calculate DRI plus MELD score (AUC-ROC=0.764).Using the random forest classifier with the features used to calculate DRI improved thediscrimination of DRI from 0.680 to 0.697. SOFT score achieved an average AUC-ROCof 0.638. Assessing the predictive accuracy of the final models with top 15 features, astrained for 30 day outcome, for graft failure at 3 months, the AUC-ROC value decreasedfrom 0.818 to 0.715 with random forests and 0.835 to 0.559 with neural networks. Bycomparison, DRI prediction of 3 month graft failure was 0.595.

There are many data mining paradigms, of which two of the most widely used areartificial neural networks and random forest classifiers. In a recent landmark paperwhere the performance of 179 different data mining classifiers were used to classify all121 datasets, representing the entire University of California Irvine Machine LearningRepository, random forest classifiers were found to be the most accurate [Fernández-Delgado et al., 2014]. There are four reports using artificial neural networks to predicttransplant outcome in literature [Dvorchik et al., 1996; Matis et al., 1995; Briceño et al.,2014; Cruz-Ramirez et al., 2013] The present study is the first report using a randomforest data mining algorithm for predicting outcome following liver transplantation.

There are multiple theoretical advantages with the use of random forest algorithms inthis application. It is well known in data mining literature that artificial neural networksare prone to overfitting and learning noise in data, resulting in unstable models withpoor generalization ability [Cheng and Titterington, 1994; Gardner and Dorling, 1998;Adya and Collopy, 1998; Zhang, 2000]. However, by design, random forest classifiers areless prone to overfitting producing more stable models [Anaissi et al., 2013; Amaratungaet al., 2008; Cutler et al., 2007]. In medical datasets, there is frequently a large degree ofmissing data since the data is often not collected for research purposes, and some tests

89

Page 114: Thesis: Building better predictive models for health-related outcomes

are not routinely performed even though they may be highly prognostic (e.g. donor liverbiopsy for assessment of steatosis). Simply excluding these cases may bias the results dueto the fact that the missing-ness of the data is not completely at random [Acuna andRodriguez, 2004; Schafer and Graham, 2002]. Random forest algorithms are superiorin handling datasets missing a significant proportion of input data such as with thisstudy [Pantanowitz and Marwala, 2009]. Furthermore, while artificial neural networksare essentially, a black-box into which data is inputted and a prediction is outputted,the feature importance measure with random forest can indicate the importance of eachvariable in the dataset thereby improving the transparency of the algorithm [Cutleret al., 2007].

Myriad factors interact to influence liver transplant including donor, recipient andlocally specific transplant features. There have been many attempts to predict graftfailure, following liver transplant in literature [Rana et al., 2008; Halldorson et al., 2009;Desai et al., 2004; Avolio et al., 2008; Ioannou, 2006; Amin et al., 2004; Avolio et al.,2011; Dutkowski et al., 2011]. Some studies looked at predicting graft failure usingeither donor features, recipient features [Desai et al., 2004], or a combination of both[Rana et al., 2008; Halldorson et al., 2009; Ioannou, 2006; Amin et al., 2004; Avolioet al., 2011; Dutkowski et al., 2011]. However, these approaches have all failed to gaingreater adaptability because they are developed from patient populations which maynot be generalizable to other centres due to regional differences in patient, donor orprocess features, or changes in practice since their development [Mataya et al., 2014;Briceño et al., 2013]. Furthermore, they are calculated from simple multiple regressionstatistical models which assumes the linear influence of different variables. A predictivemodel required to enable effective organ allocation needs to be locally and temporallyapplicable, and account for the complex interactions within the data available prior totransplantation.

Currently, decisions for organ allocation are largely subjective or based on a recipientsickest-first or waiting-time approach rather than an outcome-based approach. Data min-ing algorithms are increasingly used for modern clinical decision-making. Compared tocurrent methods, they are data driven, able to accommodate numerous interdependentvariables and specific to the population from which they were trained on. In addition,

90

Page 115: Thesis: Building better predictive models for health-related outcomes

compared with static indices, they are dynamic, able to learn case-by-case with theexpansion of the training set.

Using feature importance measure, the most influential donor and recipient variableswere determined. Most of these features such as donor age, whether the offer is afterbrain death or cardiac death, donor cause of death, donor hospital State (geographicaldistance), donor alcohol consumption, recipient disease category and medical status atactivation are already known as important features [Feng et al., 2006; Ioannou, 2006;Mateo et al., 2006; Moore et al., 2005]. Donor haemoglobin, protein level and insulinusage were also top-ranking predictive features which make sense clinically. Donor CMVand recipient HSV status were also predictive and although less intuitive, has been shownto be associated with acute viral infection and rejection [Linares et al., 2011; Pedersenand Seetharam, 2014]. Interestingly, the decision to retrieve the pancreas for islet cell orwhole organ transplant was also a top-ranking feature, although the decisions to retrievekidneys, lungs or heart were not significant features. This is likely because the decisionfor pancreas retrieval is usually more stringent, requiring more ideal donor conditions.

This study highlights the importance of feature selection and tailoring in predictivemodelling. The predictive accuracy of the well-known DRI was improved when tailoredto the specific influences at the Austin Health Liver Transplant Unit. Accuracy wasfurther improved with the addition of recipient MELD characteristic with the bestaccuracy found with the application of a unit-specific Random Forest algorithm usingthe top-ranking predictive features.

The main limitations of data mining algorithms are that they are best suited topredicting outcome in the environment from which they are derived. Conversely, thislimitation is also its strength, in that it is highly specific to the peculiarities of a partic-ular transplant centre, enabling the best decision for each individual transplant. There-fore, while it is not ideal to export a trained algorithm from one transplant centre tothe next, certainly, the approach, with an algorithm tailored to each transplant centreis possible. A further limitation of this algorithm is that while it is trained to predict30 day graft failure, its predictive accuracy may not extend to other important livertransplant outcomes such as 3, 6 or 12 month graft failure, early graft dysfunction,

91

Page 116: Thesis: Building better predictive models for health-related outcomes

acute/chronic rejection, infections, immunosuppression or late biliary strictures. Eachof these outcomes might require a separately trained algorithm.

A limitation of this study is that the data mining algorithm was derived from anobservational database. While the bootstrapping with replacement methodology is wellvalidated for the development of robust predictive data mining models [Breiman, 1996;Austin and Tu, 2004], and our attempts to predict 3 month graft failure for a separatevalidation dataset looks promising, prospective validation for 30 day graft failure wouldbe valuable to confirm the predictive ability.

5.5 summary

This study confirms that data mining algorithms based on donor and recipient variableswhich are known prior to organ allocation can be utilized to predict transplantationoutcomes. This approach may be used as a tool for transplant surgeons to improve organallocation decisions. The ability to quantify risk may allow for improved confidence withthe use of marginal organs and better outcomes following transplantation.

However, due to the limited number of graft failures within 30 days available inthe study dataset, it would be worthwhile to obtain data from other transplant centersand validate the findings in a multi-center study. Moreover, similar studies can be con-ducted to predict other useful outcomes such as 3, 6 or 12 month graft failure, earlygraft dysfunction, acute/chronic rejection, infections, immunosuppression or late biliarystrictures.

The next chapter (Chapter 6) explores a data mining approach for predicting theserotypes of a bacterial species, using whole genome sequencing data, without using anyprior knowledge about the gene cluster associated with the serotype.

92

Page 117: Thesis: Building better predictive models for health-related outcomes

6A NOVEL DATA MIN ING APPROACH TO PREDICT ION OFSTREPTOCOCCUS PNEUMONIAE SEROTYPE

In Chapter 5, we discuss an application of data mining techniques to liver transplantdata, where we demonstrate that data mining algorithms can be utilized to predicttransplant outcomes, using donor and recipient variables known prior to organ allocation.In this chapter1, we explore another application of data mining techniques, on genomicsequencing data to predict the subgroups of a bacterial species called serotypes.

Serotyping is a common bacterial typing process where isolates are grouped accord-ing to their distinctive surface structures called antigens, which is important for publichealth and epidemiological surveillance. This typing process is labour intensive, time con-suming and requires expert knowledge [Ashton et al., 2016]. A data mining approachusing whole genome sequencing data, which is becoming cheaper and routinely available,will be helpful in providing accurate results faster. To the best of our knowledge, all ex-isting approaches proposed to predict serotypes using whole genome sequencing data,use knowledge about the gene clusters associated with the serotypes for that species.However, the relationship between genotypes and serotypes might not be understoodfor all bacterial species. In this work, we gathered whole genome sequencing data offour publicly available Streptococcus Pneumoniae datasets with serologically derivedserotype information, resulting in one of the largest Streptococcus Pneumoniae collec-tions. Using this collection, we demonstrate that data mining approaches can be usedto accurately predict serotypes, without any prior knowledge about the gene clusters

1 This chapter is based on the following manuscript in preparation:“A Novel Data Mining Approach to Prediction of Streptococcus Pneumoniae Serotype”, YamunaKankanige, Benjamin Goudey and Thomas Conway.

93

Page 118: Thesis: Building better predictive models for health-related outcomes

associated. However, organisms exist in populations — groups of organisms of a speciesliving in one habitat — and a learning model developed for one population might notperform well for another. In this chapter, we demonstrate that using samples from multi-ple different populations when training the data mining models boosts the generalizationperformance on unseen populations.

6.1 introduction

Bacterial typing is the process of clustering closely related microorganisms together. Thisclustering is important in public health and for epidemiological purposes, since serotyp-ing is used for detecting outbreaks, identifying pathogenic variants and identifying theoccurrences of antibiotic resistant variants. [Tenover et al., 1995; van Belkum et al.].There are many typing methods for grouping microorganisms of a species together suchas multilocus sequence typing (MLST), serotyping, pulsed-field gel electrophoresis andribotyping [Miljković-Selimović et al., 2009; Ashton et al., 2016; Tenover et al., 1995;van Belkum et al.].

Serotyping is a common typing process, where the grouping is performed accordingto the various surface structures called antigens. However, the process of serotypinginvolves preparing a range of antibodies and performing a series of binary tests todetermine which antibodies locks on to the surface antigens. Therefore, this process istime consuming, labour intensive and requires expert knowledge [Ashton et al., 2016;Kapatai et al., 2016]. In addition, issues like cross-reactivity make it harder to resolvethe serotype and it‘s known that the “reading of the results can be subjective” [Kapataiet al., 2016].

In this chapter, we investigate whether data mining techniques could be used forpredicting the serotypes, using whole genome sequencing data of the bacterial isolates,where an isolate is the representation of a given microorganism isolated from the sample.Predictive data mining using genomic and proteomic data has received a great deal ofattention recently, especially from the research in molecular biology [Bellazzi and Zu-pan, 2008; Bellazzi et al., 2011]. Single nucleotide polymorphisms — the single letterdifferences when aligned with a reference sequence [Szymczak et al., 2009], kmers —

94

Page 119: Thesis: Building better predictive models for health-related outcomes

fixed length words generated from the read data [Zhang et al., 2003], gene expressionmicroarray data [Brazma and Vilo, 2000] —generated by measuring the gene expres-sions levels of thousands of genes simultaneously, and protein expression data [Clarkeet al., 2008] —generated by analyzing complex protein mixtures, are popular types ofgenomic data used in data mining approaches. Unsupervised learning techniques suchas hierarchical clustering [SÃÿrlie et al., 2001], K-means clustering [Gasch and Eisen,2002], self-organizing maps [Tamayo et al., 1999] and supervised learning techniquessuch as support vector machines [Brown et al., 2000] and random forests [Wang et al.,2009] have been used for data mining on genomic data [Brazma and Vilo, 2000]. Withthe advancement of next generation sequencing technologies, whole genome sequencingis becoming cheaper, faster and routinely available, which opens up a whole area ofpreviously unimaginable research projects using this vast amount of data [Soon et al.,2013; Stephens et al., 2015; Jolley and Maiden, 2010]. Therefore, a data mining basedmethod for predicting serotypes of bacterial isolates, using whole genome sequencingdata will be useful in providing results quicker without the need for domain experts.

For our investigation, we selected Streptococcus Pneumoniae as the bacterial species,and formed one of the largest collections of Streptococcus Pneumoniae available withserologically derived serotype information. Streptococcus Pneumoniae is a pathogenicbacteria, which can cause multiple diseases such as pneumonia, meningitis, bronchitisand other pneumococcal infections, mainly in children under 5 years and the elderly.It is also known to be highly recombinogenic [Chewapreecha et al., 2014b]. Currentlythere are 92 distinct serotypes of Streptococcus Pneumoniae consisting of 25 distinctserotypes and 21 serogroups. Each of these serogroups consist of two to five closely re-lated serotypes, resulting in 67 serotypes [Kapatai et al., 2016]. It has been shown in theliterature that in Streptococcus Pneumoniae, the capsular polysaccharides biosynthesis(cps) gene cluster governs the serotype [Mavroidi et al., 2007].

Even though a data mining model developed for one population should provide ac-curate results for isolates within that population, it might perform poorly for anotherpopulation. One possible reason is the population structures that exist within popula-tions of organisms, where population refers to a group of organisms of a species livingin a specific habitat. Population structure is the relationship between organisms within

95

Page 120: Thesis: Building better predictive models for health-related outcomes

a population, which occurs due to recombinations and horizontal gene transfers withinthe populations [Smith et al., 2000].

This topic of transferring knowledge gained on one population to another, is knownas transfer learning or domain adaptation and has been widely studied in data mining[Pan and Yang, 2010; Weiss et al., 2016]. Domain adaptation is a challenging problemwhen the “domains” are different populations of organisms, represented by distinctpopulation structures. In a data mining context population structure can be viewedas the distribution differences of the variants among the populations, which influencethe decisions and their accuracies when training on data from one population to makepredictions on another population [Stephan et al., 2015].

To the best of our knowledge, all approaches proposed for domain adaptation requireunlabeled or labelled samples from the target domain. However, our objective is to inves-tigate the ability to give predictions for samples from a new population, without havingany prior samples from that population. Using four Streptococcus Pneumoniae datasetsof three different populations, we demonstrate that we could achieve higher predictionaccuracies for new populations by including data from a few different populations whentraining data mining models.

The primary objective of this study is to examine whether data mining techniquesmaking use of whole genome sequencing data can accurately predict the serotypes of bac-terial isolates belonging to Streptococcus Pneumoniae, without using any prior knowl-edge about the gene clusters associated with serotypes. Such an approach would beconvenient in identifying the serotypes faster and accurately for public health purposes.

The contributions of this chapter are,

• Using a large collection of publicly available Streptococcus Pneumoniae datasets,we demonstrate that the serotypes of bacterial isolates of known serotypes can bepredicted accurately using data mining techniques, when samples from all the testpopulations are included in training datasets.

• By comparing our results with the results obtained using a state of the art rulebased system, we exhibit that our method provides comparable results.

96

Page 121: Thesis: Building better predictive models for health-related outcomes

• Using the variable importance measures available in Random Forest, we demon-strate that a higher portion of important variants for predicting the serotypes arewithin the cps gene cluster which is compatible with the current consensus. Theseresults acknowledge that our approach would be useful for predicting serotypes ofbacterial species where the association between the genotypes and phenotypes arenot known. Moreover, it would be useful for identifying such associations.

• By training the data mining models on one or more populations and evaluatingon another population, we exhibit that the generalization performances for previ-ously unseen populations can be improved by using samples from a few differentpopulations during training.

• By analyzing the differences in confidence levels of correct and incorrect predic-tions, we propose that the accuracy of the learning models could be improved bychoosing not to provide predictions when not confident, allowing the users to seekalternative methods for those isolates.

To the best of our knowledge, ours is the first study that explores the usage ofwhole genome sequencing data of bacterial isolates to determine serotypes, using a datamining approach without using any prior knowledge about the gene clusters associatedwith serotypes.

6.1.1 Related Work

Some of the prior studies exploring similar objectives investigate the possibility of pre-dicting serotypes using other wet lab techniques such as MLST requiring plenty of timeand effort, while others look at using known gene clusters to predict serotypes. By com-parison, we propose a novel approach for predicting the serotypes by employing wholegenome sequencing data, without using any prior knowledge about the gene clustersassociated with serotypes.

Achtman et al. [Achtman et al., 2012], proposed that the MLST approach would bea better alternative for serotyping in Salmonella enterica when compared with serotyp-ing, while Ashton et al. [Ashton et al., 2016] explored the possibility of substituting

97

Page 122: Thesis: Building better predictive models for health-related outcomes

serotyping with MLST approach using 6,887 isolates of Salmonella enterica, achievingan accuracy of 96%.In another work, it has been shown that a data mining approachbased on high resolution melt curves can accurately predict the serotypes of Streptococ-cus Pneumoniae [Athamanolap et al., 2014]. Brisse et al. [Brisse et al., 2004] employedPolymerase Chain Reaction (PCR) amplification of the cps gene cluster in Klebsiellapneumoniae, while Pai et al. [Pai et al., 2006] used PCRs of the cps gene cluster ofStreptococcus Pneumoniae for determining serotypes. Effectiveness of pulsed-field gelelectrophoresis for predicting serotypes was studied by Gaul et al. [Gaul et al., 2007]using 674 isolates of swine Salmonella.

Sequence matching strategies, where databases of major serotype determinants areconstructed, and against which raw sequence reads are mapped, have been proposedby Zhang et al. [Zhang et al., 2015] for predicting the serotypes of Salmonella, and byJoensen et al. [Joensen et al., 2015] for Esherichia coli, using the gene clusters codingthe O and H antigens. Kapatai et al. [Kapatai et al., 2016] proposed an automatedpipeline called PneumoCaT2 which predicts the serotype of Streptococcus Pneumoniaeaccurately using the whole genome sequencing data of the cps gene cluster.

6.2 materials and methods

This section presents the experiments we performed using four publicly available Strep-tococcus Pneumoniae datasets. To compare our results an automated pipeline called“PneumoCaT” was used. The primary objective of the experiments was to show thatdata mining techniques can predict the serotypes of a bacterial species without us-ing the knowledge about the gene clusters associated with serotypes. In addition, weinvestigated whether data mining techniques are able to predict serotypes of unseenpopulations accurately.

2 Pneumococcal Capsular Typing

98

Page 123: Thesis: Building better predictive models for health-related outcomes

Table 6.1: Summary of the datasets before and after preprocessing

Dataset Origin Before prepro-cessing

After prepro-cessing

Mass Massachusetts, USA [Croucheret al., 2013]

616 isolates 588 isolates

Thai Maela refugee camp, Thailand[Chewapreecha et al., 2014a]

3085 isolates 2566 isolates

UK1 Public Health England NationalReference Lab [Kapatai et al.,2016]

871 isolates 858 isolates

UK2 UK [Kapatai et al., 2016] 2065 isolates 2025 isolates

6.2.1 Datasets

Four publicly available datasets of Streptococcus Pneumoniae were used in our experi-ments; 616 isolates collected fromMassachusetts, USA (Mass) [Croucher et al., 2013] and3085 isolates from Maela refugee camp, Thailand (Thai) [Chewapreecha et al., 2014a],871 isolates from the Public Health England National Reference Lab (UK1) [Kapataiet al., 2016] and 2065 isolates from UK (UK2) [Kapatai et al., 2016]. During the datapreprocessing stage, serologically non-typable isolates and isolates with serotypes justoccurring once in a dataset were removed due to lack of representative information. Af-ter preprocessing the final study datasets contained 588 isolates in Mass belonging to 33serotypes, Thai dataset had 2566 isolates associated to 55 serotypes, UK1 dataset had858 isolates of 80 serotypes and UK2 dataset had 2025 isolates of 57 serotypes. Table6.1 summarizes these datasets before and after preprocessing while the distribution ofserotypes within these datasets is shown in Figure 6.1.

Figure 6.2 compares the distributions of the number of isolates per serotype acrossthe four datasets. It is evident from the figure that while the combined number ofisolates per serotype varies from 2 to 495, some of the serotypes occur in very fewisolates. Figure 6.3 compare the percentage distributions of the serotypes, in Mass,Thai, UK1 and UK2 datasets. This figure shows that while the other datasets are moreevenly distributed across the serotypes, isolates from a few serotypes govern the Thaidataset. A possible reason for this observation is that while each of the other datasets are

99

Page 124: Thesis: Building better predictive models for health-related outcomes

Figure 6.1: Distribution of serotypes within the four datasets.

collected from a broader population such as Public Health England National ReferenceLab and Massachusetts in USA, Thai dataset is collected from Maela refugee camp inThailand [Chewapreecha et al., 2014a], which is a concentrated population within 2.4km2, where isolates belonging to a few serotypes could be circulating.

6.2.2 Features

Single nucleotide polymorphisms (SNP)s are a common form of genetic variation wherebya single base mutation exists between two DNA sequences. This type of variation is rel-atively easy to measure compared to longer or more complicated variation, and hasbeen shown to be a useful proxy for many types of genetic variation in Streptococcus[Chewapreecha et al., 2014b]. In this work, SNPs were derived for each of the four pop-ulations separately using a standard mapping and variant calling pipeline. Reads foreach isolate were mapped against the PMEN1 reference (NC_011900.1, StreptococcusPneumoniae ATCC 700669) using bowtie2 [Langmead and Salzberg, 2012] and bcftools[Li et al., 2009]. SNPs were then combined across the different datasets, where SNPsthat were not called in a particular dataset were called as the reference base. Finally,

100

Page 125: Thesis: Building better predictive models for health-related outcomes

110

10A10B10F11A11B11C12A12B12F

131415

15A15B/C

15F16F17A17F18A18B18C18F19A19B19F

22021

22A22F23A23B

23B123F24B24F25A25F

2728A28F

293

3132A32F33A33B33C33D33F

3435A35B35C35F

36373839

440

41A41F

42454648

56

6A6B6C6D7A7B7C7F

89A9L9N9V

0 100 200 300 400 500

Isolates per Serotype

Se

roty

pe

Dataset UK2 UK1 Thai Mass

Figure 6.2: Comparison of isolates per serotype.

101

Page 126: Thesis: Building better predictive models for health-related outcomes

110

10A10B10F11A11B11C12A12B12F

131415

15A15B/C

15F16F17A17F18A18B18C18F19A19B19F

22021

22A22F23A23B

23B123F24B24F25A25F

2728A28F

293

3132A32F33A33B33C33D33F

3435A35B35C35F

36373839

440

41A41F

42454648

56

6A6B6C6D7A7B7C7F

89A9L9N9V

0 10 20 30

Percentage Distribution per Serotype

Se

roty

pe

Dataset UK2 UK1 Thai Mass

Figure 6.3: Comparison of percentage distribution of the serotypes within the four datasets, such thatthe sum of percentages per dataset adds to 100

102

Page 127: Thesis: Building better predictive models for health-related outcomes

a further round of filtering was conducted removing SNPs with that were marked asmissing in more than 10% of samples, had a mutation called in less than 1% of isolatesas well as removing records that had more than 10% of SNPs marked as missing. Theresulting 86, 704 SNPs were used in our experiments without further filtering.

6.2.3 Random Forest

The ensemble learning technique —Random Forest— was used in our experiments,which inherently handles multi-class classification. Random forest is an ensemble ofdecision trees, each of which is built using a bootstrap sample of training data, with arandom subset of features considered for generating splits at each node. Predictions bythe forest are determined by majority voting of individual trees [Breiman, 2001].

The three main parameters specified when training a random forest are the numberof trees in the forest, the number of randomly selected features to be considered foreach node and the minimum node size of the tree [Liaw and Wiener, 2002].The generalconsensus is that it is best to have a large number of trees, the square root of thenumber of features available is a good estimate for the number of features per node andby default the trees in the forest are allowed to grow unpruned for classification tasks[Wright and Ziegler, 2015]. We generated 1000 trees in all the experiments. Other twoparameters were chosen as described above; unpruned trees with the square root of thenumber of features in the dataset as the number of features chosen for each tree. The‘ranger’ package in R [Wright and Ziegler, 2015] was used for producing random forestsand their predictions.

6.2.4 Experimental Setup

We conducted the following experiments to achieve the objectives given in the introduc-tion. The performances of our experiments were evaluated using classification accuracy(1- error rate), which is a well-known measurement for evaluating and comparing theperformance of classification models [Demšar, 2006; Huang and Ling, 2005].

103

Page 128: Thesis: Building better predictive models for health-related outcomes

6.2.5 Leave-one-out Cross Validation

Due of the limited number of isolates for most of the serotypes, as shown in Figures 6.2and 6.3, leave-one-out cross validation was used as the experimental scheme when eval-uating predictive performances within each population. Leave-one-out cross validationis an unbiased accuracy estimation method, where one record of the dataset is left-outin the training phase and the accuracy of the generated model is estimated using theleft-out record. This process is repeated for all the records in the dataset and the finalaccuracy is calculated by considering all the individual accuracies [Kohavi, 1995; Efronand Tibshirani, 1997].

6.2.6 PneumoCaT

“PneumoCaT” is a recently introduced whole genome sequencing based framework forserotyping Streptococcus Pneumoniae isolates [Kapatai et al., 2016]. PnuemoCaT isan automated pipeline, which suggest serotypes using whole genome sequencing data,following a two-step process by matching the read data to the cps sequences of theserotypes. For comparison, we evaluated the performance of PneumoCaT in predictingthe serotypes of all the isolates used in our experiments.

6.2.7 Merged Dataset

In order to estimate the accuracy of random forests in predicting serotypes, when iso-lates from a broader population are available, we combined these four datasets (Mass,Thai, UK1 and UK2) using stratified bootstrap sampling with replacement. In strati-fied sampling, samples are selected such that the representation of the serotypes in theselected sample is similar to the representation in the original combined data [Teddlieand Yu, 2007]. In brief, an equivalent number of records from original datasets were ran-domly selected with duplicates to create a combined training set. It has been shown inthe literature that such a bootstrap sample will contain about 63% unique records from

104

Page 129: Thesis: Building better predictive models for health-related outcomes

original datasets [Breiman, 1996]. The remaining isolates, not included in the trainingset were combined as the corresponding test set. This methodology known as out-of-bagerror estimation, ensures that there will be no overlaps between training and test sets[Breiman, 1996]. This process was repeated 1000 times to generate 1000 sets of trainingand corresponding test datasets, each which was used to train a random forest modeland evaluate performance.

6.2.8 Variable Importance

The variable importance measures available with random forest indicate the importanceof each variable in the dataset, there by improving the transparency of the algorithm[Cutler et al., 2007]. One popular method for calculating the importance of each featureis by observing the differences in prediction accuracies when the features of out-of-bagsamples are permuted across the random forest [Liaw and Wiener, 2002]. By rankingSNPs according to their feature importance and scoring them according to their ranksin each of the 1000 datasets, we generated the overall ranking of the SNPs across thefour datasets, with respect to their ability in predicting serotypes.

6.2.9 Population Adaptation

The final set of experiments were conducted to evaluate the generalizability of ourapproach when presented with isolates from a different population. This is an importantaspect in data mining approaches for predicting serotypes, which indicate the widerusability of models without confining to the original population using which algorithmsare trained. In these experiments, we evaluated the accuracies of random forest modelswhen training on one or a combination of datasets to make predictions on anotherdataset. Three sets of experiments were conducted, 1) Training on one dataset andevaluate on another (12 experiments) 2) train on a combination of two datasets andevaluate on another dataset (12 experiments) 3) train on a combination of 3 datasetsand evaluate on the other dataset (4 experiments).

105

Page 130: Thesis: Building better predictive models for health-related outcomes

Table 6.2: Comparison of the accuracies of leave-one-out cross validation (LOO-CV) with PneumoCaT

Dataset LOO-CV PneumoCaTMass 0.95 0.96Thai 0.98 0.95UK1 0.89 0.99UK2 0.97 0.98

6.2.10 Confidence in Predictions

Confidence provide a level of certainty or reliability of the associated predictions, allow-ing the users to accept the predictions or attempt alternative methods [Bhattacharyya,2013; Harper, 2005 Mar]. The learning algorithm used in our work, Random Forest, iscapable of providing the confidence on the predictions, based on the number of treesthat voted for the final prediction [Bureau et al., 2005]. We used this confidence ofthe predictions to investigate whether the confidence levels of the correct and incorrectpredictions are different.

6.3 results

6.3.1 PneumoCaT Results

For comparison with our data mining approach, we used a recently introduced pipelinecalled PneumoCaT [Kapatai et al., 2016] for identifying the serotypes of the isolatesbelonging to the four datasets. The development dataset of PneumoCaT is the full UK1dataset used in our experiments (without removing the records with just one recordper serotype). Table 6.2 displays the predictive performances of PneumoCaT, wherethe records with predictions equal to ‘6E’, which is a novel serotype introduced byPneumoCaT, were removed from accuracy calculations (due to incompatibility withother datasets). We also excluded the records where the prediction is ‘Reference datanot available’, which indicates a case where reference sequences were missing for someserotypes.

106

Page 131: Thesis: Building better predictive models for health-related outcomes

6.3.2 Leave-one-out Cross Validation

Table 6.2 shows the accuracies of leave-one-out cross validation experiments withineach population, for the same isolates used for calculating PneumoCaT results. Theaccuracies of three out of four of the experiments are around 95%, which is comparablewith the results reported in similar studies [Pai et al., 2006; Kapatai et al., 2016]. Thelowest accuracy occurs in UK1 dataset, where the isolates are spread across most of theserotypes with limited sample sizes as shown in Figures 6.2 and 6.3. For example, sixteenserotypes of UK1 dataset have only two records, compared with 4 in UK2, 3 in Massand 2 in Thai, making it harder for a classifier to correctly predict the results, therebyexplaining the lower accuracy in UK1 dataset when compared with the other results.Figure 6.4 exhibits the comparisons between the actual serotypes and the predictedserotypes of the four datasets.

According to the comparison given in Table 6.2 between leave-one-out cross valida-tion and PneumoCaT, we can see that our results are comparable with the state of theart method, except for one dataset. PneumoCaT works well with the UK1 dataset with99% accuracy, contrasting with the 88% accuracy during leave-one-out cross validation.This observation is justified by the fact that UK1 dataset is the development datasetof PneumoCaT [Kapatai et al., 2016], whereas possible reasons for the low accuracyduring leave-one-out experiments have been discussed before.

6.3.3 Merged Dataset

To estimate the accuracy of random forests in predicting serotypes when isolates froma broader population are available, we combined these four datasets (Mass, Thai, UK1and UK2) using stratified bootstrap sampling with replacement. Figure 6.5 displays theaccuracy distribution of the 1000 bootstrap samples where the estimated accuracy ofthe merged dataset is 97%.

107

Page 132: Thesis: Building better predictive models for health-related outcomes

10

10A

11A

14

15

15A

15B/C

15F

16F

17F

18C

19A

19F

22F

23A

23B

23F

3

31

33F

34

35B

35F

38

6

6A

6B

6C

7C

7F

9A

9N

9V

10

10

A

11

A

14

15

15

A

15

B/C

15

F

16

F

17

F

18

C

19

A

19

F

22

F

23

A

23

B

23

F

3 31

33

F

34

35

B

35

F

38 6 6A

6B

6C

7C

7F

9A

9N

9V

Actual Serotype

Pre

dic

ted

Sero

type

0.00 0.25 0.50 0.75 1.00

% per Actual Serotype

(a) Mass

110A10B10F11A

1314

15A15B/C

16F17F18A18C18F19A19B19F

2021

22A22F23A23B23F24F28F

293

3132A32F33B33C33D33F

3435B35C35F

37384

404546485

6A6B6C7B7F

89N9V

1

10

A

10

B

10

F

11

A

13

14

15

A

15

B/C

16

F

17

F

18

A

18

C

18

F

19

A

19

B

19

F

20

21

22

A

22

F

23

A

23

B

23

F

24

F

28

F

29 3 31

32

A

32

F

33

B

33

C

33

D

33

F

34

35

B

35

C

35

F

37

38 4 40

45

46

48 5 6A

6B

6C

7B

7F 8 9N

9V

Actual Serotype

Pre

dic

ted S

ero

type

0.00 0.25 0.50 0.75 1.00

Proportion per Actual Serotype

(b) Thai

Figure 6.4: Comparison of the prediction performances of all the serotypes in leave-one-out cross vali-dation

108

Page 133: Thesis: Building better predictive models for health-related outcomes

110A10B10F11A11B11C12A12B12F

1314

15A15B/C

15F16F17A17F18A18B18C18F19A19F

22021

22A22F23A23B

23B123F24B24F25A25F

2728A28F

293

3132A32F33A33B33C33F

3435A35B35C35F

363738394

4041A41F

424546485

6A6B6C6D7A7B7C7F

89A9L9N9V

11

0A

10

B1

0F

11

A1

1B

11

C1

2A

12

B1

2F

13

14

15

A1

5B

/C1

5F

16

F1

7A

17

F1

8A

18

B1

8C

18

F1

9A

19

F2 20

21

22

A2

2F

23

A2

3B

23

B1

23

F2

4B

24

F2

5A

25

F2

72

8A

28

F2

9 3 31

32

A3

2F

33

A3

3B

33

C3

3F

34

35

A3

5B

35

C3

5F

36

37

38

39 4 40

41

A4

1F

42

45

46

48 5 6A

6B

6C

6D

7A

7B

7C

7F 8 9A

9L

9N

9V

Actual Serotype

Pre

dic

ted

Se

roty

pe

0.00 0.25 0.50 0.75 1.00

Proportion per Actual Serotype

(c) UK1

Figure 6.4: Comparison of the prediction performances of all the serotypes in leave-one-out cross vali-dation (cont.)

109

Page 134: Thesis: Building better predictive models for health-related outcomes

110A10B10F11A11B12B12F

1314

15A15B/C

16F17F18A18B18C18F19A19F

22021

22A22F23A23B23F24B24F

2728A

293

3133F

3435A35B35F

363738394

485

6A6B6C6D7B7C7F

89N9V

110A

10B

10F

11A

11B

12B

12F

13

14

15A

15B

/C16F

17F

18A

18B

18C

18F

19A

19F

2 20

21

22A

22F

23A

23B

23F

24B

24F

27

28A

29 3 31

33F

34

35A

35B

35F

36

37

38

39 4 48 5 6A

6B

6C

6D

7B

7C

7F 8 9N

9V

Actual Serotype

Pre

dic

ted

Se

roty

pe

0.00 0.25 0.50 0.75 1.00

Proportion per Actual Serotype

(d) UK2

Figure 6.4: Comparison of the prediction performances of all the serotypes in leave-one-out cross vali-dation (cont.)

110

Page 135: Thesis: Building better predictive models for health-related outcomes

0

50

100

150

0.960 0.965 0.970 0.975

accuracies

count

Figure 6.5: Accuracy distribution of the 1000 bootstrap samples

6.3.4 Variable Importance

We used the same 1000 sets of training and test samples to determine the most impor-tant SNPs for predicting the serotypes of the isolates, using the permutation variableimportance measure provided by random forest models. By scoring the SNPs accordingto their importance rating per bootstrap sample set, over the 1000 sets, the overall ranksof the SNPs were determined. It has been shown in the literature that in Streptococ-cus Pneumoniae, serotypes are governed by the capsular polysaccharides biosynthesis(cps) gene cluster [Mavroidi et al., 2007]. Figure 6.6 demonstrates how the percentageof SNPs belonging to the cps gene cluster vary according to the number of top rankedfeatures considered. For example, 83.5% of the top 200 SNPs are within the cps genecluster, as shown in the plot. These results are compatible with the current consensusthat cps gene cluster governs the serotypes of Streptococcus Pneumoniae and indicatethat our approach would be useful for predicting serotypes of bacterial species where the

111

Page 136: Thesis: Building better predictive models for health-related outcomes

association between the genotypes and phenotypes are not known. Moreover, it wouldbe useful for identifying such associations.

0

20

40

60

80

0 2500 5000 7500 10000

Number of SNPs

% in c

ps

Figure 6.6: The variation of the percentage of SNPs within the cps gene cluster according to the numberof top ranked features (based on the 1000 bootstrap samples)

6.3.5 Population Adaptation

We investigated how this kind of an approach will generalize for an unseen population,by training on one or more datasets and testing on another dataset. There are threesets of experiments conducted, using one dataset, two datasets and three datasets intraining the models.

Tables 6.3, 6.4 and 6.5 reveal the results of these experiments when using one dataset,two datasets and three datasets in training the models respectively. As we have shownin Figure 6.1, these four datasets have only 27 serotypes in common. Therefore, whentraining on one or more datasets and evaluating on another, some of the serotypesexisting in the evaluation dataset may not exist in the training dataset. Thus, when

112

Page 137: Thesis: Building better predictive models for health-related outcomes

Table 6.3: Accuracies of the experiments, when training on two datasets and testing on another

TrainTest Mass Thai UK1 UK2

Mass 0.60 0.78 0.72Thai 0.27 0.32 0.35UK1 0.84 0.23 0.82UK2 0.96 0.67 0.88

Table 6.4: Accuracies of the experiments, when training on three datasets and testing on another

TrainTest Mass Thai UK1 UK2

Mass & Thai 0.70 0.68Mass & UK1 0.52 0.89Mass & UK2 0.69 0.87Thai & UK1 0.94 0.81Thai & UK2 0.96 0.87UK1 & UK2 0.97 0.63

calculating the accuracies, the isolates with serotypes not available in the training datawere excluded from calculations.

Figure 6.7 displays how the overall accuracies of the experiments change in theprevious experiments. According to the results, it is evident that when the trainingdata contains samples from a range of populations the accuracies of the predictionsincrease, providing better generalization performance for unseen populations. However,when training on a combination of three datasets and testing on one dataset, we cansee that all the other datasets perform better than the Thai dataset. One possible

Table 6.5: Accuracies of the experiments, when training on one dataset and testing on another

TrainTest Mass Thai UK1 UK2

Mass & Thai & UK1 0.91Mass & Thai & UK2 0.87Mass & UK1 & UK2 0.70Thai & UK1 & UK2 0.96

113

Page 138: Thesis: Building better predictive models for health-related outcomes

0.2

0.4

0.6

0.8

1.0

One Two Three

Number of datasets in training

Accu

racy

Figure 6.7: Comparison of the accuracies when training on one or more datasets and evaluating onanother

explanation for this observation could be the possibility of population structure issueswithin Thai dataset, which contains isolates from a concentrated population (within 2.4km2) [Chewapreecha et al., 2014a].

6.3.6 Confidence in Predictions

Being an ensemble learner, a simple method of obtaining confidence levels of randomforest predictions is to calculate the number of trees that voted for a prediction [Bureauet al., 2005]. We investigated how these confidences of predictions vary for correct andincorrect predictions, when using the experiments where we train on three populationsand evaluate on one population. Isolates with serotypes which are not known for train-ing data were not excluded from the calculations here, since in practice they representnew serotypes not known to the training population. From the Figure 6.8 it is apparentthat the relative confidence levels of the correct predictions are higher than the confi-dence levels of the incorrect predictions. The mean confidence gap between correct andincorrect predictions are closest in Mass dataset which is the smallest dataset out of the

114

Page 139: Thesis: Building better predictive models for health-related outcomes

four, providing the highest accuracy as given in Table 6.5. Therefore, in Mass there areonly few incorrect predictions available, explaining the smaller gap.

0.25

0.50

0.75

1.00

Correct Incorrect

Prediction correct or incorrect

Confidence

(a) Mass

0.25

0.50

0.75

Correct Incorrect

Prediction correct or incorrect

Confidence

(b) Thai

0.00

0.25

0.50

0.75

1.00

Correct Incorrect

Prediction correct or incorrect

Confidence

(c) UK1

0.25

0.50

0.75

1.00

Correct Incorrect

Prediction correct or incorrect

Confidence

(d) UK2

Figure 6.8: Comparison of the confidences of predictions for correct and incorrect predictions (basedon the results of training on three datasets and evaluating on the remaining dataset)

Using the results of the Thai dataset from the above, we investigated how the accu-racies of the predictions vary, if we provide predictions only when the predictions areassociated with confidences higher than a predefined threshold. The accuracy calcula-tions here are based only on the predictions made. The results are displayed in Figure

115

Page 140: Thesis: Building better predictive models for health-related outcomes

6.9a, which confirms that we can increase the accuracy of the predictions provided, bychoosing not to provide predictions when not confident, allowing the users to seek alter-native methods for those isolates. Furthermore, the relationship between the possibleconfidence thresholds and the number of isolates that won’t be typed at that confi-dence threshold is given in Figure 6.9b, while Figure 6.9c demonstrates the relationshipbetween the accuracy of the predictions and the number of isolates that won’t be typed.

6.4 discussion

In this chapter, we have explored an interesting application of data mining techniqueson whole genome sequencing data. We explored the problem of predicting the subgroupsof a bacterial species called serotypes, which is important to identify for public healthand epidemiological surveillance [Zhang et al., 2015]. The process of serotyping is timeconsuming, labour intensive and requires expert knowledge [Ashton et al., 2016; Kapataiet al., 2016]. Thus, we investigated whether this process could be automated with thehelp of data mining techniques, where this becomes a multi-class classification problem.

The primary objective of our study was to evaluate the performance of data min-ing techniques in predicting serotypes of a bacterial species, without using any priorknowledge about the gene clusters associated with serotypes. Using one of the largestcollections of Streptococcus Pneumoniae datasets with available serologically derivedserotype information, we demonstrated that data mining techniques could successfullypredict the serotypes within the populations (by conducting experiments within indi-vidual datasets and merged datasets). The results also highlight that if the trainingdatasets contain enough records representing all the serotypes, the accuracies of thepredictions would be better.

Using the permutation based feature ranking provided by Random Forest, SNPswere scored according to their importance across 1000 sets of bootstrap samples. Wegenerated the overall ranking of the SNPs according to their importance for predictingthe serotypes as given by the random forest models. The results exhibit that most ofthe top-ranking features are within the cps gene cluster, in compliance with currentunderstanding. These results suggest that our approach would be useful for predicting

116

Page 141: Thesis: Building better predictive models for health-related outcomes

0.7

0.8

0.9

1.0

0.25 0.50 0.75

Confidence

Accura

cy

(a) Relationship between the accuracy of the predictions and the confidence threshold for providingpredictions

0

1000

2000

0.25 0.50 0.75

Confidence

# n

ot ty

ped

(b) Relationship between the confidence thresholds of the predictions and the number of isolates nottyped

0.7

0.8

0.9

1.0

0 1000 2000

# not typed

Accura

cy

(c) Relationship between the accuracy of the predictions and the number of isolates not typed

Figure 6.9: Relationships between the accuracy, the confidence threshold and the number of isolatesnot typed (based on the results of training on the combination of Mass, UK1 and UK2 andevaluating on Thai dataset)

117

Page 142: Thesis: Building better predictive models for health-related outcomes

serotypes of bacterial species where the association between the genotypes and pheno-types are not known. Moreover, it would be useful for discovering such associations.

A secondary objective of our work was to investigate the generalization performancesof the data mining models when predicting for previously unseen populations. By con-ducting a group of experiments where data from one or more populations are combinedto generate the training dataset and another dataset is used as the testing dataset, weexhibit that generalization performances for unseen populations increase when a num-ber of populations are represented in the training dataset. The fact that data storageand powerful computers are getting cheaper justifies this solution.

Our results are compared with an existing tool for identifying the serotypes of Strep-tococcus Pneumoniae called PneumoCaT. According to the results, PneumoCaT is ableto identify the serotypes with higher accuracies across all the datasets, while our methodproduce comparable results for most of the datasets. However, developing such a tool fora bacterial species takes a lot of time and effort, since identifying how each serotype isgenetically different from one another is required in such an approach. Moreover, Pneu-moCaT uses the knowledge about the gene cluster associated with the serotypes, makingit unsuitable for some bacterial species about which this knowledge is not available. Incomparison, our method requires less time and effort to develop a learning model fora new bacterial species, while being suitable for ones where the gene cluster associatedwith serotypes are not known.

When we use classification accuracy to evaluate predictions, we ignore the confidenceon the predictions and assume that all the predictions are equally probable [Huang andLing, 2005]. Providing confidence of predictions is a desirable feature of a data miningmodel, increasing the usability of such models. Confidences provide a level of certaintyor reliability of the associated predictions, allowing the users to accept the predictionsor attempt alternative methods [Bhattacharyya, 2013; Harper, 2005 Mar]. Using thepercentage of trees that voted for the final prediction as the confidence of the predic-tions, we demonstrate that the relative confidence levels of the incorrect predictions arelower, when compared with the correct predictions. Moreover, we demonstrate that theaccuracy of the predictions provided by the data mining approaches can be improvedby choosing to provide predictions only when the predictions are associated with con-

118

Page 143: Thesis: Building better predictive models for health-related outcomes

fidences that are higher than a predefined threshold, improving the reliability of ourapproach.

There are some limitations to our study. As discussed in the introduction, serotypingprocess can be affected by the cross-reactivity of the antisera used. Moreover, the readingof the results can be subjective. Therefore, similar to any dataset, there can be errors inthe serologically derived serotypes. However, serotypes derived by that process are thebasis of our solutions, where the serotype information of the training datasets is used tomodel the learning algorithms, which is an unavoidable limitation. Another limitationof our approach is the inability to predict new serotypes. However, the confidence ofthe predictions could be used as a guide here to allowing to identify isolates in need ofalternative methods for serotype determination.

As future work, we would investigate the issue of population structure, that affectthe generalization performance when predicting for an unseen population. It would alsobe worthwhile to investigate how this approach would work for other bacterial species,especially for ones in which the relationship between the serotypes and gene clusters arenot established.

6.5 summary

In this chapter, we proposed a novel random forest based approach for determiningserotypes of bacterial species using whole genome sequencing data without using anyprior knowledge about the gene clusters associated with serotypes. Using one of thelargest collections of publicly available Streptococcus Pneumoniae datasets, we demon-strate that our approach accurately predicts the serotypes when samples from all thetest populations are included in training datasets, comparable with the existing stateof the art methods, where this state of the art methods use the knowledge about thegene clusters associated with the serotypes.

Using the variable importance measures available in Random Forest, we demonstratethat a higher portion of important variants for predicting the serotypes are within the cpsgene cluster which is compatible with the current consensus. These results acknowledgethat our approach would be useful for predicting serotypes of bacterial species where the

119

Page 144: Thesis: Building better predictive models for health-related outcomes

association between the genotypes and phenotypes are not known. Moreover, it wouldbe useful for identifying such associations.

Using one or more populations for training random forests and evaluating on an-other population, we exhibit that the generalization performances for previously unseenpopulations can be improved by using samples from a few different populations duringtraining.

Finally, by analyzing the differences in confidence levels of correct and incorrectpredictions, we propose that the accuracy of the learning models could be improvedby choosing not to provide predictions when not confident, allowing the users to seekalternative methods for those isolates.

120

Page 145: Thesis: Building better predictive models for health-related outcomes

7CONCLUS IONS

Predicting health-related outcomes is important, for developing decision support sys-tems for assisting clinicians and other healthcare workers regularly faced with criticaldecisions. Such models will save their time, help them to manage healthcare resourcesand ultimately provide better quality care for patients. These outcomes are now madepossible thanks to complex medical data routinely generated at hospitals and laborato-ries, and developments in data mining methods.

This thesis focused on building better predictive models for health-related data,where the contributions can be categorized into two parts. The first part investigateddata mining techniques used for improving the data, such as feature acquisition andtransformation, which helps to build better prognostic models for health-related out-comes. The second part discussed two applications of data mining models on clinicaland biomedical data for predicting health-related outcomes. The focus of this chapter isto summarize all the contributions of this thesis and discuss the limitations and potentialfuture work.

7.1 contributions of the presented work

Chapter 3 investigated the problem of active feature acquisition during prediction time.The presence of missing values in training and test instances is commonplace in datamining applications, while possessing more information about data facilitates data min-ing models that provide more accurate predictions. Acquiring complete instances canbe time consuming, prohibitively costly, or impossible [Melville et al., 2004]. In some

121

Page 146: Thesis: Building better predictive models for health-related outcomes

settings, it is feasible to acquire additional features at a nominal cost. For example, inmedical diagnosis, clinicians order laboratory tests for a patient, which have not yetbeen performed.

We proposed a novel confidence-based sequential feature acquisition method —TABASCO— for classification tasks during prediction time using a random forest clas-sifier. TABASCO proposed to acquire the feature that maximizes the confidence of theprediction provided by the classifier. It improved the classification accuracy of test in-stances by requesting the most informative missing features sequentially, for one testinstance at a time. We demonstrated that TABASCO significantly outperforms a num-ber of baselines for 5 sequential feature acquisitions, using a real clinical dataset of about22,000 records and 15 features, where we have only 2 features available per test instanceto begin with.

Using 8 publicly available benchmark datasets, we showed that when missing valueshave been introduced to the training datasets in a class-dependent manner, our methodoutperforms the other baselines. This is important because presence of missing values indatasets is very common in data mining, especially in medical datasets, where the datahas not been collected for data mining purposes [Cios and Moore, 2002]. Missing valuescan also be class dependent in such domains, for example sicker patients are likely togo through more laboratory tests.

Chapter 4 explored a non-linear feature transformation technique which has beensuccessful with clinical datasets and proposes an extension to the method. In this chap-ter, we tried to answer some questions that arise when using DLR for classification.Thus, we extended the use of density based logistic regression (DLR) in a number ofways in this chapter. Firstly, we identified an issue with robustness of density estima-tion in DLR, in the context of leave-one-out estimation. To address this we proposed topreserve a completely separate dataset for all density estimations and we found that asize of 50 : 50 works well. Secondly, we extended the use of DLR to higher dimensionaltransformations. Our results showed that 2D transformations can help to boost classifi-cation performance in several situations, but transformations beyond 2D do not appearto be worthwhile. We also observed that higher order transformations are likely to be

122

Page 147: Thesis: Building better predictive models for health-related outcomes

more useful when the proportion of positive is higher. Thirdly, we demonstrated thatDLR technique can be used successfully in transfer learning situations.

Chapter 5 and 6 discussed two applications of data mining techniques for predictinghealth-related outcomes.

In Chapter 5 we presented a study done in collaboration with the Liver Transplantunit in Austin Health in Heidelberg, Australia, where we proposed that data miningtechniques can be used to predict graft failure after liver transplantation. By comparingdata mining approaches with well-known indexes such as donor risk index, model forend-stage liver disease score and survival outcomes following liver transplantation score,we demonstrated that using donor, transplant and recipient characteristics known atthe decision time of a transplant, data mining approaches can achieve high accuracyin matching donors and recipients, potentially providing better graft survival outcomes.This approach may be used as a tool for transplant surgeons to improve organ allocationdecisions. The ability to quantify risk may allow for improved confidence with the useof marginal organs and better outcomes following transplantation.

Chapter 6 focused on using data mining techniques for predicting the serotypes of abacterial specie, using DNA sequence data, without any prior knowledge about the genecluster associated with the serotype. In this work, using whole genome sequencing data offour publicly available Streptococcus Pneumoniae datasets, we demonstrated that datamining approaches can be used to predict serotypes of the isolates with known serotypes,without any prior knowledge about the gene clusters associated with the serotypes. Theresults of our experiments also suggest that if the training dataset contains enoughrecords representing all the serotypes, the accuracies of the predictions would be better.Using the variable importance measures available in random forest, we demonstratedthat a higher portion of important variants for predicting the serotypes are within thecps gene cluster which is compatible with the current understanding. These resultsacknowledge that our approach would be useful for predicting serotypes of bacterialspecies where the association between the genotypes and phenotypes are not known.Moreover, it would be useful for identifying such associations. Using one or more datasetsfor training random forests and evaluating on another dataset, we revealed that thegeneralization performances for previously unseen datasets can be improved by using

123

Page 148: Thesis: Building better predictive models for health-related outcomes

samples from a few different populations during training. Finally, by analyzing thedifferences in confidence levels of correct and incorrect predictions, we proposed thatthe accuracy of the learning models could be improved by choosing not to providepredictions when not confident, allowing the users to seek alternative methods for thoseisolates.

7.2 future directions

The studies presented in this thesis and their results open up some potentially usefuldirections for future work. This section is dedicated to discuss those future directions.

In Chapter 3, we studied the problem of active feature acquisition for classificationtasks, which is well suited for data mining models predicting health-related outcomes. Inour experiments, we used datasets with numerical features only. However, this algorithmcan easily be extended for datasets with categorical features. Our solution works bychoosing a set of possible values and trying each of them to choose the value with thehighest confidence gain in predictions. As such, the efficiency of this solution could beimproved when choosing the next features to request for a set of incoming instances.However, the solution works efficiently for a single test instance at a time, which is themost likely scenario in practice. In our experiments, we considered the setting where onemissing feature is acquired at a time. Nonetheless, in practice most of the laboratorytests are performed in blocks. Therefore, it would be interesting to investigate how ourapproach could be extended to incorporate such scenarios. We hope that these studieswill be useful in the future, when clinical decision support systems become the norm,especially when the learning systems use many features as opposed to few.

When using datasets for a data mining task, the whole dataset may not be usedfor model development, due to incorrect and missing values. This issue is prominentin clinical and biomedical datasets where the datasets are not collected for researchpurposes. One popular method of data pre-processing is using the complete cases only,by discarding the records with missing values. However, those discarded records containuseful information. If that information could be used directly or indirectly during modeldevelopment, it will boost the performance of the classifiers. We believe that the density

124

Page 149: Thesis: Building better predictive models for health-related outcomes

based transformation technique discussed in Chapter 4 could utilize those records forbandwidth calculations and kernel density estimations.

A limitation of our study presented in Chapter 5 is that the data mining algorithmwas derived from a smaller observational database. While the bootstrapping with re-placement methodology is well validated for the development of robust predictive datamining models [Breiman, 1996; Austin and Tu, 2004], and our attempts to predict 3month graft failure for a separate validation dataset looks promising, prospective valida-tion for 30 day graft failure would be valuable to confirm the predictive ability. Moreover,due to the limited number of graft failures within 30 days available in the study datasetand the validation dataset, it would be worthwhile to obtain data from other transplantcenters and validate the findings in a multi-center study. Furthermore, similar studiescan be conducted to predict other useful outcomes such as 3, 6 or 12 month graft failure,early graft dysfunction, acute/chronic rejection, infections, immunosuppression or latebiliary strictures. We hope that these studies will contribute towards transforming thecurrent healthcare systems to be more personalized and data-driven.

Being able to identify the subgroups of bacterial isolates faster and accurately is im-portant for public health and epidemiological surveillance. As future work, we would liketo investigate the issue of population structure within the populations of microorganismsthat affect the generalization performance when predicting for an unseen population. Itwould also be worthwhile to investigate how this approach would work for other bac-terial species, especially ones in which the gene clusters associated with the serotypesare not known. Such systems which can predict the serotypes of the bacterial isolateswith associated confidences, would enable better utilization of resources by employingalternative wet lab techniques only when the learning models are not confident.

In conclusion, we hope that the contributions of this thesis will lay fruitful foun-dations for a future where clinical decision support systems become the norm and thehealthcare systems are more personalized and data-driven.

125

Page 150: Thesis: Building better predictive models for health-related outcomes

126

Page 151: Thesis: Building better predictive models for health-related outcomes

REFERENCES

M. Achtman, J. Wain, F.-X. Weill, S. Nair, Z. Zhou, V. Sangal, M. G. Krauland, J. L.Hale, H. Harbottle, A. Uesbeck, G. Dougan, L. H. Harrison, S. Brisse, and the S. enter-ica MLST study group. Multilocus sequence typing as a replacement for serotypingin salmonella enterica. PLOS Pathogens, 8(6):1–19, 06 2012. doi: 10.1371/journal.ppat.1002776. URL https://doi.org/10.1371/journal.ppat.1002776. (Cited onpage 97.)

E. Acuna and C. Rodriguez. The treatment of missing values and its effect on classifieraccuracy. In Classification, clustering, and data mining applications, pages 639–647.Springer, 2004. (Cited on page 90.)

M. Adya and F. Collopy. How effective are neural networks at forecasting and prediction?a review and evaluation. Journal of Forecasting, 17:481–495, 1998. (Cited on pages 16and 89.)

M. F. Akay. Support vector machines combined with feature selection for breast can-cer diagnosis. Expert Systems with Applications, 36(2):3240–3247, 2009. (Cited onpages 17 and 30.)

J. Alcalá, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, and F. Herrera.Keel data-mining software tool: Data set repository, integration of algorithms and ex-perimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing,17:255–287, 2010. (Cited on pages 42 and 66.)

J. Allen, H. M. Davey, D. Broadhurst, J. K. Heald, J. J. Rowland, S. G. Oliver, andD. B. Kell. High-throughput classification of yeast mutants for functional genomicsusing metabolic footprinting. Nature Biotechnology, 21:692 EP –, 05 2003. URLhttp://dx.doi.org/10.1038/nbt823. (Cited on page 30.)

127

Page 152: Thesis: Building better predictive models for health-related outcomes

P. D. Allison. Missing data. Sage Thousand Oaks, CA, 2012. (Cited on page 27.)

D. Amaratunga, J. Cabrera, and Y.-S. Lee. Enriched random forests. Bioinformatics,24(18):2010–2014, 2008. ISSN 1367-4803. (Cited on page 89.)

F. Amato, A. López, E. M. Peña-Méndez, P. Vaňhara, A. Hampl, and J. Havel. Artificialneural networks in medical diagnosis, 2013. (Cited on page 15.)

M. G. Amin, M. P. Wolf, J. A. TenBrook, R. B. Freeman, S. J. Cheng, D. S. Pratt, andJ. B. Wong. Expanded criteria donor grafts for deceased donor liver transplantationunder the meld system: a decision analysis. Liver Transplantation, 10(12):1468–1475,2004. ISSN 1527-6473. (Cited on page 90.)

A. Anaissi, P. J. Kennedy, M. Goyal, and D. R. Catchpoole. A balanced iterative randomforest for gene selection from microarray data. BMC Bioinformatics, 14(1):261, 2013.ISSN 1471-2105. (Cited on page 89.)

M. Anbarasi, E. Anupriya, and N. Iyengar. Enhanced prediction of heart disease withfeature subset selection using genetic algorithm. International Journal of EngineeringScience and Technology, 2(10):5370–5376, 2010. (Cited on pages 18 and 19.)

P. M. Ashton, S. Nair, T. M. Peters, J. A. Bale, D. G. Powell, A. Painset, R. Tewolde,U. Schaefer, C. Jenkins, T. J. Dallman, E. M. de Pinna, K. A. Grant, and S. W. G.S. I. Group. Identification of salmonella for public health surveillance using wholegenome sequencing. PeerJ, 4:e1752, 2016. (Cited on pages 7, 93, 94, 97, and 116.)

P. Athamanolap, V. Parekh, S. I. Fraley, V. Agarwal, D. J. Shin, M. A. Jacobs, T.-H.Wang, and S. Yang. Trainable high resolution melt curve machine learning classifierfor large-scale reliable genotyping of sequence variants. PLOS One, 9(10):e109094,2014. (Cited on page 98.)

P. C. Austin and J. V. Tu. Bootstrap methods for developing predictive models. TheAmerican Statistician, 58(2):131–137, 2004. ISSN 0003-1305. (Cited on pages 92and 125.)

128

Page 153: Thesis: Building better predictive models for health-related outcomes

P. C. Austin, J. V. Tu, J. E. Ho, D. Levy, and D. S. Lee. Using methods from thedata-mining and machine-learning literature for disease classification and prediction:a case study examining classification of heart failure subtypes. Journal of ClinicalEpidemiology, 66(4):398–407, 2013. (Cited on page 15.)

A. Avolio, M. Siciliano, R. Barbarino, E. Nure, B. Annicchiarico, A. Gasbarrini, S. Agnes,and M. Castagneto. Donor risk index and organ patient index as predictors of graftsurvival after liver transplantation. In Transplantation Proceedings, volume 40, pages1899–1902. Elsevier, 2008. (Cited on page 90.)

A. W. Avolio, U. Cillo, M. Salizzoni, L. De Carlis, M. Colledan, G. E. Gerunda, V. Maz-zaferro, G. Tisone, R. Romagnoli, and L. Caccamo. Balancing donor and recipientrisk factors in liver transplantation: The value of d-meld with particular reference tohcv recipients. American Journal of Transplantation, 11(12):2724–2736, 2011. ISSN1600-6143. (Cited on page 90.)

K. Bache and M. Lichman. UCI machine learning repository, 2013. URL http://

archive.ics.uci.edu/ml. (Cited on pages 42 and 66.)

W. G. Baxt. Use of an artificial neural network for the diagnosis of myocardial infarction.Annals of Internal Medicine, 115(11):843–848, 1991. (Cited on pages 15 and 30.)

R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In DataEngineering, 2005. ICDE 2005. Proceedings. 21st International Conference on, pages217–228. IEEE, 2005. (Cited on page 11.)

R. Bellazzi and B. Zupan. Predictive data mining in clinical medicine: current issuesand guidelines. International Journal of Medical Informatics, 77(2):81–97, 2008. ISSN1386-5056. (Cited on pages 1, 3, 11, 13, 17, 22, 78, and 94.)

R. Bellazzi, F. Ferrazzi, and L. Sacchi. Predictive data mining in clinical medicine: afocus on selected methods and applications. Wiley Interdisciplinary Reviews: DataMining and Knowledge Discovery, 1(5):416–430, 2011. ISSN 1942-4795. doi: 10.1002/widm.23. URL http://dx.doi.org/10.1002/widm.23. (Cited on pages 2, 11, 12,and 94.)

129

Page 154: Thesis: Building better predictive models for health-related outcomes

S. Bhattacharyya. Confidence in predictions from random tree ensembles. Knowledgeand Information Systems, 35(2):391–410, 2013. (Cited on pages 106 and 118.)

M. Bilgic and L. Getoor. Voila: Efficient feature-value acquisition for classification.In Proceedings of the National Conference on Artificial Intelligence, volume 22, page1225. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2007.(Cited on page 37.)

M. Bilgic and L. Getoor. Value of information lattice: Exploiting probabilistic inde-pendence for effective feature subset acquisition. Journal of Artificial IntelligenceResearch (JAIR), 41:69–95, 2011. (Cited on page 37.)

B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal marginclassifiers. In Proceedings of the Fifth Annual Workshop on Computational LearningTheory, pages 144–152. ACM, 1992. (Cited on page 16.)

A. W. Bowman and A. Azzalini. Applied smoothing techniques for data analysis : thekernel approach with S-Plus illustrations. Oxford statistical science series: 18. Oxford :Clarendon Press ; New York : Oxford University Press, 1997., 1997. ISBN 0198523963.(Cited on page 64.)

A. P. Bradley. The use of the area under the roc curve in the evaluation of machinelearning algorithms. Pattern Recognition, 30(7):1145–1159, 1997. ISSN 0031-3203.(Cited on pages 25 and 82.)

A. Brazma and J. Vilo. Gene expression data analysis. FEBS Letters, 480(1):17–24,2000. (Cited on pages 11 and 95.)

L. Breiman. Out-of-bag estimation. Report, Citeseer, 1996. (Cited on pages 22, 79, 92,105, and 125.)

L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. (Cited on pages 20,38, 78, and 103.)

L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and regressiontrees. CRC press, 1984. (Cited on page 19.)

130

Page 155: Thesis: Building better predictive models for health-related outcomes

J. Briceño, R. Ciria, and M. de la Mata. Donor-recipient matching: myths and realities.Journal of Hepatology, 58(4):811–820, 2013. (Cited on pages 76 and 90.)

J. Briceño, M. Cruz-Ramírez, M. Prieto, M. Navasa, J. Ortiz de Urbina, R. Orti, M.-Á.Gómez-Bravo, A. Otero, E. Varo, S. Tomé, G. Clemente, R. Bañares, R. Bárcena,V. Cuervas-Mons, G. Solórzano, C. Vinaixa, Á. Rubín, J. Colmenero, A. Valdivieso,R. Ciria, C. Hervás-Martínez, and M. de la Mata. Use of artificial intelligence as aninnovative donor-recipient matching model for liver transplantation: Results from amulticenter spanish study. Journal of Hepatology, 61(5):1020–1028, 2014. (Cited onpage 89.)

S. Brisse, S. Issenhuth-Jeanjean, and P. A. Grimont. Molecular serotyping of klebsiellaspecies isolates by restriction of the amplified capsular antigen gene cluster. Journalof Clinical Microbiology, 42(8):3388–3398, 2004. (Cited on page 98.)

M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares,and D. Haussler. Knowledge-based analysis of microarray gene expression data byusing support vector machines. Proceedings of the National Academy of Sciences, 97(1):262–267, 2000. (Cited on pages 11 and 95.)

A. Bureau, J. Dupuis, K. Falls, K. L. Lunetta, B. Hayward, T. P. Keith, andP. Van Eerdewegh. Identifying snps predictive of phenotype using random forests.Genetic Epidemiology, 28(2):171–182, 2005. ISSN 1098-2272. doi: 10.1002/gepi.20041.URL http://dx.doi.org/10.1002/gepi.20041. (Cited on pages 106 and 114.)

H. B. Burke. Artificial neural networks for cancer research: outcome prediction. InSeminars in Surgical Oncology, volume 10, pages 73–79. Wiley Online Library, 1994.(Cited on pages 15 and 30.)

R. W. Busuttil and K. Tanaka. The utility of marginal donors in liver transplantation.Liver Transplantation, 9(7):651–663, 2003. (Cited on pages 6 and 76.)

L. CAO and C. ZHANG. The evolution of kdd: Towards domain-driven data mining.International Journal of Pattern Recognition and Artificial Intelligence, 21(04):677–

131

Page 156: Thesis: Building better predictive models for health-related outcomes

692, 2007. doi: 10.1142/S0218001407005612. URL http://www.worldscientific.

com/doi/abs/10.1142/S0218001407005612. (Cited on page 12.)

N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge universitypress, 2006. (Cited on page 41.)

R. Chattopadhyay, Q. Sun, W. Fan, I. Davidson, S. Panchanathan, and J. Ye. Mul-tisource domain adaptation and its application to early detection of fatigue. ACMTransactions on Knowledge Discovery from Data (TKDD), 6(4):18, 2012. (Cited onpage 31.)

N. V. Chawla and D. A. Davis. Bringing big data to personalized healthcare: a patient-centered framework. Journal of General Internal Medicine, 28(3):660–665, 2013.(Cited on page 12.)

W. Chen, Y. Chen, Y. Mao, and B. Guo. Density-based logistic regression. In KDD’13, 19th ACM SIGKDD international conference on Knowledge Discovery and DataMining, pages 140–148, 2013a. (Cited on pages 5, 56, and 70.)

Y. Chen, R. J. Carroll, E. R. M. Hinz, A. Shah, A. E. Eyler, J. C. Denny, and H. Xu.Applying active learning to high-throughput phenotyping algorithms for electronichealth records data. Journal of the American Medical Informatics Association, 20(e2):e253–e259, 2013b. (Cited on page 27.)

B. Cheng and D. M. Titterington. Neural networks: A review from a statistical per-spective. Statistical Science, pages 2–30, 1994. ISSN 0883-4237. (Cited on pages 16and 89.)

J. Cheng, A. Randall, and P. Baldi. Prediction of protein stability changes for single-site mutations using support vector machines. Proteins: Structure, Function, andBioinformatics, 62(4):1125–1132, 2006. (Cited on page 17.)

C. Chewapreecha, S. R. Harris, N. J. Croucher, C. Turner, P. Marttinen, L. Cheng,A. Pessia, D. M. Aanensen, A. E. Mather, A. J. Page, S. J. Salter, D. Harris, F. Nosten,D. Goldblatt, J. Corander, J. Parkhill, P. Turner, and S. D. Bentley. Dense genomic

132

Page 157: Thesis: Building better predictive models for health-related outcomes

sampling identifies highways of pneumococcal recombination. Nature Genetics, 46(3):305–309, 2014a. (Cited on pages 99, 100, and 114.)

C. Chewapreecha, P. Marttinen, N. J. Croucher, S. J. Salter, S. R. Harris, A. E. Mather,W. P. Hanage, D. Goldblatt, F. H. Nosten, C. Turner, P. Turner, S. D. Bentley,and J. Parkhill. Comprehensive identification of single nucleotide polymorphismsassociated with beta-lactam resistance within pneumococcal mosaic genes. PLOSGenet, 10(8):e1004547, 2014b. (Cited on pages 95 and 100.)

T. Churches and P. Christen. Some methods for blindfolded record linkage. BMCMedical Informatics and Decision Making, 4(1):9, 2004. doi: 10.1186/1472-6947-4-9.URL https://doi.org/10.1186/1472-6947-4-9. (Cited on page 10.)

K. J. Cios and G. W. Moore. Uniqueness of medical data mining. Artificial Intelligencein Medicine, 26(1):1–24, 2002. (Cited on pages 3, 12, 13, 26, 53, and 122.)

R. Clarke, H. W. Ressom, A. Wang, J. Xuan, M. C. Liu, E. A. Gehan, and Y. Wang.The properties of high-dimensional data spaces: implications for exploring gene andprotein expression data. Nature Reviews Cancer, 8(1):37–49, 2008. (Cited on pages 11and 95.)

F. F. Costa. Big data in biomedicine. Drug Discovery Today, 19(4):433–440, 2014.(Cited on pages 9 and 12.)

K. Croome, P. Marotta, W. Wall, C. Dale, M. Levstik, N. Chandok, and R. Hernandez-Alejandro. Should a lower quality organ go to the least sick patient? model forend-stage liver disease score and donor risk index as predictors of early allograftdysfunction. In Transplantation Proceedings, volume 44, pages 1303–1306. Elsevier,2012. (Cited on page 76.)

N. J. Croucher, J. A. Finkelstein, S. I. Pelton, P. K. Mitchell, G. M. Lee, J. Parkhill,S. D. Bentley, W. P. Hanage, and M. Lipsitch. Population genomics of post-vaccinechanges in pneumococcal epidemiology. Nature Genetics, 45(6):656–663, 2013. (Citedon page 99.)

133

Page 158: Thesis: Building better predictive models for health-related outcomes

M. Cruz-Ramirez, C. Hervas-Martinez, J. C. Fernandez, J. Briceno, and M. De La Mata.Predicting patient survival after liver transplantation using evolutionary multi-objective artificial neural networks. Artificial Intelligence in Medicine, 58(1):37–49,2013. (Cited on page 89.)

D. R. Cutler, T. C. Edwards, K. H. Beard, A. Cutler, K. T. Hess, J. Gibson, and J. J.Lawler. Random forests for classification in ecology. Ecology, 88(11):2783–2792, 2007.ISSN 1939-9170. (Cited on pages 21, 89, 90, and 105.)

M. Dash and H. Liu. Feature selection for classification. Intelligent Data Analysis, 1(1-4):131–156, 1997. (Cited on page 44.)

D. Delen, G. Walker, and A. Kadam. Predicting breast cancer survivability: a compari-son of three data mining methods. Artificial Intelligence in Medicine, 34(2):113–127,2005. (Cited on pages 15 and 19.)

J. Demšar. Statistical comparisons of classifiers over multiple data sets. Journal ofMachine Learning Research, 7(Jan):1–30, 2006. (Cited on pages 70 and 103.)

N. M. Desai, K. C. Mange, M. D. Crawford, P. L. Abt, A. M. Frank, J. W. Markmann,E. Velidedeoglu, W. C. Chapman, and J. F. Markmann. Predicting outcome after livertransplantation: utility of the model for end-stage liver disease and a newly deriveddiscrimination function1. Transplantation, 77(1):99–106, 2004. (Cited on pages 77and 90.)

M. desJardins, J. MacGlashan, and K. L. Wagstaff. Confidence-based feature acquisitionto minimize training and test costs. In Proceedings of the 2010 SIAM InternationalConference on Data Mining, pages 514–524. SIAM, 2010. (Cited on page 37.)

A. Dhurandhar and K. Sankaranarayanan. Improving classification performancethrough selective instance completion. Machine Learning, 100(2-3):425–447, 2015.(Cited on page 36.)

S. Dubois, N. Romano, K. Jung, N. Shah, and D. C. Kale. The effectiveness of transferlearning in electronic health records data. 2017. (Cited on page 31.)

134

Page 159: Thesis: Building better predictive models for health-related outcomes

T. Duong and M. Hazelton. Plug-in bandwidth matrices for bivariate kernel density esti-mation. Journal of Nonparametric Statistics, 15 (1)(1):17 – 30, 2003. ISSN 10485252.(Cited on page 61.)

T. Duong and M. Hazelton. Cross-validation bandwidth matrices for multivariate kerneldensity estimation. Scandinavian Journal of Statistics, 32(3):485 – 506, 2005. ISSN03036898. (Cited on page 61.)

P. Dutkowski, C. E. Oberkofler, K. Slankamenac, M. A. Puhan, E. Schadde, B. MÃijll-haupt, A. Geier, and P. A. Clavien. Are there better guidelines for allocation in livertransplantation?: A novel score targeting justice and utility in the model for end-stageliver disease era. Annals of Surgery, 254(5):745–754, 2011. ISSN 0003-4932. (Citedon page 90.)

I. Dvorchik, M. Subotin, W. Marsh, J. McMichael, and J. Fung. Performance of multi-layer feedforward neural networks to predict liver transplantation outcome. Methodsof Information in Medicine, 35(1):12–18, 1996. ISSN 0026-1270. (Cited on page 89.)

B. Efron. Estimating the error rate of a prediction rule: improvement on cross-validation.Journal of the American Statistical Association, 78(382):316–331, 1983. (Cited onpage 79.)

B. Efron and R. Tibshirani. Improvements on cross-validation: the 632+ bootstrapmethod. Journal of the American Statistical Association, 92(438):548–560, 1997.(Cited on page 104.)

N. Esfandiari, M. R. Babavalian, A.-M. E. Moghadam, and V. K. Tabar. Knowledge dis-covery in medicine: Current issue and future trend. Expert Systems with Applications,41(9):4434–4463, 2014. (Cited on pages 10, 13, 19, and 28.)

Y. Fan, T. B. Murphy, J. C. Byrne, L. Brennan, J. M. Fitzpatrick, and R. W. G.Watson. Applying random forests to identify biomarker panels in serum 2d-dige datafor the detection and staging of prostate cancer. Journal of Proteome Research, 10(3):1361–1373, 2011. (Cited on page 21.)

135

Page 160: Thesis: Building better predictive models for health-related outcomes

U. Fayyad and K. Irani. Multi-interval discretization of continuous-valued attributesfor classification learning. pages 1022–1027, 1993. (Cited on page 44.)

U. M. Fayyad. Data mining and knowledge discovery: Making sense out of data. IEEEExpert: Intelligent Systems and Their Applications, 11(5):20–25, 1996. (Cited onpage 77.)

S. Feng, N. Goodrich, J. Bragg-Gresham, D. Dykstra, J. Punch, M. DebRoy, S. Green-stein, and R. Merion. Characteristics associated with liver graft failure: the concept ofa donor risk index. American Journal of Transplantation, 6(4):783–790, 2006. (Citedon pages 76, 80, 81, and 91.)

M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundredsof classifiers to solve real world classification problems. Journal of Machine LearningResearch, 15(1):3133–3181, 2014. (Cited on page 89.)

R. G. Fichman, R. Kohli, and R. Krishnan. Editorial overview–the role of informa-tion systems in healthcare: current research and future trends. Information SystemsResearch, 22(3):419–428, 2011. (Cited on page 12.)

J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linearmodels via coordinate descent. Journal of Statistical Software, 33(1):1, 2010. (Citedon page 67.)

M. W. Gardner and S. Dorling. Artificial neural networks (the multilayer perceptron)-areview of applications in the atmospheric sciences. Atmospheric Environment, 32(14):2627–2636, 1998. ISSN 1352-2310. (Cited on pages 16 and 89.)

A. P. Gasch and M. B. Eisen. Exploring the conditional coregulation of yeast geneexpression through fuzzy k-means clustering. Genome Biology, 3(11):research0059–1,2002. (Cited on pages 11 and 95.)

S. B. Gaul, S. Wedel, M. M. Erdman, D. Harris, I. T. Harris, K. E. Ferris, and L. Hoffman.Use of pulsed-field gel electrophoresis of conserved xbai fragments for identificationof swine salmonella serotypes. Journal of Clinical Microbiology, 45(2):472–476, 2007.(Cited on page 98.)

136

Page 161: Thesis: Building better predictive models for health-related outcomes

X. Geng, T.-Y. Liu, T. Qin, and H. Li. Feature selection for ranking. In Proceedings ofthe 30th Annual International ACM SIGIR Conference on Research and Developmentin Information Retrieval, SIGIR ’07, pages 407–414, New York, NY, USA, 2007. ACM.ISBN 978-1-59593-597-7. doi: 10.1145/1277741.1277811. URL http://doi.acm.org/

10.1145/1277741.1277811. (Cited on page 44.)

D. Ghosh and A. M. Chinnaiyan. Classification and selection of biomarkers in genomicdata using lasso. BioMed Research International, 2005(2):147–154, 2005. (Cited onpage 29.)

J. W. Graham. Missing data analysis: Making it work in the real world. Annual Reviewof Psychology, 60:549–576, 2009. (Cited on pages 27 and 34.)

K. Gray, P. Aljabar, R. Heckemann, A. Hammers, and D. Rueckert. Random forest-based similarity measures for multi-modal classification of alzheimer’s disease. 65, 102012. (Cited on page 21.)

D. H. Gustafson. Length of stay prediction and explanation. Health Services Research,3(1):12–34, 1968 Spring. (Cited on page 11.)

I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal ofMachine Learning Research, 3(Mar):1157–1182, 2003. (Cited on page 44.)

I. Guyon and A. Elisseeff. An Introduction to Feature Extraction, pages 1–25. Springer,2006. (Cited on page 30.)

I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classificationusing support vector machines. Machine Learning, 46(1):389–422, 2002. (Cited onpages 17 and 30.)

J. Halldorson, R. Bakthavatsalam, O. Fix, J. Reyes, and J. Perkins. D-meld, a simplepredictor of post liver transplant mortality for optimization of donor/recipient match-ing. American Journal of Transplantation, 9(2):318–326, 2009. (Cited on pages 76and 90.)

137

Page 162: Thesis: Building better predictive models for health-related outcomes

A. Hapfelmeier, T. Hothorn, K. Ulm, and C. Strobl. A new variable importance measurefor random forests with missing data. Statistics and Computing, 24(1):21–34, 2014.ISSN 0960-3174. (Cited on pages 21, 38, 79, and 80.)

P. Harper. A review and comparison of classification algorithms for medical decisionmaking. Health Policy, 71(3):315–331, 2005 Mar. (Cited on pages 3, 13, 106, and 118.)

A.-C. Haury, P. Gestraud, and J.-P. Vert. The influence of feature selection methodson accuracy, stability and interpretability of molecular signatures. PLOS One, 6(12):e28210, 2011. (Cited on pages 28 and 29.)

H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions onKnowledge and Data Engineering, 21(9):1263–1284, 2009. (Cited on page 25.)

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 770–778, 2016. (Cited on page 16.)

M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf. Support vectormachines. IEEE Intelligent Systems and Their Applications, 13(4):18–28, 1998. (Citedon pages 17 and 30.)

G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neuralnetworks. Science, 313(5786):504–507, 2006. (Cited on page 30.)

S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch mode active learning and its applicationto medical image classification. In Proceedings of the 23rd International Conferenceon Machine Learning, pages 417–424. ACM, 2006. (Cited on page 27.)

J. J. Hopfield. Neural networks and physical systems with emergent collective compu-tational abilities. Proceedings of the National Academy of Sciences, 79(8):2554–2558,1982. (Cited on page 15.)

J. J. Hopfield. Neurons with graded response have collective computational propertieslike those of two-state neurons. Proceedings of the National Academy of Sciences, 81(10):3088–3092, 1984. (Cited on page 15.)

138

Page 163: Thesis: Building better predictive models for health-related outcomes

T. Hothorn, K. Hornik, C. Strobl, and A. Zeileis. Party: a laboratory for recursivepartytioning, 2010. (Cited on page 80.)

J. Huang and C. X. Ling. Using auc and accuracy in evaluating learning algorithms.IEEE Transactions on Knowledge and Data Engineering, 17(3):299–310, 2005. (Citedon pages 103 and 118.)

V. Huddar, B. K. Desiraju, V. Rajan, S. Bhattacharya, S. Roy, and C. K. Reddy. Pre-dicting complications in critical care using heterogeneous clinical data. IEEE Access,4:7988–8001, 2016. (Cited on page 9.)

J. Iavindrasana, G. Cohen, A. Depeursinge, H. MÃijller, R. Meyer, and A. Geissbuhler.Clinical data mining: a review. pages 121–33, 01 2009. (Cited on page 3.)

G. N. Ioannou. Development and validation of a model predicting graft survival af-ter liver transplantation. Liver Transplantation, 12(11):1594–1606, 2006. (Cited onpages 76, 90, and 91.)

S. Janitza, C. Strobl, and A.-L. Boulesteix. An auc-based permutation variable impor-tance measure for random forests. BMC Bioinformatics, 14(1):1, 2013. (Cited onpage 80.)

S. Ji and L. Carin. Cost-sensitive feature acquisition and classification. Pattern Recog-nition, 40(5):1474–1485, 2007. (Cited on page 36.)

T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of theEighth ACM SIGKDD International Conference on Knowledge Discovery and DataMining, pages 133–142. ACM, 2002. (Cited on page 77.)

K. G. Joensen, A. M. Tetzschner, A. Iguchi, F. M. Aarestrup, and F. Scheutz. Rapidand easy in silico serotyping of escherichia coli using whole genome sequencing (wgs)data. Journal of Clinical Microbiology, pages JCM–00008, 2015. (Cited on page 98.)

A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody,P. Szolovits, L. A. Celi, and R. G. Mark. Mimic-iii, a freely accessible critical caredatabase. Scientific data, 3:160035, 2016. (Cited on page 53.)

139

Page 164: Thesis: Building better predictive models for health-related outcomes

K. A. Jolley and M. C. Maiden. Bigsdb: scalable analysis of bacterial genome variationat the population level. BMC Bioinformatics, 11(1):595, 2010. (Cited on page 95.)

P. S. Kamath and W. Kim. The model for end-stage liver disease (meld). Hepatology,45(3):797–805, 2007. (Cited on page 77.)

P. S. Kamath, R. H. Wiesner, M. Malinchoc, W. Kremers, T. M. Therneau, C. L.Kosberg, G. D’Amico, E. R. Dickson, and W. Kim. A model to predict survival inpatients with end-stage liver disease. Hepatology, 33(2):464–470, 2001. (Cited onpage 77.)

P. Kanani and P. Melville. Prediction-time active feature-value acquisition for cost-effective customer targeting. Advances In Neural Information Processing Systems,2008. (Cited on page 37.)

G. Kapatai, C. L. Sheppard, A. Al-Shahib, D. J. Litt, A. P. Underwood, T. G. Harrison,and N. K. Fry. Whole genome sequencing of streptococcus pneumoniae: development,evaluation and verification of targets for serogroup and serotype prediction using anautomated pipeline. PeerJ, 4:e2477, 2016. (Cited on pages 94, 95, 98, 99, 104, 106,107, and 116.)

A. Kapoor and E. Horvitz. Breaking boundaries: Active information acquisition acrosslearning and diagnosis. Advances In Neural Information Processing Systems, 22:898–906, 2009. (Cited on page 37.)

H. Kaur and S. K. Wasan. Empirical study on applications of data mining techniquesin healthcare. Journal of Computer Science, 2(2):194–200, 2006. (Cited on page 10.)

M. Kaur, H. Gulati, and H. Kundra. Data mining in agriculture on crop price prediction:Techniques and applications. International Journal of Computer Applications, 99(12):1–3, 2014. (Cited on page 77.)

M. Khalilia, S. Chakraborty, and M. Popescu. Predicting disease risks from highlyimbalanced data using random forest. BMC Medical Informatics and Decision Making,11(1):51, 2011. (Cited on page 21.)

140

Page 165: Thesis: Building better predictive models for health-related outcomes

K. Kira and L. A. Rendell. The feature selection problem: Traditional methods and a newalgorithm. In Proceedings of the Tenth National Conference on Artificial Intelligence,AAAI’92, pages 129–134. AAAI Press, 1992. ISBN 0-262-51063-4. URL http://dl.

acm.org/citation.cfm?id=1867135.1867155. (Cited on pages 29 and 44.)

D. G. Kleinbaum and M. Klein. Logistic regression: a self-learning text. Springer, 2011.(Cited on pages 14 and 15.)

R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and modelselection. In Proceedings of the 14th International Joint Conference on ArtificialIntelligence - Volume 2, IJCAI’95, pages 1137–1143, San Francisco, CA, USA, 1995.Morgan Kaufmann Publishers Inc. ISBN 1-55860-363-8. URL http://dl.acm.org/

citation.cfm?id=1643031.1643047. (Cited on page 104.)

M. H. Kollef, K. Heard, Y. Chen, C. Lu, N. Martin, and T. Bailey. Mortality andlength of stay trends following implementation of a rapid response system and real-time automated clinical deterioration alerts. American Journal of Medical Quality, 32(1):12–18, 2017. doi: 10.1177/1062860615613841. URL https://doi.org/10.1177/

1062860615613841. PMID: 26566998. (Cited on page 11.)

I. Kononenko. Estimating attributes: analysis and extensions of relief. In EuropeanConference on Machine Learning, pages 171–182. Springer, 1994. (Cited on pages 29and 44.)

I. Kononenko. Machine learning for medical diagnosis: history, state of the art andperspective. Artificial Intelligence in Medicine, 23(1):89–109, 2001 Aug. (Cited onpages 3, 9, 13, 16, and 77.)

S. Kotsiantis, D. Kanellopoulos, and P. Pintelas. Data preprocessing for supervisedleaning. International Journal of Computer Science, 1(2):111–117, 2006. (Cited onpage 28.)

M. Kukar, I. Kononenko, C. Grošelj, K. Kralj, and J. Fettich. Analysing and improvingthe diagnosis of ischaemic heart disease with machine learning. Artificial Intelligencein Medicine, 16(1):25–50, 1999. (Cited on page 11.)

141

Page 166: Thesis: Building better predictive models for health-related outcomes

A. Kusiak. Feature transformation methods in data mining. IEEE Transactions onElectronics Packaging Manufacturing, 24(3):214–221, 2001. (Cited on page 30.)

P. Langley. Machine learning for adaptive user interfaces. In Annual Conference onArtificial Intelligence, pages 53–62. Springer, 1997. (Cited on page 77.)

B. Langmead and S. L. Salzberg. Fast gapped-read alignment with bowtie 2. NatureMethods, 9(4):357–359, 2012. (Cited on page 100.)

Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.(Cited on page 16.)

G. Leroy, T. Miller, G. Rosemblat, and A. Browne. A balanced approach to health infor-mation evaluation: A vocabulary-based naïve bayes classifier and readability formulas.Journal of the Association for Information Science and Technology, 59(9):1409–1419,2008. (Cited on page 18.)

H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis,and R. Durbin. The sequence alignment/map format and samtools. Bioinformatics,25(16):2078–2079, 2009. (Cited on page 100.)

J.-S. Li, Y. Tian, Y.-F. Liu, T. Shu, and M.-H. Liang. Applying a bp neural networkmodel to predict the length of hospital stay. In Health Information Science, volume7798, pages 18–29, 2013 Mar. (Cited on page 11.)

A. Liaw and M. Wiener. Classification and regression by randomforest. R news, 2(3):18–22, 2002. (Cited on pages 20, 21, 78, 79, 103, and 105.)

F. Lin. The role of data mining in clinical predictive medicine: a narrative review. InHIC 2006 Bridging the Digital Divide: Clinician, consumer and computer, pages 169– 174. Health Informatics Society of Australia Ltd (HISA), 2006. (Cited on pages 12,21, and 26.)

L. Linares, G. Sanclemente, C. Cervera, I. Hoyo, F. Cofán, M. J. Ricart, F. Pérez-Villa,M. Navasa, M. A. Marcos, A. Antón, T. Pumarola, and A. Moreno. Influence of

142

Page 167: Thesis: Building better predictive models for health-related outcomes

cytomegalovirus disease in outcome of solid organ transplant patients. In Transplan-tation Proceedings, volume 43, pages 2145–2148. Elsevier, 2011. (Cited on page 91.)

C. X. Ling, Q. Yang, J. Wang, and S. Zhang. Decision trees with minimal costs. InProc of the 21st International Conference on Machine Learning, pages 544–551, 2004.(Cited on pages 33, 35, 36, and 44.)

C. X. Ling, V. S. Sheng, and Q. Yang. Test strategies for cost-sensitive decisiontrees. IEEE Transactions on Knowledge and Data Engineering, 18(8):1055–1067, 2006.(Cited on page 35.)

R. J. Little, R. D’Agostino, M. L. Cohen, K. Dickersin, S. S. Emerson, J. T. Farrar,C. Frangakis, J. W. Hogan, G. Molenberghs, S. A. Murphy, J. D. Neaton, A. Rotnitzky,D. Scharfstein, W. J. Shih, J. P. Siegel, and H. Stern. The prevention and treatment ofmissing data in clinical trials. New England Journal of Medicine, 367(14):1355–1360,2012. (Cited on page 27.)

H. Liu and H. Motoda. Feature transformation and subset selection. IEEE IntelligentSystems and Their Applications, 13(2):26–28, 1998. (Cited on page 28.)

P. Liu, L. Lei, J. Yin, W. Zhang, W. Naijun, and E. El-Darzi. Healthcare data mining:Prediction inpatient length of stay. In Intelligent Systems, 2006 3rd InternationalIEEE Conference on, pages 832–837, 2006 Sept. (Cited on page 3.)

E. Loekito, J. Bailey, R. Bellomo, G. K. Hart, C. Hegarty, P. Davey, C. Bain, D. Pilcher,and H. Schneider. Common laboratory tests predict imminent medical emergencyteam calls, intensive care unit admission or death in emergency department pa-tients. Emergency Medicine Australasia, 25(2):132–139, 2013 Apr. (Cited on pages 10and 15.)

E. Loekito, J. Bailey, R. Bellomo, G. K. Hart, C. Hegarty, P. Davey, C. Bain, D. Pilcher,and H. Schneider. Common laboratory tests predict imminent death in ward patients.Resuscitation, 84(3):280–285, 2013 Mar. (Cited on pages 11, 15, and 42.)

143

Page 168: Thesis: Building better predictive models for health-related outcomes

J. Luengo, S. García, and F. Herrera. On the choice of the best imputation methodsfor missing values considering three groups of classification methods. Knowledge andInformation Systems, 32(1):77–108, 2012. (Cited on page 27.)

K. L. Lunetta, L. Hayward, J. Segal, and P. Van Eerdewegh. Screening large-scaleassociation study data: exploiting interactions using random forests. BMC Genetics,5(1):1, 2004. ISSN 1471-2156. (Cited on pages 43 and 80.)

J. Luo, M. Wu, D. Gopukumar, and Y. Zhao. Big data application in biomedicalresearch and health care: A literature review. Biomedical Informatics Insights, 8:1,2016. (Cited on pages 9 and 10.)

S. Ma and J. Huang. Penalized feature selection and classification in bioinformatics.Briefings in Bioinformatics, 9(5):392–403, 2008. (Cited on pages 28 and 29.)

Y. Ma, G. Luo, X. Zeng, and A. Chen. Transfer learning for cross-company softwaredefect prediction. Information and Software Technology, 54(3):248–256, 2012. (Citedon page 66.)

L. Mataya, A. Aronsohn, J. R. Thistlethwaite, and L. Friedman Ross. Decision mak-ing in liver transplantation–limited application of the liver donor risk index. LiverTransplantation, 20(7):831–837, 2014. (Cited on pages 76 and 90.)

R. Mateo, Y. Cho, G. Singh, M. Stapfer, J. Donovan, J. Kahn, T.-L. Fong, L. Sher,N. Jabbour, S. Aswad, R. R. Selby, and Y. Genyk. Risk factors for graft survivalafter liver transplantation from donation after cardiac death donors: an analysis ofoptn/unos data. American Journal of Transplantation, 6(4):791–796, 2006. (Cited onpage 91.)

S. Matis, H. Doyle, I. Marino, R. Mural, and E. Uberbacher. Use of neural networks forprediction of graft failure following liver transplantation. In Computer-Based MedicalSystems, 1995., Proceedings of the Eighth IEEE Symposium on, pages 133–140. IEEE,1995. (Cited on page 89.)

A. Mavroidi, D. M. Aanensen, D. Godoy, I. C. Skovsted, M. S. Kaltoft, P. R. Reeves,S. D. Bentley, and B. G. Spratt. Genetic relatedness of the streptococcus pneumoniae

144

Page 169: Thesis: Building better predictive models for health-related outcomes

capsular biosynthetic loci. Journal of Bacteriology, 189(21):7841–7855, 2007. (Citedon pages 95 and 111.)

P. Melville, M. Saar-Tsechansky, F. Provost, and R. Mooney. Active feature-value ac-quisition for classifier induction. In Data Mining, 2004. ICDM ’04. Fourth IEEEInternational Conference on, pages 483–486, 2004. (Cited on pages 5, 27, 34, 36,and 121.)

P. Melville, F. Provost, M. Saar-Tsechansky, and R. Mooney. Economical active feature-value acquisition through expected utility estimation. In Proceedings of the 1st In-ternational Workshop on Utility-based Data Mining, pages 10–16. ACM, 2005. (Citedon page 36.)

R. M. Merion. When is a patient too well and when is a patient too sick for a livertransplant? Liver Transplantation, 10(S10), 2004. (Cited on pages 6, 75, and 77.)

B. Miljković-Selimović, B. Kocić, T. Babić, and L. Ristić. Bacterial typing methods.Acta Facultatis Medicae Naissensis, 26(4), 2009. (Cited on page 94.)

B. D. Mittelstadt and L. Floridi. The ethics of big data: Current and foreseeable issues inbiomedical contexts. In The Ethics of Biomedical Big Data, pages 445–480. Springer,2016. (Cited on pages 9 and 12.)

D. E. Moore, I. D. Feurer, T. Speroff, D. L. Gorden, J. K. Wright, R. S. Chari, andC. W. Pinson. Impact of donor, technical, and recipient risk factors on survival andquality of life after liver transplantation. Archives of Surgery, 140(3):273–277, 2005.(Cited on page 91.)

F. Nan, J. Wang, and V. Saligrama. Feature-budgeted random forest. In Proceedings ofthe 32nd International Conference on Machine Learning, ICML’15, pages 1983–1991,2015. (Cited on page 37.)

C. W. Olanow and W. C. Koller. An algorithm (decision tree) for the managementof parkinson’s disease treatment guidelines. Neurology, 50(3 Suppl 3):S1–S1, 1998.(Cited on page 19.)

145

Page 170: Thesis: Building better predictive models for health-related outcomes

R. Pai, R. E. Gertz, and B. Beall. Sequential multiplex pcr approach for determiningcapsular serotypes of streptococcus pneumoniae isolates. Journal of Clinical Microbi-ology, 44(1):124–131, 2006. (Cited on pages 98 and 107.)

S. Palaniappan and R. Awang. Intelligent heart disease prediction system using datamining techniques. In Computer Systems and Applications, 2008. AICCSA 2008.IEEE/ACS International Conference on, pages 108–115. IEEE, 2008. (Cited onpage 18.)

S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledgeand Data Engineering, 22(10):1345–1359, 2010. (Cited on pages 30, 65, and 96.)

A. Pantanowitz and T. Marwala. Missing data imputation through the use of the RandomForest Algorithm, pages 53–62. Springer, 2009. ISBN 3642031552. (Cited on page 90.)

S. A. Pattekari and A. Parveen. Prediction system for heart disease using naïve bayes.International Journal of Advanced Computer and Mathematical Sciences, 3(3):290–294, 2012. (Cited on page 18.)

M. Pedersen and A. Seetharam. Infections after orthotopic liver transplantation. Journalof Clinical and Experimental Hepatology, 4(4):347–360, 2014. ISSN 0973-6883. (Citedon page 91.)

K. Polat and S. Güneş. Breast cancer diagnosis using least square support vectormachine. Digital Signal Processing, 17(4):694–701, 2007. (Cited on page 11.)

F. Provost and T. Fawcett. Analysis and visualization of classifier performance: compar-ison under imprecise class and cost distributions. In Proceedings of the Third Inter-national Conference on Knowledge Discovery and Data Mining, pages 43–48. AAAIPress, 1997. (Cited on page 24.)

J. R. Quinlan. C 4.5: Programs for machine learning. The Morgan Kaufmann Seriesin Machine Learning, San Mateo, CA: Morgan Kaufmann,| c1993, 1993. (Cited onpage 19.)

146

Page 171: Thesis: Building better predictive models for health-related outcomes

L. M. Raamsdonk, B. Teusink, D. Broadhurst, N. Zhang, A. Hayes, M. C. Walsh, J. A.Berden, K. M. Brindle, D. B. Kell, J. J. Rowland, H. V. Westerhoff, K. van Dam, andS. G. Oliver. A functional genomics strategy that uses metabolome data to revealthe phenotype of silent mutations. Nature Biotechnology, 19:45 EP –, 01 2001. URLhttp://dx.doi.org/10.1038/83496. (Cited on page 30.)

L. E. Raileanu and K. Stoffel. Theoretical comparison between the gini index andinformation gain criteria. Annals of Mathematics and Artificial Intelligence, 41(1):77–93, 2004. (Cited on pages 29 and 44.)

A. Rana, M. Hardy, K. Halazun, D. Woodland, L. Ratner, B. Samstein, J. Guarrera,R. Brown Jr, and J. Emond. Survival outcomes following liver transplantation (soft)score: a novel method to predict patient survival following liver transplantation. Amer-ican Journal of Transplantation, 8(12):2537–2546, 2008. (Cited on pages 76, 81,and 90.)

P. Ray, Y. Le Manach, B. Riou, and T. T. Houle. Statistical evaluation of a biomarker.The Journal of the American Society of Anesthesiologists, 112(4):1023–1040, 2010.(Cited on pages 22, 25, 26, and 82.)

C. K. Reddy and C. C. Aggarwal. Healthcare data analytics, volume 36. CRC Press,2015. (Cited on pages 9 and 10.)

F. Rosenblatt. Principles of neurodynamics. perceptrons and the theory of brain mech-anisms. Technical report, CORNELL AERONAUTICAL LAB INC BUFFALO NY,1961. (Cited on page 15.)

M. Rosenblatt. Remarks on some nonparametric estimates of a density function. TheAnnals of Mathematical Statistics, pages 832–837, 1956. ISSN 00034851. (Cited onpage 60.)

C. Rudin. Algorithms for interpretable machine learning. In Proceedings of the 20thACM SIGKDD international conference on Knowledge discovery and data mining,pages 1519–1519. ACM, 2014. (Cited on page 3.)

147

Page 172: Thesis: Building better predictive models for health-related outcomes

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representationsby error propagation. Technical report, California Univ San Diego La Jolla Inst forCognitive Science, 1985. (Cited on page 15.)

M. Saar-Tsechansky, P. Melville, and F. Provost. Active feature-value acquisition. Man-agement Science, 55(4):664–684, 2009. (Cited on pages 5, 34, and 36.)

Y. Saeys, I. Inza, and P. L. naga. A review of feature selection techniques in bioinfor-matics. Bioinformatics, 23(19):2507–2517, 2007 Oct. (Cited on pages 28 and 29.)

T. Saito and M. Rehmsmeier. The precision-recall plot is more informative than theroc plot when evaluating binary classifiers on imbalanced datasets. PLOS One, 10(3):e0118432, 2015. (Cited on page 25.)

J. L. Schafer and J. W. Graham. Missing data: our view of the state of the art. Psy-chological Methods, 7(2):147, 2002. (Cited on page 90.)

B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648,University of Wisconsin – Madison, 2009. (Cited on pages 27 and 36.)

S. Sheng, C. X. Ling, and Q. Yang. Simple test strategies for cost-sensitive decisiontrees. In European Conference on Machine Learning, pages 365–376. Springer, 2005.(Cited on page 35.)

V. S. Sheng and C. X. Ling. Feature value acquisition in testing: a sequential batch testalgorithm. In Proceedings of the 23rd International Conference on Machine Learning,pages 809–816. ACM, 2006. (Cited on pages 35, 36, and 44.)

M. J. Sigakis, E. A. Bittner, and J. P. Wanderer. Validation of a risk stratification indexand risk quantification index for predicting patient outcomes: In-hospital mortality,30-day mortality, 1-year mortality, and length-of-stay. Anesthesiology., 2013 Jun 13.doi: doi:10.1097/ALN.0b013e31829ce6e6. (Cited on page 11.)

B. W. Silverman. Density estimation for statistics and data analysis. Monographs onstatistics and applied probability: [26]. Chapman and Hall, 1986. ISBN 0412246201.(Cited on page 60.)

148

Page 173: Thesis: Building better predictive models for health-related outcomes

J. M. Smith, E. J. Feil, and N. H. Smith. Population structure and evolution-ary dynamics of pathogenic bacteria. BioEssays, 22(12):1115–1122, 2000. ISSN1521-1878. doi: 10.1002/1521-1878(200012)22:12<1115::AID-BIES9>3.0.CO;2-R.URL http://dx.doi.org/10.1002/1521-1878(200012)22:12<1115::AID-BIES9>

3.0.CO;2-R. (Cited on page 96.)

Y. Song, T. Yue, H. Wang, J. Li, and H. Gao. Disease prediction based on transferlearning in individual healthcare. In International Conference of Pioneering Com-puter Scientists, Engineers and Educators, pages 110–122. Springer, 2017. (Cited onpage 31.)

W. W. Soon, M. Hariharan, and M. P. Snyder. High-throughput sequencing for biologyand medicine. Molecular Systems Biology, 9(1):640, 2013. (Cited on page 95.)

A. Statnikov, L. Wang, and C. F. Aliferis. A comprehensive comparison of randomforests and support vector machines for microarray-based cancer classification. BMCBioinformatics, 9(1):319, 2008. (Cited on page 17.)

J. Stephan, O. Stegle, and A. Beyer. A random forest approach to capture geneticeffects in the presence of population structure. Nature Communications, 6:7432, 2015.(Cited on page 96.)

Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, C. Zhai, M. J. Efron, R. Iyer,M. C. Schatz, S. Sinha, and G. E. Robinson. Big data: astronomical or genomical?PLS Biology, 13(7):e1002195, 2015. (Cited on page 95.)

E. W. Steyerberg, F. E. Harrell, G. J. Borsboom, M. Eijkemans, Y. Vergouwe, andJ. D. F. Habbema. Internal validation of predictive models: efficiency of some proce-dures for logistic regression analysis. Journal of Clinical Epidemiology, 54(8):774–781,2001. (Cited on page 22.)

E. W. Steyerberg, A. J. Vickers, N. R. Cook, T. Gerds, M. Gonen, N. Obuchowski,M. J. Pencina, and M. W. Kattan. Assessing the performance of prediction models:a framework for some traditional and novel measures. Epidemiology, 21(1):128, 2010.(Cited on pages 22, 25, 45, and 82.)

149

Page 174: Thesis: Building better predictive models for health-related outcomes

S. Szymczak, J. M. Biernacka, H. J. Cordell, O. González-Recio, I. R. König, H. Zhang,and Y. V. Sun. Machine learning in genome-wide association studies. Genetic Epi-demiology, 33(S1), 2009. (Cited on pages 11 and 94.)

T. SÃÿrlie, C. M. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie, M. B.Eisen, M. van de Rijn, S. S. Jeffrey, T. Thorsen, H. Quist, J. C. Matese, P. O. Brown,D. Botstein, P. E. LÃÿnning, and A.-L. BÃÿrresen-Dale. Gene expression patterns ofbreast carcinomas distinguish tumor subclasses with clinical implications. Proceedingsof the National Academy of Sciences, 98(19):10869–10874, 2001. (Cited on pages 11and 95.)

P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander,and T. R. Golub. Interpreting patterns of gene expression with self-organizing maps:methods and application to hematopoietic differentiation. Proceedings of the NationalAcademy of Sciences, 96(6):2907–2912, 1999. (Cited on pages 11 and 95.)

J. Tang, S. Alelyani, and H. Liu. Feature selection for classification: A review. DataClassification: Algorithms and Applications, page 37, 2014. (Cited on pages 28 and 29.)

A. J. Tector, R. S. Mangus, P. Chestovich, R. Vianna, J. A. Fridell, M. L. Milgrom,C. Sanders, and P. Y. Kwo. Use of extended criteria livers decreases wait time forliver transplantation without adversely impacting posttransplant survival. Annals ofSurgery, 244(3):439–450, 2006. (Cited on pages 6 and 76.)

C. Teddlie and F. Yu. Mixed methods sampling a typology with examples. Journal ofMixed Methods Research, 1(1):77–100, 2007. (Cited on page 104.)

H. Temurtas, N. Yumusak, and F. Temurtas. A comparative study on diabetes diseasediagnosis using neural networks. Expert Systems with Applications, 36(4):8610–8615,2009. (Cited on page 11.)

F. C. Tenover, R. D. Arbeit, R. V. Goering, P. A. Mickelsen, B. E. Murray, D. H. Persing,and B. Swaminathan. Interpreting chromosomal dna restriction patterns produced bypulsed-field gel electrophoresis: criteria for bacterial strain typing. Journal of ClinicalMicrobiology, 33(9):2233, 1995. (Cited on pages 7 and 94.)

150

Page 175: Thesis: Building better predictive models for health-related outcomes

M. Thahir, T. Sharma, and M. K. Ganapathiraju. An efficient heuristic method foractive feature acquisition and its application to protein-protein interaction prediction.In BMC Proceedings, volume 6, page S2. BioMed Central, 2012. (Cited on page 36.)

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society. Series B (Methodological), pages 267–288, 1996. (Cited on page 64.)

B. Turhan, T. Menzies, A. B. Bener, and J. Di Stefano. On the relative value ofcross-company and within-company data for defect prediction. Empirical SoftwareEngineering, 14(5):540–578, 2009. (Cited on page 72.)

A. van Belkum, P. T. Tassios, L. Dijkshoorn, S. Haeggman, B. Cookson, N. K. Fry,V. Fussing, J. Green, E. Feil, P. Gerner-Smidt, S. Brisse, and M. Struelens. Guide-lines for the validation and application of typing methods for use in bacterial epi-demiology. Clinical Microbiology and Infection, 13:1–46, 2017/12/05 . doi: 10.1111/j.1469-0691.2007.01786.x. URL http://dx.doi.org/10.1111/j.1469-0691.2007.

01786.x. (Cited on pages 7 and 94.)

V. N. Vapnik. The nature of statistical learning theory. 1995. (Cited on page 16.)

A. Vellido, J. D. Martín-Guerrero, and P. J. Lisboa. Making machine learning modelsinterpretable. In ESANN, volume 12, pages 163–172. Citeseer, 2012. (Cited on page 3.)

M. L. Volk, M. Roney, and R. M. Merion. Systematic bias in surgeons’ predictions ofthe donor-specific risk of liver transplant graft failure. Liver Transplantation, 19(9):987–990, 2013. (Cited on page 76.)

M. P. Wand and M. C. Jones. Kernel smoothing. Monographs on statistics and appliedprobability: 60. Chapman & Hall, 1995. ISBN 0412552701. (Cited on page 60.)

L. Wang, M. Q. Yang, and J. Y. Yang. Prediction of dna-binding residues from proteinsequence information using random forests. BMC Genomics, 10(1):S1, 2009. (Citedon pages 11 and 95.)

K. Weiss, T. M. Khoshgoftaar, and D. Wang. A survey of transfer learning. Journal ofBig Data, 3(1):9, 2016. (Cited on pages 30 and 96.)

151

Page 176: Thesis: Building better predictive models for health-related outcomes

Y. Weiss, Y. Elovici, and L. Rokach. The cash algorithm-cost-sensitive attribute selec-tion using histograms. Information Sciences, 222:247–268, 2013. (Cited on page 37.)

J. Wiens, J. Guttag, and E. Horvitz. A study in transfer learning: leveraging data frommultiple hospitals to enhance hospital-specific predictions. Journal of the AmericanMedical Informatics Association, 21(4):699–706, 2014. (Cited on page 31.)

I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques.Morgan Kaufmann, 2005. (Cited on pages 1, 21, and 79.)

F. Wolfe, H. A. Smythe, M. B. Yunus, R. M. Bennett, C. Bombardier, D. L. Goldenberg,P. Tugwell, S. M. Campbell, M. Abeles, P. Clark, A. G. Fam, S. J. Farber, J. J.Fiechtner, C. Michael Franklin, R. A. Gatter, D. Hamaty, J. Lessard, A. S. Lichtbroun,A. T. Masi, G. A. Mccain, W. John Reynolds, T. J. Romano, I. Jon Russell, and R. P.Sheon. The american college of rheumatology 1990 criteria for the classification offibromyalgia. Arthritis & Rheumatism, 33(2):160–172, 1990. ISSN 1529-0131. doi: 10.1002/art.1780330203. URL http://dx.doi.org/10.1002/art.1780330203. (Citedon page 26.)

A. M. Wood, I. R. White, and S. G. Thompson. Are missing outcome data adequatelyhandled? a review of published randomized controlled trials in major medical journals.Clinical Trials, 1(4):368–376, 2004. (Cited on page 27.)

M. N. Wright and A. Ziegler. ranger: A fast implementation of random forests for highdimensional data in c++ and r. arXiv preprint arXiv:1508.04409, 2015. (Cited onpages 20 and 103.)

M. Xu, K. G. Tantisira, A. Wu, A. A. Litonjua, J.-h. Chu, B. E. Himes, A. Damask, andS. T. Weiss. Genome wide association study to predict severe asthma exacerbationsin children using random forests classifiers. BMC Medical Genetics, 12(1):90, 2011.(Cited on page 21.)

T. Xu, T. D. Le, L. Liu, R. Wang, B. Sun, and J. Li. Identifying cancer subtypes frommirna-tf-mrna regulatory networks and expression data. PLOS ONE, 11(4):1–20, 04

152

Page 177: Thesis: Building better predictive models for health-related outcomes

2016. doi: 10.1371/journal.pone.0152792. URL https://doi.org/10.1371/journal.

pone.0152792. (Cited on page 11.)

Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classificationwith dirichlet process priors. Journal of Machine Learning Research, 8(Jan):35–63,2007. (Cited on page 66.)

C.-S. Yang, C.-P. Wei, C.-C. Yuan, and J.-Y. Schoung. Predicting the length of hospitalstay of burn patients: Comparisons of prediction accuracy among different clinicalstages. Decision Support Systems, 50(1):325–335, 2010 Dec. (Cited on page 11.)

I. Yoo, P. Alafaireet, M. Marinov, K. Pena-Hernandez, R. Gopidi, J.-F. Chang, andL. Hua. Data mining in healthcare and biomedicine: a survey of the literature. Journalof Medical Systems, 36(4):2431–2448, 2012. (Cited on pages 2, 3, 17, and 18.)

M. Yousef, M. Nebozhyn, H. Shatkay, S. Kanterakis, L. C. Showe, and M. K. Showe.Combining multi-species genomic data for microrna identification using a naive bayesclassifier. Bioinformatics, 22(11):1325–1334, 2006. (Cited on page 18.)

G. P. Zhang. Neural networks for classification: a survey. Systems, Man, and Cybernetics,Part C: Applications and Reviews, IEEE Transactions on, 30(4):451–462, 2000. ISSN1094-6977. (Cited on pages 16 and 89.)

S. Zhang, Y. Yin, M. B. Jones, Z. Zhang, B. L. D. Kaiser, B. A. Dinsmore, C. Fitzger-ald, P. I. Fields, and X. Deng. Salmonella serotype determination utilizing high-throughput genome sequencing data. Journal of Clinical Microbiology, 53(5):1685–1692, 2015. (Cited on pages 98 and 116.)

W. Zhang, T. D. Le, L. Liu, Z.-H. Zhou, and J. Li. Mining heterogeneouscausal effects for personalized cancer treatment. Bioinformatics, 33(15):2372–2378,2017. doi: 10.1093/bioinformatics/btx174. URL +http://dx.doi.org/10.1093/

bioinformatics/btx174. (Cited on page 11.)

X. H. Zhang, K. A. Heller, I. Hefter, C. S. Leslie, and L. A. Chasin. Sequence informationfor the splicing of human pre-mrna identified by support vector machine classification.Genome Research, 13(12):2637–2650, 2003. (Cited on pages 11 and 95.)

153

Page 178: Thesis: Building better predictive models for health-related outcomes

X. Zheng, H. Chen, and T. Xu. Deep learning for Chinese word segmentation and POStagging. In Proceedings of the 2013 Conference on Empirical Methods in NaturalLanguage Processing, pages 647–657, Seattle, Washington, USA, October 2013. As-sociation for Computational Linguistics. URL http://www.aclweb.org/anthology/

D13-1061. (Cited on page 16.)

154