Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf ·...

115
Mining Mozambique Health Data The Case of Malaria: From Bayesian Incidence Risk to Incidence Case Predictions Orlando P. Zacarias

Transcript of Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf ·...

Page 1: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

Mining Mozambique Health DataThe Case of Malaria: From Bayesian Incidence Risk to IncidenceCase Predictions

Orlando P. Zacarias

Page 2: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

ii

Page 3: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

Mining Mozambique Health DataThe Case of Malaria: From Bayesian Incidence Risk to IncidenceCase Predictions

Orlando P. Zacarias

Page 4: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

Stockholm University

ISBN 978-91-7649-304-5

ISSN 1101-8526

DSV Report Series No. 15-020

c© Orlando P. Zacarias

Typeset by the author using LATEX

Distributor: Department of Computer and Systems Sciences (DSV)

Printed in Sweden by Publit, Stockholm 2015

Page 5: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

v

This thesis is dedicated to my late father.You were my greatest inspiration throughout my

entire life ...Que Deus te tenha e guarde junto dele!

Page 6: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

vi

Page 7: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

ABSTRACT

The health sector in Mozambique is piled with data, holding records of ma-jor public health diseases, such as malaria, cholera, etc. The process ofscrutinizing such a mass of health data for useful information is challengingbut essential for the health authorities and professionals. Statistical learningand inferential approaches can be used to provide health decision makerswith appropriate tools for disease diagnosis and assessment, where the anal-ysis is performed using Bayesian predictive techniques and data mining.The purpose of this thesis is to investigate how predictive data mining andBayesian regression methods can be used effectively, so as to extract usefulknowledge from reported malaria health data to support decision makingand management.In summary, effective Bayesian predictive methods based on spatial andspace-time reported cases of malaria have been derived, allowing the ex-traction of the main risk factors for malaria. Predictive models that com-bine consecutive temporal connections within the analysis of the space-timevariations of the disease have been found to be relevant when the explicitmodeling of seasonality is not required or is even unfeasible.Investigation of the most effective ways to derive numerical predictive mod-els was performed using several regression predictive methods. The conclu-sions are that effective numerical prediction of new cases of the disease canbe achieved by training support vector machines using a time-window ap-proach for the choice of different training sets based on a number of yearsand reducing the time towards the test set. The best performance is obtainedfor a smaller time-window. Another contribution of this thesis is the de-termining of the importance of predictors in the prediction of the incidence

Page 8: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

viii

of malaria, performed by adopting the permutation accuracy strategy (fromthe random forests method) using the test set. Also, an additional contribu-tion relates to a significant reduction in the predictive error, which has beenobtained by the employment of a sample correction bias strategy, while test-ing the predictive models in different regions, other than where they wereinitially developed.

Page 9: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

SAMMANFATTNING

Hälso- och sjukvårdssektorn i Mozambique har samlat stora mängder datakring folksjukdomar, såsom malaria och kolera. Processen att analysera så-dana mängder data för att utvinna värdefull information är utmanande menockså av yttersta vikt för hälsovårdsmyndigheter och vårdpersonal. Stati-tistisk inlärning och inferens kan användas för att tillhandahålla beslutsfat-tare inom hälso- och sjukvården verktyg att ställa diagnoser och göra ut-värderingar genom analys baserad på Bayesianska prediktiva metoder ochdataanalys. Syftet med denna avhandling är att undersöka hur prediktiv da-taanalys och Bayesianska regressionsmetoder på ett ändamålsenligt sätt kananvändas för att extrahera användbar kunskap från rapporterade malariafalltill stöd för beslutsfattande. Bayesianska prediktiva metoder har tillämpatsför analys av insamlad data med spridning både i tiden och rummet för attextrahera viktiga riskfaktorer för malaria. Prediktiva modeller som kombi-nerar länkade temporala data med variationer i rum-tiden har visat sig varaanvändbara då explicit modellering av förändringar i samband med årstiderinte behövs eller är möjlig. Vidare har metoder för att generera prediktivaregressionsmodeller undersökts. En slutsats är att träffsäker prediktion avantalet nya sjukdomsfall kan uppnås genom inlärning av stödvektormaski-ner (support vector machines) från data som motsvarar olika långa tidsföns-ter, där det bästa resultatet erhålls från korta tidsfönster. Ett annat av avhand-lingens bidrag är en metod för att bestämma vilka faktorer som är viktigavid prediktion av malaria, vilket görs genom att anamma en permuterings-metod för testmängden, vilken tidigare har använts för s k random forests.Ytterligare ett bidrag i avhandlingen är att visa att en signifikant reduceringav prediktionsfelet kan erhållas genom tillämpning av strategi för att kor-

Page 10: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

x

rigera det fel i urvalet som uppstår då modellen tillämpas för geografiskaområden som den ursprungligen inte var utvecklad för.

Page 11: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

ACKNOWLEDGMENTS

First, I would like to express my deep gratitude to my supervisor ProfessorHenrik Boström, for his tireless efforts, patience, guidance, and commit-ment to make this project a success. Thanks, Henrik, for being supportiveand helping me moving forward. To my co-supervisor Prof. Mats Daniel-son, I express my deepest appreciation for the advice and relevant issuesraised in the final stage of this thesis. I especially thank my late supervi-sor Professor Peter Majlender and co-supervisor Dr. Mikael Andersson fortheir guidance and support throughout the first stage of my research jour-ney. I must also thank Prof. Love Ekenberg for making it possible for meto start my PhD studies in the Department of Computer and Systems Sci-ences (DSV). To the Mozambican institutions that provided the data used inthe studies included in this thesis, namely, the Ministry of Health, the Insti-tute of Meteorology, the National Institute of Statistics, and the Directorateof Geography and Registration, I express my gratitude for your invaluablehelp. My thanks are extended to the DaTe - data & text mining discussiongroup at DSV, especially Aron and Thashmee for their valuable commentsand discussions in our internal seminars.Secondly, I thank the Swedish International Development CooperationAgency (SIDA) for funding my research, in cooperation with the EduardoMondlane University and the International Science Program (ISP - Upp-sala Universitet), you made this thesis project a reality. To the Mozambicanteam in Stockholm and other Ph.D. students at DSV, especially Avelino,Sotomane, Xavier, Juvencio, Lucilio, Condo, Nelson, Javier, Florence, Eve-lyn, Meshari, Ranil, Rueben, Elly, Mturi and Geoffery, many thanks foryour support and the entertaining time we spent together. I also thank Fa-

Page 12: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

xii

tima, Rudolfo, Birgitta, Sören (big elephant), Marihan, and all the staff at thedepartment for their support and help solving various administrative issuesand taking good care of me throughout my studies at DSV. Special thanksgo to my mother, brothers and sisters, all the big family and in-laws for theirsupport and prayers. This work was inspired by your love and support. Atlast, and most, to my wife Leia and our kids Daisy, Denny and Junior. With-out your love, understanding and support this work would not have beenpossible. A very big thank to you. NHI BONGILE!! KHANIMAMBO!!

Page 13: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

xiii

Contents

Abstract vii

Sammanfattning ix

Acknowledgments xi

Abbreviations xvii

List of Figures xix

List of Tables xxi

1 Introduction 11.1 Research Background . . . . . . . . . . . . . . . . . . . . . 11.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Research Question . . . . . . . . . . . . . . . . . . . . . . 101.4 List of Included Publications . . . . . . . . . . . . . . . . . 121.5 Research Contributions . . . . . . . . . . . . . . . . . . . . 131.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Theoretical Background 192.1 Spatial Epidemiology . . . . . . . . . . . . . . . . . . . . . 192.2 The Bayesian Paradigm . . . . . . . . . . . . . . . . . . . . 23

2.2.1 Bayesian Inference . . . . . . . . . . . . . . . . . . 232.2.2 Bayesian Hierarchical Modeling . . . . . . . . . . . 25

2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 282.3.1 Learning Algorithms . . . . . . . . . . . . . . . . . 29

Page 14: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

xiv

2.3.2 Alternative Learning Methods . . . . . . . . . . . . 332.4 Criteria for Model Evaluation . . . . . . . . . . . . . . . . . 34

2.4.1 Bayesian Metrics . . . . . . . . . . . . . . . . . . . 342.4.2 Data Mining Performance Evaluation . . . . . . . . 35

3 Research Approach 373.1 Research Methodology and Philosophies . . . . . . . . . . . 373.2 Research Process and Methods . . . . . . . . . . . . . . . . 393.3 Study Site . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.4 Procedure of Data Collection and Analysis . . . . . . . . . . 42

3.4.1 Literature Review . . . . . . . . . . . . . . . . . . . 423.4.2 Datasets and Analysis Issues . . . . . . . . . . . . . 44

3.5 Research Limitations . . . . . . . . . . . . . . . . . . . . . 583.6 Ethical Considerations . . . . . . . . . . . . . . . . . . . . 59

4 Summary of Results and Findings 614.1 Predictive Bayesian Regression . . . . . . . . . . . . . . . . 61

4.1.1 Paper I . . . . . . . . . . . . . . . . . . . . . . . . 624.1.2 Paper II . . . . . . . . . . . . . . . . . . . . . . . . 634.1.3 Paper III . . . . . . . . . . . . . . . . . . . . . . . 65

4.2 Predictive Data Mining . . . . . . . . . . . . . . . . . . . . 664.2.1 Paper IV . . . . . . . . . . . . . . . . . . . . . . . 674.2.2 Paper V . . . . . . . . . . . . . . . . . . . . . . . . 684.2.3 Paper VI . . . . . . . . . . . . . . . . . . . . . . . 694.2.4 Paper VII . . . . . . . . . . . . . . . . . . . . . . . 72

5 Concluding Remarks 755.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 755.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 80

References 83

Appendices 94

Paper I 97

Paper II 109

Page 15: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

xv

Paper III 121

Paper IV 131

Paper V 137

Paper VI 145

Paper VII 151

Page 16: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

xvi

Page 17: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

xvii

Abbreviations

AIC Akaike Information Criteria

ARIMA Autoregressive Integrated Moving Average

CAR Conditional Autoregressive

DIC Deviance Information Criteria

DM Data Mining

DT Decision Trees

HIS Health Information System

INAM Instituto Nacional de Metereologia

INE Instituto Nacional de Estatistica

IRS Indoor-Residual Spraying

ITNs Insecticide-treated mosquito nets

KDD Knowledge Discovery in Database

Marr Marracuene

Matut Matutuine

MCMC Monte Carlo Markov Chain

MoH Ministry Of Health

MSE Mean Squared Error

Nam Namaacha

NDVI Normalized Difference Vegetation Index

Page 18: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

xviii

NHS National Health System

PHI Public Health and Informatics

RBF Radial Basis Function

RFs Random Forests

RW(1) First Order Random Walk Process

SMR Standard Mortality Rate

SVMs Support Vector Machines

WHO World Health Organization

Page 19: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

xix

List of Figures

1.1 Year with highest raw malaria cases. . . . . . . . . . . . . . 31.2 Year with lowest raw malaria cases. . . . . . . . . . . . . . 41.3 Framework connecting the research question and process,

and the interrelationships between studies. . . . . . . . . . . 11

2.1 Snapshot for Hierarchical Abstraction Object Structure ofDisease (malaria) Process for Maputo province . . . . . . . 22

2.2 Example of a hierarchical model used in paper I . . . . . . . 272.3 Hyperplanes able to linearly separate the data . . . . . . . . 322.4 Maximum-margin hyperplane . . . . . . . . . . . . . . . . 322.5 Linear non-separable hyperplane. . . . . . . . . . . . . . . . 33

3.1 Study area - Maputo province. . . . . . . . . . . . . . . . . 433.2 Time trend of malaria cases (first highest yearly district mean),

temperature and rainfall. . . . . . . . . . . . . . . . . . . . 503.3 Time trend of malaria cases (second highest yearly district

mean), temperature and rainfall. . . . . . . . . . . . . . . . 503.4 Time trend of malaria cases (lowest yearly district mean),

temperature and rainfall. . . . . . . . . . . . . . . . . . . . 513.5 Contribution of each district to monthly malaria trends, year

2007. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.6 Contribution of each district to monthly malaria trends, in

the year 2008. . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1 Original versus Predicted using test set 2009. . . . . . . . . 70

Page 20: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

xx

Page 21: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

xxi

List of Tables

3.1 Connection of research studies, designs, strategies and ques-tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2 Mean of raw malaria events in each district per year. . . . . . 48

Page 22: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

xxii

Page 23: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

CHAPTER 1

INTRODUCTION

This thesis investigates the application of statistical learning and inferencein the health sector of Mozambique. We use data mining and Bayesianmethods to design predictive models from health data comprising reportedcases of malaria. The first chapter presents the background, the problemarea, and the research question. In addition, a brief description of relatedand included publications is given, followed by a summary of the researchcontribution. The chapter concludes with the presentation of the thesis struc-ture.

1.1 RESEARCH BACKGROUND

The Mozambican national health system (NHS) provides and ensures pub-lic primary health-care services to all citizens. The amount of disease anddisease-related data generated by the NHS is enormous. It is introducedinto the health information system (HIS) and arranged primarily as weeklyepidemiological data (through the Weekly Epidemiological Bulletin 1), col-lated from all provinces in the country [1]. The number of reported cases(episodes) of a particular disease are summed and basic statistics are pro-duced. They are provided as time series showing the geographic variation of

1Boletim Epidemiológico Semanal

Page 24: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

2 CHAPTER 1

notified cases (as for instance the malaria incidence2 cases), which are pub-lished in quarterly and annual reports. The HIS was entirely paper-baseduntil the end of 1992, when the government introduced a computer-baseddatabase and reporting system at the provincial and national levels [3]. Itcovers some health programs, such as, family planning, malaria, motherand child, etc. Due to this limitation, several computer-based informationsystems have been implemented within the NHS. Some originated withinthe corresponding health program whereas others are supported by interna-tional partners, leading to a situation referred to as “spaghetti” [4]. On theother hand, under the Lubombo Spatial Development Initiative3 there is acomputerized malaria information system storing malaria case data, includ-ing also a spatial component through a geographical information system [5].The information is collected during the spraying campaigns4 and introducedinto their malaria information system. It plays a key role in the monitoringand planning of new spray activities. Unfortunately, in Mozambique, theproject is operational only in Maputo province and a small number of dis-tricts in the Gaza province.

Following the definition of the World Health Organization (WHO), malariais considered a life-threatening disease caused by a parasite [7]. The dis-ease is mostly encountered in tropical and subtropical countries of Africa,Asia, Central and South America [6; 8]. It is transmitted to other human be-ings by the bite of an infected female Anopheles mosquito [6], thus calledthe disease vector. Every year it affects around 500 million individuals andhas a very high death rate [6; 9]. In Mozambique, malaria has remainedat the top of the country’s disease list for many years now, showing highlevels of mortality, particularly for children [9]. Official records from the

2Incidence is a disease measure defined as the number of new cases per popu-lation at risk in a given time [2].

3LSDI is a project that brings together three countries, South Africa, Mozam-bique and Swaziland, and operates in provinces where the countries share commonborders.

4Primary vector control intervention through indoor-residual spray (IRS), asrecommended by the WHO [6]. Dichloro-Diphenyl-Trichloroethane (known asDDT) is an insecticide used in the spraying of inside dwellings to kill the malariavector (mosquitoes).

Page 25: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

INTRODUCTION 3

Ministry of Health (MoH) report that about a million children are sufferingfrom malaria in the country. The level of incidence becomes more troublingfor far remote locations away from the major cities and provincial capitals,in rural areas, where patients tend to travel kilometers to reach the nearesthealth facility. In Figures 1.1 and 1.2 the distribution of malaria raw data inthe years with the highest and lowest rates of reported cases in the rural dis-tricts of Maputo province is shown, for each corresponding month. It shouldbe noted that the administrative area of Matola is considered as a district atthe level of collection and analysis of data from the MoH, notwithstandingthat it possesses the status of a municipality. It should be noted that the inci-dence of malaria dropped significantly between 2000 and 2012, this mainlydue to the change in the therapy used (introduction of artemisinin) and massutilization of mosquito nets. However, these factors are not captured andsystematized in the health information system.

Figure 1.1: Year with highest raw malaria cases.

Page 26: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

4 CHAPTER 1

Figure 1.2: Year with lowest raw malaria cases.

The amount and level of information flowing into the health sector repre-sents a major challenge for the health system managers, employees, and ser-vice providers at various levels, especially regarding the making of timelydecisions about, for example, the distribution of anti-malarial drugs or othernecessary resources. However, such a large volume of data can be analyzedin order to generate useful information (knowledge) [10]. It is crucial thatan analytical approach based on reported malaria cases take into account theintrinsic way that some important features may be reflected in the availabledata. Such features (patterns, relationships) are related to the geographicalregion (the location where the disease occurs), time (the time interval con-taining the observation), and affected populations. These dimensions forma triad: space, time, and personal relationship.

The extraction of such information and knowledge can be an important con-tribution to explaining the transmission and variation of malaria. To achievethis goal, some methods and techniques that can contribute to the analysisof data and knowledge extraction can be applied [11–13]. Among these

Page 27: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

INTRODUCTION 5

approaches are predictive data mining and Bayesian regression-based infer-ence. They allow the extraction of useful and interesting knowledge andregularities from data. Through the use of data mining (DM), it is possibleto discover patterns5 in the data. The process of extracting previously un-known information may be automatic or semi-automatic. This informationshould be valid and able to generate useful actions. Also, the discoveredpatterns should have a significant advantage for strategic decision making[12; 13]. DM is a step in a larger process known as Knowledge Discov-ery in Database (KDD) [12]. As such, it relates to the use of algorithmsthrough which the patterns may be extracted. KDD is a process of non-trivial extraction of implicit unknown and potentially useful information byautomatically scanning the databases [10]. Additionally, multiple regres-sion hierarchical models of reported malaria cases, considering independentvariables assumed to explain the disease outcome, can be set up within theBayesian statistical framework [14–16]. In this setting, the prior knowledgeregarding the hypothesis being tested together with the information from theanalyzed dataset are integrated. Consequently, estimates of posterior distri-bution of the Bayesian inference for any parameter of the model are obtainedthrough simulation, using computational techniques such as Markov ChainMonte Carlo (MCMC). They represent the updated information of the pri-ors considering the information from the observed dataset [15; 17]. Also,the Bayesian analysis allows determining the uncertainty associated to theestimated parameters. This uncertainty is relevant as it will be used in the as-sessment of the model’s performance. More discussion of these approacheswill be presented in Chapter 2.

Furthermore, this useful knowledge and information can be integrated intothe health information system to, for instance, reduce the time needed fordecision making and increase the objectivity of the decisions. In fact, healthinformation systems have been developed to optimize the flow of relevantinformation, triggering a process of knowledge and decision-making, plan-ning and intervention activities [18; 19]. However, malaria incidence risk6 is

5A pattern is any physical or symbolic entity that describes a subset of the dataor a model applicable to the subset [10].

6Defined as the probability of a malaria disease event occurring during a speci-

Page 28: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

6 CHAPTER 1

largely dependent on the interactions between the host (human), the mosquitovector, and the environment [20]. This relationship is known as the epidemi-ological triad of the disease [20]. For example, malaria-related interventionslike treatment therapy, the use of ITNs, and case-control management, arenot captured by the HIS. Consequently, critical and valuable information islost, which may render worthless all efforts to reduce and perhaps eliminatethe disease. Actually, the incorporation of these non-climatic factors andothers, such as land usage and irrigation, population movement, drug resis-tance, availability of health services, etc., may provide better estimations oflocal variations of malaria incidence [21–23]. Still, integrating all possiblefactors that may influence malaria incidence variation is costly and perhapsunfeasible. Also, this could be impractical if considering the situation ofscarcity in the Mozambique health sector, with an insufficient number ofhealth professionals, training needs, weak integration with other sectors,etc.

The work of this thesis overlaps three distinct areas: data mining, Bayesianstatistics, and public health. From the data mining perspective, a comparisonof different supervised machine learning methods to predict the number ofnew disease cases is proposed. Since malaria affects the vast majority of theMozambican population, it is considered to be a public health problem witha clear identification of the at-risk communities and populations [1]. TheBayesian approach suggests the employment of probabilistic models thatmay allow making inferences using a malaria response variable and variousfactors assumed to influence it [15; 17]. These factors are then assessed andclassified according to their relative importance, i.e., significance. Subse-quently, the three focus areas and the solutions proposed later in this thesis,are linked to the scientific area of public health and informatics (PHI). Thedefinition given in [24; 25] emphasizes this link by relating PHI to the sys-tematic application of information and computer science technologies in theinvestigation of public health problems and learning. Hence, the contribu-tions of this thesis are considered to be in the field of public health andinformatics.

fied period of time [2].

Page 29: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

INTRODUCTION 7

1.2 THE PROBLEM

Statistical learning techniques employed to extract knowledge about thehealth of the population using related factors are quite attractive approaches,as they can allow establishing practical and relevant dependencies. Datamining methods have been successfully applied in several areas of the healthsector, from the detection of health fraud and abuse to predictive modelingand disease diagnosis. In medicine, the methods of artificial neural net-works have been employed in studies of osteoporosis [26; 27], pneumology[28], paediatric surgery [29], kidney failure [30], heart diseases [31], cancerrisk detection [32; 33], and diabetes [34] to predict the corresponding classlabels. These applications are used as decision support tools for the dailyactivities of clinicians. Methods of artificial neural networks, nonlinear re-gression, and least squares support vector machines have been used to inves-tigate models for the prediction of future outbreaks of dengue hemorrhagicfever (DHF) in [35–37], which take climatic factors such as temperature(minimal, mean and maximum), relative humidity, and rainfall as the inputvariables. In contrast, [36; 37] only use rainfall as an input factor, which isa major disadvantage if one considers the link of the disease vector and theremain climatic variables. Besides that, these forecasting studies have greatpotential for application to malaria data knowledge extraction, as they notonly employ similar factors, but are also transmitted by the same diseasevector.

However, knowledge extraction from malaria datasets performed using pre-dictive data mining are not only scarce, but they are mostly applied to theanalysis of the biological systems of this disease. Investigations for genemalaria selection [38], mining of the genome database [39] and malariatranscriptome [40] are some examples of reported studies. Studies predict-ing malaria infection outcomes are reported in [41; 42] using hypercube andartificial neural networks methods. Also using the hypercube method, [43]identifies the association of the explanatory variables to increased malariarisk cases. Recently, a study that applies the approach for group method ofdata handling (GMDH) polynomial neural network 7 to predict malaria out-

7This is a group of inductive self-organizing methods to solve more complex

Page 30: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

8 CHAPTER 1

breaks has been reported in [45]. It allows the use of a polynomial referencefunction, where simple polynomials are initially created to approximate thestudied systems. This approach is similar to artificial neural networks wherethe neurons are activated by a polynomial function. Consequently, a GMDHpolynomial network may also follow into similar artificial neural networkstraps of either local minima or to overfitting. Additionally, these studies donot predict continuous values. Alternative methods to predict continuousvalues using malaria datasets have been developed. The developed numeri-cal malaria predictive (forecasting) models employ approaches such as au-toregressive integrated moving average (ARIMA) [46–48] and Poisson re-gression with polynomial finite distributed lag [21; 49]. These applicationsinclude as input variables temperature and rainfall. But, [46] also includesthe normalized difference vegetation index (NDVI).

Within the Bayesian framework, multiple regression prediction models us-ing count data usually assume a negative Binomial or Poisson likelihood,depending on whether the data has a fixed over-dispersion constant or anunknown magnitude, respectively. The Poisson likelihood is the most com-mon choice. The Bayesian approach enables hierarchical data modelingin which it is possible to incorporate spatio-temporal structures. In fact, theuse of methods for predictive modeling that employ space-time data sets hasbeen investigated for a long time [14; 50; 51]. When the spatial sample sizeis not large enough for a regression, the combination of spatial and temporaldimensions can help increase the regression samples used for the analysis.Several variants of space-time modeling are possible, where either space-time interactions and (or) temporal effects are introduced. A linear timetrend has been used in [52–54], with the [53] proposing an inter-annual dis-ease variation. Studies [55; 56] use an autoregressive random process AR(1)to model space and time interactions, where the the temporal correlation be-tween consecutive time lags at the spatial level is defined as either monthlyor annual.

The predictive models established by some of the research here reviewedcan also be applied to other diseases, in particular to the study of malaria.However, several shortcomings can be identified. Models derived using data

problems of handling observed data samples [44].

Page 31: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

INTRODUCTION 9

mining methods mostly employ a classification as their data analysis task tofind prediction levels of future disease outbreak risks and diagnosis. Thisdoes not provide a numerical prediction, though some of the methods canbe used to perform continuous values prediction. In addition, the numericalpredictive modeling of new episodes of diseases from the perspective of datamining is apparently unavailable. ARIMA based models are weak in out-lier detection, with the possibility for parameter re-estimation and changesin the order of the ARIMA model [45; 48]. For the models obtained usinga polynomial distributed lag, the selection of parameters for maximum laglength and degree of the polynomial are the main constraints [57]. Anotherlimitation of the methods used for numerical prediction of disease forecastrelates to the lack of endorsement of the time-window (frame) approach intotheir predictive modeling framework. In this setting, the training set can besub-divided and structured into different time lags, yielding a number ofnew training datasets. This is important if bearing in mind that the Toblerlaw of geography [58] relating spatial dependencies between near objectscan also be applied to the time dimension. Thus, we propose numericalpredictive models that are enhanced using such a time-window strategy foran effective choice of the required historical data from which to develop themodels. Although the existing Bayesian-based models are derived under theconjugate space and time domains, the possibility of using several varyingindexes in the temporal dimension is also lacking. In fact, explicitly allow-ing the dependency between the response and independent variables to varywithin years and months of the same and the following year is crucial. Thiscan allow capturing temporal correlation effects linking climatic seasons.Therefore, we extend the space-time interactions described in the predic-tion models of [55; 56] to simultaneously accommodate monthly and an-nual variation while finding significant predictive relationships. Moreover,investigations of sub-regional predictive modeling to validate and reproduceprevious findings on malaria transmission and variation, interactions, andpredictive relationships, are fundamental [59].

Hence, in this dissertation, we devise predictive models based on methodsand techniques from the fields of statistical learning and inference, aimingat reducing the knowledge gap identified above. The models will be usedto extract invaluable knowledge from health datasets employing the malaria

Page 32: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

10 CHAPTER 1

data of routinely reported cases from administrative districts of Mozam-bique. In the Bayesian framework, the models attempt to extend their pre-dictive performance from the spatial to include a temporal dimension whileaccounting for previous temporal random effects. Within predictive datamining, the challenge for the effective learning of the number of new dis-ease episodes is considered in each region and time. With the evidence-based results obtained by both methods of learning, we hope to contributeto a more informed decision making and management process on the part ofhealth decision-makers. A further discussion of Bayesian and data miningapproaches can be found in Chapter 2.

1.3 RESEARCH QUESTION

Based on the problem description, the following research question is formu-lated,

How can one effectively extract useful knowledge from malaria health datato support decision-making and management, using predictive data miningand Bayesian regression methods?

The posed research question is answered in seven developed empirical stud-ies, using historical malaria and climatic datasets. They are summarizedin Figure 1.3 below. The figure outlines the categorization on predictiveBayesian regression and data mining approaches and interdependencies (in-cluding the used datasets) between the conducted studies. Also, the figureestablishes the connections between the research question, the process lead-ing to the extraction of knowledge, and the corresponding studies. Althoughin both groups of studies we model continuous-valued functions through re-gression where the predictive models analyze the data looking for specificpatterns or trends, in the Bayesian studies, the associations between the tar-get variable (malaria incidence cases) and predictors are investigated. But inthe predictive data mining studies, the forecasting of (unknown) new diseaseepisodes using similar predictors including the analysis of possible most im-portant factors is sought.

Page 33: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

INTRODUCTION 11

Figure 1.3: Framework connecting the research question and process, and theinterrelationships between studies.

Page 34: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

12 CHAPTER 1

1.4 LIST OF INCLUDED PUBLICATIONS

This thesis consists of a list of scientific contributions, detailed below. Anextensive summary of the included papers is provided in Chapter 4 and theirfull versions in Part II of the thesis.

PAPER I: Mapping malaria incidence distribution that accounts forenvironmental factors in Maputo province - MozambiqueOrlando P. Zacarias and Mikael Andersson, Malaria Journal,9:79, 2010.DOI: 10.1186/1475-2875-9-79

PAPER II: Spatial and temporal patterns of malaria incidence in Mozam-biqueOrlando P. Zacarias and Mikael Andersson, Malaria Journal,10:189, 2011.DOI: 10.1186/1475-2875-10-189

PAPER III: Comparison of infant malaria incidence in districts ofMaputo province, MozambiqueOrlando P. Zacarias and Peter Majlender, Malaria Journal,10:93, 2011.DOI: 10.1186/1475-2875-10-93

PAPER IV: Predicting the Incidence of Malaria Cases in MozambiqueUsing Regression Trees and ForestsOrlando P. Zacarias and Henrik Boström, International Jour-nal of Computer Science and Electronics Engineering(IJCSEE), Volume 1, Issue 1, pages: 50–54, 2013.ISSN: 2320-4028

PAPER V: Strengthening the Health Information System in Mozam-bique through Malaria Incidence PredictionOrlando P. Zacarias and Henrik Boström, IST-Africa 2013Conference Proceedings, International Information

Page 35: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

INTRODUCTION 13

Management Corporation (IIMC), 29–31 May, 2013.ISBN: 978-1-905824-38-0

PAPER VI: Comparing Support Vector Regression and Random ForestsModeling for Predicting Malaria Incidence in MozambiqueOrlando P. Zacarias and Henrik Boström, InternationalConference on Advances in ICT for Emerging Regions(ICTer): 217–221, 2013 IEEE. ISBN: 978-1-4799-1276-6

PAPER VII: Generalization of Malaria Incidence Prediction Modelsby Correcting Sample Selection BiasOrlando P. Zacarias and Henrik Boström,ADMA 2013 Conference Proceedings,M. Yao et al. (Eds.): ADMA 2013, Part II, LNAI 8347,pp. 189–200, 2013. Springer-Verlag. Berlin, 2013.

1.5 RESEARCH CONTRIBUTIONS

What follows is a brief presentation of each empirical study and paper, in-cluding its contribution to answering the posed research question.

Paper I: This very first study was designed to identify and establish thespatial distribution of malaria incidence, where we explore a Bayesian hi-erarchical regression approach to making inferences using two risk factors.The contribution of this paper relates to finding predictive methods for theextraction of spatial and seasonal malaria risk factors and estimating its in-cidence. The methods set a way to establish predictive relationships and es-timates of seasonal malaria variations in the districts of Mozambique. Theseresults may be useful for planning control activities and intervention takinginto consideration that the predictors, temperature and rainfall change, in-fluence not only the incidence of disease, but also the variation of the spatialpatterns of its distribution. Therefore, it could be concluded that the em-ployed methods with seasonal grouping of malaria incidence cases providean effective framework for the analysis of this type of data. The limitationof the predictive models developed is their inability to explicitly capture

Page 36: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

14 CHAPTER 1

temporal predictive relationships and variations in disease transmission.

In Paper II, a combination of geographical and time-trend approaches ofBayesian inference learning considering the data of reported cases of malaria,sampled from the entire population, was adopted. It employs a similarmethod to that of paper I, where the extension relates to the inclusion oftime variation in the analysis. Investigation of a combined usage of condi-tional autoregressive (CAR) process and temporal dimension structures wasperformed, with an adjustment of unobserved spatio-temporal interactions.Effective predictive methods based on space-time reported cases varying formonths of the same and the following year in each spatial unit was the maincontribution in this study. The methods allowed the spatial dependency to begiven by a conditional auto-regressive (CAR) process with a weighting ma-trix using a binary structure (1 for neighbors and 0 otherwise) or the lengthof the border between the neighboring districts. Also, the approach usedfor modeling the time variation allowed avoiding direct grouping of monthsinto the seasons of the year. This was beneficial, as, for example, the sum-mer season extends from the last few months of each year to the first monthsof the next year. These results can benefit the decision-making process onthe management of malaria control activities, highlighting the need to takeinto special consideration variations of the independent variables, maximumtemperature and humidity in the region. The lack of additional malaria riskfactors and data at a smaller scale are the main limitations of this study.

Paper III: The combination of the space-time perspective of the previousstudy hinted at the need to use a similar technique, however for a differentdemographic population group: infants (children less than five years old).A prediction model built using Bayesian regression methods with no spatialdependency of the malaria data was more effective. Therefore, it could beconcluded that a very low spatial variation of reported cases was observed.This can be relevant for the design of disease management strategies andpolicy implementation, focused on this population group. The small samplesize of the malaria data was a major drawback to this study.

The need to have models capable of detecting patterns to anticipate situa-tions of disease outbreak and capable of routinely forecasting new disease

Page 37: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

INTRODUCTION 15

events based on historical data, led to the design of the studies reported inPapers IV-VII. They attempt to answer the research question by providingpredictive models for the number of new episodes of malaria. These ac-curate estimates of new disease cases may allow health decision makers toidentify areas with high exposure to epidemics in a timely manner, leadingto directing resources to the most affected regions. It is expected that theadoption of the obtained tools may lead to reducing the malaria burden.In Paper IV, regression trees and random forests methods are adopted andcompared while forecasting the number of new malaria incidence cases inthe space and time domains. A sample of reported cases of malaria from allage groups (the entire population) is used. Nine different models followinga time window approach were generated. Based on the adoption of a strat-egy for the choice of different training sets based on the number of years,and reducing the time towards the test set, the developed models performedbetter than with a larger time window. The findings show that a more accu-rate predictive model was obtained using the random forests method basedon a two-year time window strategy. The permutation accuracy within therandom forests approach was studied to investigate the relative importanceof the predictors included. The contribution of this investigation to answer-ing the posed research question lies in the capabilities of the random forestsapproach in providing effective and accurate estimates of the number of newmalaria episodes, and finding the input variables with the most significantpredictive relationships regarding malaria incidence forecasting. A limita-tion of this study relates to the monthly aggregation of the data, given thatin real life, weather forecasts are provided on a daily and weekly basis.

The investigation performed in Paper V is based on the support vectormachines approach, where we derive patterns of new malaria cases usingsamples from the infant population (children under five years of age). Theresults show that after optimizing the required parameters, the predictivemodel based on a radial basis kernel acquired the best predictive perfor-mance. To derive the knowledge of the relative importance of the predic-tors, the adoption of a random forests strategy was performed with the pre-dictors being randomly permuted using the test dataset. Therefore, it couldbe concluded that support vector machine methods using a radial basis ker-nel are more effective than those employing other kernels for malaria infant

Page 38: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

16 CHAPTER 1

data. Also, the use of the random forests approach within the support vectormachines environment allowed the analysis of each predictor’s contribution(importance) to malaria forecasting estimation. As in Paper III, this inves-tigation had the limitation of the reduced size of the datasets used in theanalysis.

The goal of Paper VI is to design and access predictive models for the num-ber of new malaria incidence cases using a dataset sampled from the entirepopulation (all age groups) and methods for numerical predictions. It is anextension of Paper IV, but we introduce a more robust method for the ex-perimental comparison, i.e., random forests and support vector machinesapproaches are compared. The decision trees predictive technique was usedas the baseline since our goal was to find the best overall predictive mod-eling method. The results show that the SVMs model using a radial basiskernel and based on a two-year time window strategy obtained the mosteffective predictive accuracy. Investigation of the most important predic-tors was conducted following the adoption of the random forests strategyof variable importance within the SVMs model and performing a randompermutation of the predictors using the test set. In both (RFs and SVMs)predictive methods, indoor-residual spray vector control interventions wereranked as the most valuable predictor towards malaria incidence forecast-ing. In addition, the reduction of the time-window for the historical dataimproved the predictive performance, as in Paper IV. The study has thelimitation of using a higher level of aggregate data (monthly) rather than afiner aggregation (e.g., weekly), considering the availability of forecasts ofclimatic data.

Paper VII evaluates the most effective predictive models derived in PapersV and VI. The goal is to determine their ability to generalize to datasets fromother regions and different (new) times other than those where the modelswere originally developed. The predicted errors were significantly reducedfollowing the application of a sample correction bias strategy compared tothe use of original models. It could be concluded that the obtained predictiveperformance can effectively be increased by using the method of sampleselection bias. Therefore, if the test data can be expected to be drawn froma different distribution than that of the training data set, the adoption of a

Page 39: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

INTRODUCTION 17

sample correction bias technique is relevant. This is useful, as it extends thecapabilities of predictive models derived previously, by trying to cover thewhole national malaria program in the country.

1.6 THESIS OUTLINE

In Chapter 2, we give a description of the theoretical background of theBayesian inference and data mining approaches, which are the basis of theresearch presented in this dissertation. As to the Bayesian perspective, thechapter focus on its application to spatial statistics and epidemiology. Thenit presents a description of machine learning techniques used within the datamining approach of the thesis.

Chapter 3 presents the methodological approach pursued in the thesis. Itspecifies the research settings and philosophical assumptions, the criteriafor site selection, and the methods of data collection. The chapter closeswith a discussion of the design used in the performed studies included in thethesis.

Chapter 4 summarizes and discusses the main findings of this research de-rived from the undertaken studies. It closes with a discussion of the limita-tions of this research.

The main conclusions and recommendations for future work are given inChapter 5.

Page 40: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020
Page 41: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

CHAPTER 2

THEORETICAL BACKGROUND

Every day we make decisions that in one way or another affect our environ-ment: at the social level, where we live, work, in relation to our families,etc. These decisions are usually based on inferences and relationships em-bodied in quantities normally observable that contribute to the formation ofmodels of the phenomenon under study. The models thus obtained are usedto capture trends in the observed data that can then be used to predict newevents.This chapter starts by first presenting an overview of spatial epidemiology.It then follows the outline of Bayesian inference including its terminologyand links related to this research. The third section covers machine learn-ing, followed by the data mining methods and techniques employed in thisthesis, given in Sections four and five. Section 6 discusses some applicationissues. Lastly, a presentation of the criteria used in this thesis to evaluate thepredictive models is given in Section seven.

2.1 SPATIAL EPIDEMIOLOGY

The investigation of the spatial variation in disease risk has been made possi-ble thanks to advances in areas such as computer science, geographic infor-mation systems, statistical applications, and the ready access to geograph-ically indexed health and population data [60]. Spatial epidemiology thusconcerns the analysis of variations in disease incidence at the geographical

Page 42: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

20 CHAPTER 2

level. According to the definition given by [61], spatial epidemiology is partof the field of ecological studies that, through the explanation of the distri-bution of diseases in different locations, provides a better understanding ofthe causes and origins of diseases. An analysis of the incidence of diseasesand their geographic distribution in order to determine their relationship topotential factors is crucial, especially in public health and other related areasof epidemiological application. Hence, in epidemiology, an essential step ina disease study is the precise description of its incidence in a population[60–62]. This description may be categorized in various domains, such as:

• Temporal.

• Spatial.

• Distribution according to personal attributes.

The main goal of this categorization is to identify the general incidencepatterns and the relative groups at risk. Depending on the purpose and typeof spatial analysis, four groups of studies can be listed [62]:

1. Disease mapping: Performs the presentation in a map of the spatialand spatio-temporal patterns of disease risk.

2. Geographical correlation studies: Measures health outcomes by ex-ploring the spatial variations in exposure to environmental variables(e.g., air pollution) and lifestyle factors (such as diet, alcohol con-sumption, etc.). Geographical correlation studies and disease map-ping use similar statistical models, but have distinct aims. Diseasemapping is descriptive in nature and geographical correlation studiesare concentrated on aetiology questions.

3. Risk assessment related to a point or line source: This type ofstudy is more adapted to situations when the risk is close to its sourceor when the source is considered a potential environmental hazard.The risk may be a linear source (powerline, road, etc.) or point source(dam, nuclear power plant, etc.).

4. Cluster detection and disease clustering: This category of studiesis conducted to provide early detection of increased incidence whenthe specific disease aetiology is not available.

Page 43: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

THEORETICAL BACKGROUND 21

The above division into groups is superficial. For example, the mapping ofdiseases depends on the scale, and can provide information about diseaseclustering or, with more details, about individual clusters of diseases. Asstressed above, geographic correlation studies have much in common withdisease mapping studies, and the statistical models they use are often simi-lar. A point source of exposure may lead to an excessive number of localizedcases of a disease that can be detected on a normal disease map. However,disease dynamics are driven by many factors, which in the case of malariainclude, among others, the spatial distribution of the vector (mosquitoes),the type of therapy used and the vector’s resistance to some medicines, en-vironmental and climatic variables, etc. Figure 2.1 presents some of theactors, factors, and their hierarchical relationships. The diagram depicts themalaria process abstraction with an indication of three related layers:

• Environmental and climatic;

• Spatial and population;

• Populations, from which to extract the disease affected populationsample.

Climatic and other factors affect the vector density and the local popu-lation within the eleven administrative provinces comprising the country.However, the process is driven mainly by heavy rainfalls that brings moremosquitoes, and this can often lead to an increase in the population of in-fected mosquitoes. Consequently, more people become infected, leading tomore hospital visits. In Figure 2.1, we concentrate our study on the Maputoprovince and its at-risk population, considering the malaria vector popu-lation that survives local conditions and takes blood meals and multiplies.Some of the mosquitoes (vector population) die naturally. Persons in thestudy region may or may not acquire malaria, as a function of several factors,such as their relative immunity. The diseased persons can be successfullytreated and thus recover; while others die. From the spatial epidemiologyperspective, this thesis performs a disease risk mapping and thus deals withexplicit spatial or geographical data [63]. Maps of malaria incidence risk[54; 56; 64; 65] present a retrospective analysis of the spatio-temporal dy-namic spread rate, thus being useful for targeting malaria prevention andtreatment in a given geographical area and time period.

Page 44: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

22 CHAPTER 2

Figure 2.1: Snapshot for Hierarchical Abstraction Object Structure of Disease(malaria) Process for Maputo province

Page 45: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

THEORETICAL BACKGROUND 23

2.2 THE BAYESIAN PARADIGM

For most of the last century, non-Bayesian approaches to inference domi-nated statistical theory and practice. However, the last decades have wit-nessed the reappearance of the Bayesian inference approach. This processwas mainly driven by advances in computing power and techniques, and theimprovement of estimation methods using iterative sampling [17; 66]. Itsnumerous applications cover areas such as economics, the social and phys-ical sciences, etc. Bayesian data analysis is a practical method for mak-ing inferences from data by using probability models through an inductivelearning process via Bayes’ rule [17; 67].

2.2.1 BAYESIAN INFERENCE

The philosophy of classical statistics (also known as frequentist statistics),is built on the fact that all the statistical information about a random phe-nomenon that one wants to infer is based only on what is observed, i.e., onthe data. As such, the classical statistician usually assumes that the param-eters of the model hereafter referred to as θ , are fixed quantities.

However, in the Bayesian approach, the parameters of interest are regardedas random quantities. As a result, a Bayesian does not distinguish be-tween quantities that have been observed and the parameters of the model.Therefore, while constructing a Bayesian model, probability distributionsfor the parameters must be supplied. They represent uncertainty about theparameters θ and are modeled by a probability distribution π. Moreover,the Bayesian statistician, unlike the classical one, considers not only the in-formation coming from the data but also the information that the researcherhas about the phenomenon under study. This information is combined intoa prior knowledge and is represented by the prior distribution with den-sity π(θ). The prior π(θ) does not depend on the data and holds all theinformation the researcher may have about the parameters before the data isavailable or known. Two types of priors may be defined [17]:

1. informative: when the researcher has some knowledge about the pa-rameter distribution;

Page 46: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

24 CHAPTER 2

2. non-informative: when the researcher has no knowledge about thedistribution of the parameters before the sampling process.1

The process of choosing the parameter θ from π(θ) is implicitly connectedto the data generation mechanism in such a way that this mechanism gener-ates the dataset Y from some unknown function f (y,θ), usually written as aconditional density f (y|θ) [68]. The goal is then to learn about the parame-ter value θ based on the data Y; i.e., Bayesian statistical inference is aboutlearning from data. Bayesian inference is based on Bayes’ rule [15; 17].Therefore, considering the random nature of θ and using the joint distribu-tion of θ and Y, the Bayesian approach determines the conditional densityof θ given the data Y [15; 17; 68] as

π(θ |Y = y) =π(θ) f (y|θ)

g(y)(2.1)

Equation 2.1 is called the posterior density of θ and will be the basis ofany inference about θ that is deemed to be interesting. The density in thedenominator g(y) =

∫π(θ) f (y|θ)dθ in case of continuous θ (while in the

discrete case, g(y) = ∑θ π(θ) f (y|θ)) is called the data marginal density.If omitting the factor g(y) in Equation (2.1) for a fixed y, and in fact itdoes not depend on θ , then one obtains the unnormalized posterior density[15; 17; 68]:

π(θ |Y = y) ∝ π(θ) f (y|θ) (2.2)

The expressions 2.1 and 2.2 are the main core of Bayesian inference, wherethe primary task of any application is first to develop the model f (y,θ) re-sponsible for data generation (the likelihood density). Then, the requiredcalculations for the summarization of the posterior density π(θ |y) are car-ried out. Within the Bayesian inference framework, the posterior distribu-tion describes the probability of the information available for a parameterof interest [69]. When the density function of this distribution has an un-known form or is too complex, intensive computational methods for its ap-proximation can be used. Through statistical simulations, we estimate the

1This leads to natural differences in researchers’ perceptions of the uncertaintylevels in relation to θ , which will in turn result in different model specifications.

Page 47: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

THEORETICAL BACKGROUND 25

characteristics of probabilistic models that cannot be computed analytically[70]. When dealing with highly complex problems, due to the high numberof parameters involved, in general, Markov Chain Monte Carlo (MCMC)methods are used. The basic idea of MCMC methods is to build a Markovchain whose equilibrium distribution equals the distribution of interest. Af-ter a finite number of simulations of this chain, it is expected to reach theequilibrium distribution, and from this point give rise to a sample distri-bution of interest. There are several ways to construct Markov chains. Themost widely used approach is the Gibbs sampler proposed by Geman (1984)[71] and popularized by Gelfand and Smith (1990) [72]. Another similartechnique is the Metropolis–Hastings method, proposed by Metropolis etal. (1953) [73] and Hastings (1970) [74].

The inferential process used in this thesis regarding the methods for Bayesianregression to derive predictive models of malaria incidence (papers I–III),was implemented in WinBUGS2 software version 1.4.1. The inferences areprovided as the estimated value of the central tendency (mean) with a confi-dence level. WinBUGS is a statistical package running under the Windowsenvironment and based on the Bayesian inference Using Gibbs Sampling(also known as BUGS) project [75]. It is used for the Bayesian analysis ofcomplex statistical models and the parameters are estimated via the MCMC.

2.2.2 BAYESIAN HIERARCHICAL MODELING

Bayesian models typically hold a structure suitable for hierarchical catego-rization of their processes [15; 76]. The hierarchical structure is introducedinto the modeling by considering the prior distribution h(θ |φ) of the modelθ with prior parameters φ as one level of the hierarchy and the likelihood asthe final stage. In practice, one starts by defining the distribution thought toexplain the process that generates the data. Using the Bayes theorem, thisresults in the posterior distribution given as h(θ |y)∝ h(y|θ)h(θ ;φ). In orderto capture some complex structure of the data, the prior is usually subdividedinto a number of conditional distributions called hierarchical stages of theprior distribution [15]. The Bayesian hierarchical model is entirely definedwhen the prior parameters φ associated with the likelihood parameters θ are

2Available at: http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml

Page 48: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

26 CHAPTER 2

assigned a prior distribution. The posterior is obtained as [15],

h(θ |y) ∝ h(y|θ)h(θ ;φ)h(φ ;γ)

∝ h(y|θ)h(θ |φ)h(φ |γ). (2.3)

Two levels of hierarchy of the prior distribution can be distinguished in themodel formulation of 2.3:

• At the first level h(θ |φ)

• h(φ |γ) in the second level. These prior distributions are called hyper-priors and the corresponding parameters γ are called hyper-parameters.

Extensions to the structure can be incorporated by adding more levels ofhierarchy. Many different types of Bayesian hierarchical models exist, de-pending on the specific form of the model and the estimation strategy used[15; 17]. The most popular are models such as the random effects, the vari-ance components, the mixed and the generalized linear mixed models [77].A multiple regression Poisson approach with several hierarchies is appliedwithin the Bayesian framework to evaluate the climatic predictors of malariaincidence patterns in papers I–III of this thesis. The Poisson assumption forthe likelihood of the counted reported malaria cases, instead of the nega-tive Binomial, is due to the unavailability of the magnitude of the observedcount cases, as required by the latter method. Figure 2.2 shows a Bayesianmodel with three hierarchies developed in one of the studies presented inthis thesis. The model is in fact a sub-model of a far more complex globalmalaria abstraction model presented above in Figure 2.1. In this graphicalmodel, the observed malaria data O[i] is a stochastic node with mean mu[i],following a Poisson distribution. The node mu[i] is a logical node (equiv-alent to instantiation in most programming languages) with the right-handside shown in the Figure 2.2, under the category value. Loops in the graphare represented by a “plate,” as illustrated in Figure 2.2 for the range (i in1:m). The quantities Al pha, Beta[1], Beta[2] are parameters, while Tau.uand Tau.v are hyper-parameters of the model.

Page 49: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

THEORETICAL BACKGROUND 27

Figure 2.2: Example of a hierarchical model used in paper I

Page 50: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

28 CHAPTER 2

2.3 MACHINE LEARNING

Being a branch of artificial intelligence, machine learning applies increasinglevels of automation to the knowledge engineering process [78]. Throughthis, it is possible to replace time-consuming human activities with auto-matic techniques that may improve their accuracy or efficiency by findingand exploiting regularities in the training data. Learning problems can bebroadly divided into three categories [79]:

• Supervised learning (the outcome variable is provided to guide thelearning process)

• Reinforcement learning (the learner acts on the environment to findactions that yield the most reward)

• Unsupervised learning (features are provided and no outcome variableis given)

Machine learning algorithms are the basis of the data mining process. Datamining can be defined as a semi-automatic process for discovering patternsin data [13]. The results obtained should be of such importance to the ap-plied field or business as to bring additional value and advantages. Mostof the research fields found in data mining are based on machine learning.However, major differences exist between the approaches used in artificialintelligence and those in database disciplines [12]. This may explain why alarge part of the research in machine learning has been focused on learninginstead of on creating useful information for the user. Moreover, machinelearning studies the development of learning methods that can mimic hu-man behavior, whereas data mining does not attempt to replace humans. Itis driven by the goal of discovering relevant information that can be usedto provide knowledge to humans [12; 13]. Data mining falls, conceptually,under the umbrella of Knowledge Discovery in Database (KDD). As such,it is part of an extensive process of knowledge discovery. It involves datacleaning, data integration, data mining, and, lastly, the presentation of use-ful knowledge to the user. From the point of view of its functionality, datamining can be categorized into:

• descriptive (e.g., concept description, association);

Page 51: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

THEORETICAL BACKGROUND 29

• predictive (e.g., classification, prediction).

Data mining is also assumed to be an academic area at the intersection ofmachine learning and statistics [80], where our main objective in this thesisis to deliver predictive models by exploiting several data mining (machinelearning 3) learning algorithms.

2.3.1 LEARNING ALGORITHMS

I - Decision Trees and Ensemble of Trees

Decision trees (DT) are classifiers based on a simple divide-and-conqueralgorithm. They predict the class labels for data items [12; 13; 81]. Theprocess of decision tree learning follows an inductive learning approach,i.e., the tree is induced from a set of data instances. Its creation begins byfirst assigning to the root node, all the examples in the training set. If theexamples belong to the same class, no decision is taken for their furtherpartition: the solution has been found. In case they belong to one or moreclasses, a test is performed at this node that results in a split. This processis recursively repeated for each new intermediate node until each branchreaches a leaf node. The nodes are split according to a criterion consideringtheir degree of impurity and using entropy or the Gini index [13; 81; 82].When using the DT, it is important to limit the complexity of the learnedtrees so that they do not overfit the training examples. Thus, while or afterbuilding the trees, two strategies are used for pruning, so as to minimize thetrue misclassification error estimate [13; 81]:

• post-pruning (also known as backward pruning) - performed after thetree is completely built.

• pre-pruning (called also forward pruning) - decision for stopping thedevelopment of subtrees is made while on the process of tree-building.

The post-pruning strategy is chosen by most implementations of DT. It isbased on two distinct approaches: subtree replacement and subtree raising.Although the error estimation is based on a weak statistical reasoning [13],there is a general recognition that works well enough in practice.

3In this thesis we use both terms synonymously

Page 52: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

30 CHAPTER 2

Albeit single decision trees are considered to be relatively good classifiers(predictors), often, accuracy may be increased by combining the results ofa collection of decision trees [83–85]. In fact, this simulates a way of com-bining opinions from different experts to come to a decision, rather than justconsidering only that of a solitary adviser. Strategies for combining decisiontrees include [13]:

• Bagging (Bootstrap aggregation)

• Boosting

• Stacking (Stacked generalization)

Bagging and boosting combine the means of different decision models ofthe same type, integrating their outputs into a single prediction. In caseof a classification problem, this is accomplished through a weighted vote;whereas for numerical prediction, a weighted average is used [13; 85]. Inboosting, the weights are used to provide higher significance to more suc-cessful models, while in bagging, equal weights are given to the models.Random forests (RFs) uses a bagging strategy where the decision trees aregrown by a randomized tree-building algorithm [83]. A sampling with re-placement procedure is employed on the training set examples to producea modified training set of the same size as the original, where some train-ing instances are incorporated more than once. At each node, a number offeatures are randomly selected and the best split on them is used for split-ting the node. Thus, in each run a slightly different tree may be produced.The predictions obtained by the process of ensemble of decision trees arecombined by taking the most common forecasting or by determining theiraverage.

The integration of multiple models can also be achieved through stacking[13]. This strategy is less used than its rivals boosting and bagging, due tothe recognized difficulties around the theoretical analysis and existence ofmany different variations of its applications [13]. Stacking normally uses acombination of different models, i.e., the models are built by different learn-ing algorithms. Other approaches for learning ensemble classifiers are therandom subspace method [86] - which is similar to the RF strategy, however,

Page 53: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

THEORETICAL BACKGROUND 31

with no bagging; and the rotation forests method introduced by [87]. Therandom subspace is mostly applied to learn the ensembles of the nearest-neighbor algorithm [88].

II - Support Vector Machines (SVMs)

SVMs are a set of algorithms, comprising techniques of mathematical pro-gramming developed by [89]. They are increasingly used due to their attrac-tive features (e.g., the ability to hold large feature spaces and avoid over-fitting) and promising empirical performance. They employ the principleof structural risk minimization, where the classifier is created based on theconcept of decision planes to define decision boundaries [90]. The decisionplane separates a set of objects into different class memberships.

In a simple binary (two classes) classification problem, the goal would be toseparate the observed data in two groups. The hyperplane is then positionedin a way that separates the examples of both classes to fall into each of itsside. An example of a fictitious data set separable by linear hyperplanes isgiven in Figure 2.3, where there exist many linear hyperplanes separatingthe data, but only one maximizing the margin (such that it maximizes thedistance between the hyperplane and the nearest point to each class, i.e., themaximum-margin hyperplane). In this case, we obtain a linear hyperplane,i.e., a linear model [13], with a set of N observations xi = (xi1, . . . ,xiN) andtwo classes yi ∈ [−1,1]. Three hyperplanes are determined as shown in Fig-ure 2.4, where the classes are represented by open and filled circles. Hence,the obtained maximum-margin hyperplane H0 gives the largest separationbetween the two classes. The points defining the hyperplanes H1 and H2 arecalled support vectors (examples that are closest to the decision boundaryH0), and their linear combination represents a solution to the support vec-tor problem [91]. That is to say that given the support vectors, it is easy toconstruct the maximum-margin hyperplane and the remaining examples areirrelevant.

When a linear boundary is inappropriate (i.e., the data cannot be separatedin the original feature space), SVMs can use a separating hyperplane in atransformed feature space [91; 92]. This is achieved by mapping the in-put vector x into a high dimensional feature space z. The most commonly

Page 54: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

32 CHAPTER 2

Figure 2.3: Hyperplanes able to linearly separate the data.4

Figure 2.4: Maximum-margin hyperplane.4

Page 55: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

THEORETICAL BACKGROUND 33

used kernel functions to perform this mapping are polynomial, radial basisfunction (RBF) and certain sigmoid [93]. Figure 2.5 illustrates an exampleof a linear non-separable support vector problem to which feature transfor-mation can be applied. SVMs can also be applied to regression problems

Figure 2.5: Linear non-separable hyperplane.4.

simply by introducing an alternative loss function [94]. There are four avail-able loss functions: Quadratic, Laplace, Huber and ε-insensitive. The mostcommonly used is the ε-insensitive band [94], where ε is a user defined pa-rameter. An extensive discussion of support vector machines can be foundin [89].

2.3.2 ALTERNATIVE LEARNING METHODS

Within the Bayesian statistical framework, it is possible to predict new ob-servations by using a prior predictive density (based on unknown observed)

4Adapted from Andrew Moore tutorials. Http://www.cs.cmu.edu/˜awm

Page 56: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

34 CHAPTER 2

[15; 17; 68]. Although this path was followed initially, it was eventuallyabandoned in favor of predictive data mining approaches; considering theintegration of developed predictive models into the current health informa-tion system.

Several algorithms for a learning linear classifier, such as, least-squares lin-ear regression, linear regression, and logistic regression, can be upgraded tolearn nonlinear decision boundaries [13], and perhaps be potential learnersof new disease patterns. This can be achieved by applying the kernel trick(replace the dot product by a kernel function) to the corresponding linearlearner. However, none of the obtained nonlinear classifiers outperformsthe SVMs. In fact, ridge regression may be applied when few data pointsare available, in which case noise in the observations may induce a highvariance. But, kernel ridge regression lacks sparseness in the vector of coef-ficients [13; 95] compared to SVMs, and only has an advantage when thereare few training instances and a larger number of attributes. Also, it requiresmuch memory and uses more computational time than SVMs due to the lackof reduction of the learning data (i.e., irrelevant learning samples are not dis-carded) [95]. Similarly, a drawback in the sparseness of a solution is foundin advanced linear and kernel logistic regression methods [13; 95]. Anotherapproach that may be used to introduce a nonlinear classifier is through amultilayer perceptron. Still, the performance of this method is poor, as itcan lead to a local minimum.

2.4 CRITERIA FOR MODEL EVALUATION

The evaluation has been conducted using statistical measures of perfor-mance applied to the Bayesian and data mining predictive modeling frame-works. In the Bayesian format, a model information criteria measurementwas applied. In the data mining predictive modeling, performance estimateshave been summarized using the mean squared errors measurement.

2.4.1 BAYESIAN METRICS

The Bayes information criteria introduced [96], and the Akaike informationcriteria (AIC) [97], are the two most used criteria for model selection in the

Page 57: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

THEORETICAL BACKGROUND 35

Bayesian framework. Spiegelhalter et al. [98] introduced the deviance in-formation criteria (DIC), which is an extension of AIC and used for compar-isons and adequacy while fitting Bayesian predictive models in this thesis.DIC provides a measure of the relative goodness of fit of a model. If defin-ing hierarchical models (as used in the studies of this scientific work), DICis more efficient, as it employs the number of effective parameters, whileAIC uses the actual number of parameters. The DIC of a model m is definedas [98],

DIC(m) = 2D(θm,m)−D(θm,m) = D(θm,m)+2pm (2.4)

where D(θm,m) is the deviance measure, defined as the negative of twicethe log-likelihood,

D(θm,m) =−2log( f (y|θm,m)) (2.5)

and D(θm,m) is the posterior mean. The quantity pm is the effective numberof model parameters given by

pm =D(θm,m)−D(θm,m) (2.6)

and θm is the posterior mean of the parameters used in model m. SmallerDIC values indicate the model that better fits the data. The WinBugs toolused in the Bayesian studies included in the thesis employs the DIC as itsmetric for model adequacy and comparison.

2.4.2 DATA MINING PERFORMANCE EVALUATION

The prediction performance of a machine learning scheme can be evaluatedby using an independent test set, a cross-validation technique, or both. Asthe test set consists of different examples from those used to build the clas-sifier, it is only used for evaluation purposes, i.e., not for training. When thedata is insufficient, it is reasonable to use cross-validation to avoid wastingthe data, which could be valuable as a way to obtain more reliable perfor-mance estimates by increasing the number of test instances. Thus, all dataare used to train the classifier so as to test its performance. However, theapplication of internal validation techniques such as cross-validation is not

Page 58: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

36 CHAPTER 2

often sufficient [99]. As a matter of fact, when applied to new samples, theperformance of predictive models is generally lower than what is observedwhen making predictions using the original sample, even when excludingtraining examples [100]. The predictive problems investigated in this thesisconcern the application of regression techniques, such as, regression trees,random forests and support vector regression. Several error measures arecommonly used as model evaluation statistics [13]: mean-absolute error,mean-squared error, root-mean-squared error, etc. These measurements areinvaluable since they indicate error in the units (or squared units) of the tar-get (response) variable, in this way amplifying the error in the positive direc-tion independently of the sign present in the initial error score and avoidingthe issue of errors canceling each other. In the predictive studies performedin this thesis, the mean-squared error (MSE) is the adopted metric. It isgiven by

MSE =N

∑i=1

(Pi−Ai)

N(2.7)

with P1 · · ·PN being the predicted values on the test instances and A1 · · ·AN

the actual observed values. Moreover, the choice of the MSE estimator isguided by comparative purposes of the developed predictive models, aimingat selecting the models that better explain the observed cases of this diseaseand accounting for unbiasedness and minimal variance. Considering thedata transformation of the response and predictor attributes to a commonscale in the pre-processing phase, the obtained mean squared errors corre-spond to normalized errors.

Page 59: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

CHAPTER 3

RESEARCH APPROACH

This chapter discusses the choice of methods and strategies for addressingthe research question. To start with, a discussion of the adopted method-ology and the proposed framework is given. Subsequently, a detailed de-scription and justification for the choice of study site is provided. Then theprocess for setting up each empirical study as well as the data collection andanalysis is presented. The chapter closes by describing the limitations of thestudies and by summarizing the ethical issues in relation to the empiricalstudies included in the thesis.

3.1 RESEARCH METHODOLOGY AND PHILOSOPHIES

According to [101], research is defined as the description of a careful “inves-tigation into and study of materials and sources in order to establish facts andreach new conclusions.” Methodology is the theory of how research shouldbe conducted, including the theoretical and philosophical assumptions uponwhich research is based and the implications for the adopted methods [102].Thus, the role of methodology is to show how to walk along the steppingstones of scientific research, helping to reflect and encourage a new lookat the world. Within the philosophy of science, different paradigms maybe used to guide methodologies for scientific research. They include posi-tivist vs. interpretive, empiricist vs. rationalist, qualitative vs. quantitative,etc. [103]. Another well known paradigm is behavioral vs. design science,

Page 60: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

38 CHAPTER 3

which was introduced by [104] and addresses information systems researchproblems.

As stated in [105], philosophical paradigms can be broken down along threemain dimensions. Ontology specifies the form and nature of reality; Epis-temology refers to what is considered acceptable knowledge (i.e., what isthe nature of knowledge and what can be known); Methodology pertains tothe process of studying the phenomena (i.e., how to go about obtaining theknowledge).

In this thesis, our interest lies in the approaches of data mining and Bayesianregression for effectively extracting useful knowledge from the predictivemodels built from malaria health data in order to support decision-makingand management. This implies comparing several predictive Bayesian re-gression and data mining methods. In this process, relationships are formedbetween the concepts, through the testing of theories and hypotheses. Thissetting certainly has a predominance of positivist elements. In fact, the logicbasis of the hypothetico-deduction of predictive models tends towards a pos-itivist paradigm [106]. Positivist research assumes that reality is objectivelygiven, involves testing theories, and is described by measurable propertiesthat are independent of the researcher [107; 108]. Its knowledge generationprocess is supported by quantitative methods, which are adopted in this re-search. In contrast, the interpretative paradigm assumes that the knowledgeand meaning are driven by interpretation, where the employed methods aremainly qualitative [108]. A common distinction is also made between in-ductive and deductive research approaches [102]. Inductive research in-volves obtaining generalized conclusions from observations, progressingfrom empirical studies to the induction of new theories. In deductive re-search, the researcher deduces, using well-established theories, one or morehypotheses that are subjected to empirical analysis. The Epistemologicalassumptions of objectivism guide the positivist approach [109]. Objectiviststry to identify independent causes that can lead to effects, predict events,where testing theories and hypotheses are either verified or refuted by theobserved effects [107]. The subjectivist epistemology category from the in-terpretive approach focuses on the meaning of phenomena, rather than theirmeasurement: i.e., the object of study does not contribute to the generation

Page 61: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

RESEARCH APPROACH 39

of this meaning [110]. We adopt a combination of deductive and objec-tivist approaches within the quantitative methods of knowledge generationto guide the research conducted in this thesis.

3.2 RESEARCH PROCESS AND METHODS

The research focus of this thesis involves three distinct but overlappingareas: data mining, Bayesian statistics, and public health. In Chapter 1,we identified a gap in existing methods for extracting useful knowledgefrom malaria health data to support decision-making and management. Thisknowledge can be obtained by constructing effective predictive models builtfrom reported malaria cases and other complementary data. Therefore, sev-eral predictive models and methods are compared. They attempt to discoversignificant malaria predictors, their predictive relationships to malaria in-cidence and forecast the number of new malaria episodes. Based on theanalysis provided in the previous section, the deductive research processappears to be aligned with the research question posed in this research.

Following the identification of our research approach, the next step is tofind appropriate research methods and strategies that could be applied inthe thesis. The research in this thesis follows a quantitative research ap-proach, justified by studying a realistic and objective problem embodied inthe positivist paradigm. From the point of view of Bayesian regression, theuse of this method tries to explain concepts (through inference) rather thanjust describing them. As for the perspective of predictive data mining, thenumerical prediction of new malaria episodes is sought. However, quantita-tive methods may fail under various circumstances, among which are: whenthere is a need for in-depth exploration of the problem at hand; if the prob-lem is too complex, then an in-depth qualitative study (e.g., a case study)is more appropriate and when there is a need to develop hypotheses andtheories. When the research problem requires both determining causalityand the meaning of concepts, one could resort to a mixed-methods design,where a combination of quantitative and qualitative methods is employed[102; 111].

In order to ensure better results in this investigation when applying the se-

Page 62: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

40 CHAPTER 3

lected method, causal [102; 112] and correlational [112] approaches of re-search design have been used. We have sought to find an empirical asso-ciation between the independent and dependent variables. Predictive rela-tionships are found from models of multiple Bayesian regression, regres-sion trees, and forests, and support vector regression based techniques. Thechoice of causal approach is justified by the fact that we seek to establishcause–effect relationships. The correlational approach aims to determinethe existence of predictive relationships. The selection of both the causaland correlational approaches does not allow the researcher to take controlof or manipulate the independent variables, as opposed to the experimen-tal approach[112]. Also, we employ an archival research research strat-egy [102]. The adoption of this strategy can be explained by the fact thatin this investigation we use historical administrative malaria and climaticdatasets. The data limitations and the applied design strategies of each em-pirical study are described in Chapter 3.4.2. The criteria and metrics forthe performance evaluation of the predictive models is discussed in Chapter2.4. The connections between these research studies, research methods andstrategies, and the research question is given in Table 3.1.

Page 63: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

RESEARCH APPROACH 41

Stud

yR

esea

rch

Des

ign

Res

earc

hSt

rate

gyPa

rts

Add

ress

edby

Res

earc

hQ

ues-

tion

Ica

usal

&co

rrel

atio

nal

arch

ival

pred

ictiv

ere

latio

nshi

psII

caus

al&

corr

elat

iona

lar

chiv

alpr

edic

tive

rela

tions

hips

III

caus

al&

corr

elat

iona

lar

chiv

alpr

edic

tive

rela

tions

hips

IVca

usal

&co

rrel

atio

nal

arch

ival

pred

ictiv

ere

latio

nshi

psan

dnu

mbe

rof

new

mal

aria

epis

odes

Vca

usal

&co

rrel

atio

nal

arch

ival

pred

ictiv

ere

latio

nshi

psan

dnu

mbe

rof

new

mal

aria

epis

odes

VI

caus

al&

corr

elat

iona

lar

chiv

alpr

edic

tive

rela

tions

hips

and

num

ber

ofne

wm

alar

iaep

isod

esV

IIco

rrel

atio

nal

arch

ival

pred

ictiv

enu

mbe

rof

new

mal

aria

epis

odes

Tabl

e3.

1:C

onne

ctio

nof

rese

arch

stud

ies,

desi

gns,

stra

tegi

esan

dqu

estio

n.

Page 64: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

42 CHAPTER 3

The process of deductive research may be further enhanced by allowing forthe generalization of the resulted artifacts [102]. These artifacts are mostlyrelated to the design science paradigm [113]. Many stages are involved intheir design and development. This thesis employs only the developmentand (theoretical) evaluation phases. As such, the artifacts are representedby the predictive models developed from historical cases that arose fromdifferent population samples: children (aged 0–4 years) and all age groups(the entire population, which also includes children). They have been testedfor their ability to generalize to further regions of the country.

3.3 STUDY SITE

Mozambique is a tropical country located in Southern Africa. The countryis mainly divided into eleven administrative provinces. The study was con-ducted in Maputo province, the south-most province. A map of the studyarea is given in Figure 3.1. The province is divided into eight administrativedistricts. Its boundaries are South Africa to the south, Gaza province to thenorth, the Indian Ocean to the east, and Swaziland to the west. The studyarea was selected following the criteria of the availability of malaria casedata as well as coverage by an acceptable weather network in the regionwhen the study was launched.

3.4 PROCEDURE OF DATA COLLECTION AND ANAL-YSIS

3.4.1 LITERATURE REVIEW

As a first step, preceding the process of collecting the data, a review of theliterature on previous studies of malaria incidence patterns was conducted.This aimed at acquiring a thorough and systematic view of the malaria prob-lem in the affected regions and Mozambique in particular. Therefore, vari-ous bibliographic sources were consulted, such as:

• Reports and publications on malaria from the Mozambique Ministry

Page 65: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

RESEARCH APPROACH 43

Figure 3.1: Study area - Maputo province.

Page 66: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

44 CHAPTER 3

of Health1.

• Publications on population and health demographic issues from INEincluding access to its web page2 [114].

• Articles from malaria on-line journals3 employing Bayesian and otherspatial statistical techniques in relation to climatic and other factors4.

• Articles from the Journal of Statistics in Medicine5

• Articles from the International Journal of Tropical Medicine relatedto malaria and environmental factors6

Webster and Watson in [115] have discussed the matter of literature review,pointing out the need and its importance for being complete, covering rel-evant issues on the topic and also not being limited to a particular researchmethodology, a set of journals, or even a geographic region. This approachto a literature review was adopted in this thesis, with the reviewed literaturecovering very wide tropical geographical regions. Consequently, a betterunderstanding of the research area and a subsequent clarification of the pur-pose and objectives of this investigation was achieved. Along with that,it was possible to achieve a clear view of the required data, which helpeddefine the steps needed for their acquisition.

3.4.2 DATASETS AND ANALYSIS ISSUES

Data can primarily be collected by means of observation and (or) mea-surement methods. When the observations involve non-numerical data, weare dealing with qualitative data and consequently a qualitative researchapproach is used, whereas, in the presence of numerical information, wethen refer to it as quantitative data and apply quantitative research methods[102; 116]. But, it is possible in some cases to use qualitative data collection

1http://www.misau.gov.mz2http://www.ine.gov.mz/3http://www.malariajournal.com/4http://www.oxfordjournals.org/our_journals/aje/about.html5http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%291097-02586http://www.hindawi.com/journals/jtm/

Page 67: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

RESEARCH APPROACH 45

procedures while gathering quantitative data. Likewise, data collection tech-niques such as questionnaires and interviews can also be applied. However,it is impractical to pursue such techniques, especially due to the wider timerange of data collected for this thesis and the financial constraints relatedto studies. Thus, this thesis gets its original quantitative data from sourceinstitutions, namely MoH, INAM and INE. This data was already collectedfor other purposes in relation to the needs and goals of each institution andis usually termed secondary data [102].

Although the strategy of archival research is usually associated with theanalysis of secondary data [102], in the present research this was not thecase. Actually, the administrative records of malaria data were initially col-lected with the purpose of analyzing and presenting the averages of malariaincidence mortality determined by the Standardized Mortality Ratio (SMR)[117]. SMR is defined as θ = Yi/Ei, where Yi is the counted number ofdeaths due to malaria and Ei is the expected count of malaria deaths basedon population size, both in the region i [117; 118]. They can be used tomap estimates of disease related death risk in a given geographical region.Because SMRs are based on ratio estimators, they have the weakness ofyielding large changes in estimates when a relatively small expected valueis given. Additionally, no variation in the expected counts is observed forzero valued of the SMRs. Moreover, SMR is considered a saturated estimateof the relative death risk, being thus non-parsimonious [117].

The population sample of this study consists of persons treated for malariain each district of Maputo province. These malaria records are composed ofrandom disease events that occurred in a given geographical administrativedistrict for each week. The weekly counting of disease episodes consti-tutes an aggregation of everyday registered cases. They were acquired fromthe MoH—the provincial health statistics departments. These departmentsbring together the records of positive malaria tests and also unconfirmedcases with symptoms similar to malaria diagnosed by health-trained person-nel (the so-called clinical malaria cases [1]). The counted data are initiallyassembled in various district health departments. Weekly malaria cases wereaggregated monthly in order to agree with a monthly aggregation of climaticvariables performed at the data source provider institution. Health interven-

Page 68: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

46 CHAPTER 3

tion IRS and the climatic environmental factors (air temperature, precipita-tion, and humidity) were chosen as the independent variables. The climaticvariables were provided by INAM and were collected at weather monitorstations located in the administrative districts of Maputo province. The pop-ulation data were provided by INE and based on the national censuses 1997and 2007. All studies conducted in this thesis employ secondary data fromthe three institutions above indicated. Some use malaria cases of the chil-dren demographic group (0–4 years), while others use cases collected fromthe general population (i.e., the former is a subset of the later dataset). Mod-els for children were developed using two years of data (2007–2008), wherein 2007 the lowest raw malaria cases recorded in the general population isshown in Figure 1.2. The models for all ages (general population) were de-rived based on ten years of malaria records, 1999–2008. Although 33% ofall malaria cases recorded in 2007 are from the children’s group, only 31%of the cases recorded in 2008 were observed in the children’s group. Whatfollows is a description of the data analysis process and research approachesapplied to each study in the research process of this thesis, and is dividedinto predictive Bayesian regression and predictive data mining studies.

Predictive Bayesian Regression Studies

Paper IThis study investigates the role of weather (temperature and rainfall) vari-ables in being predictors for the variation of malaria incidence transmissionin Maputo province, in the spatial (geographical) domain. To analyze theirrelationship, a predictive Bayesian hierarchical model was derived. Themodel is based on a Poisson multiple regression of two years of malariacounts aggregated at the district level. The malaria and climatic datasetspassed the test for consistency coding errors check, and both showed nomissing values. The disease counts were annually aggregated for each dis-trict. The environmental factors were aggregated monthly for the wholeprovince. This induced a stratification of the disease cases following theseasonal breakdown of months according to the winter (dry) season April toSeptember and the summer (wet) season from October to March [119]. The

Page 69: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

RESEARCH APPROACH 47

Bayesian model was extended by including the environmental explanatorycovariates (independent variables) of temperature and rainfall.

Two spatial independent random effects, correlated and uncorrelated effects,were introduced. The spatial component of uncorrelated heterogeneity (non-structurally dependent) was modeled through a set of exchangeable priorswith zero mean and variance τ2

ν , given by νi ∼ N(0,τ2ν). The correlated het-

erogeneity, reflecting the local spatial and structural influence of neighbor-ing geographic administrative areas (districts), was modeled using a Condi-tional Autoregressive (CAR) approach [120], with contiguity matrix Wi j = 1for adjacent districts i and j (i.e., neighbors) and Wi j = 0 otherwise.Four predictive models were compared in the analysis with random effectsincluded:

• with covariates;

• without covariates (simplest model);

• with either one of the covariates.

The parameter estimation and inference was performed using MCMC simu-lation techniques based on Poisson variation to draw posterior samples. Themodel comparison was performed employing the DIC. The models were fit-ted in WinBugs, and GeoBugs was used for mapping the resulting posteriordistribution of estimated malaria relative risk.

Paper IIThe main objective of study II was to investigate spatial and temporal pat-terns of malaria incidence risk in their relation to climatic predictors, suchas temperature, humidity, and rainfall. The raw provincial malaria data wasprovided as weekly counts of disease episodes for ten-year periods and weresubjected to the following operations:

1. Checking of data consistency by looking at possible illegitimate dis-ease codes (e.g., alphanumeric case numbers).

2. Extraction of district week malaria counts and subsequent combina-tion into monthly time aggregation.

Page 70: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

48 CHAPTER 3

3. Tabulation to summarize and display data for each district and year.

This process was performed using the MS-Excel package. Table 1 showsthe means of the raw malaria cases data per year in each district of Maputoprovince, where a decline of reported malaria cases can be observed in mostof the districts specially from year 2003. This situation may be due to theimplementation of special measures such as distribution of mosquitoes netto the affected population.

Year Boane Magude Manhiça Moamba Marr Matola Matut Nam1999 58 696 47209 60811 50070 25076 135108 20640 143412000 67672 53553 70720 51188 24551 165678 22398 213822001 55586 37788 85399 47876 23198 176662 14004 179192002 41968 36506 90475 47124 25049 158340 14194 222182003 47396 38628 94631 30207 22052 138079 12661 217292004 42704 35357 86281 42429 19198 121008 5774 106622005 9861 11693 97516 10344 9026 160487 968 38712006 8277 9626 68296 10141 10569 80745 1295 51672007 4062 4026 33054 2403 3166 43334 738 24312008 4536 2864 35933 2329 3662 46111 419 995

Table 3.2: Mean of raw malaria events in each district per year.

Data of the climatic factors containing monthly temperature (minimum, av-erage, and maximum), humidity, and rainfall, are initially collected on adaily basis and subsequently compiled at the hosting institution, INAM,through monthly aggregation for the corresponding weather station. Uponcollection, they were checked for missing values and yielded 16.7% of miss-ing observations. Three reasons for missing data records were identified:break-down of measurement instruments, maintenance interventions, andinvalid readings due to equipment malfunctions. Therefore, missing at ran-dom mechanism [121] was assumed in the analysis of the environmentaldatasets. To overcome this problem, a multiple imputation procedure pro-ducing the complete dataset including the imputed values was employed.Five such datasets were obtained, i.e., m = 5, such scores for every miss-ing value, reflecting the uncertainty related to the missing observation. Thehigher the number m of imputed values, the more unbiased estimates areobtained for the parameters of interest and their variances [122]. A single

Page 71: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

RESEARCH APPROACH 49

imputed estimate was then obtained by averaging the scores of each missingclimatic factor instance.

A preliminary bi-variate analysis of malaria case data and climatic covari-ates (temperature: minimal, mean and maximal, rainfall and humidity) wasperformed using the statistical package R [123]. All the variables showeda significant relationship to malaria incidence, with p-values smaller than0.001. However, malaria data cases are available at the district (areal) level,while the environmental data is provided as points, i.e., at the meteorolog-ical stations. This analytical mismatch situation is known as the misalign-ment problem [124–126]. To deal with the misalignment problem, the wholeprovince was interpolated through a Gaussian process [65], by overlaying a90-point grid over the study region map using Arc-view geographical infor-mation systems. The interpolation is then performed by applying Bayesiankriging.Projections of population data from the 1997 and 2007 general censusesprovided by the INE were used as the expected malaria cases in each admin-istrative district, and require no further analysis. All datasets were combinedto yield a ten-year time-series. Figures 3.2–3.4 illustrate the time trend ofthe raw malaria cases and the two climatic factors maximal temperature andrainfall in three different districts (Matola, Manhiça and Matutuine).Malaria observed counts are modeled assuming to follow a Poisson distri-bution with mean µrit . Random effects were introduced into the model asfollows:

• θi - spatial effect

• ϕrt - year and monthly temporal random effect

• δrit - space and time interaction term

The spatial dependence was defined by a conditional autoregressive processand implemented using two different specifications of the weighting matrixW :

1. Binary structure with wi j = 1 for neighboring districts i and j, andwi j = 0 otherwise.

Page 72: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

50 CHAPTER 3

Figure 3.2: Time trend of malaria cases (first highest yearly district mean),temperature and rainfall.

Figure 3.3: Time trend of malaria cases (second highest yearly district mean),temperature and rainfall.

Page 73: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

RESEARCH APPROACH 51

Figure 3.4: Time trend of malaria cases (lowest yearly district mean), temper-ature and rainfall.

2. Weighted by the length of the boundary of each district pair (i, j)given in kilometers (km), and wi j = 0 when the districts i and j hadno shared boundary.

A first order random walk prior allowing year independence was given tothe temporal trends ϕrt relating year and month. It is represented by a one-dimensional version of a CAR normal prior distribution, i.e.,

ϕrt ∼CARNormal(Q,σ2ϕ) and parameterized by σ

2ϕ . (3.1)

This allows the definition of the temporal neighbors of month t as beingt−1 and t+1 for t = 2, ...,11; with the first and last months having singularneighbors.

The space-time interaction terms δrit capture departures from space and timemain effects which may highlight space-time clusters of malaria incidence.In this study they are assumed to be independent for every year and monthwith a constant variance over time. An auto-regressive AR(1) prior processis chosen to represent this random effect which is parameterized by tem-poral variance allowing for correlation between consecutive months withinthe same district, i.e., assuming that malaria cases in month t are influenced

Page 74: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

52 CHAPTER 3

by cases from the previous month t − 1. This relationship holds not onlyfor a month of the same year but also connecting December–January ofconsecutive years, except for January of the first and December of the lastyears. In this, it attempts to also reflect a possible year to year trend. Dueits high correlation with the maximum and minimum temperature, the meantemperature was replaced as a covariate by the temperature variation (dif-ference between maximum and minimum temperature). Also, this was im-portant to avert the use of the average temperature, since this quantity is inmost cases considered to hold less information than the temperature varia-tion for the development of malaria vectors and their survival [127]. Thehumidity, minimal, and maximal temperatures have non-linear relationshipsto log(Orit), the natural logarithm of the number of observed malaria cases.They were then converted to categorical variables. To isolate outliers andallow the observation of linear relationships between the variables of themodel, the plotting of the log-transformed climatic covariates against the ra-tio of malaria prevalence and the population was performed. This resulted inthe creation of cut-off points and ten coefficients for different threshold in-tervals. Seven different models were initially considered, and a comparisonof their suitability was performed. As a result, three models for subsequentanalysis were chosen, with a full model using a weight structure based onbinary adjacency being the best-fitting model.

Paper IIIThis investigation relates to the analysis of pediatric malaria data arisingfrom districts of Maputo province in two consecutive years: 2007–2008.Malaria data in the initial pre-processing phase comprised case records ofweekly counts of disease episodes from 1999 to 2008. After data consis-tency error checking and cleaning, only the data for 2007–2008 containedcomplete malaria case episodes for the population group of interest. There-fore, only the malaria data from these two years were aggregated to monthlyaverages for further analysis. The data in the remaining years was mixedwith the entire population malaria reported cases. Figures 3.5 and 3.6show the variations in the raw cases of malaria in the districts involved inthe study for both years. Environmental factors including rainfall, tem-perature (minimal, mean, and maximal), and humidity were provided as amonthly average by the source and required no further analysis. Statistical

Page 75: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

RESEARCH APPROACH 53

Figure 3.5: Contribution of each district to monthly malaria trends, year 2007.

Figure 3.6: Contribution of each district to monthly malaria trends, in the year2008.

Page 76: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

54 CHAPTER 3

population data of the region taken from the official 1997 and 2007 na-tional censuses was employed to determine inter-census population growthat the district level for each month. The combination of all datasets resultedin a time-series with 24 datapoints. Space–time predictive Bayesian mod-els based on Poisson multiple regression were designed and fitted with theinclusion of spatial and temporal correlations. The expected cases were de-termined as the product of the number of children in each district per monthand a fixed rate of malaria incidence per 10,000 children population year.The model at the log-level is given by

log(µit) = log(Eit)+α +βT Xit +θi +ϕt +δit (3.2)

The main parameters are ϕt , representing the temporal monthly effect mod-eled as a first order random walk process (RW(1)) and θi, capturing spatialvariability. This spatial component is introduced through a conditional au-toregressive process with a binary weighted matrix wi j = 1 for neighboringdistricts and wi j = 0 otherwise. The space-time component δit highlightsspatio-temporal clusters of disease risk and is modeled by an exchangeablehierarchical structure δit ∼ N(0,τ2

δ) with a constant variance. Four differ-

ent models were fitted and compared to determine the model with the bestperformance (smallest DIC value). The model with no spatial effects wasfound to fit the data best.

Predictive Data Mining Studies

In this set of studies, we investigated predictive models for forecasting thenumber of new malaria cases considered in the research question discussedin Chapter 1.3. The datasets used in most of the studies are similar, withstudies II, IV and VI employing all ages (entire population) malaria data;while studies III and V use malaria case records from the population of chil-dren. Therefore, we avoid repeating their analysis here, except in relationto the health vector intervention control strategy represented by the variableIRS, which was not included in the previous studies. IRS is given in per-centage, representing the number of houses sprayed for the malaria vector(i.e., the anopheles mosquito) control in each administrative district. Everyhouse can undergo up to two rounds of spraying per year [1; 6], depending

Page 77: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

RESEARCH APPROACH 55

on the available resources. In Mozambique, these vector control activitiesare carried out in the period before the rainy season and towards the mid-dle of the dry season. The IRS data was considered clean after consistencychecking for coding errors, and integrated with the malaria and climate datasets for further investigation. What follows is a presentation of the methodsapplied in each predictive data mining study.

Paper IVData preprocessing using Weka 3-7-5 data mining tool was performed inthree steps:

• Replacement of around 16.7% of missing predictor variables by themean value of each attribute.

• Conversion of nominal data into binary performed by employing Wekafilters. This was useful for a better investigation of each particular cat-egorical variable.

• Data reduction to a similar scale through normalization of all numeri-cal variables to fall into the interval [−1.0,1.0]. This was achieved byadopting the min-max linear transformation on data attributes.

The tuning of the random forests parameters, the number of best possiblesplits on the attributes (m), and the number of trees (ntrees) was performedemploying two strategies:

1. Based on multiples of m =√

p, where p− number of features;

2. Using the default ( m =√

p), half (m/2)and twice (2m) the default.

The best combination of the number of trees and the splitting attributes wasthen chosen. The dataset was divided into a training set, comprising nineyears (1999–2007) and a test set of one year (2008). A time-window strategywas adopted with the training set sub-divided into nine subsets for trainingto investigate the accuracy of each derived predictive model based on thisgrouping of the historical data. The predictive performance of the modelswas assessed by analyzing their normalized mean squared errors. Permuta-tion accuracy from random forests was used to measure the contribution ofeach predictor to improving the predictive performance of the best model.

Page 78: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

56 CHAPTER 3

Paper VThe data was collected from the eight administrative districts of Maputoprovince and monthly averaged. There were two categorical variables rep-resenting the district names and the month of the year, the others were nu-merical. The temperature variable included the minimum and maximummeasurements. Temperature variation was defined as the difference betweenthe maximum and minimum temperatures, and was introduced to replacethe average temperature. A data transformation was then performed for usewithin the SVMs as follows:

(a) Numeric variables were normalized to be in the interval [0.0,1.0]. Thiswas performed by applying the min-max linear transformation on theattributes of the original dataset.

(b) Non-numeric multi-valued variables were converted to binary, with theapplication of Weka filters.

The dataset was then sub-divided into a training set based on two yearsof observations (2007–2008) and a testing set comprising the observationsfrom the subsequent year (2009). A grid optimization technique was em-ployed for parameter tuning, which resulted in selecting the radial basisfunction (RBF) kernel and the remaining relevant support vector machineparameters. A strategy for assessing the importance of each variable, fromthe random forests approach, was adopted to measure the predictors’ signif-icance.

Paper VIThe data was preprocessed to meet the SVMs’ requirements in three stages:

• Replacement of missing values in training set (amounted to 16.7% ofall predictors’ measurements).

• Transformation of each numerical variable by scaling to the interval[−1.0,1.0], performed using min-max linear normalization on dataattributes.

• Transformation of two nominal variables (administrative districts andmonth of the year) into binary, by employing Weka filters.

Page 79: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

RESEARCH APPROACH 57

The problem of missing values is addressed in two ways:

1. Replacing the missing values by the mean of each attribute.

2. In the second approach, the the multiple imputation techniques ap-plied in [65] were used.

The selection of the parameters and the required kernel was performed usingthe two training sets after replacing the missing values and employing a grid-search procedure. This resulted in choosing the radial basis kernel for bothdatasets with corresponding cost and gamma values for further analysis.Breiman [83] introduced and implemented the concept of the importancepermutation accuracy level to estimate the influence of predictor variablesin the random forests learning method. However, the SVMs do not includethe determination of importance factors in their modeling prediction pro-cess [91]. To accommodate this, we proposed a modification of the randomforests approach [83] for estimating the importance of each variable. Thisis achieved by randomly permuting the values of one feature at a time andmeasuring the effect on the predictive performance using the test set.

Paper VIIThe aim of this study was to evaluate the ability of the predictive models de-veloped in studies four and six, to generalize to new datasets. Actually, wetested the hypothesis of the usefulness and reliability of the derived monthlypredictive models of the number of malaria incidence episodes forecastingon new datasets. Hence, we seek to determine the applicability of the devel-oped predictive models to other regions of Mozambique. That is, to regionsdifferent from where the initial models were derived, and at different (fu-ture) time periods (years).

Two datasets of reported malaria cases, with one containing records of sam-ples from children less than five years old and the other from all age groupsof the population, were collected. The data collection process was per-formed in two distinct provinces from the central and northern regions of thecountry, and also included datasets of intervention control activities (indoor-residual spray) and climatic factors. The dataset of malaria cases held aweekly aggregation of everyday disease episodes. Following a check for

Page 80: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

58 CHAPTER 3

their consistency and for possible coding errors, they were extracted andfurther aggregated, yielding monthly district malaria cases. Environmentalclimatic factors data were originally aggregated by month at the data sourceinstitution. Microsoft Excel was first used for preliminary data analysis andpreparation. However, 2.35% of the values of the climatic factors were miss-ing and were replaced by the mean of each attribute. The indoor-residualspraying needed no further checking. All datasets were then combined intoa single data file and prepared for additional investigation within the derivedmodels.

Since the datasets used for the process of testing the models on unseendatasets were not chosen at random, this raised the issue of sample selectionbias [128], also known as the covariate shift problem [129]. As a matter offact, the fundamental principle of most machine learning schemes was vio-lated since the training and testing samples were not drawn from the sameindependently identical distribution (iid). Unbiased estimates of testing er-rors can be determined by employing importance weighting [130], whereexamples of the training set are weighted using an estimate of the ratio be-tween the testing and training sets through feature densities. A Gaussiankernel was used to estimate the training and testing distributions through akernel density estimation, where a data-driven bandwidth selection processwas firstly performed.

3.5 RESEARCH LIMITATIONS

One of the limitations of this research relates to the quality of the providedmalaria cases data, being presented as a mix (cocktail) of clinically definedcases and parasite detection rates. This has the potential of introducing aconfounding factor, as clinical cases are based on patients’ febrile condi-tions 7, which was not addressed in this thesis. Secondly, the climatic dataavailable for this study was aggregated on a monthly basis while the his-torical malaria cases were aggregated on a weekly basis. This hindered ourability to analyze the data at a more refined scale. Besides the weather dataused in this investigation, there are other potential sources (such as early

7See [1] for definition and explanation.

Page 81: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

RESEARCH APPROACH 59

disease diagnosis, timely and appropriate treatment, drug resistance, socio-demographic information, vegetation index, etc.) that can explain the vari-ation in malaria transmission, but were not measured and consequently notused. The inclusion of these into the modeling could perhaps have enrichedand improved the analysis, and possibly have led to reducing the uncertaintyof the estimates.In our research, the learning methods for building the regression modelswere chosen from among: decision trees, random forests [83], and supportvector machines [89]. The choice of the last two learning algorithms wasbased on their ability to build predictive models with higher accuracy, whilethe first was used as a benchmark. The use of a Bayesian inference pro-cedure was motivated by its flexibility, capacity to handle highly complexpredictive models, and its incorporation of prior information [15; 17].

3.6 ETHICAL CONSIDERATIONS

In the research process, we are usually confronted with the question ofethics. Ethics refers to our behavior in relation to the rights of those whobecome the subject of our research or are in any way affected by it [102].Therefore, research ethics include issues such as the formulation and clari-fication of our research questions and topic, the collecting, processing andstorage of the data, the analyses of the data, and the presentation of ourfindings in a responsible and respectfull manner [102; 116].

In view of using the archival research strategy in this research and the methodapplied for collecting secondary data described in the previous section, eth-ical concerns were considered in the following manner. As the providedhealth data are aggregates of counted malaria cases, it was not possibleto distinguish individual patients suffering from the disease and thus ac-cess to individual health records was not possible. The climatic data werenot affected by ethical considerations at the individual level. However,the datasets may pose ethical issues in relation to the appropriateness oftheir use and correct acknowledgment. Therefore, letters from the main re-searcher (the author of this thesis) were sent to the Mozambican Ministryof Health and the National Institute of Meteorology ensuring the adequate

Page 82: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

60 CHAPTER 3

use of the released malaria and climatic data from all involved researchers.It was also stressed that the provided data would be used only within thecurrent project.

Page 83: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

CHAPTER 4

SUMMARY OF RESULTS AND

FINDINGS

This chapter is divided into two main parts: the predictive Bayesian re-gression and the predictive data mining. Each contains a presentation of thefindings of the empirical studies performed. In all the studies, I was the mainauthor of the papers and contributed in research activities such as: study set-ting and design, programming, data analysis, drafting of articles and also inimproving the papers, based on peer-reviewers comments.

4.1 PREDICTIVE BAYESIAN REGRESSION

In this section, the Bayesian application is sub-divided into three studiesbased on a hierarchical Poisson regression approach. These empirical stud-ies allowed finding methods for knowledge extraction presented as predic-tive relationships of significant malaria incidence predictors, estimates of re-gional trends of malaria incidence risk, and their mapping in space and time.This extraction process is carried out after comparing the predictive perfor-mance of several prediction models developed using Bayesian spatial meth-ods. Hence, only prediction models showing lower error levels, i.e., modelswith best predictive performance are used in this process. The studies do notinclude the time-lag perspective in their modeling to find significant links todifferent predictors. However, the prediction models were able to detect

Page 84: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

62 CHAPTER 4

and describe malaria transmission dependencies on the variations of someof the independent variables used (possible risk factors). Additionally, theyemploy the so-called single lag of zero months combined with non-linearmodeling of fixed effects (input variables) such as temperature and humid-ity [131]. The presentation of the maps of the malaria relative risk aids inillustrating areas of high risk, where further interventions may be required.This is especially important considering that for an effective application ofcontrol measures and a timely targeting of interventions, an accurate assess-ment and identification of the areas and months of risk is required to evaluatemonthly and seasonal malaria transmission at the district level. Followingsimilar studies [54; 132; 133] of country-specific malaria transmission mapsand its iteration to environmental factors, the results obtained in this thesiswere somehow expected, with some refinement on temperature variability,introduced by the temperature variation predictor in study II. Further spe-cific results are given under each study in the corresponding paper.

4.1.1 PAPER I

The study reported in Paper I is titled: Mapping malaria incidence distribu-tion that accounts for environmental factors in Maputo province - Mozam-bique. It is authored by Orlando P. Zacarias and Mikael Andersson.

This study served as the starting point to explore possible correlations of twobasic weather parameters and malaria spatial incidence variation, outliningdisease risk at the district level in Mozambique. In order to extract valuableinformation relating the weather and malaria datasets, Bayesian Poisson re-gression models were fitted by the MCMC procedure. The climatic predic-tors used were the maximum temperature and rainfall in Maputo province.The study covers two years, 2001–2002, and combined the eight admin-istrative districts of the province. The main outcome of this study is thedevelopment of methods for the extraction of spatial malaria risk factorsusing a seasonal approach: the summer (wet, rainy) and winter (dry, cold)seasons.

Likewise, significant relationships of malaria health data to the predictorsused were effectively extracted after comparing the performance of sev-

Page 85: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

SUMMARY OF RESULTS AND FINDINGS 63

eral predictive models developed using the obtained Bayesian spatial meth-ods. The study also indicates that useful knowledge revealed as seasonalestimates of malaria predictors and incidence risk can be extracted from asimple model of spatial prediction. These estimates disclose some ratherintuitive geographical association between malaria incidence variation andclimatic predictors in the region. Thus, the conclusion was drawn that theemployed methods with seasonal grouping of malaria incidence cases pro-vide an effective framework for the analysis of this type of data at the spatiallevel.

Compared to similar studies, these results show that seasonal climate vari-ation influences the transmission of malaria in Mozambique, although notsignificantly altering the spatial patterns of the disease, unlike [134], thatindicates no pattern modification over the seasons of the year. Further, thequantification of the regional unstructured random effects term was essen-tial to capturing some of the remaining over-dispersion not explained byclimatic factors; contrary to previous research [56], where these estimateswere not measured. Evidence from a similar study in Africa [135] foundrainfall to be associated to malaria in the summer season, whilst this inves-tigation established no such indication.

4.1.2 PAPER II

Paper II reports the second performed study and is titled: Spatial and tem-poral patterns of malaria incidence in Mozambique. With a list of authors,Orlando P. Zacarias and Mikael Andersson.

The aim of this study was to find effective predictive methods based on spa-tial and temporal trends of malaria incidence and determine the influencethat climatic predictors, such as temperature, rainfall, and humidity, have onmalaria variation in Maputo province, Mozambique. The developed space-time Bayesian predictive models use datasets of the ten years 1999–2008aggregated for each month. The full model with binary spatial structurewas found to be the most effective prediction model, with relative perfor-mance 10711.8 compared to 10714.8 of the border length prediction model.Knowledge of predictive relationships was extracted by analyzing the sig-

Page 86: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

64 CHAPTER 4

nificance of the malaria predictors. The maximum temperature and relativehumidity were the most influential predictors. The outcome of the analysesalso show that rainfall, minimal temperature below 16.4oC and maximumtemperature in the range 24.5− 27oC, marginally contributed significantlyfor malaria escalation and consequent variation of the disease incidence risk.These findings are consistent with the [127] in relation to the predictor mini-mal temperature in the indicated range. However, the malaria risk incidencemay have been triggered by a combination of the minimum temperaturevariable’s being within 17− 21.1oC (although not significant), the maxi-mum temperature’s being within 28−35oC, and the relative humidity’s be-ing within 52.4−83%. Moreover, an increase of 1oC in the maximum tem-perature has the potential to lead to a higher increase of malaria incidence,whereas an increase of 1% in the humidity generates a variation from lowerto higher malaria incidence. Although with marginally significant effect, thepredictor monthly temperature variation also influenced the transmission ofdisease in the studied period. This result is in fact in line with that of [127]on the way temperature variation drives malaria transmission. The trends ofincidence in the districts of Matutuine and Namaacha are generally lowerthan in other districts, except for August, when this trend increases in theNamaacha district. The methods used lead us to conclude that the effec-tive model derived allows not only learning the variation of the incidence ofmalaria considering all months of the same and the following year, but alsoextracting predictive relationships based on significant predictors.

The approach used in this study to quantify the space-time seasonal malariavariation between months and connecting to the following year differs fromthat of previous research. The investigations [55; 56] specify only betweenmonths or years connections respectively, whereas [136–138] follow a monthlyclassification as being either suitable or not for malaria transmission. More-over, our approach naturally enables visualizing the distribution of the ex-tracted malaria seasonal incidence without explicitly modeling the seasons.This is beneficial, as, for instance, the summer season is extended from thelast few months of each year to the first months of the next year.

The research conducted in this Paper II combining space and time perspec-tives in the analysis of the data shows the usefulness of the extracted knowl-

Page 87: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

SUMMARY OF RESULTS AND FINDINGS 65

edge, underlying the relationships between the variation in the incidence ofmalaria and the independent climate variables used. Presenting this previ-ously unknown knowledge to health policy makers in an accessible way cansubsequently lead to making more informed decisions.

4.1.3 PAPER III

The third investigation is reported in paper III with the title: Comparisonof infants malaria incidence in districts of Maputo province, Mozambique.The authors are Orlando P. Zacarias and Peter Maijlender.

This study proposes space and time Bayesian predictive models based onPoisson regression, using malaria data from the most vulnerable populationgroup in Mozambique. The model is built out of the Maputo province pe-diatric data (children less than five years of age), covering 2007–2008. Itincludes environmental factors, such as temperature, humidity, and rainfall,as the main malaria predictors.

The results determined as the most effective the prediction model with nospatial dependence of malaria incidence within the districts studied. Themajor conclusion that was drawn was that very low spatial variation waspresent in the reported malaria cases. Using the derived predictive mod-els, it was possible to determine that the independent variables rainfall andhumidity were the most significant predictors of malaria variation in theregion. Thus, both possess a positive predictive relationship to malaria inci-dence. The temperature covariate was found to be a less important malariarisk factor for the children’s demographic group in the region. Moreover,every additional 1% humidity level and 1 mm rain could lead to an increaseof 5% and 1% in the malaria incidence risk, respectively. Therefore, relativehumidity appear to play a substantial role in disease variation and transmis-sion. Space and time interactions are allowed to vary monthly and con-necting two consecutive months. This avoids directly modeling the seasons.Coincidentally, the malaria incidence is high from October to March, whichhappens to be the wet season.

The Bayesian space-time inference problem of this study is intended to finduseful knowledge of risk factors and disease estimates that influence malaria

Page 88: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

66 CHAPTER 4

incidence in the population of children. The availability of these estimatescan be of support to health decision making by providing explicit infor-mation on the relationships between climatic input variables and malariatransmission and variation. A further supportive tool is given by the mapsof monthly space-time disease risk estimates. They highlight a particularinsight for health decision makers, by suggesting the need for special mea-sures and programs to be in place for children in the rainy period and dis-tricts.

Contrasting with [53; 55; 56], the study III found no regional spatial cor-relation in the studied period. A high incidence risk was found in the wetseason, as in [54; 55; 133]. The specified time effect quantified the estimatedrisk by allowing for the connection of every consecutive month except thefirst and last, while in [53], a linear dependence structure of some months isemployed. However, both studies were able to establish spatially smoothedtemporal trends of high incidence risk clusters in some regions.

4.2 PREDICTIVE DATA MINING

This part of the research findings is sub-divided into four studies investi-gating methods to derive predictive models for forecasting the number ofnew malaria incidence cases. As such, it is the first study in Mozambique(perhaps in Southern Africa) performing numerical predictive estimates ofmalaria incidence cases from the data mining perspective. Before present-ing the findings of each predictive data mining study, here we summarizethe most general results. The studies present several methodological con-tributions in the application of data mining and computer science to healthand variation of malaria transmission. Thus, in this part of our research, wepropose methods that are computationally more effective in predicting thenew incidence cases of malaria, which includes:

• Adoption of the time-window strategy in selecting the appropriatemalaria datasets for training the learner.

• Adoption of the random forests strategy of evaluation of the impor-tance of each variable: ranking the malaria predictors using the sup-

Page 89: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

SUMMARY OF RESULTS AND FINDINGS 67

port vector machine approach for regression problems.

• Adoption of an importance weighting strategy to overcome the sam-ple selection bias problem while generalizing the derived predictivemodels to new malaria episodes. This is relevant if we expect thattest data can be drawn from a different distribution than that of thetraining set.

Data analysis methods used in previous research such as [41–43; 45] per-form a class label prediction. Thus, they lack numerical forecasting, i.e.,there is no prediction of continuous values. The obtained predictive modelscan be seen as a supportive tool for health decision makers, where valuableinformation about relationships between malaria independent variables andforecasting of new disease episodes is given in a very explicit way. Whatfollows is a description of the results obtained in each of the predictive datamining studies.

4.2.1 PAPER IV

Paper IV summarizes the results obtained in this study. It is titled: Predict-ing the Incidence of Malaria Cases in Mozambique Using Regression Treesand Forests. The authors are, Orlando P. Zacarias and Henrik Boström.

The study investigates predictive models for the number of new malariacases in districts of Maputo province employing regression trees and randomforests methods. The used data includes administrative districts, the numberof reported cases of malaria, indoor-residual spray activities performed, andthe climatic variables temperature, rainfall, and humidity.

Nine models employing decision trees and random forests techniques gener-ated from different time windows predicted monthly malaria cases episodesin the studied region. More effective predictions were derived by the predic-tive model based on a two-year time window. Therefore, building modelsbased on methods where a reduction of the time window as to what histori-cal data to take into account, may substantially increase their predictive per-formance. A comparison of the predicted results showed 70% correlation tothe actual malaria cases from the testing set. After conducting a paired t-test

Page 90: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

68 CHAPTER 4

comparison of the decision trees and Random Forests best predictive mod-els, the results show a statistically significant difference. Permutation accu-racy importance was chosen to determine the importance of the variables ofthe malaria predictors. The scores of the predictors’ relative importance forthe two-year random forests model ranked the indoor-residual campaign asthe most important predictor. Thus, these campaigns have a greater contri-bution towards malaria incidence reduction in the province. In the group ofenvironmental variables, minimal temperature and rainfall are the most im-portant predictors. From a regional (administrative district) perspective, thedistricts subjected to high levels of malaria incidence are Manhiça, Matolaand Marracuene. The results of temporal (month) variability show that inJanuary the number of cases reached their peak. This was followed by themonths of July, June, September, August and October.

The findings of this study show the usefulness of the knowledge extractionperformed by the random forests approach in providing accurate estimatesof future numbers of malaria episodes. This is extended by including theresults regarding the significant predictive relationships of the independentvariables and malaria incidence forecasting.

4.2.2 PAPER V

This study is reported in paper V which is titled: Strengthening the HealthInformation System in Mozambique through Malaria Incidence Prediction,by Orlando P. Zacarias and Henrik Boström.

The study aims at developing a predictive model to estimate the number ofmalaria cases for children up to four years old in Maputo province, Mozam-bique. The model is based on support vector machines, employing supportvector regression with the ε-insensitive loss function. It makes use of his-torical data of reported malaria cases, climatic factors (temperature, humid-ity and rainfall) and the vector control strategy of indoor-residual spraying,covering three years: 2007–2009.

After optimizing the required parameters, the results show that the predic-tive model based on a radial basis kernel was the most effective. There-fore, the conclusion can be drawn that support vector machines methods

Page 91: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

SUMMARY OF RESULTS AND FINDINGS 69

using radial basis kernels are more effective than employing other kernelsfor malaria pediatric data, and this model is used to accomplish forecastingof new pediatric malaria cases. The support vector machines method doesnot provide a way to perform a ranking of the relative importance of thepredictors. Consequently, the random forests permutation accuracy strat-egy was adopted with predictors being randomly permuted using the testdataset. This allowed extracting information regarding the relative impor-tance of the independent variables used. The predictor temperature variationscored the topmost importance value, thus providing the highest predictivepower. From the spatial perspective, the districts of Matutuine, Namaacha,Moamba and Marracuene were ranked highly. These districts are mostprone to reporting an increased number of new malaria incidence cases.As for the time variable, the months April, February, September and Au-gust were ranked first, whereas March, June and October were ranked withrelatively less importance in their impact on malaria incidence escalation.The predictive performance of the infants model on the test set is good withnormalized mean squared error of 0.007102731. This is illustrated in Figure4.1, where the strength of the prediction model is assessed by contrastingthe observed test set data of 2009 and the predicted estimates of this testingprocedure. The x-axis represents the space–time (district–month) relation,while in the y-axis, we have the number of malaria episodes (original andpredicted malaria cases).

This investigation relates to the research question by providing knowledgeabout estimates of new malaria episodes using support vector machinestechnique and also information on the relative importance of the includedinput variables in the prediction of these estimates.

4.2.3 PAPER VI

The study VI is disclosed in paper VI and titled: Comparing Support VectorMachines and Random Forests Modeling for Predicting Malaria Incidencein Mozambique. Authors are Orlando P. Zacarias and Henrik Boström.

The focus of this study is to derive and compare malaria predictive modelsdeveloped on the basis of support vector machines and random forests tech-

Page 92: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

70 CHAPTER 4

Figure 4.1: Original versus Predicted using test set 2009.

Page 93: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

SUMMARY OF RESULTS AND FINDINGS 71

niques, using historical datasets from Maputo province. The SVM methodemployed is realized through one of its sub-class, the support vector re-gression, and was used with the ε-insensitive loss function. A dataset withrecords of malaria cases covering 1999–2008 was used to evaluate the pre-dictive models on the last year when developing models from one up to nineyears of historical data.

The DT method was used for a baseline comparison. Missing climatic inputvalues were replaced using two strategies: multiple imputation and attributemean replacement. The two-year time window predictive models generatedby the SVMs using both strategies to handle missing values accomplishedthe best predictive performance compared to the random forests model. Thelowest obtained mean squared error 0.0032699 was far better than the RFs’relative performance of around 0.0171. However, the model based on thedataset with missing values replaced by the mean of each attribute derivedusing SVMs approach obtained the most effective performance. This couldbe explained by the low variance level of the corresponding missing at-tributes, so that replacement with the “most common value” was more thanenough. In fact, predictive models using SVMs were more effective thanboth random forests and decision tree derived models, thus showing theirsuitability in application such as mining malaria data. The accuracy of thepredictive models using random forests is higher than the baseline results(decision tree) for the first seven cases (i.e., from nine- to three-year timeframe models). Moreover, the best predictive model show a statistical sig-nificance difference in relation to the random forests derived model, withvery low p-value.

As in Paper V, there was a need to overcome the weakness of SVMs inrelation to the evaluation of the relative importance of the predictors. Therandom forests permutation accuracy strategy was adopted with the predic-tors being randomly permuted using the test dataset. The knowledge ex-tracted from this variable importance analysis gave the first rank to the pre-dictor indoor-residual spray in both the support vector machines and randomforests approaches. They also ranked high the importance of the adminis-trative districts of Manhiça and Matola from the spatial perspective, and themonth of January from the temporal perspective. The derived SVMs predic-

Page 94: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

72 CHAPTER 4

tive model was able to determine that the months of February, March, Mayand June are influential in raising the number of new malaria episodes. Also,climatic factors such as humidity, minimal temperature, and temperaturevariation (the difference between the maximum and the minimum temper-atures) are of the greatest influence regarding malaria case increases. Lessinfluential temporal malaria predictors are the months from July to Decem-ber, and April as well.

One of the methods used in this paper is similar to the one in Paper IV,but with a larger dataset. The paper contributes to the research question bycomparing the two most used numerical predictive methods: support vectormachines and random forests. Useful and unexpected information extractedfrom this investigation was that some variables were equally highly rankedby both methods. The use of the time frame approach for model devel-opment is apparently of significant importance since not all historical dataseem to contribute for obtaining an effective model. Additionally, the useof sophisticated techniques of data replacement like multiple imputation didnot lead to greater improvements of model predictive performance com-pared to using a simple attribute mean replacement strategy. Afterwards,the predictive model derived using the support vector machine regressionapproach was able to relate its results to the effect of seasonality. This canbe seen by the relative importance ranking of the months January–March,where the environmental conditions are adequate for triggering most malariacases in Mozambique, as this coincides with the wet season (summer or therainy season).

4.2.4 PAPER VII

This investigation is reported in paper VII and is titled: Generalization ofMalaria Incidence Prediction Models by Correcting Sample Selection Bias.The paper is authored by Orlando P. Zacarias and Henrik Boström.

The study evaluates the generalization ability of two predictive models basedon support vector machines to forecast the number of malaria incidenceepisodes, developed taking datasets from the Maputo province districts andcertain time periods in Mozambique. The research is extended by including

Page 95: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

SUMMARY OF RESULTS AND FINDINGS 73

the investigation of the sample selection bias problem. The models were de-rived in papers V and VI, where the application of support vector machinesin the model construction was based on the support vector regression tech-nique. The predictive performance was evaluated using two-holdout sets,one for children up to four years old and the other considering malaria datafrom all age groups (the entire population).

This investigation was motivated by the need to evaluate the learner per-formance ability on the data from novel regions and different time peri-ods. That is, on previously unseen data sets, we perform a generalizationtest, after the models have experienced learning from their correspondingdatasets and before their practical application. Therefore, the malaria pre-dictive models are tested for generalization using data from two differentprovinces (regions) and time periods, other than those where the modelswere originally developed. This scenario led to the violation of the tradi-tional machine learning assumption where training and test data are inde-pendently and identically drawn from the same distribution, consequentlyraising the problem of bias on the sample selection. The correction of sam-ple selection bias is accomplished by employing the importance weightingstrategy, where a Gaussian kernel density estimation technique was appliedto automatically generate a corrected distribution of training features ad-justed to the distribution of the testing set.

The results of the generalization test using the children model show that themodel prediction performance increased after applying the sample selec-tion bias correction strategy, unlike the direct use of the original developedmodel. Literally, using the model derived from the children dataset, signifi-cant reductions of the mean squared error rates were achieved in the districtsof Zambézia province, while in the generalization test of the model for allage groups following the bias correction, a reduction of 25.0% in MSE is ob-served in some districts of Zambézia and 8.0% for Cabo Delgado provinces.However, a slight increase (less than 4.0%) in MSE is found in the remain-ing districts of the same provinces. Although some fluctuation is observedin these results, this empirical study demonstrates that the application of thesample selection bias approach can substantially improve the performanceof malaria incidence predictive models in the presence of nonrandom data

Page 96: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

74 CHAPTER 4

sampling relating training and test sets.

The study attempts to answer the research question by firstly recognizingthe usefulness of the previously derived predictive models. This recogni-tion is given by attempts to use the models in other regions of the country,which triggers the sample selection bias problem. The solution to this prob-lem brings us knowledge that leads to improving the effectiveness of thedeveloped predictive models.

Page 97: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

CHAPTER 5

CONCLUDING REMARKS

The focus of this thesis has been on the investigation of effective predictivemodels to extract useful knowledge from malaria reported case data, usingBayesian regression and predictive data mining methods. Several empiricalstudies were conducted, each of which contributes its share to answering theresearch question.

5.1 CONCLUSIONS

In this thesis, we developed predictive models of malaria disease that arecorrelated with a set of predictors, using two frameworks to find the mosteffective predictive methods. The predictive methods use the estimatedand assessed predictors of malaria incidence risk and the number of newmalaria episodes in spatial and temporal dimension. However, the influenceof climatic factors on the variation of malaria has long been established.Several investigations [21; 46; 139–142] using similar climatic independentvariables have identified different predictive relationships and predictors ofmalaria transmission. This brings up the issue of the complexity involved inthe iteration of these input variables with the disease vector. Furthermore,these results indicate the need for local evaluation of all possible iterationsusing appropriate correlation forms in most of the tropical regions affectedby this disease [59]. Therefore, the use of the statistical learning approach toinvestigate the extraction of knowledge from malaria datasets, including the

Page 98: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

76 CHAPTER 5

provision of forecasts for the number of new malaria episodes in Mozam-bique, was an open issue. A strategy to allow the variation of the temporaldimension in two connecting indexes delay was effectively adopted duringthe development of predictive models based on Bayesian regression. Hence,this work does not use explicit lagging when designing predictive models.However, while forecasting the number of new malaria cases, the monthlypredictions cover the required time lag for disease onset as reported in [21].Seven publications that describe the most effective methods have been ad-dressed to answering the research question,How can one effectively extract useful knowledge from malaria health datato support decision-making and management, using predictive data miningand Bayesian regression methods?

In our approach to tackle the research question from the Bayesian regressionperspective, several spatial and temporal predictive models based on Pois-son multiple regression methods were developed and evaluated to yield thebest fitting model according to their deviance information criteria measures.The full model (including all random effects and covariates) using malariadata from all age groups (i.e., the entire population) with CAR binary spa-tial correlation was chosen. Its results allowed us to establish predictiverelationships relating the relative humidity and maximum temperature over27.0oC as being the most significant predictors affecting the incidence pat-terns of malaria risk in Maputo province. Although a minimal temperaturein the range [17.0,21.1] degrees Celsius, and temperature variation, showeda predictive relationship to malaria incidence risk pattern, their impact wasnot significant. Hence, their contribution can be neglected. Our improvedapproach of quantifying space-time malaria correlations between monthsand years instead of only between months [55] or years [56] was potentiallyrelevant and extends the work of these authors. Its usefulness can also beseen in its ability to easily visualize the seasonal malaria incidence distri-bution, as the correlation dependencies for every following month withinthe study period except for the first and last (month and year respectively)are established. Moreover, this modeling approach can be of great benefitwhen explicitly modeling the seasonality (i.e., grouping by months) is eitherunfeasible or not required.However, in the study using the Bayesian regression approach to find pre-

Page 99: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

CONCLUDING REMARKS 77

dictors for malaria incidence patterns from a distinct population group—children less the five years of age—the predictive model with no spatiallycorrelated heterogeneity between districts was the best fitting model. Thisresulted in learning that humidity and rainfall were the main predictors af-fecting malaria variation in this population group. Although apparently con-tradictory to the exclusion of temperature in the current study, while rainfallwas excluded in the previous, the results may be explained by the size of thepediatric malaria data used for the analysis. Furthermore, a very low spatialvariation of reported cases was observed in study III. The patterns of malariaincidence show a significant variation in space and time in some districts(time is given by month and year combination for study II and month onlyin study III) compared to standard mortality rates. Regarding the minimumtemperature and the artificially created input variable temperature variationin study II, this thesis confirms the weak iteration [127] of these predictorsin the process of malaria transmission and variation. Relative risk estimatesof malaria incidence were extracted from the developed Bayesian predic-tive models and used for mapping the regional risk of the disease at thecorresponding time. This may be useful to improve the decision makingand planning of malaria intervention activities for each district, taking intoconsideration an increase in regional malaria outbreaks.

Within predictive data mining, the group method of data handling poly-nomial neural network is a useful technique employed to predict malariaoutbreaks [45]. Its drawback however relates to the possibility of fallinginto the similar traps of local minima or over-fitting as the artificial neuralnetwork approach. Hence, in order to accomplish the objective of find-ing useful knowledge from malaria data in the form of predictive patternsof new malaria incidence cases, several predictive models from differentapproaches were compared. Their modeling development was based onsupport vector machines and random forests methods, comprising historicaldata of malaria cases, climatic variables, and vector control activities real-ized as indoor-residual spraying. They included the ability to establish themost valuable (important) malaria predictors. The model design following atime-window approach from nine to one years of malaria historical data forall ages, while evaluating predictive performance on the last year, provedto be practical for determining the best dataset from which to develop pre-

Page 100: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

78 CHAPTER 5

dictive models that extract useful knowledge from malaria data. Moreover,the performance of the RFs method is not substantially different from thatobtained by the SVMs algorithm in this investigation, although methods us-ing SVMs are generally considered more efficient. This may be due to thespace-dimension where the number of features used in this investigation issmall - less than fifty.

Adoption of the random forests strategy to overcome the problem of variableimportance estimations of model predictors within support vector machines,was useful and allowed comparing the most valuable predictors obtainedfrom both approaches. Models for all ages based on both strategies deter-mined indoor-residual spraying as the variable providing the highest predic-tive power. Also, the districts of Manhiça and Matola, the climatic variableshumidity and minimal temperature, and the month of January were foundto hold the highest predictive relationships in the prediction of the numberof malaria episodes for both support vector machines and random forestsapproaches. In spite of that, while predicting new malaria cases, the modelbased on support vector machines approach was more effective. The in-vestigation of predictive malaria transmission assuming climatic and IRSfactors for the children’s population, i.e., children under five years of age,enabled estimating possible fresh expected numbers of malaria cases in thishigh-risk and susceptible population. Designing this study was appropriateconsidering the lack of immunity to the disease within this population group.The developed predictive models extend the work of [45] by first perform-ing numerical prediction and employing different techniques to forecast newmalaria episodes.

The process for testing the predictive capabilities of the derived models togeneralize to unseen sets was also investigated. Because the default as-sumption in machine learning, where the training and test sets are drawnfrom independently and identically distributed variables, did not hold, wefaced the sample selection bias problem. This problem was overcome byemploying a weighting importance strategy to approximate the distributionsof the unseen testing set to the training sets used for deriving the predictivemodels. This solution is apparently a useful alternative in the modeling andestimation of the number of new malaria episodes. When the type of bias

Page 101: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

CONCLUDING REMARKS 79

is either feature or class based, adjusting for sample bias is considered inmost cases effective [143]. However, in cases where the type of bias is saidto be a complete bias, no further reduction is possible [143]. Numericalprediction results obtained after correcting for sample selection bias in boththe children’s model and the all ages model show a substantial performanceimprovement. The predictive performance of the children’s model got anoverall improvement of 24% in its error reduction rate, while the all agespredictive model got an error reduction rate improvement of around 12% inZambézia province and 2.5% in Cabo Delgado province.

To sum up, a contribution to the extraction of useful knowledge from malariahealth data from the Bayesian perspective may be achieved by combiningspace-time interactions and allowing for a continuous correlation connect-ing every month and year instead of a direct seasonal division. This allowedalso the extraction of predictive relationships based on the significance ofthe predictors. Under predictive data mining, the adoption of a time-windowstrategy and the random forests approach to obtain variable importance es-timation scores both employed within SVMs, are contributions of the thesisto the extraction of useful knowledge from malaria health data.

The outcomes of this thesis lead us to conclude that variations of meteoro-logical conditions which impact malaria parasite development and vector,as well as vector control activities, have substantial predictive relationshipswith malaria transmission in Mozambique. The assessment of these linksenhanced our understanding of the strength of the malaria predictors, asmany potential factors affecting malaria transmission are hard to separable.This situation turns out to be the biggest challenge in modeling the incidencepatterns of malaria in space and time. On the other hand, the identification ofthe principal predictors and the design of predictive models for the numberof new malaria cases may provide support to the decision making processfrom planning to the design of control interventions. The predictive modelsmay also contribute to reduce the subjectivity and time for decision making.Moreover, it is impractical to develop a model based on most of districts ofMozambique, and then use the remaining set of districts for testing, mainlydue to the lack of the required data. Having said that, the sample selectionbias problem will eventually be present in most of the cases. Therefore,

Page 102: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

80 CHAPTER 5

the approach pursued in this thesis seems to be practical and appropriateto tackle this problem. To the best knowledge of the author, this researchrepresents one of the first studies to develop regional predictive models toestimate the relative risk of malaria transmission, its mapping, the identi-fication of potential predictor variables, and predictions of new number ofmalaria cases in Mozambique, thus casting light on a fresh and recurringtheme of public health and informatics, the importance of which is increas-ingly growing in the country.

5.2 FUTURE WORK

This dissertation encompasses the application of methods for learning fromBayesian regression, decision trees, random forests, and support vector ma-chines based on health and climatic datasets. For that purpose, predictivemodeling finding estimates of predictors for malaria transmission and fore-casting of new numbers of disease episodes were derived. The knowledgeextracted by these models can provide directions for improving the man-agement and decision-making process in the health sector. However, furtherrefinement of the developed predictive models is possible as more relevantdata become available. Also, the possibility of applying similar modelingapproaches to the study of other tropical diseases or related areas is ex-tremely important and valid. Thus, many other paths may be pursued tocontinue with issues that remained opened in this dissertation:

1. Refine the time scale by reducing it to weekly datapoint observations.

2. Investigate the application of other strategies for correcting samplebias.

3. Consider the design of a malaria early warning health decision sys-tem so as to contribute to the management of all activities within themalaria program unit at the Ministry of Health.

4. Use the predictive variable importance approach from random foreststo generate hypotheses based on selected predictors that could bemore rigorously analyzed in the Bayesian framework.

Page 103: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

CONCLUDING REMARKS 81

5. Perform studies to determine the usefulness of the different modelsdeveloped in this research from the public health standpoint.

6. Investigate the possibility of applying similar approaches to predict-ing new episodes of other diseases, such as dengue, cholera, menin-gitis, etc.

Page 104: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020
Page 105: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

REFERENCES 83

References

[1] S. MABUNDA, G. MATHE, E. STREAT, S. NERY, AND A. KILIAN. National Malaria In-dicator Survey (MIS-2007). http://www.malariasurveys.org/documents/MIS MalariaSurvey2007.pdf, 2007. Accessed: 2013-09-27. 1, 6, 45, 54, 58

[2] KENNETH J. ROTHMAN, SANDER GREENLAND, AND TIMOTHY L. LASH. Modern Epidemi-ology. Lippincott Williams & Wilkins, Philadelphia, USA, third edition, 2008. 2, 6

[3] J. BRAA, E. MACOME, J. C. MAVIMBE, J. L. NHAMPOSSA, J. L. DA COSTA, AND B. JOSS.Study of the Actual and Potential Usage of Information and Communication Technology atDistrict and Provincial Levels in Mozambique with Focus on the Health Sector. ElectronicJournal of Information Systems in Developing Countries (EJISDC), 5(2):1–29, 2001. 2

[4] H. KIMARO AND J. L. NHAMPOSSA. Analyzing the problem of unsustainable health infor-mation systems in less-developed economies: case studies from Tanzania and Mozambique.Information Technology for Development, 11:273–298, 2005. 2

[5] LUBOMBO SPATIAL DEVELOPMENT INITIATIVE. Lubombo Spatial Development Initiative,Mozambique. Report 2009. 2

[6] THE WORLD HEALTH ORGANIZATION. Using Climate to Predict Infectious Disease Out-breaks: A Review, 2004. 2, 54

[7] WORLD HEALTH ORGANIZATION. http://www.who.int/mediacentre/factsheets/fs094/en/. Ac-cessed: September 2013. 2

[8] L. M. BARAT, N. PALMER, S. BASU, E. WORRALL, K. HANSON, AND ANNE MILLS. DoMalaria Control Interventions Reach the Poor? A View Through the Equity Lens. Ameri-can Journal of Tropical Medicine and Hygiene, 71 (Suppl 2):174–178, 2004. 2

[9] WHO. World Malaria Report 2009. World Health Organization, 2009. 2

[10] FAYYAD USAMA, G. PIATETSKY-SHAPIRO, AND PADHRAIC SMYTH. From Data Mining toKnowledge Discovery in Databases. AI Magazine, 17(3):37–54, 1997. 4, 5

[11] WILLIAM J. FRAWLEY, G. PIATETSKY-SHAPIRO, AND CHRISTOPHER J. MATHEUS. Knowl-edge Discovery in Databases: An Overview. AI Magazine, 13(3):57–70, 1992. 4

[12] MARGARET H. DUNHAM. Data Mining: Introductory and Advanced Topics. Prentice Hall.Pearson Education, Inc., Upper Saddle River New Jersey, 2002. 5, 28, 29

Page 106: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

84 REFERENCES

[13] IAN H. WITTEN, EIBE FRANK, AND MARK A. HALL. Data Mining: Practical Machine Learn-ing Tools and Techniques. Morgan Kaufmann Publishers, San Francisco, USA, third edition,2011. 4, 5, 28, 29, 30, 31, 34, 36

[14] L BERNARDINELLI, D CLAYTON, C PASCUTO, C MONTOMOLI, M GHISLANDI, ANDM SONGONI. Bayesian analysis of space-time variation in disease risk. Statistics in Medicine,14:2433–2443, 1995. 5, 8

[15] IOANNIS NTZOUFRAS. Bayesian Modeling Using WinBugs. John Wiley & Sons, Inc., NewJersey, USA, 2009. 5, 6, 24, 25, 26, 34, 59

[16] A.C.A. ET. AL CLEMENTS. Bayesian spatial analysis and disease mapping: tool to enhanceplanning and implementation of a schistosamiasis control programme in Tanzania. TropicalMedicine and International Health, 11:490–503, 2006. 5

[17] ANDREW GELMAN, JOHN B. CARLIN, HAL S. STERN, AND DONALD B. RUBIN. BayesianData Analysis. Chapman & Hall/CRC. Text Series in Statiscal Science, London, UK, secondedition, 2004. 5, 6, 23, 24, 26, 34, 59

[18] A. CANDELIERI, G. DOLCE, F. RIGANELLO, AND W. G. SANNITA. Data Mining in Neu-rology. In Knowledge-Oriented Applications in Data Mining, InTech, pages 261–276, 2011.5

[19] C. H. CHOO. The knowing Organization: How Organizations Use Information to ConstructMeaning, Create Knowledge and Make Decisions. International Journal of Information Man-agement, 16 (5):478–486, 1996. 5

[20] J. KEVIN BAIRD, MICHAEL J. BANGS, JASON D. MAGUIRE, AND MAZIE J. BARCUS. Epi-demiological Measures of Risk of Malaria. Methods in Molecular Medicine: Malaria Methodsand Protocols, 72:13–22, 2002. 6

[21] H. TEKLEHAIMANOT, A. LIPSITCH, M.AND TEKLEHAIMANOT, AND J. SCHWARTZ.Weather-based prediction of Plasmodium falciparum malaria in epidemic-prone regionsof Ethiopia I. Patterns of lagged weather effects reflect biological mechanism. MalariaJournal, 3:41, 2004. 6, 8, 75, 76

[22] G. ZHOU, N. MINAKAWA, A. K. GITHEKO, AND G. YAN. Association between climate vari-ability and malaria epidemics in the East African highlands. In Proceedings from NationalAcademic Science, USA, 101, pages 2375–2380, 2004.

[23] J. COX AND T. A. ABEKU. Early warning systems for malaria in Africa: from blueprint topractice. Trends in Parasitology, 23:243–246, 2007. 6

[24] A. FRIEDE, H.L. BLUM, AND M. MCDONALD. Public Health Informatics: How Infor-mation Age Technology Can Strengthen Public Health. Annual Review of Public Health,16:239–252, 1995. 6

[25] International Journal of Environmental Research and Public Health.http://www.mdpi.com/journal/ijerph/special_issues/pubhealth-informatics. Acessed: May,2014. 6

Page 107: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

REFERENCES 85

[26] J. S. CHIU, Y. C. LI, F. C. YU, AND Y. F. WANG. Applying an artificial neural network topredict osteoporosis in the elderly. Studies in Health Technology and Informatics, 124:609–614, 2006. 7

[27] G. LEMINEUR, R. HARBA, N. KILIC, O. N. UCAN, O. OSMAN, AND L. BENHAMOU. Effi-cient estimation of osteoporosis using artificial neural networks. In 33rd Annual Conf. IEEEIndustrial Electronics Society (IECON), pages 3039–3044, Taipei, Taiwan, 2007. 7

[28] M. MUNLEY, J. LO, G. SIBLEY, G. BENTEL, M. ANCHER, AND L. MARKS. A neural net-work to predict symptomatic lung injury. Journal Physics in Medicine and Biology, 44:2241–2249, 1999. 7

[29] D. MANTZARIS, G. ANASTASSOUPOULOS, A. ADAMOPOUOS, I. STEPHANAKIS, K. KAM-BOURI, AND S. GARDIKIS. Abdominal pain estimation in childhood based on artificial neu-ral classification. In Proc. the 10th Int. Conf. on Engineering Applications of Neural Networks,pages 129–134, Thessaloniki, 2007. 7

[30] A. O. OSOFISAN, O. O. ADEYEMO, B. A. SAWYERR, AND O. EWEJE. Prediction of kidneyfailure using artificial neural networks. European Journal of Scientific Research, 61(4):487–492, 2011. 7

[31] S. B PATIL AND Y. S. KUMARASWAMY. Intelligent and effective heart attack predictionsystem using data mining and artificial neural network. European Journal of Scientific Re-search, 31(4):642–656, 2009. 7

[32] K. GORYNSKI, I. SAFIAN, W. GRADZKI, M. P. MARSZA, J. KRYSINSKI, S. GORYNSKI,A. BITNER, J. ROMASZKO, AND A. BUCINSKI. Artificial neural networks approach toearly lung cancer detection. Central European Journal of Medicine, 9(5):632–641, 2014.Doi=10.2478/s11536-013-0327-6. 7

[33] P. RAMACHANDRAN, N. GIRIJA, AND T. BHUVANESWARI. Early detection and preventionof cancer using data mining techniques. International Journal of Computer Applications,97(13), july 2009. 7

[34] M. A. SAPON, K. ISMAIL, AND S. ZAINUDIN. Prediction of diabetes by using artificialneural network. International conference on circuits, systems and simulation (IPCSIT), 7, 2011.7

[35] N. RACHATA, P. CHAROENKWAN, T. YOOYATIVONG, K. CHAMNONGTHAL, C. LURSINSAP,AND K. HIGUCHI. Automatic Prediction System of Dengue Haemorrhagic-Fever OutbreakRisk by Using Entropy and Artificial Neural Network. In Proceedings of International Sym-posium in Information Technology (ITSim 2008), pages 210–214, 2008. 7

[36] Y. YUSOF AND Z. MUSTAFFA. Dengue outbreak prediction: A least squares support vectormachines approach. International Journal of Computer Theory and Engineering, 3(4), august2011. 7

[37] N. A. HUSIN, N. SALIM, AND A. R. AHMAD. Modeling of dengue outbreak predictionin Malaysia: A comparison of Neural Network and Nonlinear Regression Model. In Pro-ceedings of International Symposium in Information Technology (ITSim 2008), pages 1–4, 2008.7

Page 108: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

86 REFERENCES

[38] NICKI TIFFIN, JANET F. KELSO, ALAN R. POWELL, HONG PAN, VLADIMIR B. BAJIC, ANDWINSTON A. HIDE. Integration of text- and data-mining using ontologies successfully se-lects disease gene candidates. Nucleic Acids Research, 33(5):1544–1552, 2005. 7

[39] D. S. ROSS, M. J. CRAWFORD, R. G. DONALD, M. FRAUNHOLZ, O. S HARB, C. Y. HE,J. C. KISSINGER, M. K. SHAW, , AND B. STRIEPEN. Mining plasmodium genome databaseto define organellar function: what does the apicoplast do? Phil. Transactions of RoyalSociety Biological Sciences, 357, 2002. 7

[40] M LLINÁS AND H. A. DEL PORTILLO. Mining the malaria transcriptome. Trends in para-sitology, 21:350–352, 2005. 7

[41] C. LOUCOUBAR, R. PAUL, A. BAR-HEN, A. HURET, A. TALL, C. SOKHNA, J-F TRAPE,A. B. LY, J. FAYE, A. BADIANE, G. DIAKHABY, F. D. SARR, A. DIOP, A. SAKUNTABHAI,AND J-F BUREAU. An Exhaustive, Non-Euclidean, Non-parametric Data Mining Tool forUnraveling the Complexity of Biological Systems - Novel Insights into Malaria. PLoS Med6(9): e24085, 2011. doi=10.1371/journal.pone.0024085. 7, 67

[42] O. OLUGBENGA, O. UZOAMAKA, AND O. NWINYI. A knowledge-based data mining systemfor diagnosing malaria related cases in healthcare management. Egyptian Computer ScienceJournal, 34(4), may 2010. 7

[43] C. LOUCOUBAR, L. GRANGE, R. PAUL, A. HURET, A. TALL, O. TELLE, C. ROUSSIL-HON, J. FAYE, F. DIENE-SARR, J-F TRAPE, O. MERCEREAU-PUIJALON, A. SAKUNTABHAI,AND J-F BUREAU. High number of previous plasmodium falciparum clinical episodes in-crease risk of future episodes in a sub-group of individuals. PLoS Med 8(2): e55666, 2013.doi=10.1371/journal.pone.0055666. 7, 67

[44] GMDH page. http://www.gmdh.net/GMDH_toc.htm. Accessed: May, 2014. 8

[45] A. ARIFIANTO, A. M. BARMAWI, AND A. T. WIBOWO. Malaria incidence forecasting fromincidence record and weather pattern using polynomial neural network. International Jour-nal of Future Computing and Communication, 3(1):60–65, 2014. 8, 9, 67, 77, 78

[46] A. GOMEZ-ELIPE, A. OTERO, MV. HERP, AND A. AGUIRRE-JAIME. Forecasting malariaincidence based on monthly reports and environmental factors in Karuzi, Burundi, 1997-2003. Malaria Journal, 6:129, 2007. 8, 75

[47] OLIVER J. T. BRIËT, P. VOUNATSOU, D. M. GUNAWARDENA, G. N. L. GALAPPATHTHY,AND PRIYANIE H. AMERASINGHE. Models for short term malaria prediction in Sri Lanka.Malaria Journal, 7:76, 2008.

[48] ESKINDIR LOHA AND BERNT LINDTJØRN. Model variations in predicting incidence ofPlasmodium falciparium malaria using 1998-2007 morbidity and meteorological data fromsouth Ethiopia. Malaria Journal, 10:166, 2010. 8, 9

[49] H. TEKLEHAIMANOT, J. SCHWARTZ, A. TEKLEHAIMANOT, AND M. LIPSITCH. Weather-based prediction of Plasmodium falciparum malaria in epidemic-prone regions of EthiopiaII. Weather-based prediction systems perform comparably to early detection systems inidentifying times for interventions. Malaria Journal, 3:44, 2004. 8

Page 109: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

REFERENCES 87

[50] LA WALLER, BP CARLIN, H XIA, AND AE GELFAND. Hierarchical spatio-temporal map-ping of disease rates. Journal of the American Statistical Association, 92:607–617, 1997. 8

[51] L KNORR-HELD. Bayesian modeling of inseparable space-time variation in disease risk.Statistics in Medicine, 19:2555–2567, 2000. 8

[52] B. H. MANH, A. C. A. CLEMENTS, N. Q. THIEU, N. M. HUNG, LE X. HUNG, S. I. HAY,T. T. HIEN, F. L. WERTHEIM, R. W. SNOW, AND P. HORBY. Social and environmentaldeterminants of malaria in space and time in Viet Nam. International Journal of Parasitology,41(1):109–11, 2010. 8

[53] ARCHIE C.A. CLEMENTS, ADRIAN G. BARNETT, ZHANG WEI CHENG, ROBERT W. SNOW,AND HOM NING ZHOU. Space-time variation of malaria incidence in Yunnan province,China. Malaria Journal, 8(180), 2009. doi=10.1186/1475-2875-8-180. 8, 66

[54] I KLEINSCHMIDT, B SHARP, I MUELLER, AND P VOUNATSOU. Rise in malaria incidencerates in South Africa: a small-area spatial analysis of variation in time trends. AmericanJournal of Epidemiology, 155:257–264, 2002. 8, 21, 62, 66

[55] M.L.H. MABASO, M. CRAIG, P. VOUNATSOU, AND T. SMITH. Towards empirical descrip-tion of malaria seasonality in southern Africa: the example of Zimbabwe. Tropical Medicineand International Health, 10:909–918, 2005. 8, 9, 64, 66, 76

[56] MUSAWENKOI L.H. MABASO, PENELOPE VOUNATSOU, STANELY MIDZI, JOAQUIMDA SILVA, AND THOMAS SMITH. Spatio-temporal analysis of the role of climate in theinter-annual variation of malaria incidence in Zimbabwe. International Journal of HealthGeographics, 5(20), 2006. 8, 9, 21, 63, 64, 66, 76

[57] M. PAGANO AND M. J. HARTLEY. On fitting distributed lag models subject to polynomialrestrictions. Journal of Econometrics, 16(2):171–198, 1981. ISSN: 0304-4076. 9

[58] W. TOBLER. A computer movie simulating urban growth in Detroit region. EconomicGeography, 46:234–240, 1970. 9

[59] S. F. RUMISHA, T. A. SMITH, H. MASANJA, S. ABDULLA, AND P. VOUNATSOU. Relathion-ships between child survival and malaria transmission: an analysis of the malaria trans-mission intensity and mortality burden across Africa (MTIMBA) project data in Rufiji de-mographic surveillance system, Tanzania. Malaria Journal, 13:124, 2014. Doi:10.1186/1475-2875-13-124. 9, 75

[60] PAUL ELLIOTT AND DANIEL WARTENBERG. Spatial Epidemiology:Current Approachesand Future Challenges. Environmental Health Perspectives, 112(9):998–1004, june 2004. 19,20

[61] RICHARD DOLL. The Epidemiology of Cancer. Cancer, 45(10):2475–2485, may 1980. 20

[62] P. ELLIOTT, J. C. WAKEFIELD, N. G. BEST, AND D. J. BRIGGS. Spatial Epidemiology:Methods and Applications. In : Spatial Epidemiology: Methods and Applications (Elliott, P.and Wakefield, J. C. and Best, N. G. and Briggs, D. J., eds ). Oxford University Press Inc., pages3–14, 2000. 20

Page 110: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

88 REFERENCES

[63] RICHARD S. OSTFELD, GREGORY E. GLASS, AND FELICIA KEESING. Spatial epidemiology:an emerging (or re-emerging) discipline. Trends in ecology and Evolution, 20(6):328–336,june 2005. 21

[64] A GEMPERLI, P VOUNATSOU, I KLEINSCHMIDT, M BAGAYIKO, C LENGELER, ANDT SMITH. Spatial Patterns of infant mortality in Mali: the effects of malaria endemicity.American Journal of Epidemiology, 59:64–72, 2004. 21

[65] ORLANDO P. ZACARIAS AND MIKAEL ANDERSON. Spatial and temporal patterns ofmalaria incidence in Mozambique. Malaria Journal, 10:189, 2011. 21, 49, 57

[66] PETER CONGDON. Applied Bayesian Modelling. John Wiley & Sons Ltd, Wiley Series inProbability and Statistics, England, London, 2003. 23

[67] PETER D. HOFF. A First Course in Bayesian Statistical Methods. Springer Texts in Statistics,New York, USA, first edition, 2009. 23

[68] DENNIS D. BOOS AND L. A. STEFANSKI. Essential Statistical Inference: Theory and Methods.Springer Texts in Statistics, New York, USA, 2013. 24, 34

[69] H.S. MIGON AND D. GAMERMAN. Statistical Inference - an integrated approach. Arnold:Hodder Headline Group, London, UK, 1999. 24

[70] SHELDON M. ROSS. Introduction to Probability Models. Academic Press, San Diego, USA,eight edition, 2003. 25

[71] S. GEMAN AND D. GEMAN. Stochastic Relaxation, Gibbs Distributions and the BayesianRestoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence,6:721–741, 1984. 25

[72] A.E. GELFAND AND A.F.M. SMITH. Sampling-Based Approaches to Calculating MarginalDensities. American Statistical Association, 85:398–409, 1990. 25

[73] N. METROPOLIS, A.W. ROSENBULTH, M.N. ROSENBULTH, A.H. TELLER, AND E. TELLER.Equation of State Calculations by Fast Computing Machine. Chemical Physics, 21:1089–1091, 1953. 25

[74] W.K. HASTINGS. Monte Carlo Sampling Methods Using Markov Chains and Their Appli-cations. Biometrika, 57:97–109, 1970. 25

[75] D.J. LUNN, A. THOMAS, N. BEST, AND D. SPIEGELHALTER. WinBUGS – a Bayesian mod-eling framework: concepts, structure, and extensibility. Statistics and Computing, 10:325–337, 2000. 25

[76] STEPHEN W. RAUDENBUSH AND ANTHONY S. BRYK. Hierarchical Linear Models: Appli-cations and Data Analysis Methods. Sage Publications, Inc., California, USA, second edition,2002. 25

[77] ALAN M. ZASLAVSKY. Hierarchical Bayesian Modeling. In Subjective and ObjectiveBayesian Statistics: Principles, Models and Applications, pages 336–358. Wiley Inter-science.A John Wiley & Sons, Inc., Publication, New Jersey, USA, 2010. 26

Page 111: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

REFERENCES 89

[78] P. LANGLEY AND H. SIMON. Applications of Machine Learning and Rule Induction. Com-munications of the ACM, pages 55–65, nov 1995. 28

[79] ETHEM ALPAYDIN. Introduction to Machine Learning. The MIT Press, Cambridge, Mas-sachusetts, USA, second edition, 2010. 28

[80] PAWEL CICHOSZ. Data Mining Algorithms: Explained Using R. 29

[81] CHIDANAND APTÉ AND SHOLOM WEISS. Data mining with decision trees and decisionrules. Journal of Future Generation Computer Systems, 13:197–210, 1997. 29

[82] L. BREIMAN, J. FRIEDMAN, R. OLSHEN, AND C. STONE. Classification and RegressionTrees, 1984. Wadsworth International Group. 29

[83] LEO BREIMAN. Random Forests. Machine Learning, 45:5–32, 2001. 30, 57, 59

[84] D. HEATH, S. KASIF, AND S. SALZBERG. Committees of decision trees. In Cognitive Tech-nology: In Search of a Human Interface (eds. Gorayska, B. & Mey, J.), pages 305–317, 1996.Elsevier Science, Amsterdam, The Netherlands.

[85] R. E. SCHAPIRE. The boosting approach to machine learning: an overview. In NonlinearEstimation and Classification (eds. Denison, D. D. and Hansen, M. H. and Holmes, C. C. andMallick, B. & Yu, B.), pages 141–171, 2003. Springer, New York. 30

[86] T. K. HO. The random subspace method for constructing decision forests. IEEE Transac-tions on Pattern Analysis and Machine Learning, 20:832–844, 1998. 30

[87] J. J. RODRIGUEZ, L. I. KUNCHEVA, AND C. J. ALONSO. Rotation forests: A new classifierensemble method. IEEE Transactions on Pattern Analysis and Machine Learning, 28:1619–1630, 2006. 31

[88] S. D. BAY. Nearest neighbor classification from multiple feature subsets. Intelligence DataAnalysis, 3:191–209, 1999. 31

[89] V. N. VAPNIK. Statistical learning theory. John Wiley & Sons, New York, USA, 1998. 31, 33,59

[90] VOJISLAV KECMAN. Support Vector Machines Basics. Technical report, University of Auck-land, School of Engineering, New Zealand, april 2004. 31

[91] DAVID BARBELLA, SAMI BENZAID, JANARA CHRISTENSEN, BRET JACKSON, X. VICTORQIN, AND DAVID MUSICANT. Understanding Support Vector Machine Classifications viaa Recommender System-Like Approach. In Proceedings of the International Conference onData Mining (DMIN), Las Vegas, USA, july, 13-16 2009. 31, 57

[92] RICHARD O. DUDA, PETER E. HART, AND DAVID G. STORK. Pattern Classification. JohnWiley & Sons, Inc., New York, 2nd edition, 2001. 31

[93] S. R. GUNN. Support Vector Machine for classification and regression, 1998. TechnicalReport. University of Southampton. Faculty of Engineering, Science and Mathematics. Schoolof Electronics ad Computer Science. 33

Page 112: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

90 REFERENCES

[94] C. CORTES AND V. VAPNIK. Support-vector networks. Machine Learning, 20:273–297, 1995.33

[95] T. HASTIE, R. TIBSHIRANI, AND J. FRIEDMAN. The Elements of Statistical Learning: DataMining, Inference, and Prediction. Springer Series in Statistics, Stanford, California, 2nd edition,aug 2009. 34

[96] D. SCHWARZ. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464, 1978.34

[97] H. AKAIKE. Information theory and an extension of the maximum likelihood principle.In B. Petrov and F. Csaki, eds., Proceedings of 2nd International Symposium on InformationTheory, Academia Kiado. Budapest, pages 267–281, 1973. 34

[98] D. SPIEGELHALTER, N. BEST, B. CARLIN, AND A. VAN DER LINDE. Bayesian measuresof model complexity and fit (with discussion). Journal of the Royal Statistical Society, B64:583–639, 2002. 35

[99] KAREL G. M. MOONS, ANDRE P. KENGNE, DIEDERICK E. GROBBEE, PATRICK ROYSRON,YVONNE VERGOUWE, DOUGLAS G. ALTMAN, AND MARK WOODWARD. Risk predictionmodels: II. External validation, model updating, and impact assessment. Biomedical HeartJournal, march 2012. doi=10.1136/heartjnl-2011-301247. 36

[100] ORLANDO P. ZACARIAS AND HENRIK BOSTRÖM. Generalization of Malaria Incidence Pre-diction Models by Correcting Sample Selection Bias. In International Conference on Ad-vanced Data Mining and Applications. M. Yao et al. (Eds.): ADMA 2013, Part II, LNAI 8347,pp. 189-200, c©Springer-Verlag Berlin Heidelberg, 2013. 36

[101] Oxford, "research". Oxford Dictionaries. http://www.oxforddictionaries.com. Acessed: Nov,2014. 37

[102] MARK SAUNDERS, PHILIP LEWIS, AND ADRIAN THORNHILL. Research Methods for BusinessStudents. Prentice Hal, Pearson Education Limitedl, England, UK, fifth edition, 2009. 37, 38,39, 40, 42, 44, 45, 59

[103] DAVID L. MORGAN. Paradigms Lost and Pragmatism Regained: Methodological Impli-cations of Combining Qualitative and Quantitative Methods. Journal of Mixed MethodsResearch, 1(1):48–76, 2007. Doi: 10.1177/2345678906292462. 37

[104] A. R. HEVNER, S. T. MARCH, J. PARK, AND S. RAM. Design Science in Information Sys-tems Research. MIS Quarterly, 28(1):75–105, march 2004. 38

[105] E. G. GUBA AND Y. S. LINCOLN. Competing paradigms in qualitative research. In Y. S.DENZIN, N. K. & LINCOLN, editor, Handbook of qualitative research, pages 105–117. Sage,London, 1994. 38

[106] DANIEL A. BERKOWITZ. Culture Meanings of News. Sage Publications, Iowa, USA, 2011. 38

[107] D. COHEN AND B. CRABTREE. Qualitative Research Guidelines Project. July 2006. Ac-cessed: November, 2014. 38

Page 113: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

REFERENCES 91

[108] R. P. GEPHART. Paradigms and Research Methods. Academy of Management: researchmethods division, 4, 1999. 38

[109] EGON G. GUBA. The Paradigm Dialog. Theory, Culture and Society. SAGE Publications, 1990.38

[110] MICHAEL CROTTY. The Foundations of Social Research: Meaning and Perspective in the Re-search Process. Sage Publications, London, 1998. 39

[111] K. YIN ROBERT. Case Study Research: Design and Methods. Sage Publications, California,USA, fourth edition, 2009. 39

[112] T. J. ELLIS AND Y. LEVY. Towards a guide for novice researchers on research methodology:Review and proposed methods. Issues in Information Science and Information Technology,6:323–337, 2009. 40

[113] ALAN HEVNER AND SAMIR CHATTERJEE. Design Research in Information Systems: Theoryand Practice. Springer, USA, 2010. 42

[114] INE. Publicações Periódicas de Indicadores de Saùde. Instituto Nacional de Estatistica, 2002.44

[115] JANE WEBSTER AND RICHARD T. WATSON. Analyzing the past to prepare for the future:Writing a literature review. MIS Quarterly, 26(2):xiii–xxiii, june 1995. 44

[116] JOHN W. CRESWELL. Research Design: Qualitative, Quantitative and Mixed Methods Ap-proaches. Sage Publications, Inc., California, USA, second edition, 2003. 44, 59

[117] ANDREW LAWSON, WILLIAM J. BROWNE, AND CARMEN L. VIDAL RODEIRO. DiseaseMapping with WinBUGS and MLwiN. John Wiley & Sons Ltd, England, UK, 2003. 45

[118] ALAN KELLY. Case Studies in Bayesian Disease Mapping for Health and Health ServiceResearch in Ireland. In : Disease Mapping and Risk Assessment for Public Health (Lawson,Andrew and Biggeri, Annibale and Böhning, Dankmar and Lesaffre, Emmanuel and Viel, Jean-François and Bertollini, Roberto, eds). John Weley & Sons, Ltd, pages 349–363, 1999. 45

[119] ORLANDO P. ZACARIAS AND MIKAEL ANDERSSON. Mapping malaria incidence that ac-counts for environmental factors in Maputo province-Mozambique. Malaria Journal, 9:79,2010. 46

[120] JULIAN BESAG, JEREMY YORK, AND ANNIE MOLLIÉ. Bayesian Image Restoration with twoApplications in Spatial Statistics. Annals of the Institute of Statistical Mathematics, 43(1):1–59, 1991. 47

[121] RL LITTLE AND DB RUBIN. Statistical analysis with missing data. John Wiley & Sons, NewYork, 1987. 48

[122] CHANGYU SHEN. Application of Multiple Imputation to Data from Two-phase Sampling:Estimation of the Incidence Rate of Cognitive Impairment. Journal of Data Science, 5:513–518, 2007. 48

Page 114: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

92 REFERENCES

[123] R-Statistical tool for data analysis. http://CRAN.R-project.org/. Accessed-September 16,2012. 49

[124] L ZHU, B. P. CARLIN, AND A. E. GELFAND. Hierarchical regression with misaligned spatialdata: relating ambient ozone and paediatric asthma ER visits in Atlanta. Environmetrics,14:537–557, 2003. 49

[125] A. E. GELFAND, L ZHU, AND B. P. CARLIN. On the change of support problem for spatio-temporal data. Biostatistics, 2:31–45, 2001.

[126] S BANERJEE, BP CARLIN, AND AE GELFAND. Hierarchical modeling and analysis for spatialdata. Chapman and Hall, New York, 2004. 49

[127] J. I. BLANFORD, S. BLANFORD, R. G. CRANE, K. P. MANN, M. E. PAAIJMANS, K. V.SCHREIBER, AND M. B. THOMAS. Implications of temperature variation for malaria par-asite development across Africa. Scientific Reports, 3(1300), 2013. doi=10.1038/srep01300.52, 64, 77

[128] JAMES J. HECKMAN. Sample selection bias as a specification error. Econometrica,47(1):153–161, 1979. 58

[129] HIDETOSHI SHIMODAIRA. Improving predictive inference under covariate shift by weight-ing the log-likelihood function. Journal of Statistical Planning and Inference, 90:227–244,2000. 58

[130] JIAYUAN HUANG, ALEXANDER J. SMOLA, ARTHUR GRETTON, KARSTEN M. BORGWARDT,AND BERNHARD SCHÖLKOPF. Correcting Sample Selection Bias by Unlabeled Data. InNIPS, 2007. 58

[131] XING ZHAO, FEI CHEN, ZIJIAN FENG, XIAOSONG LI, AND XIAO-HUA ZHOU. The temporallagged association between meteorological factors and malaria in 30 counties in south-westChina: a multilevel distributed lag non-linear analysis. Malaria Journal, 13:57, 2014. 62

[132] J.M.C. RIBEIRO, F. SEULU, T. ABOSE, G. KIDANE, AND A TEKLEHAIMANOT. Temporaland spatial distribution of anopheline mosquitos in an Ethiopian village: Implications formalaria control strategies. Bulletin of the World Health Organization, 74:299–305, 1996. 62

[133] I KLEINSCHMIDT, BL SHARP, GPY CLARKE, B CURTIS, AND C FRASER. Use of gener-alized linear mixed models in the spatial analysis of small area malaria incidence rates inKwaZulu Natal, South Africa. American Journal of Epidemiology, 153:1213–1221, 2001b.62, 66

[134] R. ABELLANA, C. ASCASO, F. SAUTE, D. NHALUNGO, A. DELINO, AND P. ALONSO.Spatio-seasonal modelling of the incidence rate of malaria in Mozambique. Malaria Journal,7: 228, 2008. 63

[135] M. H. CRAIG, B. L. SHARP, M. L. H. MABASO, AND I. KLEINSCHMIDT. Developing aspatial-statistical model and map of historical malaria prevalence in Botswana using astaged variable selection procedure. International Journal of Health Geographics, 6:44, 2007.63

Page 115: Orlando P. Zacarias - DiVA portalsu.diva-portal.org/smash/get/diva2:867936/FULLTEXT01.pdf · Stockholm University ISBN 978-91-7649-304-5 ISSN 1101-8526 DSV Report Series No. 15-020

REFERENCES 93

[136] M. C. THOMSON, S. J. CONNOR, U. D’ALESSANDRO, B. ROWLINGSON, P. DIGGLE,M. CRESSWELL, AND B. GREENWOOD. Predicting malaria infection in Gambian childrenfrom satellite data and bed net use surveys: The importance of spatial correlation in theinterpretation of results. American Journal of Tropical Medicine and Hygiene, 6:2–8, 1999. 64

[137] M. C. THOMSON, S. J. CONNOR, P. MILLIGAN, AND S. P . FLASSE. Mapping Malaria riskin Africa: what can satellite data contribute. Parasitology Today, 13, 313, 1997.

[138] S. I. HAY, R. SNOW, AND D. J. ROGERS. Predicting malaria seasons in Kenya using multi-temporal meteorological satellite sensor data. Transactions of the Royal Society of TropicalMedicine and Hygiene, 92:12–20, 1998. 64

[139] M .H. CRAIG, I. KLEINSCHMIDT, J. B. NAWN, D. LE SUEUR, AND B. L. SHARP. Exploringthirty Years of malaria case data in KwaZulu-Natal, South Africa: Part I. The impact ofnon-climatic factors. Tropical Medicine and International Health, 9:1247–1257, 2004. 75

[140] A. K. GITHEKO AND W. NDEGWA. Predicting malaria epidemics in the Kenyan highlandsusing climatic data: a tool for decision makers. Global Change and Human Health, 2(1),2001.

[141] J. A. PATZ, M. HULME, C. ROSENZWEIG, T. D. MITCHELL, R. A. GOLDBERG, A. K.GITHEKO, S. LELE, A. J. MCMCHAEL, AND D. LE SUEUR. Climate change (Communi-cation arising): Regional warning and malaria resurgence. Nature, 420:627–628, 2002.

[142] T. A. ABEKU, G. J. VAN OORTMARSSEN, G. BORSBOOM, S. J. DE VLAS, AND J. D. F.HABBEMA. Spatial and temporal variations of malaria epidemic risk in Ethiopia: factorsinvolved and implications. Acta Tropical, 87:331–340, 2003. 75

[143] WEI FAN AND IAN DAVIDSON. On Sample Selection Bias and Its Efficient Correction viaModel Averaging and Unlabeled Examples. In Proceedings of the Seventh SIAM InternationalConference on Data Mining, Minneapolis, Minnesota, USA, april 2007. 79