ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural...

19
Applying semantic technologies in cervical cancer research Christos Maramis a,b, , Manolis Falelakis a,b,c , Irini Lekka d , Christos Diou a,b , Pericles Mitkas a,b , Anastasios Delopoulos a,b a Informatics and Telematics Institute, Centre for Research and Technology Hellas, Greece b Information Processing Laboratory, Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, Greece c Department of Computing, Goldsmiths, University of London, UK d Lab of Medical Informatics, Medical School, Aristotle University of Thessaloniki, Greece article info abstract Article history: Received 12 January 2011 Received in revised form 18 February 2013 Accepted 18 February 2013 Available online 13 March 2013 In this paper we present a research system that follows a semantic approach to facilitate medical association studies in the area of cervical cancer. Our system, named ASSIST and developed as an EU research project, assists in cervical cancer research by unifying multiple patient record repositories, physically located in different medical centers or hospitals. Semantic modeling of medical data and rules for inferring domain-specific information allow the system to (i) homogenize the information contained in the isolated repositories by translating it into the terms of a unified semantic representation, (ii) extract diagnostic information not explicitly stored in the individual repositories, and (iii) automate the process of evaluating medical hypotheses by performing casecontrol association studies, which is the ultimate goal of the system. © 2013 Elsevier B.V. All rights reserved. Keywords: Semantic unification Heterogeneous medical data Association studies Cervical cancer ontology Medical inference 1. Introduction Cervical cancer (CxCa) is one of the most common cancers worldwide [1]. Infection by the human papillomavirus (HPV) is accepted as the central risk factor for CxCa [2]; however, it is unlikely to be the sole cause for developing cancer. Findings indicate that other factors in addition to HPV infection are likely to be important determinants in cervical carcinogenesis. Ongoing research includes investigating the role of specific genetic and environmental factors in determining HPV-persistence and subsequent progression of disease [3]. Genetic association studies [4] constitute a significant scientific approach that employs specific statistical analysis methods to investigate the association among genetic characteristics, environmental agents and virus characteristics so as to suggest pathogenetic mechanisms that will provide new markers of risk, diagnosis and prognosis, and possibly treatment. However, these association studies require large sets of patient phenotypic (e.g., virus characteristics, clinical tests, patient lifestyle) and genotypic data (e.g., polymorphisms of a gene) all provided in a structured format. ASSIST (abbreviation for Association Studies Assisted by Inference and Semantic Technologies) is a research project whose overall objective is to provide medical researchers of CxCa with an integrated environment that unifies multiple patient record repositories, physically located at different laboratories, clinics and/or hospitals. This environment enables researchers to combine phenotypic and genotypic data, utilize existing patient records from past research studies in several clinics, and, eventually, perform biomedical research in a low-cost and time-efficient way. Furthermore, ASSIST incorporates the statistical analysis tools that are required in order to automate the process of evaluating medical hypotheses of the type used in genetic casecontrol association studies. Data & Knowledge Engineering 86 (2013) 160178 Corresponding author at: Information Processing Laboratory, Dept. Electrical and Computer Engineering, Aristotle University of Thessaloniki, 54124, Greece. Tel.: +30 2310 996 359; fax: +30 2310 996398. E-mail address: [email protected] (C. Maramis). 0169-023X/$ see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.datak.2013.02.003 Contents lists available at SciVerse ScienceDirect Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak

Transcript of ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural...

Page 1: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

Data & Knowledge Engineering 86 (2013) 160–178

Contents lists available at SciVerse ScienceDirect

Data & Knowledge Engineering

j ourna l homepage: www.e lsev ie r .com/ locate /datak

Applying semantic technologies in cervical cancer research

Christos Maramis a,b,⁎, Manolis Falelakis a,b,c, Irini Lekka d, Christos Diou a,b,Pericles Mitkas a,b, Anastasios Delopoulos a,b

a Informatics and Telematics Institute, Centre for Research and Technology Hellas, Greeceb Information Processing Laboratory, Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, Greecec Department of Computing, Goldsmiths, University of London, UKd Lab of Medical Informatics, Medical School, Aristotle University of Thessaloniki, Greece

a r t i c l e i n f o

⁎ Corresponding author at: Information ProcessingTel.: +30 2310 996 359; fax: +30 2310 996398.

E-mail address: [email protected] (C. M

0169-023X/$ – see front matter © 2013 Elsevier B.V.http://dx.doi.org/10.1016/j.datak.2013.02.003

a b s t r a c t

Article history:Received 12 January 2011Received in revised form 18 February 2013Accepted 18 February 2013Available online 13 March 2013

In this paper we present a research system that follows a semantic approach to facilitatemedical association studies in the area of cervical cancer. Our system, named ASSIST anddeveloped as an EU research project, assists in cervical cancer research by unifying multiplepatient record repositories, physically located in different medical centers or hospitals.Semantic modeling of medical data and rules for inferring domain-specific information allowthe system to (i) homogenize the information contained in the isolated repositories bytranslating it into the terms of a unified semantic representation, (ii) extract diagnosticinformation not explicitly stored in the individual repositories, and (iii) automate the processof evaluating medical hypotheses by performing case–control association studies, which is theultimate goal of the system.

© 2013 Elsevier B.V. All rights reserved.

Keywords:Semantic unificationHeterogeneous medical dataAssociation studiesCervical cancer ontologyMedical inference

1. Introduction

Cervical cancer (CxCa) is one of the most common cancers worldwide [1]. Infection by the human papillomavirus (HPV) isaccepted as the central risk factor for CxCa [2]; however, it is unlikely to be the sole cause for developing cancer. Findings indicatethat other factors in addition to HPV infection are likely to be important determinants in cervical carcinogenesis. Ongoingresearch includes investigating the role of specific genetic and environmental factors in determining HPV-persistence andsubsequent progression of disease [3]. Genetic association studies [4] constitute a significant scientific approach that employsspecific statistical analysis methods to investigate the association among genetic characteristics, environmental agents and viruscharacteristics so as to suggest pathogenetic mechanisms that will provide new markers of risk, diagnosis and prognosis, andpossibly treatment. However, these association studies require large sets of patient phenotypic (e.g., virus characteristics, clinicaltests, patient lifestyle) and genotypic data (e.g., polymorphisms of a gene) all provided in a structured format.

ASSIST (abbreviation for Association Studies Assisted by Inference and Semantic Technologies) is a research project whoseoverall objective is to provide medical researchers of CxCa with an integrated environment that unifies multiple patient recordrepositories, physically located at different laboratories, clinics and/or hospitals. This environment enables researchers tocombine phenotypic and genotypic data, utilize existing patient records from past research studies in several clinics, and,eventually, perform biomedical research in a low-cost and time-efficient way. Furthermore, ASSIST incorporates the statisticalanalysis tools that are required in order to automate the process of evaluating medical hypotheses of the type used in geneticcase–control association studies.

Laboratory, Dept. Electrical and Computer Engineering, Aristotle University of Thessaloniki, 54124, Greece.

aramis).

All rights reserved.

Page 2: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

161C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

In order to provide the desired functionality, ASSIST is founded on semantic web technologies. More specifically, an ontologicalmodel is used to describe medical concepts, medical data types, relations between them and medical rules. When needed, thismodel can be updated by knowledge engineers – in collaboration with CxCa expert researchers – through a graphical userinterface (GUI) application. Inference and Description Logics are utilized to homogenize multi-center patient data and inferdiagnostic information, resulting in the construction of a unified patient record repository that adheres to the aforementionedontological model. The system includes a GUI for its users to graphically form semantic queries and medical assumptions and alsoa powerful core that is able to respond to the queries and statistically validate the assumptions using data from the unified patientrecord repository.

The rest of this paper is structured as follows. Section 2 outlines the system specifications of ASSIST and then presents theconceptual architecture of the system emphasizing the semantic web technologies that have been used and their interactions.Section 3 presents two – real – usage scenarios of ASSIST and evaluates the usability and response speed of the system. Theextensibility of ASSIST in multiple directions is discussed in Section 4. The state-of-the-art on related semantics-enabledbiomedical systems is reviewed in Section 5. Finally, Section 6 draws the conclusion of this work and provides insights for thefuture of semantic web technologies in medical research.

2. System architecture

In this section we present the overall architecture of ASSIST from a knowledge engineering perspective, i.e., emphasizing thesystem's semantic modules, their role in the system, and their interactions; the term “semantic” refers to the ability of a module toprocess information defined through an ontology with RDF or OWL annotations. ASSIST consists of three main communicatingsubsystems, namely the System Core, the User Interface (UI), and the Medical Repositories & Associated Interfaces, which arecomplemented by two standalone applications, namely the Medical Knowledge Base (MKB) Editor and the UI View Extractor. Anoutline of ASSIST's architectural components is provided in Fig. 1.

Before proceeding with the detailed description of system modules, we present in the following section the “big picture” ofASSIST: The motivation behind ASSIST project, the project's objectives and associated system specifications, the issues that had tobe addressed and how all these are mapped to the modules of the final system.

Medical

Repository

#3

Medical

Repository

#2

Medical

Repository

#1

Repository

Schema

Mapping

Inference

Engine

Medical

Knowledge

Base

Association

Study

Analysis

Semantic

Query

Unified Medical RepositoryUser

Interface

UI View

Extractor

MKB

Editor

Repository

Schema

Mappings

ASSIST

Ontology

Medical

Implication

RulesFile

Index &

UI Tree UI Path

File

Medical Repositories &

Associated InterfacesSystem

Core

User

Interface

Fig. 1. Outline of ASSIST's architectural components (systemmodules and configuration files). Semantic modules are indicated in blue color. System configurationfiles are colored green.

Page 3: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

162 C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

2.1. CxCa research context and system specifications

Currently, research centers expertized in the CxCa domain work in isolation: Imagine a number of diverse medical repositoriesincluding CxCa-related data that are geographically distributed and moreover contain syntactically and semanticallyheterogeneous information. An example of syntactic data heterogeneity between two research center databases regarding thenotions of HPV infection andMethylenetetrahydrofolate reductase (MTHFR) gene polymorphism – two commonly considered factorsin CxCa association studies – is given in Table 1. To make things even worse, some research centers do not encode their medicaldata in any structured format (e.g., XML or RDBs) but rather store it in plain text documents.

Obviously, these practices hinder the progress of medical research in the domain, since the number of records contained ineach isolated repository enables the conduction of association studies for a relatively small number of study factors. In order toinfer statistically significant multi-factor associations, data from several medical centers must be integrated into a larger dataset.However, the lack of structured/formatted data storage in a medical center sets a hardly surmountable barrier in reusing andintegrating its data.

Passing over the issue of plain text storage, ASSIST provides the aforementioned integrated dataset by unifying isolatedstructured/formatted medical repositories (legacy repositories from now on) into its Unified Medical Repository (UMR). To achievethis unification, it has to overcome the previously discussed heterogeneity issues. The first step to this direction is the definition ofa “language” to express the information contained in the UMR; this is the ASSIST Ontology [5]. The mapping of the schema of eachlegacy repository to the ASSIST Ontology resolves syntactic heterogeneity and it is undertaken by the Repository Schema Mappingmodule.

Regarding semantic heterogeneity, this stems mainly from the use of different classification schemes for the results of varioustypes of diagnostic examinations. For example, with respect to the result of the cytology examination, some clinics employ thePapanikolaou classification [6], while others employ the Bethesda classification [7]. To this direction, ASSIST defines its ownclassification scheme [8] which is employed for all examination types. This scheme distills the knowledge of several experts in thefield of CxCa. For this reason, it is medically sound and, concurrently, its granularity is the coarser that suffices for the intendedcase–control association studies. In order to import the results of every intervention that is stored in the legacy repositories intoASSIST's classification scheme, a set of medical implication rules is defined and appropriate inference mechanisms are employed.The former is stored in the Medical Knowledge Base (MKB) module – this is also the place where the ASSIST Ontology resides –

while the latter are provided by the Inference Engine module.ASSIST exploits the achieved repository unification to provide medical researchers with an integrated environment for

conducting association studies on the expanded CxCa repository. For this reason, it includes a User Interface (UI) module whichallows its users to form association hypotheses by transparently generating corresponding semantic queries and also to view theproduced statistical results. However, semantic web tools (e.g., OWL ontologies, semantic query languages) can be overwhelmingfor medical researchers, which are the exclusive users of ASSIST. For this reason, the UI presents to its users a simplified view ofthe ASSIST Ontology including only familiar medical concepts; moreover, it provides a simple mechanism for graphical queryconstruction. System usability is ensured with the help of the UI View Extractor module. For a given association hypothesisformulated by a user, the Semantic Index & Querymodule poses the associated queries and retrieves the corresponding result-sets,while the required statistical analysis is performed by the Association Study Analysis module.

One of the most important parameters when performing case–control association studies in the CxCa domain is thedetermination of a patient's severity index with respect to the risk of malignant progression. However, this information, which inmedical practice is a conclusion drawn by medical experts after considering the results of all patient's examinations, is not alwaysstored in the individual repositories. In order to expand its set of supported association studies, ASSIST infers the disease severityinformation by invoking the Inference Enginemodule to apply appropriate medical implication rules on the patient's examinationdata. As we have already mentioned, these rules reside in the MKB, which holds the knowledge of the CxCa domain expressed interms of the ASSIST Ontology. In cases where the knowledge has to be updated, this is accomplished with the help of the MKBEditor module.

2.2. Medical Knowledge Base (MKB)

The MKB contains all the medical knowledge in the domain of CxCa that is required to support the intended case–controlassociation studies and, therefore, it is considered the cornerstone of the system. The MKB has been developed in collaboration

Table 1An example of syntactically heterogeneous information encoding in two clinics.

Clinic 1 Clinic 2

Genetic_Info.MTHFR = “C/C” Polymorphism-MTHFR = “C/C”HPV_Result = “16, 18” HPV_16 = yes

HPV_18 = yesHPV_31 = no⋮

Page 4: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

163C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

with medical experts from three different research centers to accurately capture the essence of the domain knowledge. It consistsof the ASSIST Ontology and the Medical Implication Rules.

2.2.1. ASSIST OntologyThe ASSIST Ontology1 lies at the core of all the semantic technologies that have been employed by ASSIST. It models the

domain of cervical cancer – to the best of our knowledge, the first ontology to attempt this – having a steady orientation towardsthe facilitation of case–control association studies. Thus, it is considered to be a hybrid between domain and application ontologyaccording to the classification of [9]. The ontology comprises 170 original classes and 21 relations between them, while thelanguage used to describe this knowledge is OWL-DL [10].

The use of an Upper Ontology for reality representation has been identified as a basic harmonization feature [11]. The ASSISTOntology is based on the Basic Formal Ontology (BFO) [12], providing a clear distinction between the entities that endure (knownas “continuants”) and those that happen, unfold or develop in time (known as “occurents”). An effort has also been made towardscompliance with the organization of other BFO-based ontologies, wherever that was possible. To this end, the ASSIST Ontologyhas certain structural overlap with the ACGT Master Ontology (ACGT-MO) [13] (e.g., Diagnostic Process,2 Magnitude,Toxicity Level), the Ontology for General Medical Science (OGMS) [14] (e.g., Constitutional Genetic Disorder), theOntology for Biomedical Investigation (OBI) [15] (e.g., Biological Process) and the Experimental Factors Ontology (EFO) [16](e.g., Smoking Behavior). However, the ASSIST ontology certainly delineates from these efforts, as it focuses on Cervical Cancer,providing specific entities for performing association studies in this domain.

Further design principles that have dictated the ontology organization include: (1) a clear distinction between use andmention, (2) accurate representation of things in reality rather than their representations, (3) enforcement of a strict subsumptionhierarchy, (4) avoidance of multiple inheritance, and (5) avoidance of unnecessary “UnknownX” classes. A detailed description ofthese principles can be found in [13].

The conceptual schema of the ontology is based on the notion of Case. A Case can be defined as “a collection of data about apatient concerning a certain period of time, during which a diagnosis of a disease is medically meaningful”. Thus, a Person can beassociated – through the hasCase object property – with more than one Case instances, which, in turn, may include one or moreVisits during which Medical Interventions (Diagnostic Interventions, Therapeutic Interventions, Vaccinations,etc.) may take place. The main Diagnostic Interventions of interest are the Cytology examination, the Colposcopy examination,the Histology examination and the tHPV Test. Each Diagnostic Intervention is associated with a unified Result. Each Case isalso associated with a degree/index which constitutes an assessment of the patient's risk regarding the development of cervicalcancer, namely the CxCa Severity. However, for the purpose of performing an association study only one Case has to beconsidered per Person; for this reason, the hasWorstCase object property associates each Person with its Case with the worstCxCa Severity. The classes and properties mentioned above constitute the “core” of the ontology.

As it can be observed by investigation of the ASSIST Ontology, the classes contained therein can be classified into twocategories:

Generic Medical Entities. These entities are generic enough to be employed in almost every medical domain and are commonlystored in typical medical repositories. Examples of this category include the following classes: Case, Visit, MedicalIntervention, Diagnostic Intervention, Therapeutic Intervention, Result, etc.

Domain Specific Entities. These entities are strongly associated with CxCa, its stages, the related interventions, genotypic andphenotypic factors. Moreover, some of these entities do not directly correspond to the information stored in typical medicalrepositories. Examples of this category include the following classes: Colposcopy, Cytology, Histology, HPV Test, CxCa Severity, etc.

The aforementioned class categorization stems from a design principle and, even though not explicit (i.e., there are no abstractclasses that serve as ancestors for all the classes of each category), it facilitates the potential exploitation of ASSIST Ontology in othermedical domains as well. Indeed, in such a case, the classes of the first category can be reused as-is and only those of the secondcategory will have to be replaced according to the needs of the new domain. The same goal is also facilitated by ASSIST Ontology'scompliance with BFO and its partial overlapping with established ontologies in related domains (ACGT-MO, OBI, OGMS, and EFO).

2.2.2. Medical Implication RulesThe Medical Implication Rules of MKB have been defined with guidance from CxCa experts in order to be used for inferring

medical information not explicitly stored in the legacy repositories and they are divided into two discrete types. The first typeincludes rules that can be expressed via Description Logics [17], while the second type includes the rest of the rules. All theemployed rules infer information that, according to the medical experts, plays effective role in the conduction of the supportedcase–control association studies.

The Medical Implication Rules of the first type (type-I rules from now on) are actually OWL-DL (Equivalent) Class Axioms3 andare mainly employed in two cases: First, to semantically homogenize the results of the Diagnostic Interventions conducted indifferent research centers and, second, to infer for each Case its associated CxCa Severity index. Regarding semantic

1 The most recent version of the ASSIST Ontology can be accessed at http://mug.ee.auth.gr/assist/CxCaOntology.owl.2 The names of the ontology classes and properties are typed in monospace fonts in their first occurrence in the paper.3 Technically, these axioms are part of the OWL file of the ASSIST Ontology.

Page 5: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

164 C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

homogenization, we have already mentioned that ASSIST supports for each examination type all the established classificationschemes and also defines its own four-stage classification scheme – this is summarized in Table 2 – that serves as the unifyingclassification into which all the other classifications are eventually mapped. Let us provide an example: The classification schemessupported by ASSIST for the cytology examination are the Munich [18], the Papanikolaou [6], and the Bethesda schemes [7]. Themedical implication rule that translates the results coded in these schemes into the ASSIST classification scheme for the case of theASSIST3 index (ASSIST3 Cytology Result class) can be expressed in terms of the ASSIST Ontology (classes, object and datatypeproperties) by the following formula:

ASSI

ðð∃ h

∃ hað∃ hað∃ hað∃ hað∃ hað∃ hað

4 How

ST3 Cytology Result≡Cytology Result ⊓ð∃ isResultOfExam ðCytology ⊓asCytologyPapanikolaouResult Class Vf gÞ ⊔sCytologyMunichResult Vf gÞ ⊔sCytologyBethesdaResult SCCf gÞ ⊔sCytologyBethesdaResult adenocarcinomaf gÞ ⊔sCytologyBethesdaResult adenocarcinomaEndocervicalf gÞ ⊔sCytologyBethesdaResult adenocarcinomaNOSf gÞ ⊔sResultDataType ASSIST3f gÞÞÞÞ:

In a similar manner, medical implication rules that assess the CxCa Severity index of each Case according to the ASSISTclassification scheme have been defined with the help of DL. These rules consider the aggregate results of three distinctexamination types (Cytology, Colposcopy, and Histology) conducted within the context of the Case.

The other type of implication rules (type-II rules from now on) lies beyond the expressive abilities of DL and thus the rules ofthis type are expressed as construct queries in the SeRQL4 semantic query language [20]. The employed SeRQL queries areessentially graph constructing rules that infer medical information in two cases. In the first case, these rules extract for each Casean aggregate (or collective) result per examination type in terms of the ASSIST classification scheme, by taking into account all theexamination results of the specific type. More specifically, for each Case, a collective result of all the Cytologies that have beenconducted within the Case is extracted (Collective Cytology Result class), and the same holds for the other three types ofDiagnostic Intervention (Colposcopy, Histology, and HPV Test).

The second case includes the rules that extract, among all the Cases that belong to a patient, the worst one with respect to theirCxCa Severity indices according to the ASSIST classification scheme; as we have already seen, this is expressed in ontology termsvia the hasWorstCase object property. As an example, the medical implication rule that infers the patients whose worst CaseCxCa Severity index is ASSIST2, under the assumption that the patients with ASSIST3 index of worst Case have been alreadyinferred, is expressed by the SeRQL query that is given in Listing 1.

Listing 1. The SeRQL query that expresses the medical implication rule for ASSIST2 index of worst Case CxCa Severity.

CONSTRUCT {Person} asNS : hasWorstCase {wCase}

FROM {Person} asNS : hasCase {wCase},

{wCase} asNS : diagnoses {wSeverity},

{wSeverity} rdf : type {asNS : ASSIST2_Severity}

WHERE wCase = ALL (

SELECT Cas

FROM {Person2} asNS : hasCase {Cas},

{Cas} asNS : diagnoses {Severity},

{Severity} rdf : type {asNS : ASSIST2_Severity}

WHERE Person = Person2

LIMIT 1)

MINUS

CONSTRUCT {Person} asNS : hasWorstCase {wCase}

FROM {Person} asNS : hasWorstCase {},

{Person} asNS : hasCase {wCase},

{wCase} asNS : diagnoses {wSeverity},

{wSeverity} rdf : type {asNS : ASSIST2_Severity}

USING NAMESPACE asNS = bhttp://assist.iti.gr/assist_ontology.owl#>

2.3. Inference Engine

The Inference Engine is the module that provides all the required mechanisms for inferring medical information. It isclosely associated with the MKB, since its role is to apply the medical implication rules of the MKB on the contents of the

ever, other semantic query languages (e.g., the W3C recommendation SPARQL [19]) could just as well serve for expressing type-II rules.

Page 6: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

Table 2The unifying classification scheme employed by ASSIST. It is applied on allexamination types and also on the overall CxCa Severity index of a Case.

Class Description

ASSIST0 Normal or within normal limitsASSIST1 Low grade abnormal findingsASSIST2 High grade abnormal findingsASSIST3 Findings conclusive for cervical cancer

165C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

legacy repositories that have previously been syntactically unified (see Section 4). The purpose of the Inference Engine istwofold:

1. to semantically homogenize the legacy repository data by translating all the examination results to the ASSIST classification(Section 1), and

2. to infer all the CxCa-related diagnostic information that is required for the conduction of the intended case–control association studies.

The output of the Inference Engine is the final unifying CxCa repository of ASSIST, ready to be employed for conductingassociation studies (see Section 5). Due to the existence of two types of medical implication rules, two discrete inferencemechanisms have to be employed.

The first mechanism utilizes the medical implication rules that are expressed in OWL-DL (type-I rules) and it can be applied byan established DL reasoner (e.g., Fact++ [21], Pellet [22], RACER [23]). The reasoner is based on type-I equivalent class axioms tocompute the inferred types of all the individuals in the ontology. The first inference mechanism is implemented by the built-in DLreasoner of OWLIM [24], which is the reasoning-enabled semantic data repository of Sesame [25], and it is mainly employed to(1) semantically homogenize the examination results, and (2) to extract the CxCa Severity Index of each Case (see Section 2).

The second inference mechanism applies the rules of type-II. Since these rules are expressed as SeRQL queries, an appropriatesemantic query engine suffices to accomplish this inference task. For this reason, ASSIST employs the query engine of SesameServer [25], which is able to pose SeRQL queries to an ontology-based repository (see Section 2.5) and retrieve the correspondingresult-sets in the form of RDF graphs; the retrieved RDF graph is used to incrementally update the semantic repository with theinferred information. This inference mechanism is mainly used to infer (1) for each Case the collective results of everyexamination type, and (2) the worst Case of a patient with respect to CxCa Severity index (see Section 2).

The two inference mechanisms described above need to be combined in order for the inference procedure of the system to becompleted. More specifically, multiple inference layers of both mechanisms have to be applied sequentially on the medical data sothat the odd layers employ the first mechanism, while the even layers employ the second mechanism. Each inference layer needsto employ the conclusions of the previous layer in order to reach its own conclusions; this explains the need for interleaving thetwo inference mechanisms.

In the following we shortly describe, with the help of an example, the complete inference procedure for a certain person Pwith3 Cases (C1, C2, and C3). The sequence of the applied inference layers is also outlined in Fig. 2.

Layer 1 (OWLIM). Assume that the patient has been subject to a cytology examination within the context of Case C1 and theresult of this examination (instance R1) has been Class_V in the Papanikolaou classification. Then, based on the equivalent classaxiom presented in Section 2.2, the DL reasoner will infer that R1 rdf:type ASSIST3_Cytology_Result. Similar conclusions will bedrawn for all the examinations conducted by P.

Layer 2 (SeRQL). A set of appropriate construct queries will be sequentially applied in order to infer the collective result (R2) forthe cytologies belonging to Case C1. Obviously, since there is a cytology with ASSIST3 Severity (R1), the outcome of thesequeries will be the triple R2 rdf:type ASSIST3_Collective_Cytology_Result. This procedure is repeated for all the examinationtypes of Cases C1, C2, and C3.

Repositories 1 − N

syntactically unified

Layer 1:

semantically

homogenize

Layer 2:

infer collective

results

Layer 3:

infer severity

index

Layer 4:

infer

worst case

ASSIST’s UMR

Fig. 2. The sequence of inference layers applied by ASSIST's Inference Engine. The layers employing the 1st inference mechanism (OWLIM) are indicated in bluecolor, while those employing the 2nd inference mechanism (SeRQL) are colored red.

Page 7: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

166 C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

Layer 3 (OWLIM). Given the collective result of the previous step, the DL reasoner will apply the appropriate equivalent classaxioms to compute the overall severity of each case (Si for Ci, where i = 1,2,3). For Case C1, if we assume that no colposcopy orhistology examinations were conducted, then the reasoner will conclude that S1

Layer 4 (SeRQL). In this step, the Inference Engine has to identify the worst Case of P in terms of CxCa severity. This isaccomplished by a series of construct queries. Assuming that Cases C2 and C3 have been diagnosed with lower CxCa severitiesthan ASSIST3, then the present inference layer will generate the triple P asNS:hasWorstCase C1.

2.4. Repository Schema Mapping

This module constitutes ASSIST's interface to its legacy repositories. At this point, we should mention that ASSIST strictlyconforms to all up-to-date ethical, security, and privacy recommendations when dealing with medical data. To this direction, thewritten approval from the Ethical Review Board of each participating center for using the data was ensured. Apart from this,advanced authentication and authorization procedures are applied regarding both its end-users (LDAP-based passwords) andalso the information exchange between legacy repositories and the system core (VPN protocol).

However, the most important aspect of the aforementioned recommendations has to do with the anonymity of the sensitivemedical data. This has been achieved by applying two layers of anonymization. The first layer is at the site of the participating centersand involves the application of privacy enhancement algorithms (reduction of indirect re-identifiability, outlier suppression,replacement of dates by timestamps) and pseudonymization (replacement of nominative information by pseudonyms andsubsequent encryption of nominative information). For each participating repository, the first anonymization layer produces ade-identified repositorywhich is eventuallymade available to ASSIST (in fact, this is the legacy repository depicted in Fig. 1). A similaranonymization layer is applied at the Repository Schema Mapping module as an additional security measure.

Getting back to the description of the Repository Schema Mapping, the main purpose of the present module is to map theschema elements of the legacy repositories to entities of the ASSIST Ontology. The definition of these mappings requirescollaboration between technical administrators of the legacy repositories and knowledge engineers of ASSIST. As an example, theMTHFR-related schema elements of the two clinics in Table 1 are both mapped onto the MTHFR C677T Polymorphism Ala/Ala

class of the ASSIST Ontology. The outcome of this mapping procedure is the syntactic unification of the legacy repositories.The aforementioned mappings are defined manually through the mapping language of the D2RQ platform [26] and are stored

in an XML document (Repository Schema Mappings file in Fig. 1). The D2RQ Engine is employed to query the legacy repositoriesand transform the retrieved information into RDF triples that instantiate entities of the ASSIST Ontology.

2.5. Unified Medical Repository (UMR)

The purpose of this module is to unify the medical information contained in the diverse legacy repositories into a singlecontainer that can be accessed by ASSIST's users in a transparent and seamless manner. It is essentially a large repository thatemploys the ASSIST Ontology to express the heterogeneous CxCa-related information of the participating research centers.

The construction of the UMR is achieved with the help of the Inference Engine and the Repository Schema Mapping modules.First, the latter module undertakes the task of populating the UMR. I.e., the Repository Schema Mapping module producesinstances of ASSIST Ontology classes and literals that correspond to elements of the participating legacy repositories and relatesthem via the ASSIST Ontology properties. As we have discussed in the previous subsection, the repository that results from thisprocedure contains syntactically unified data.

Next, the Inference Engine takes over. The semantic data heterogeneity that stems from the use of diverse classificationschemes is tackled by applying on the UMR the first layer of the Inference Engine (see Section 2.3). Following the syntactic andsemantic homogenization of the contained data, the final step is the inference of the required diagnostic information that is notexplicitly stored in the legacy repositories. This is accomplished by applying layers 2 to 4 of the Inference Engine (see Section 2.3).

With respect to the implementation details, the OWLIM repository of the Sesame Server has been selected to materialize theUMR of ASSIST. This selection has been based mainly on Sesame's ability to handle large populated ontologies and on thesimplicity of its accompanying semantic query language (SeRQL). However, other semantic repositories and/or query languages(e.g., SPARQL) could be used as well for the implementation of the present module.

2.6. UI View Extractor

In ASSIST, a typical semantic system, the formulation of medical hypotheses and queries requires the utilization of concepts/terms from the ASSIST Ontology. However, as it has been discussed in Section 2.1, the presentation of the entire ASSIST Ontology(i.e., a very complex graph) to the system users, which are medical researchers, would make ASSIST very difficult to use. Thus, toensure system usability, we decided to “hide” the complexity of the knowledge representation from ASSIST users. This has beenaccomplished by extracting two files from the ASSIST Ontology in order to simplify query construction.

UI Tree File. This tree contains exactly those classes of the ASSIST Ontology that can be employed in the supported semanticqueries. Each node of this tree receives an expert-defined name that is familiar to medical researchers and is accompanied byadditional information extracted from the ontology. This information comprises the datatype properties that are connected

Page 8: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

167C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

with the node-class and the allowed values of these properties. The UI Tree file is stored in XML format and, in practice,incorporates only a small portion of ASSIST Ontology classes.

UI Path File. For each class that is contained in the previous file, the present file includes the shortest path through the graph ofASSIST Ontology from a root class to the given class; the subsumption relations (is–a) are not taken into account in pathdefinition. The class Person has been selected as the ontology root for the purpose of path definition. A sample path is given inFig. 3. Owing to the design of the ASSIST Ontology, it can be proved that there is a unique shortest path connecting the classPerson with any given ontology class, as long as the hasCase property is omitted.

Among the previous files, only the UI Tree file is presented to the system users. The users can select nodes from this tree toformulate their hypotheses and, whenever they do, their selections are complemented by the information contained in the UIPath file –which is not presented to the user – in order to reconstruct powerful semantic queries in terms of the ASSIST Ontology.This way, the users are perfectly capable of constructing all the hypotheses and queries that are supported by the system withoutbeing overwhelmed by the complexity of the ASSIST Ontology.

The two files under discussion are generated with the help of the UI View Extractor module. This is essentially a standalonesoftware application utilized for building the “view” of the ASSIST Ontology that is eventually presented to the users. Theapplication first reads the ASSIST Ontology with the help of Jena API [27], and allows its operators, which are CxCa researchexperts, to determine the tree of “exploitable” ontology classes and also assign to each class a name familiar to medicalprofessionals. This hierarchy is stored in the UI Tree file. Then, the application automatically extracts from the ontology the pathsto the nodes of the UI Tree and stores them in the UI Path file. As it is obvious from Fig. 1, the UI View Extractor does notcommunicate directly with the subsystems of ASSIST.

2.7. User Interface (UI)

The User Interface module/subsystem constitutes the front-end of ASSIST to its users. It encapsulates the means to expose thefunctionality of ASSIST, allowing the users to uniformly and transparently access the unified biomedical repository for cervicalcancer and perform case–control association studies in a flexible and comprehensive way.

The UI is implemented using Java and Javascript as a web-based interface and it incorporates the two files that areproduced by the UI View Extractor. While the information of the UI Path file is not revealed to the user, the UI Tree file isemployed to dynamically generate and display a tree of medical concepts. This tree, whose top level is depicted in Fig. 4, isfundamental for the functionalities of the system. By dragging and dropping the nodes of the aforementioned tree, the user isable to

formulate queries for data retrieval: By selecting nodes from the displayed tree (i.e., ontology classes) and by subsequentlysetting constraints on the associated properties (i.e., ontology object and datatype properties), the user specifies the elementsof an intended query addressed to the UMR. Then, the UI retrieves from the UI Path file the ontology paths to the selectedclasses and combines them with the user selections to form a complex semantic query, ready for posing it to the UMR.

formulate association study hypotheses: In a two-step process the user defines the case and control groups of an intendedassociation study by selecting nodes from the displayed tree and by subsequently setting constraints on their associatedproperties. In a third step, the user specifies the factors of the study through the standard drag-and-drop procedure (always onthe same display tree). Then, the UI retrieves from the UI Path file the ontology paths to the selected classes and combinesthem with user selections to form, this time, two semantic queries: one for the case group and one for the control group. Bothqueries also request information related to the selected study factors.

Of course, the UI is perfectly capable of displaying the result-sets of the formulated queries (tables of raw data) and visualizingthe statistical results of the requested association studies (pie-charts, histograms, etc.).

Due to the requirement for subsystem interoperability, the communication of the UI with the System Core is carried out by theexchange of XML documents. Thus, the queries that are constructed at the UI are not expressed directly in SeRQL, which has beenselected as the official query language of ASSIST. Instead, ASSIST has defined an XML Query Schema that is mapped to theexploitable subset of SeRQL operators. This schema is employed by the UI to translate the user selections into XML queries thatcan later be transformed into meaningful SeRQL queries (see Section 2.8).

Person Case Visit Colposcopy

hasWorstCase caseHasPart hasIntervention

Fig. 3. Path through the ASSIST Ontology from class Person to class Colposcopy.

Page 9: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

Fig. 4. The graphical UI of ASSIST.

168 C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

2.8. Semantic Index and Query

The Semantic Index and Query module is charged with the orchestration of the semantic functionalities of ASSIST. The presentmodule is critical for the interoperability of other system modules since it is responsible for the circulation of queries andresult-sets within the system and also for their transformation from one form to another.

The Semantic Index and Query module receives the XML-formatted queries sent from the UI and translates them into valid andmeaningful SeRQL strings based on its built-in knowledge of the 1–1 mapping between the XML Query Schema and the employedSeRQL operators. After this translation, the module queries UMR with the resulting SeRQL strings and retrieves the correspondingresult-sets. Then, it transforms the result-sets into an XML hierarchy of tabular structure. The resulting XML document is senteither to the Association Study Analysis module (see Section 2.9) for statistical processing, or to the UI for displaying. TheSemantic Index and Query module is also responsible for receiving the XML-formatted statistical results that are produced by theAssociation Study Analysis module and for propagating them to the UI.

2.9. Association Study Analysis

This module performs all the statistical computations and tests that are required for the conduction of a case–control geneticassociation study. The Rserve module of R (http://www.r-project.org/) is employed to establish connection with a dedicated Rserver that undertakes the aforementioned computations and tests. The module can also connect to dbSNP [28] – the NCBIdatabase of single nucleotide polymorphisms – in order to retrieve frequency and count information regarding the genotypes andalleles of the investigated polymorphisms. The Association Study Analysis module receives statistical analysis parameters and rawdata from the Semantic Index & Query module (Section 2.8) and, after carrying out the requested analysis, it returns the statisticalresults to the same module. This is its exclusive means of communicating with the rest of the system.

One of the functionalities of the present module is to provide a pre-hoc power analysis of an intended association study, bycomputing the required dataset size for a given value of statistical power and effect size. Moreover, when an associationhypothesis has been formulated, the Association Study Analysis module calculates the allele and genotype frequencies in the caseand control groups and it checks whether they are compliant with those retrieved from dbSNP. Finally, the module tests theformulated association hypothesis, by conducting chi-square and log-linear tests, by calculating odd-ratios and relative risks, etc.

Page 10: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

169C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

2.10. MKB Editor

This module is a standalone software application that allows the modification of the system's MKB (ASSIST Ontology andMedical Implication Rules) whenever new scientific developments (e.g., new medical discoveries or guidelines) occur and/or theneeds of the system change. The rationale for building this application has been to develop an ontology and semantic query editorthat (1) provides exactly those editing capabilities that are necessary in the context of ASSIST, and (2) enforces a set of domainknowledge restrictions (“medical axioms”) in the editing process.

In other words, instead of employing a full-functionality ontology editor such as Protégé (http://protege.stanford.edu/) forupdating its ontology, ASSIST has developed a custom-made ontology editor that is lightweight but sufficient and also makes surethat the MKB always conforms to a set of ground-truth medical axioms. These axioms/restrictions are a safety mechanism toavoid overriding established medical knowledge and they apply also on editing type-II implication rules (semantic queries).

The MKB Editor allows knowledge engineers – always under the guidance of CxCa experts – to add, edit and delete ASSISTOntology concepts and medical implication rules of type-I (OWL-DL Class Axioms) and type-II (SeRQL Queries) in a mixture oftextual and graphical fashion. The application uses the Jena API [27] to load, modify and store ontology models. It is not directlyconnected with the MKB but can operate on the MKB files, i.e., the ASSIST Ontology and the Medical Implication Rules document(see Fig. 1). This way, necessary changes are carried out off-line, and the updated MKB files are submitted for approval to acommittee of CxCa experts. Once the knowledge modifications have been deemed medically correct, the updated files can besecurely uploaded to the MKB. The application's window for editing OWL-DL class axioms is shown in Fig. 5.

3. System usage

This section presents a prototype installation of ASSIST that is made available for CxCa researchers to use. Then, two completescenarios of ASSIST's usage are provided. The section concludes with a discussion regarding the system evaluation.

3.1. A prototype installation

A prototype installation of ASSIST can be accessed via the Internet at http://mug.ee.auth.gr/assist. This prototype incorporatesdata from three pilot research centers, namely:

• The 1st Department of Obstetrics and Gynecology, Papageorgiou Hospital (Thessaloniki, Greece).• The Laboratory of Gynecological Tumor Immunology, Charite Hospital (Berlin, Germany).• The Departments of Genetics, Obstetrics & Gynecology, and AnatomoPathology, University of Ghent (Ghent, Belgium).

Fig. 5. MKB Editor's window for editing OWL-DL class axioms.

Page 11: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

170 C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

Some statistics may help the user anticipate the size of the system. The UMR resulting from the contribution of the three pilotsincorporates 3209 Persons, 3223 Colposcopies, 5311 Cytologies, 1742 Histologies, and 4100 HPV Tests. These are associated with5986 Cases, which are diagnosed with ASSIST0 (3283), ASSIST1 (1152), ASSIST2 (799) and ASSIST3 (270), while the CxCa severityof 482 Cases is undetermined. The UMR incorporates genetic information concerning 5 CxCa-suspicious polymorphisms: CYP1A1(2018 persons), CYP2E1 (1974 persons), GSTM1 (1995 persons), GSTT1 (1995 persons), MTHFR (1981 persons), and p53Codon72 (1973 persons).

3.2. Usage scenarios

The usage scenarios to be discussed constitute two sample case–control association studies that have been actually conductedvia ASSIST on its extended CxCa repository (UMR). The purpose of these scenarios is twofold: First, to illustrate the operation ofthe system and to explain the role of the semantic modules within ASSIST's framework. Second, to prove the utility and value ofthe system, since the presented studies are examples of actual contribution to CxCa research.

3.2.1. Scenario 1Imagine the following case: Dr. X, who is a medical researcher in the CxCa domain, would like to use ASSIST in

order to test whether there is an association between the presence of the MTHFR polymorphism and the onset of CxCa.The cases for this study will be patients with CxCa (ASSIST3), while the controls will be healthy individuals (ASSIST0).Dr. X intends to perform the study on subjects from a single clinic (Papageorgiou Hospital). The scenario evolves asfollows:

1. Dr. X defines the parameters of the intended study through the system's graphical UI in a wizard-style fashion. First, shedetermines the inclusion criteria for the case and control groups by selecting nodes from the displayed tree of medicalconcepts. In the same way, she defines the study factor(s) of the study. The described procedure has been captured in Fig. 4.Then, the UI module uses the XML Query Schema to generate two ontology-compliant XML queries based on the userselections.

2. The generated queries are sent by the UI to the Semantic Index & Query module along with the defined study factor(s). Thelatter module transforms the received XML queries into the corresponding SeRQL strings and consequently queries the UMR.TheSeRQL query for the case group is given in Listing 2. The Semantic Index & Query module retrieves the result-sets andgenerates a “tabular” XML document that incorporates the retrieved information of the case and control groups. The generatedXML document is then passed to the UI.

3. At the UI side, Dr. X can see the retrieved result-sets for the two groups as well as their distribution with respect to theselected study factor(s). Then, she realizes that the case group includes only 4 patients and therefore no reliable ortrustworthy statistical analysis can be performed on the formulated association hypothesis. In order to increase the size ofthe case group, Dr. X decides to re-formulate the inclusion criteria in the two groups by considering patients from allthree participating clinics. Steps 1 and 2 are revisited to revise the hypothesis and retrieve the new case and controlgroups.

4. The size of the revised case group is satisfactory (176 patients) and Dr. X decides to proceed with the association analysis.The Semantic Index & Query module sends the study factor(s) and the XML-formatted case and control group data to theAssociation Study Analysis module. The latter module performs the request statistical analysis and replies with the results.The statistical results are forwarded to the UI, which displays them, thus completing the conduction of the associationstudy.

The performed analysis indicates that there is no statistically significant association between MTHFR and the onset ofCxCa (ASSIST3). However, the present usage scenario demonstrates how the testing of a meaningful association hypothesisthat cannot be carried out in a single clinic (Papageorgiou Hospital) because of insufficient data is made possible owning toASSIST.

Listing 2. The SeRQL query for the original case group of Scenario 1.

SELECT DISTINCT Person, MTHFR_Subclass

FROM {Person} rdf: type {asNS : Person};

asNS : hasWorstCase {aCase};

asNS : hasPoymorphism {MTHFR_C677T_Polymorphism},

{aCase} asNS : diagnoses {ASSIST3_Severity};

asNS : caseHasPart {Visit},

{ASSIST3_Severity} rdf : type {asNS : ASSIST3_Severity},

{MTHFR_C677T_Polymorphism} rdf : type {MTHFR_SubClass},

{MTHFR_SubClass} serql : directSubClassOf {asNS : MTHFR_C677T_Polymorphism},

{Visit} asNS : madeAtClinic {Clinic},

{Clinic} rdf : type {asNS : Thessaloniki_Clinic}

WHERE (NOT (isBNode (SubClass)))

USING NAMESPACE asNS = bhttp://assist.iti.gr/assist_ontology.owl#>

Page 12: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

3.2.2. Scenario 2

Imagine now that Dr. X would like to investigate whether the presence of the MTHFR polymorphism has an effect on High

Grade Cervical Intraepithelial Neoplasia (ASSIST2). Healthy individuals (ASSIST0) will be used as controls. Dr. X intends to employsubjects from one clinic only (Charite Hospital) for her study. The scenario evolves as follows:

1. Dr. X defines the parameters of the intended study exactly as in the first step of the previous scenario. When the parameterselection procedure is completed, the UI module generates the XML queries for the case and control groups.

2. This step is the same as Step 2 of Scenario 1. The generated queries are transformed into SeRQL queries and posed to the UMR.The result-sets are transformed into a “tabular” XML format and sent back to the UI.

3. The size of both the case and the control group are satisfactory (239 and 263 subjects, respectively). Dr. X decides to proceedwith the statistical analysis, which is undertaken by the Association Study Analysis module. The results indicate that MTHFR isnot significantly associated with the case group subjects (ASSIST2).

4. However, Dr X. wishes to improve the statistical power and size effect of her analysis. Thus, she initiates a new associationstudy by slightly modifying her initial hypothesis: Now, subjects can be selected from all the clinics participating in ASSIST.Steps 1, 2, and 3 of the present scenario are revisited in order to put through the new study.

Both the case and control group of the second association study are significantly increased in size (484 and 825 subjects,respectively). The results of the revised study, which are displayed in Fig. 6, are differentiated from those of the initial study: Thissecond study indicates that there is a statistically significant association between the MTHFR polymorphism and High GradeCervical Intraepithelial Neoplasia (ASSIST2). This is a real-life example of how CxCa research can be benefited from ASSIST.Testing the association hypothesis on the rich, unified repository of ASSIST has revealed an important association betweenMTHFRand ASSIST2, which has been ignored by the – scientifically sound – association study on the single-clinic dataset.

3.3. System evaluation

The utter goal of ASSIST is to allowmedical researchers to conduct case–control association studies on an extended (i.e., unified)CxCa dataset by means of a single, easy to use user interface. Themedically significant association studies of Section 2 as well as thefeedback of several CxCa researchers from their daily usage of the ASSIST prototype prove that the aforementioned goal has beenachieved. Since the great technical challenge has been to build ASSIST around semantic web technologies, the fact that the systemis perfectly functional demonstrates that the aforementioned technologies have been successfully and seamlessly integrated intoASSIST.

171C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

Fig. 6. Screen capture of ASSIST's UI displaying the results of the revised association study of Usage Scenario 2.

Page 13: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

172 C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

Owing to the wizard-style design of UI, the conduction of a complete association study requires only a few clicks anddrag-and-drop operations, which are performed in a matter of seconds. The system response to the requests of the user is equallyfast. The usability of the system has been evaluated with the help of the widely-used System Usability Scale questionnaire [29].The score of the questionnaire falls in the range [0, 100], with higher values indicating better usability. Eight users were asked tocarry out the scenarios of Section 2 and fill in the questionnaire. The results were quite satisfactory: The acquired scores rangedfrom 62.5 to 87.5 with a mean score of 74.7 – which is considered above average [30] – and standard deviation of 8.71.

Finally, the response speed of ASSIST to its user requests was evaluated. For this purpose the scenarios of Section 2 wereexecuted by the ASSIST prototype, which is installed on a PC equipped with an Intel® P4 @3 GHz processor and 1 GB of RAM. Theexecution times that were measured for the tasks involved in the scenarios are given in Table 3. These clearly indicate that thesystem response to user requests is essentially instant.

4. System extension

The extensibility of ASSIST is discussed in the present section by focusing on three possible extension directions. For eachdirection, we describe the modifications that need to be made to ASSIST in order to implement the extension, and we also specifythe system components (modules and configuration files) that are going to be affected. The extension directions are presented inorder of increasing complexity with respect to the necessary modifications.

4.1. Incorporation of new repositories

In case a newmedical center decides to join ASSIST, its repository has to be integrated into the system's UMR. For this purpose, theschema of the new repository needs to bemapped to elements of the ASSIST Ontology. Since ASSIST does not providemechanisms forautomatic syntactic integration, this mapping has to be defined manually. Fortunately though, the system incorporates valuableresources for the semantic integration of new CxCa repositories. These aremainly the introduced classification scheme alongwith themapping of all established CxCa-related diagnostic classification schemes to the ASSIST classification (see Sections 1 and 2). This waythe semantic integration of a new repository that adheres to established classification schemes is performed automatically, withoutany need for modifications in the MKB.

Based on the previous analysis, when a new repository has to be incorporated into ASSIST, the Repository Schema Mappings fileis updated manually by adding the schema mappings of the new repository. Then, the Repository Schema Mapping moduleemploys the updated file to generate an extended UMR that incorporates the contents of the new repository.

4.2. Incorporation of new CxCa knowledge

As scientific research continually progresses, it is expected that the present medical knowledge in the CxCa domain will haveto be updated (e.g., new relations betweenmedical concepts might emerge, newmedical rules might be added). Obviously, in thiscase, the MKB of ASSIST (both the ASSIST ontology and the Medical Implication Rules) will have to be modified accordingly. As aconsequence, the mappings of the participating repositories to the modified ASSIST Ontology will also need to be revised.

In order to extend ASSIST by adding new CxCa knowledge, we will start by modifying MKB with the help of the MKB Editor.Then, the Repository Schema Mappings file will be updated manually in accordance with the ontology modifications and theRepository Schema Mapping module will generate the updated UMR. Finally, the UI View Extractor will be employed to extractappropriately modified versions of the UI Tree and UI Path files.

4.3. Incorporation of new medical domains

The extension of ASSIST to new medical domains (e.g., other cancer types) is very similar, in implementation terms, with theprevious extension direction (addition of new CxCa knowledge). In this case too, the system's MKB will have to be updated, thusleading to the same sequence of module and configuration file modifications as in the previous extension direction. The only

Table 3Execution times measured by the ASSIST prototype for the tasks involved in the usage scenarios ofSection 2. For both scenarios, the 1st retrieval of case and control groups refers to the original study,while the 2nd groups retrieval refers to the revised one. The testing hypothesis time obviously refers tothe revised study.

Time (s)

Scenario 1 Groups retrieval (1) b2Groups retrieval (2) b2Hypothesis testing b3

Scenario 2 Groups retrieval (1) b3Groups retrieval (2) b3Hypothesis testing b3

Page 14: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

173C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

difference has to do with the supported association study analysis capabilities. More specifically, in order for the system to be ableto perform case–control association studies in other medical domains as well, the Association Study Analysis module needs to beupgraded with new domain-specific analysis operations (e.g., database connectivity to retrieve information about domain-specific study factors — see Section 2.9).

According to the previous analysis, when a new medical domain has to be incorporated into ASSIST, the MKB is the firstmodule to be modified; this could theoretically be performed by the MKB editor. However, given the extent of the ontologyadditions that would be probably needed in such a case, it might be wise to seek for alternative ontology developmentapproaches. Therefore, state-of-the-art methodologies for large-scale ontology development and their associated tools can beconsidered (e.g., concurrent [31] or modular ontology development [32]). After the MKB has been modified, the RepositorySchema Mappings file is manually updated and the Repository Schema Mapping module generates the updated UMR. Next, the UIView Extractor produces appropriately modified versions of the UI Tree and UI Path files. Finally, the Association Study Analysismodule is modified so as to incorporate analysis operations required by the new domain.

The system module and configuration file modifications that are associated with the three extension directions aresummarized in Table 4.

5. Related work

When trying to specify the position of the presented system in the universe of relevant scientific and/or technological efforts,we must recall what ASSIST really is: A materialized system that employs existing semantic web technologies in order to integrateheterogeneous medical data sources; the achieved data integration results in a large-scale repository of medical data that isnecessary for a specific medical research task – in our case, the conduction of association studies in the CxCa domain. Keeping thisshort description of ASSIST in mind, we must seek for related work in large-scale collaborative efforts – mostly internationalresearch or R&D projects – that attempt to integrate heterogeneous biomedical data by means of established semantic webtechnologies (mainly involving the use of ontologies). These projects should aim at delivering well-sized pools of integrated datato be used for performing specific medical research or healthcare tasks.

The works that belong in the context defined above are differentiated mostly with respect to the three following characteristics:

Semantic Architecture & Implementation. While all of them employ semantic web technologies, each system is able toincorporate a different combination of semantic technologies in its architecture and to seek for the preferred implementationof the employed semantic technologies among the available semantic tools.

Medical Application Domain. The medical domain of interest of each project determines the type, amount, and diversity of datathat need to be integrated in the system, potentially setting additional challenges on the attempted data integration procedure.

Medical Application Task. This is the task that the system has been designed for (e.g., decision support services or data mining)and usually indicates the semantic technologies that have to be employed by the designed system. For instance, a decisionsupport system most often has to possess reasoning capabilities.

In the rest of this section we briefly present the state-of-the-art systems that are related to ASSIST, focusing –when possible –

on the three aforementioned characteristics. Regarding the connection of these systems with ASSIST, we must note that theyshould by no means be considered competing approaches. Instead, ASSIST moves in parallel course with the presented efforts,being identified by its own unique triple of Semantic Architecture & Implementation, Medical Application Domain, and MedicalApplication Task characteristics.

The @neurIST (Integrated biomedical informatics for the management of cerebral aneurysm) project (http://www.aneurist.org/) has developed a grid infrastructure based on web services to support the research, diagnosis and treatment of cerebralaneurysms and subarachnoid hemorrhage [33]. The @neuroIST platform brings together heterogeneous data from theaforementioned medical domain to provide decision support services (risk assessment, treatment planning) and also knowledgediscovery regarding aneurysm. To facilitate data integration, the @neurIST Ontology for Intracranial Aneurysms [34] has beendeveloped in OWL-DL; it is built upon the DOLCE upper ontology and partially includes parts of the FMA ontology and UMLSmetathesaurus. Using the entities of the @neurIST Ontology (global schema), a data mediation service semantically annotates the

Table 4System component (modules and configuration files) modifications that areassociated with the presented extension directions. The following correspon-dences of system components to indices apply: (1) MKB module, (2) SchemaRepository Mappings file, (3) UMR module, (4) UI Tree file, (5) UI Path file, (6)Association Study Analysis module.

Extension direction Required modifications

New repository (2)–(3)New CxCa knowledge (1)–(2)–(3)–(4)–(5)New medical domain (1)–(2)–(3)–(4)–(5)–(6)

Page 15: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

174 C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

schemata of the heterogeneous data sources to produce mappings between them and the global schema, eventually integratingthem into a single virtual data source. Pellet [22] is utilized for the ontology reasoning that is associated with the knowledgediscovery task of @neurIST.

The DebugIT (Detecting and Eliminating Bacteria UsinG Information Technology) project (http://www.debugit.eu/) hasemployed clinical and operational information regarding bacteria and antibiotics from Clinical Information Systems (CIS) acrossEurope to build a virtual large-scale Clinical Data Repository (CDR). Data mining techniques on both patient and population levelhave been applied on the CDR to discover new knowledge, which is used by a decision support and monitoring tool in the clinicalenvironment [35]. The CDR schema is defined by the DebugIT Core Ontology (DCO), which focuses on patients, diseases,pathogens, their analyses and medications [36]. DCO extends the BioTop mid-level ontology [37] and re-uses several establishedbiomedical vocabularies and ontologies. For each CIS a wrapper extracts and transforms the contained data into the customizedphysical schema of a local CDR; this schema serves as the local CDR's external interface to the virtual integrated CDR. Hermit(http://hermit-reasoner.com/) is DebugIT's reasoner of choice.

The ACGT (Advancing Clinico-Genomic Trials on Cancer) project (http://eu-acgt.org/) has created and tested a unified technologicalinfrastructure for cancer-related clinical trials that facilitates the seamless access and analysis of multi-level clinico-genomic dataenriched with high-performing knowledge discovery operations and services [38]. More specifically, ACGT has focused on trials in thedomain of breast cancer and pediatric nephroblastoma. For the task of data integration the project has developed the ACGT MasterOntology [13]which is a – quite generic – commondomain OWL-DL ontology for cancer research andmanagement, built upon BFO [12].Data integration is achieved by means of a single virtual repository with the help of a subset of the Master Ontology (serving as globalschema) and a common mediator-wrapper architecture. The main results and lesson learned from the project are outlined in [39].

The Neuroweb (Integration and sharing of information and knowledge in neurology and neurosciences) project (http://nuke.neurowebkc.eu/) has shared its Medical Application Task with ASSIST. Neuroweb has developed a web-based platform that supportsassociation studies in the field of cerebrovascular disease. The shared medical task also explains the many similarities of the twoprojects regarding the Semantic Architecture and Implementation characteristic. Neuroweb has built from scratch the NeurowebReference Ontology [40], an OWL-DL ontology that is oriented towards the conduction of association studies in the aforementionedmedical domain. The reference ontology provides the virtual – in contrast to ASSIST – global schema that is required to semanticallyreconcile (i.e., integrate) the repositories of four clinical centers of excellence in cerebrovascular disease, by exploiting mappingsbetween the reference ontology elements and those of the local repositories, which are saved as SQL views of the local schemata.Apart from clinical information, the Neuroweb Reference Ontology is extended to include biomedical entities from external sources,e.g., by linking the Neuroweb biomolecular processes with Gene Ontology (GO) functional annotations.

The following two research efforts lean more towards the semantic annotation of integrated biomedical data rather than thesemantic integration of biomedical data. The HeC (Health-e-Child) project (http://www.health-e-child.org/) has focused onvertical integration – i.e., from the cellular to patient level – of biomedical data in order to provide European pediatrics with anintegrated healthcare platform with exemplars in pediatric heart diseases, inflammatory diseases and brain tumors. Thedeveloped platform provides decision support, knowledge discovery and disease modeling applications that access data inhospitals of three EU countries. The data integration in the HeC project relies on the separation of encoded information into data,metadata and semantics [41]. While the integration is achieved by populating the first two information components, it is theinstantiation of the semantics component that improves the querying, visualization, and reasoning of the integrated data [42].This is a – post-integration – semantic annotation procedure that employs terms from established biomedical knowledge sources,such as the Foundational Model of Anatomy (FMA) ontology. End-user access to the integrated data model of HeC has beensimplified via an ontology-assisted query reformulation methodology [43,44].

The main goal of the i2b2 (Informatics for Integrating Biology & the Bedside) initiative (https://www.i2b2.org/) is to provideclinical investigators with the software tools necessary to collect and manage clinical research data for a series – currently seven –

of disease-based driving biology projects by integrating medical record data and associated genomic data. The software suite isnot tied to specific types of data but can be extended for new and unanticipated data types as well as functionalities. For thispurpose, i2b2 uses a framework called “Hive”, which comprises various Cells, each providing different functionality and/or data.The Hive framework is based on web services through which the Cells export their functionality [45]. The system uses a relationaldata model to store data (the Data Repository Cell) and some unprescribed ontology (e.g., an established vocabulary) tosemantically annotate the terms and concepts of the aforementioned schema (the Ontology Cell), thus allowing data from severalrepositories to come together [46]. The translational impact of the project worldwide is quite impressive [47].

Two projects with a looser connection to ASSIST are HEARTFAID and SmartHEALTH. Both projects have employed semantictechnologies in a certain medical domain but not for the purpose of medical repository integration as did the previous projects. Morespecifically, HEARTFAID (http://lis.irb.hr/heartfaid/) has developed a knowledge-based patient-centric platform which providesdecision support and knowledge discovery services to improve prognosis, diagnosis and therapy provision in the domain ofcardiovascular diseases, with specialization in heart failure [48]. The objective of the Heart Failure Ontology, which has beendeveloped within the HEARTFAID project, is to set the ground for the reasoning tasks that are required for the decision supportservices of the platform [49]. On the other hand, SmartHEALTH (http://www.smarthealthip.com/) has aimed at integrating intohealthcare systems a series of intelligent diagnostic technologies regarding three key applications in cancer diagnostics (breast,cervical, colorectal cancer). For this purpose,medical devices have been equippedwith a semantic infrastructure, i.e., a set of semantictechnology tools (e.g., the novel μOR reasoner for resource-constrained environments [50]) to ensure the interoperability amongthese devices, clinical information systems, and laboratory information systems [51]. In this framework, a microfluidic multisensorplatform for cancer diagnostics has been developed [52].

Page 16: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

175C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

Two ongoing efforts that are also worth mentioning are the PONTE (Efficient Patient Recruitment for Innovative Clinical Trialsof Existing Drugs to Other Indications) and the VPH-Share project. PONTE (http://www.ponte-project.eu/) aims at effectivelyguiding researchers through clinical trial protocol writing by allowing them to perform intelligent queries to various sources andby offering automatic selection of individuals eligible to participate in clinical trials [53,54]. To achieve this, the systemincorporates a core ontology, which has several sub-layers and is implemented in OWL, while semantic querying is based onSPARQL. The mapping of the patient data from various coding systems to ontology terms is made by a dedicated module; theSemantic Mapper. The decision support sub-system incorporates a dedicated reasoning engine which operates on top of if–thenrules. VPH-Share (http://www.vph-share.eu/), which has recently been funded in the context of the Virtual Physiological Human(VPH) Initiative, will enable, by means of Cloud Technologies, the sharing of multiple VPH-relevant datasets in order to facilitatethe construction and operation of new VPH workflows [55,56]. VPH-Share will provide the required infrastructure and services tosupport the full life-cycle of VPH data, including semantic data annotation, integration and access. Semantic data inference,knowledge discovery and management services will be indispensable components of the intended VPH-Share framework.

Furthermore, we should mention two research efforts that, in a manner, escape the pattern of the previous works. BioDB [57] is arecentmedium-scale integration effort whose rationale is lying close to that of HeC and i2b2. BioDB is an information system wheread-hoc heterogeneous biomedical data can be imported and semantically annotated so that they can be queried in a federatedmanner. For the task of semantic annotation, BioDB provides an ontology framework which includes a repository of uploadedontologies and a mechanism to (1) associate schema elements of an imported data source with an uploaded ontology, and(2) establish inter-ontologymappings as sets of axioms. The Neuroscience Information Framework (NIF at http://www.neuinfo.org/)is an application system built upon the BioDB infrastructure [58] and driven by a set of standardized neuroscience-related ontologies[59]. On the other hand, LinkHub [60] is a semantic web system, focused on the field of proteomics, which builds an RDF graph ofrelationships among unique identifiers of individual entities (i.e., documents) residing in separate web-accessible biologicaldatabases (e.g., UniProt, PFAM, Gene Ontology); the resulting graph is employed to produce a “hub of hubs” organization of thecontents of the aforementioned biological databases in order to provide improved and federated information retrieval from theidentified documents residing therein (e.g., the primary sequence of a protein).

Finally, from an ontology design point of view, the ASSIST Ontology, as mentioned in Section 2, bears structural similaritieswith other efforts. To this end, ACGT-MO [13] has been a great source of inspiration, due to its high quality and its closely relateddomain. A number of entities have also been borrowed by OBI [15], which contains a very rich set of concepts related to all aspectsof clinical and biological trials, OGMS [14], which includes more generic medical terms, and EFO [16], which provides an extensivedescription of experimental biomedical variables. In the same time, numerous domain-specific medical ontologies designed tofacilitate semantic data integration – and thus similar in purpose with the ASSIST Ontology – keep being reported (e.g., an ECGdata ontology [61], the Bone Dysplasia Ontology [62], the Disease Ontology [63], and an ontology for sharing Biobank data [64]).

6. Conclusion

The adopted semantic paradigm has greatly benefited ASSIST. Apart from efficiently resolving the issues of data homogenizationand diagnostic information extraction as described in Section 2, the employed semantic technologies have significant contribution tocertain virtues of the system, namely usability, response speed (Section 3) and extendibility (Section 4).

Semantic technologies present a powerful and yet untapped potential in the field of medical informatics. The ability to notonly integrate data from different sources, but also to perform inference operations on that data can find multiple uses in differentmedical domains. While ASSIST focuses on cervical cancer, the same technologies and principles can be used in other fields ofmedical research as well.

ASSIST is a step in this direction, which clearly demonstrates the benefits of semantic web technologies as well as how theymight be put into use more effectively. Further research is still needed in the areas of syntactic data integration (e.g., employingnatural language processing technologies for integrating unstructured medical archives and similarity-based methods forfacilitating syntactic matching) andmultiple user roles (e.g., handling multiple views of the ASSIST Ontology). The lessons learnedin the development of ASSIST, however, indicate that the implementation of a general-purpose (e.g., covering more than onemedical domain) medical research tool that utilizes semantic technologies to conduct genetic association studies is entirelyfeasible.

Acknowledgments

This work was partially supported by the IST-2004-027510 project, entitled “Association Studies Assisted by Inference andSemantic Technologies (ASSIST)”, funded by the Commission of the European Community (CEC).

References

[1] S. Landis, T. Murray, S. Bolden, P. Wingo, Cancer statistics, 1999, CA: A Cancer Journal for Clinicians 49 (1999) 8.[2] J. Walboomers, M. Jacobs, M. Manos, F. Bosch, J. Kummer, K. Shah, P. Snijders, J. Peto, C. Meijer, N. Munoz, Human papillomavirus is a necessary cause of

invasive cervical cancer worldwide, The Journal of Pathology 189 (1999) 12–19.[3] T. Agorastos, K. Dinas, B. Lloveras, S. de Sanjose, J. Kornegay, H. Bonti, F. Bosch, T. Constantinidis, J. Bontis, Human papillomavirus testing for primary

screening in women at low risk of developing cervical cancer. The Greek experience, Gynecologic Oncology 96 (2005) 714–720.[4] H. Cordell, D. Clayton, Genetic association studies, The Lancet 366 (2005) 1121–1131.

Page 17: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

176 C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

[5] M. Falelakis, C. Maramis, I. Lekka, P. Mitkas, A. Delopoulos, An ontology for supporting clinical research on cervical cancer, Proc. of the InternationalConference on Knowledge Engineering and Ontology Development (KEOD), INSTICC Press, October 2009, pp. 103–108.

[6] G. Papanicolaou, Atlas of exfoliative cytology, volume 6, Harvard University Press, Cambridge, Mass., 1954.[7] R. Kurman, D. Solomon, The Bethesda System for reporting cervical/vaginal cytologic diagnoses, International Journal of Gynecologic Pathology 14 (1995) 189.[8] T. Agorastos, V. Koutkias, M. Falelakis, I. Lekka, T. Mikos, A. Delopoulos, P. Mitkas, A. Tantsis, S. Weyers, P. Coorevits, et al., Semantic integration of cervical

cancer data repositories to facilitate multicenter association studies: the ASSIST approach, Cancer Informatics 8 (2009) 31–44.[9] N. Guarino, Semantic matching: formal ontological distinctions for information organization, extraction, and integration, Information Extraction A

Multidisciplinary Approach to an Emerging Information Technology, Springer, Berlin Heidelberg, 1997, pp. 139–170.[10] M. Smith, C. Welty, D. McGuinness, OWL Web ontology language guide, [Online] http://www.w3.org/TR/owl-guide, (last access February 2013).[11] B. Smith, M. Brochhausen, Establishing and harmonizing ontologies in an interdisciplinary health care and clinical research environment, Studies in Health

Technology and Informatics 134 (2008) 219–233.[12] Basic formal ontology, [Online] http://www.ifomis.org/bfo, (last access February 2013).[13] M. Brochhausen, A. Spear, C. Cocos, G. Weiler, L. Martín, A. Anguita, H. Stenzhorn, E. Daskalaki, F. Schera, U. Schwarz, et al., The ACGT Master Ontology and its

applications — towards an ontology-driven cancer research and management system, Journal of Biomedical Informatics 44 (2011) 8–25.[14] R. Scheuermann,W. Ceusters, B. Smith, Toward an ontological treatment of disease and diagnosis, Summit on Translational Bioinformatics 2009 (2009) 116–120.[15] R. Brinkman, M. Courtot, D. Derom, J. Fostel, Y. He, P. Lord, J. Malone, H. Parkinson, B. Peters, P. Rocca-Serra, et al., Modeling biomedical experimental

processes with OBI, Journal of Biomedical Semantics 1 (2010) S7.[16] J. Malone, E. Holloway, T. Adamusiak, M. Kapushesky, J. Zheng, N. Kolesnikov, A. Zhukova, A. Brazma, H. Parkinson, Modeling sample variables with an

Experimental Factor Ontology, Bioinformatics 26 (2010) 1112–1118.[17] F. Baader, D. Calvanese, D. McGuinness, P. Patel-Schneider, D. Nardi, The Description Logic Handbook: Theory, Implementation, and Applications, Cambridge

University Press, 2003.[18] U. Schenck, H. Soost, Nomenklatur und Befundwiedergabe in der gynäkologischen Zytologie, der Referatband 10 (1989) 306–314.[19] E. Prud'Hommeaux, A. Seaborne, SPARQL query language for RDF, W3C recommendation, [Online] http://www.w3.org/TR/rdf-sparql-query, (last access

February 2013).[20] J. Broekstra, A. Kampman, SeRQL: a second generation RDF query language, Proc. of SWAD-Europe Workshop on Semantic Web Storage and Retrieval,

November 2003, pp. 13–14.[21] D. Tsarkov, I. Horrocks, FaCT++ description logic reasoner: system description, Automated Reasoning, Springer, Berlin Heidelberg, 2006, pp. 292–297.[22] E. Sirin, B. Parsia, B. Grau, A. Kalyanpur, Y. Katz, Pellet: a practical OWL-DL reasoner, Web Semantics: Science, Services and Agents on the World Wide Web 5

(2007) 51–53.[23] V. Haarslev, R. Müller, RACER system description, Automated Reasoning, Springer, Berlin Heidelberg, 2001, pp. 701–705.[24] A. Kiryakov, D. Ognyanov, D. Manov, OWLIM — a pragmatic semantic repository for OWL, Web Information Systems Engineering WISE 2005 Workshops,

Springer, Berlin Heidelberg, 2005, pp. 182–192.[25] J. Broekstra, A. Kampman, F. Harmelen, Sesame: a generic architecture for storing and querying RDF and RDF schema, The Semantic Web — ISWC 2002,

Springer, Berlin Heidelberg, 2002, pp. 54–68.[26] C. Bizer, A. Seaborne, D2RQ-treating non-RDF databases as virtual RDF graphs, Proc. of the 3rd International Semantic Web Conference (ISWC),

Springer-Verlag, November 2004, pp. 26–27.[27] B. McBride, Jena: a semantic web toolkit, IEEE Internet Computing (2002) 55–59.[28] S. Sherry, M. Ward, M. Kholodov, J. Baker, L. Phan, E. Smigielski, K. Sirotkin, dbSNP: the NCBI database of genetic variation, Nucleic Acids Research 29 (2001) 308–311.[29] J. Brooke, SUS: a “quick and dirty” usability scale, Usability Evaluation in Industry, volume 189, Taylor & Francis Groups, 1996, p. 194.[30] J. Sauro, Measuring usability with the system usability scale (SUS), [Online] http://www.measuringusability.com/sus.php, (last access February 2013).[31] E. Jiménez Ruiz, B. Grau, I. Horrocks, R. Berlanga, Supporting concurrent ontology development: framework, algorithms and tool, Data & Knowledge

Engineering 70 (2011) 146–164.[32] T. Özacar, Ö. Öztürk, M. Ünalır, ANEMONE: an environment for modular ontology development, Data & Knowledge Engineering 70 (2011) 504–526.[33] S. Benkner, A. Arbona, G. Berti, A. Chiarini, R. Dunlop, G. Engelbrecht, A. Frangi, C. Friedrich, S. Hanser, P. Hasselmeyer, et al., @neurIST: Infrastructure for

advanced disease management through integration of heterogeneous data, computing, and complex processing services, IEEE Transactions on InformationTechnology in Biomedicine 14 (2010) 1365–1377.

[34] M. Boeker, H. Stenzhorn, K. Kumpf, P. Bijlenga, S. Schulz, S. Hanser, The @neurIST ontology of intracranial aneurysms: providing terminological services foran integrated IT infrastructure, AMIA Annual Symposium Proc. American Medical Informatics Association, November 2007, pp. 56–60.

[35] C. Lovis, D. Colaert, V. Stroetmann, DebugIT for patient safety improving the treatment with antibiotics through multimedia data mining of heterogeneousclinical data, Studies in Health Technology and Informatics 136 (2008) 641–646.

[36] D. Schober, M. Boeker, J. Bullenkamp, C. Huszka, K. Depraetere, D. Teodoro, N. Nadah, R. Choquet, C. Daniel, S. Schulz, The DebugIT core ontology: semanticintegration of antibiotics resistance patterns, Studies in Health Technology and Informatics 160 (2010) 1060–1064.

[37] S. Schulz, E. Beisswanger, L. Van Den Hoek, O. Bodenreider, E. Van Mulligen, Alignment of the UMLS semantic network with BioTop: methodology andassessment, Bioinformatics 25 (2009) i69–i76.

[38] M. Tsiknakis, M. Brochhausen, J. Nabrzyski, J. Pucacki, S. Sfakianakis, G. Potamias, C. Desmedt, D. Kafetzopoulos, A semantic grid infrastructure enablingintegrated access and analysis of multilevel biomedical data in support of postgenomic clinical trials on cancer, IEEE Transactions on Information Technologyin Biomedicine 12 (2008) 205–217.

[39] A. Bucur, S. Ruping, T. Sengstag, S. Sfakianakis, M. Tsiknakis, D. Wegener, The ACGT project in retrospect: lessons learned and future outlook, ProcediaComputer Science 4 (2011) 1119–1128.

[40] G. Colombo, D. Merico, G. Boncoraglio, F. De Paoli, J. Ellul, G. Frisoni, Z. Nagy, A. van der Lugt, I. Vassányi, M. Antoniotti, An ontological modeling approach tocerebrovascular disease studies: the NEUROWEB case, Journal of Biomedical Informatics 43 (2010) 469–484.

[41] A. Branson, T. Hauer, R. McClatchey, D. Rogulin, J. Shamdasani, A data model for integrating heterogeneous medical data in the Health-e-Child project,Studies in Health Technology and Informatics 138 (2008) 13.

[42] T. Hauer, D. Rogulin, S. Zillner, A. Branson, J. Shamdasani, A. Tsymbal, M. Huber, T. Solomonides, R. Mcclatchey, An architecture for semantic navigation andreasoning with patient data-experiences of the Health-e-Child project, Proc. of the 7th International Conference on The Semantic Web (ISWC),Springer-Verlag, October 2008, pp. 737–750.

[43] K. Munir, M. Odeh, R. McClatchey, Ontology assisted query reformulation using the semantic and assertion capabilities of OWL-DL ontologies, Proc. of 12thInternational Symposium on Database Engineering & Applications (IDEAS), ACM, September 2008, pp. 81–90.

[44] K. Munir, M. Odeh, R. McClatchey, Managing the mappings between domain ontologies and database schemas when formulating relational queries, Proc. ofthe 13th International Symposium on Database Engineering & Applications (IDEAS), ACM, September 2009, pp. 131–141.

[45] S. Murphy, M. Mendis, K. Hackett, R. Kuttan, W. Pan, L. Phillips, V. Gainer, D. Berkowicz, J. Glaser, I. Kohane, et al., Architecture of the open-source clinicalresearch chart from Informatics for Integrating Biology and the Bedside, AMIA Annual Symposium Proc. American Medical Informatics Association,November 2007, pp. 548–552.

[46] V. Deshmukh, S. Meystre, J. Mitchell, Evaluating the informatics for integrating biology and the bedside system for clinical research, BMC Medical ResearchMethodology 9 (2009) 70.

[47] I. Kohane, S. Churchill, S. Murphy, A translational engine at the national scale: informatics for integrating biology and the bedside, Journal of the AmericanMedical Informatics Association 19 (2012) 181–185.

[48] D. Conforti, D. Costanzo, F. Perticone, G. Parati, K. Kawecka-Jaszcz, A. Marsh, C. Biniaris, M. Stratakis, R. Fontanelli, D. Guerri, et al., HEARTFAID: a knowledgebased platform of services for supporting medical-clinical management of heart failure within elderly population, Studies in Health Technology andInformatics 121 (2006) 108–125.

Page 18: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

177C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

[49] M. Prcela, D. Gamberger, A. Jovic, Semantic web ontology utilization for heart failure expert system design, Studies in Health Technology and Informatics 136(2008) 851–856.

[50] S. Ali, S. Kiefer, μOR — a Micro OWL DL reasoner for ambient intelligent devices, Advances in Grid and Pervasive Computing (2009) 305–316.[51] S. Ali, S. Kiefer, Semantic medical devices space: an infrastructure for the interoperability of ambient intelligent medical devices, Proc. of the 6th

International Conference on Information Technology and Applications in Biomedicine (ITAB), IEEE, October 2006.[52] C. Gärtner, H. Becker, C. Carstens, F. von Germar, K. Drese, A. Fragoso, R. Gransee, et al., SmartHEALTH: a microfluidic multisensor platform for POC cancer

diagnostics, Progress in Biomedical Optics and Imaging 10 (2009), (73130B.1-73130B.8).[53] V. Andronikou, E. Karanastasis, E. Chondrogiannis, K. Tserpes, T. Varvarigou, Semantically-enabled intelligent patient recruitment in clinical trials, Proc. of

the 5th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), IEEE, November 2010, pp. 326–331.[54] G. Tsatsaronis, K. Mourtzoukos, V. Andronikou, T. Tagaris, I. Varlamis, M. Schroeder, T. Varvarigou, D. Koutsouris, N. Matskanis, PONTE: a context-aware

approach for automated clinical trial protocol design, Proc. of the 6th International Workshop on Personalized Access, Profile Management, and ContextAwareness in Databases (PersDB), Microsoft Research, August 2012.

[55] S. Benkner, J. Bisbal, G. Engelbrecht, R. Hose, Y. Kaniovskyi, M. Koehler, C. Pedrinaci, S. Wood, Towards collaborative data management in the VPH-shareproject, Euro-Par 2011: Parallel Processing Workshops, Springer, Berlin Heidelberg, 2012, pp. 54–63.

[56] M. Koehler, R. Knight, S. Benkner, Y. Kaniovskyi, S. Wood, The VPH-share data management platform: enabling collaborative data management for thevirtual physiological human community, Proc. of the 8th International Conference on Semantics, Knowledge and Grids (SKG), IEEE, October 2012, pp. 80–87.

[57] A. Gupta, C. Condit, X. Qian, BioDB: an ontology-enhanced information system for heterogeneous biological information, Data & Knowledge Engineering 69(2010) 1084–1102.

[58] F. Imam, S. Larson, A. Bandrowski, J. Grethe, A. Gupta, M. Martone, Maturation of neuroscience information framework: an ontology driven informationsystem for neuroscience, Proc. of the 7th International Conference on Formal Ontology in Information Systems (FOIS), IOS Press, July 2012, pp. 15–28.

[59] F. Imam, S. Larson, A. Bandrowski, J. Grethe, A. Gupta, M. Martone, Development and use of ontologies inside the neuroscience information framework: apractical approach, Frontiers in Genetics 3 (2012).

[60] A. Smith, K. Cheung, K. Yip, M. Schultz, M. Gerstein, LinkHub: a Semantic Web system that facilitates cross-database queries and information retrieval inproteomics, BMC Bioinformatics 8 (2007) S5.

[61] B. Gonçalves, G. Guizzardi, J. Pereira Filho, Using an ECG reference ontology for semantic interoperability of ECG data, Journal of Biomedical Informatics 44(2011) 126–136.

[62] T. Groza, J. Hunter, A. Zankl, The Bone Dysplasia Ontology: integrating genotype and phenotype information in the skeletal dysplasia domain, BMCBioinformatics 13 (2012) 50.

[63] L. Schriml, C. Arze, S. Nadendla, Y. Chang, M. Mazaitis, V. Felix, G. Feng, W. Kibbe, Disease Ontology: a backbone for disease semantic integration, NucleicAcids Research 40 (2012) D940–D946.

[64] M. Brochhausen, M. Fransson, M. Eriksson, R. Merino-Martinez, L. Norlin, S. Kjellqvist, M. Anderberg, U. Topaloglu, W. Hogan, J. Litton, Developing anontology for sharing Biobank data based on the BBMRI minimum data set MIABIS, Proc. of the 4th Workshop on Ontologies in Biomedicine and Life Sciences(OBML), September 2012.

hristos Maramis received his Diploma and PhD in Electrical and Computer Engineering from the Aristotle University of Thessaloniki,reece in 2005 and 2011, respectively. He is currently a postdoctoral research associate at the Lab of Medical Informatics, School ofedicine, Aristotle University of Thessaloniki, while in the past he has worked as research associate of the Information Technologiesstitute, Centre for Research and Technologies Hellas. His main research interests lie in the area of computerized molecular diagnosisnd automatic analysis of molecular biology data. In 2006 Dr. Maramis received a three year research scholarship from the Greek Statecholarships Foundation. He is a member of the Technical Chamber of Greece and the IEEE.

CGMInaS

anolis Falelakis received his Diploma in Electrical and Computer Engineering and his Ph.D in Multimedia Analysis from the Aristotleniversity of Thessaloniki, Greece in 2003 and 2010, respectively. He is currently a Research Fellow at the Department of Computingt Goldsmiths, University of London. His research interests include Knowledge Representation and Reasoning in the fields of Video-ediated Communication, Multimedia Retrieval and Medicine as well as (feature- and model-) Matching in Computer Vision. He is aember of the Technical Chamber of Greece and the IEEE.

MUamM

ini Lekka holds an MSc in Electrical Engineering and a BSc in Physics. She is a senior scientist at the Lab of Medical Informatics, at theristotle University of Thessaloniki, Greece, where she is working on the development of personal health applications and services forhronic disease management. Her research interests comprise user-centered design and evaluation of healthcare information systemss well as secondary uses of clinical data. Prior to joining the Aristotle University Irini worked in software development for the medicalaging industry.

IrAcaim

Page 19: ISSEL - Data & Knowledge Engineering · 2016-02-04 · Fig. 1. Outline of ASSIST's architectural components (system modules and configuration files). Semantic modules are indicated

hristos Diou obtained his PhD in Electrical and Computer Engineering from the Aristotle University of Thessaloniki, Greece in 2010.e is currently a postdoctoral research associate in the Information Technologies Institute, Centre for Research and Technologiesellas. His main research interests lie in the areas of weakly supervised machine learning and learning under uncertainty, concept-ased multimedia retrieval as well as automatic image and video annotation. In 2005 Dr. Diou received a three year scholarship frome Greek State Scholarships Foundation and in 2012 he was awarded an Aristotle University of Thessaloniki "Excellence" scholarshipr postdoctoral research. He is a member of the Technical Chamber of Greece and of IEEE.

178 C. Maramis et al. / Data & Knowledge Engineering 86 (2013) 160–178

CHHbthfo

rof. Pericles A. Mitkas received his Diploma of Electrical Engineering from the Aristotle University of Thessaloniki, Greece in 1985nd an MSc and PhD in Computer Engineering from Syracuse University, USA, in 1987 and 1990, respectively. Between 1990 and 2000e was a faculty member with the Department of Electrical and Computer Engineering at Colorado State University in USA. Currently,r. Mitkas is Professor and Head of the Dept. of Electrical and Computer Engineering at Aristotle University of Thessaloniki, Greece,nd Director of the Information Processing Laboratory. Prof. Mitkas teaches Databases, Software Engineering, and Data Mining coursesnd his research interests include databases and knowledge bases, data mining, software agents, enviromatics and bioinformatics. Hisork has been published in over 200 papers, book chapters, and conference publications. He is the co-author of a book on Agenttelligence through Data Mining.

PahDaawIn

nastasios N. Delopoulos received the Diploma in Electrical Engineering and PhD degree from the Department of Electricalngineering, National Technical University of Athens (NTUA), Athens, Greece, in 1987 and in 1993, respectively, and the MSc degreeom the University of Virginia, Charlottesville, in 1990. He is currently an Associate Professor in the Department of Electrical andomputer Engineering, Aristotle University of Thessaloniki, Thessaloniki, Greece. He is the author of more than 75 journal andonference scientific papers. He has participated in 21 European and National Research and Development projects related topplication of signal, image, video, and information processing to entertainment, culture, education, and health sectors. His currentesearch interests include multimedia data understanding, computer vision and biomedical signal processing.

AEfrCcar