SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... ·...

15
SemanticCT: A Semantically-Enabled System for Clinical Trials Zhisheng Huang, Annette ten Teije, and Frank van Harmelen Department of Computer Science, VU University Amsterdam, The Netherlands {huang,annette,Frank.van.Harmelen}@cs.vu.nl Abstract. In this paper, we propose an approach of semantically en- abled systems for clinical trials. The goals are not only to achieve the interoperability by semantic integration of heterogeneous data in clini- cal trials, but also to facilitate automatic reasoning and data processing services for decision support systems in various settings of clinical trials. We have implemented the proposed approach in a system called Se- manticCT. SemanticCT is built on the top of LarKC (Large Knowledge Collider), a platform for scalable semantic data processing. SemanticCT has been integrated with large-scale trial data and patient data, and provided various automatic services for clinical trials, which include au- tomatic patient recruitment service (i.e., identifying eligible patients for a trial) and trial finding service (i.e., finding suitable trials for a patient). 1 Introduction Clinical trials provide tests which generate safety and efficacy data for health interventions. Clinical trials usually involve large-scale and heterogeneous data. The lack of integration and of semantic interoperability among the systems of clinical trials and the systems of patient data, i.e. electronic health record (EHRs) and clinical medical records (CMRs), is the main source of inefficiency of clinical trial systems. Thus, many procedures in clinical trials, such as patient recruit- ment (i.e., identifying eligible patients for a trial) and trial finding (i.e., finding suitable trials for a patient), have been considered to be laborious. Enhancing clinical trial systems with semantic technology to achieve the se- mantic interoperability of large-scale and heterogeneous data would improve the performance of clinical trials significantly. Those semantically-enabled systems would achieve efficient and effective reasoning and data processing services in various settings of clinical trials systems. In this paper, we propose an approach of semantically enabled systems for clinical trials. The proposed approach has been implemented in the system called SemanticCT 1 . The system provides semantic integration of various data in clini- cal trials. The system is designed to be a semantically enabled system of decision 1 http://wasp.cs.vu.nl/sct

Transcript of SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... ·...

Page 1: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

SemanticCT:A Semantically-Enabled System for Clinical

Trials

Zhisheng Huang, Annette ten Teije, and Frank van Harmelen

Department of Computer Science,VU University Amsterdam, The Netherlands

{huang,annette,Frank.van.Harmelen}@cs.vu.nl

Abstract. In this paper, we propose an approach of semantically en-abled systems for clinical trials. The goals are not only to achieve theinteroperability by semantic integration of heterogeneous data in clini-cal trials, but also to facilitate automatic reasoning and data processingservices for decision support systems in various settings of clinical trials.We have implemented the proposed approach in a system called Se-manticCT. SemanticCT is built on the top of LarKC (Large KnowledgeCollider), a platform for scalable semantic data processing. SemanticCThas been integrated with large-scale trial data and patient data, andprovided various automatic services for clinical trials, which include au-tomatic patient recruitment service (i.e., identifying eligible patients fora trial) and trial finding service (i.e., finding suitable trials for a patient).

1 Introduction

Clinical trials provide tests which generate safety and efficacy data for healthinterventions. Clinical trials usually involve large-scale and heterogeneous data.The lack of integration and of semantic interoperability among the systems ofclinical trials and the systems of patient data, i.e. electronic health record (EHRs)and clinical medical records (CMRs), is the main source of inefficiency of clinicaltrial systems. Thus, many procedures in clinical trials, such as patient recruit-ment (i.e., identifying eligible patients for a trial) and trial finding (i.e., findingsuitable trials for a patient), have been considered to be laborious.

Enhancing clinical trial systems with semantic technology to achieve the se-mantic interoperability of large-scale and heterogeneous data would improve theperformance of clinical trials significantly. Those semantically-enabled systemswould achieve efficient and effective reasoning and data processing services invarious settings of clinical trials systems.

In this paper, we propose an approach of semantically enabled systems forclinical trials. The proposed approach has been implemented in the system calledSemanticCT1. The system provides semantic integration of various data in clini-cal trials. The system is designed to be a semantically enabled system of decision

1 http://wasp.cs.vu.nl/sct

Page 2: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

2 Zhisheng Huang, Annette ten Teije, and Frank van Harmelen

support for various scenarios in medical applications. SemanticCT has been se-mantically integrated with various data, which include various trial documentswith semantically annotated eligibility criteria and large amount of patient datawith structured EHR and clinical medical records. Well-known medical termi-nologies and ontologies, such as SNOMED, LOINC, etc., have been used for thesemantic interoperability.

SemanticCT is built on the top of LarKC (Large Knowledge Collider), a plat-form for scalable semantic data processing2. With the built-in reasoning supportfor large-scale RDF/OWL data of LarKC, SemanticCT is able to provide vari-ous reasoning and data processing services for clinical trials, which include fasteridentification of eligible patients for recruitment service and efficient identifica-tion of eligible trials for patients.

The contribution of this paper is: (1) a framework that enables semantictechnologies for medical tasks related to the domain of clinical trials. (2) a proofof concept of the framework by SemanticCT with a focus on three tasks: (i)semantic search for clinical trials and patient data,(ii) trial finding for patients,(iii) identifying patients for a trial.

This paper is organized as follows: Section 2 presents the general ideas ofsemantically enabled systems for clinical trials. In section 3 we focus on threetasks in the clinical trial domain: search in clinical trials and patient data, trialfinding for patients and identifying eligible patients for trials. Section 4 describesa formalization of eligibility criteria of clinical trials. Section 5 proposes thearchitecture of SemanticCT and describes various services and interfaces of thesystem. Section 6 discusses the related work and make the conclusions.

2 Approach

The goal of SemanticCT is to exploit semantic techniques in the domain ofmedical trials such that several tasks like trial finding and identifying eligiblepatients for trials can be supported. In this section we describe the semantic dataintegration and the platform. Notice that we use existing semantic technologies,available medical ontologies, data sources, and semantic annotaters.

2.1 Semantic Data Integration

Semantic data integration of various data in clinical trials is a basic step tobuild a semantically enabled system for clinical trials. Many existing trial dataare usually represented as XML data with the standard fields. For example,the clinical trial service in the U.S. National Institutes of Health3 provides thestructured CDISC 20 fields of XML-encoded trial data. We can convert thoseXML data into standard semantic data, like RDF NTriple data with the anno-tations of medical ontologies or terminologies, like SNOMED, LOINC, MESHand others. Those ontologies can be used individually, or in a group with the

2 http://www.larkc.eu3 http://www.clinicaltrials.gov

Page 3: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

SemanticCT: A Semantically-Enabled System for Clinical Trials 3

ontology alignments which are provided by the BioPortal ontology service4 orother alignment tools. LinkedCT5 provides large-scale semantic data of clinicaltrials with the standard formats of Linked Open Data in the Semantic Web.

The semantic annotations of clinical trials can be obtained by using manysemantic annotation tools/systems, which have been developed by the commu-nity of the Semantic Web. BioPortal and MetaMap6 provide satisfying servicesfor semantic annotations with biomedical ontologies. Those annotation data arealso represented as XML ones. Similarly it is easy to use XSLT to convert thoseXML encoded data into RDF NTriple ones. This means that for semantic inter-operability we can exploit the available mappings among ontologies or to loadan own allignment as RDF NTriples.

Some EHR prototype systems have been developed to support some kindsof semantics-enriched patient data. Those patient data can be accessed via theservers provided by those systems. However, real patient data are usually pro-tected and not allowed for public access, because of the legal issue and privacyreason. We have developed a knowledge-based patient data generator which cansynthesize required patient data for the purpose of tests by using some domainknowledge to control the data generation and make the generated data look likerealistic ones[6].

In the paper we take three tasks into account: search in clinical trials orpatient records, trial finding for patients, and identifying eligible patients fortrials. For our feasibility tests we use clinical trials of breast cancer and weintegrated the following data in SemanticCT:

Clinical Trials. We got the XML-encoded data of 4665 clinical trials ofbreast cancer from the official NCT website www.clinicaltrials.gov, and usedXSLT to convert the XML-encoded data into RDF NTriple data, which consistsof 1,200,565 triples and 335, 507 entities.

Medical ontologies. We got the latest release of SNOMED terminologiesand converted them into RDF NTriple data. The concepts and definitions of con-verted SNOMED consists of 4,048,457 triples, which correspond with 2,046,810entities.

Semantic annotations of clinical trials. We used the semantic annotationserver of BioPortal to obtain the XML-encoded semantic annotations of the4665 clinical trials with the medical terminologies/ontologies such as SNOMED,LOINC, HL7, MESH, RxNorm, and EVS. We converted the semantic annotationdata into RDF NTriple. The total data size is about 3.0 GB. For the experiment,we load the semantic annotation data with the SNOMED concepts only. Thispart of data consists of 106,334 triples (454MB data).

Patient Data. We used APDG (Advanced Patient Data Generator), aknowledge-based patient data generator, to create 10,000 patient data of breastcancer, which cover the main properties of female breast cancer patients, likedemographic data (e.g., gender and age), diagnosis, TNM stage (T for primarytumor, N for regional lymph nodes, and M for distant metastasis), hormone

4 http://bioportal.bioontology.org/5 http://linkedct.org/6 http://metamap.nlm.nih.gov/

Page 4: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

4 Zhisheng Huang, Annette ten Teije, and Frank van Harmelen

receptor status, e.g., the status of ER (Estrogen Receptor), PR (ProgesteroneReceptor), and HER2 (Human Epidermal Growth Factor Receptor 2), etc. Wehave collected the domain knowledge from medical literature (like PubMed) andweb pages (like those from Wikipedia) and encoded those domain knowledge tocontrol the generation of patient data and make the generated patient data looklike realistic ones[6]. The generated patient data set consists of 660,000 triples.

Thus, the total loaded RDF NTriple data are over 6 million triples. It is suffi-cient for a demonstration prototype which runs at an ordinary laptop (dual coreand 4GB memory) with extremely good performance. Most SPARQL queries inSemanticCT can be finished within one second. Thus, the time performance isnot a big issue. What we concern mainly is whether such an approach can beused for supporting clinical trial tasks by developing a trial finding service anda patient recruitment service.

2.2 Semantic Platform

There have been several well-developed triple stores which can be used to serve asa semantic platform to build SPARQL endpoints for the services of querying overlarge-scale semantic data. Well-known triple stores are OWLIM7 and Virtuoso8.Those triple stores usually support for basic RDFS reasoning over semantic data.

LarKC is a platform for scalable semantic data processing. OWLIM is usedto be the basic data layer of LarKC. LarKC fulfills the needs in sectors thatare dependent on massive heterogeneous information sources such as telecom-munication services, biomedical research, and drug-discovery[4]. The platformhas a pluggable architecture in which it is possible to exploit techniques andheuristics from diverse areas such as databases, machine learning, cognitive sci-ence, the Semantic Web, and others. LarKC provides a number of pluggablecomponents: retrieval, abstraction, selection, reasoning and deciding. In LarKC,massive, distributed and necessarily incomplete reasoning is performed over web-scale knowledge sources[10]. One of our clinical trial task requires a new reasoningcomponent (see section 4) which can be plugged in the LarKC platform.

3 Tasks in clinical trial domain

There are a large number of tasks in the domain of clinical trials. In this paperwe focus on the tasks search, trial finding for patients and identifying eligi-ble patients for trials with the main question in mind whether the approachof semantically enabled system for clinical trials can support those knowledgeintensive tasks.

3.1 Search

SemanticCT provides various search services over large-scale integrated data:clinical trials, medical ontologies and patient data (see section 2.1). The seman-tic integration is realised by several available medical ontologies and mappings

7 http://www.ontotext.com/owlim8 http://virtuoso.openlinksw.com/

Page 5: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

SemanticCT: A Semantically-Enabled System for Clinical Trials 5

between those ontologies from BioPortal. We also provide the service for brows-ing semantically annotated eligibility criteria of trials, search services for patientdata browsing and specific patient finding, such as, show all triple-negative breastcancer patients. These search facilities are all realized by enabling semantic tech-nologies into the domain of clinical trials.

3.2 Trial Finding for Patients

The trial finding service is one which searches for suitable trials for a givenpatient. Namely, based on the patient data, the system will check the requirementof clinical trials with the patient data to see whether or not the trial can beconsidered as a candidate trial for further deliberation by the patient and theclinician to make the decision. Some requirements (such as gender and age) havebeen structured in the original XML data. Some of those requirements are statedin the eligibility criteria (i.e., inclusion criteria or exclusion criteria), which arerepresented in natural language text. There are different approaches to dealwith the information in text. We can either use SPARQL queries with regularexpressions over eligibility criteria, or SPARQL queries directly over semanticannotations of eligibility criteria, or formalize the text by using some kind offormalization to make the structured eligibility criteria.

Given a patient data, it seems to be ideal to check if all the properties ofa patient meets the requirements of a trial. However, we have found that it isnot necessary, because checking with a few properties are sufficient to reducesignificant amount of candidate trials and result in a small amount of trials forfurther deliberation.

For the experiment, we select just a small set of checking items, which consistsof some structured fields, such as demographic data (gender and age), and someunstructured data (i.e., those in the text of eligibility criteria) such as stage,menopausal status, and hormone receptor status. The latter can be checkedby using regular expressions with filters in SPARQL queries. Of course, we areinterested in the trials which are currently recruiting, rather than those whichhave been completed. Thus, the initial SPARQL query of trial finding for afemale patient aged 40 at stage 2 can be represented as follows:

PREFIX ...

select distinct ?ctid ?summary ?criteria

where {

?ct rdf:type sct:ClinicalTrial.

?ct sct:NCTID ?ctid.

?ct sct:EligibilityGender ’Female’.

?ct sct:OverallStatus "Recruiting".

?ct sct:EligibilityMinAge ?minage.

?ct sct:EligibilityMaxAge ?maxage.

?ct sct:BriefSummary ?summary.

?ct sct:EligibilityCriteriaTextblock ?criteria.

FILTER(?minage <= ’40 Years’&& ?maxage >= ’40 Years’).

FILTER regex(str(?criteria), ’stage 2’).}

Page 6: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

6 Zhisheng Huang, Annette ten Teije, and Frank van Harmelen

In the query above, the regex ’stage 2’ is used to match the stage in theeligibility criteria. The way of text matching is not sufficient to find all thetargeted information. We can extend the regular expressions to cover variousexpressions which talk about the stage in natural language text. It is quiteclear that we cannot exhaust all the expressions which talk about the stage innatural language text. Furthermore, the query cannot make a distinction betweenthe text appears in inclusion criteria and that in exclusion criteria, unless weintroduce more complex regular expressions which can detect the beginning andthe ending of those criteria.

We add checking on more properties of patients, like menopausal status andhormone receptor status. That would reduce more candidate trials. Such reduc-tion is very useful for clinicians. Table 1 summarizes the results of trial findingwith those selected properties for 11 randomly selected tests. Actually each testrepresents a type of patients with their corresponding properties. From the table,we know that just a few property checking would reduce significant amount ofcandidate trials and result in only a few trials for further decision. The maximalnumber of candidate trials is 28 and the minimal number of candidate trials is 3.We have also detected the problem that some item checking by regular expres-sions cannot deal with negation information correctly, in particular, for thoseappear in exclusion criteria. For example, for checking on ’hormone receptorstatus’, four trials have been mistakenly identified.

Patient Age Stage Found Menopausal RT HR RT FF EF PrecisionID Trial Status Status Trial (%)

1000001 40 0 19 premeno 1 ER+,PR-,HER2+ 2 16 1 93.751000302 67 2 16 postmeno 0 ER-,PR-,HER2+ 2 14 0 1001001422 61 1 11 perimeno 0 ER+, PR+, HER2- 0 11 0 1001001548 64 1 11 postmeno 0 ER+, PR-, HER2- 0 11 0 1001002017 52 2 18 perimeno 2 ER+,PR-,HER2+ 1 15 0 1001003862 69 0 32 postmeno 0 ER-,PR+,HER2+ 4 28 0 1001004121 42 1 17 perimeno 3 ER-,PR+, HER2- 4 10 0 1001005035 41 0 19 premeno 1 ER-,PR+,HER2+ 2 16 1 93.751006125 47 0 19 perimeno 1 ER-,PR-,HER2+ 2 16 1 93.751007321 75 3 26 postmeno 0 ER-,PR-,HER2- 23 3 0 1001009934 64 3 27 postmeno 0 ER+,PR-,HER2- 4 23 1 95.65Average 56.55 19.55 0.73 4.4 14.82 0.36 97.90Table 1. Trial Finding for Patient by SPARQL Queries with Regular Expressions. RT:Reduced Trials, HR: Hormone Receptor, FF Trials: Finally Found Trials, EF: ErrorFound

This feasibility test shows us that SPARQL queries with regular expressionsare useful and promising to select trials for a specific patient.

Page 7: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

SemanticCT: A Semantically-Enabled System for Clinical Trials 7

3.3 Identifying Eligible Patients for Trials

Another task is to provide faster identification service of eligible patients for clin-ical trials. That requires the formalization of eligibility criteria, so that match-ing patient data with formalized eligibility criteria for automatic identificationof clinical trials for patients. In [5] we propose a rule-based formalization foreligibility criteria, which is briefly discussed in the next section 4.

We have picked up 10 clinical trials randomly and formalized their eligibilitycriteria by using the rule-based formalization. We have tested the system forautomatically identifying eligible patients for those selected trials. The system isable to find minimally 241 patients and maximally 750 patients out of the 10,000patients for each trial, within less five seconds, for the system which is runningon an ordinary laptop (dual core and 4GB memory)[5]. This formalization is alsouseful for trial finding service, because it can provide exactly matching on thedata, without relying on exhaustive regular expression patterns. This feasibilitytest shows us that rule-based formalization of eligibility criteria for identifyingeligible patients for trials is doable in an effective and efficient way. Clearly thenext step is to set-up an experiment with real patient data and validation of theresults with a clinician.

4 Rule-based Reasoning

For reasoning over various semantic data for clinical trials, SPARQL queriesare not always powerful and flexible enough to specify complex requirements ofeligibility criteria. In the experiments with automatic identification of eligiblepatients, we have observed that SPARQL queries with regular expressions arenot always sufficient, for instance, for checking eligibility criteria.

For example, in order to check if an eligibility criteria require a patient ofthe stage 2 breast cancer, we have to use a regular expression to cover variousexpressions which talk about the stage in natural language text, like this:

FILTER regex(str(?criteria),

’stage 2|stage II |stage 0, 1, 2|stage I, II|stage IIa|stage IIb’)

As we have discussed before, it is quite clear that we cannot exhaust all theexpressions which talk about the stage in natural language text. Therefore, thatwould result in some eligibility criteria uncheckable at the run time (i.e., query-ing time). We have developed a rule-based formalization of eligibility criteriafor clinical trials[5], so that eligibility criteria in natural language text can beprocessed offline, i.e., when their formalizations are generated.

Compared with existing formalizations, the rule-based formalization is moreefficient and effective, because of the declarative form, easy maintenance, reusabil-ity and expressivity[5].

There exist various rule languages which can be used for the formalization ofeligibility criteria. In the researches of artificial intelligence, logic programming

Page 8: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

8 Zhisheng Huang, Annette ten Teije, and Frank van Harmelen

languages, like Prolog, are well known and popular rule-based languages. Sev-eral rule-based languages, like SWRL 9 and RIF10, have been proposed for thesemantics-enable rule-based language. In biomedical domain, the Arden syntax11 has been developed to formalize rule-like medical knowledge. However, com-pared with logic programming language Prolog, both SWRL, RIF and the Ardensyntax have very limited functionalities for data processing.

In SemanticCT, the rule-based formalization is developed based on the logicprogramming language Prolog. We select the SWI-Prolog12 as the basic languagefor the rule-based formalization of eligibility criteria, because of its Semantic Websupport and powerful processing facilities [8,9].

We formalize the knowledge rules of the specification of eligibility criteria ofclinical trials with respect to the following different levels of knowledge: trial-specific knowledge, domain-specific knowledge, and common knowledge.

Trial-specific Knowledge Trial-specific knowledge are those rules whichspecify the concrete details of the eligibility criteria of a specific clinical trial.Those criteria are different from a trial to another trial.

Given a patient ID, we suppose that we can obtain its patient data throughthe common knowledge of the interface with SPARQL endpoints and its internaldata storage. Thus, in order to check if a patient meets an inclusion criterion,we can check if its patient data meet the criterion.

Furthermore, we would not expect to check all the criteria with respect to thepatient data, because some of those required data may be missing in the patientdata. We introduce a special predicate getNotYetCheckedItems to collect thosecriteria which have not yet been formalized for the trial.

For example, the inclusion criteria in the trial NCT00002720 can be formal-ized as follows:

meetInclusionCriteria(_PatientID, PatientData, CT,

NotYetCheckedItems):-

CT = ’nct00002720’,

breast_cancer_stage(PatientData, ’1’),

invasive_breast_cancer(PatientData),

er_positive(PatientData),

known_pr_status(PatientData),

age_between(PatientData, 65, 80),

postmenopausal(PatientData),

getNotYetCheckedItems(CT, NotYetCheckedItems).

Which states that the inclusion criteria include: i) Histologically proven stageI, invasive breast cancer, ii) Hormone receptor status: Estrogen receptor posi-tive and Progesterone receptor positive or negative, iii) Age: 65 to 80, and iv)Menopausal status: Postmenopausal.

Domain-specific Knowledge Those trial-specific rules above may involvesome knowledge which are domain relevant, i.e., the domain knowledge, which

9 http://www.w3.org/Submission/SWRL/10 http://www.w3.org/TR/rif-overview/11 http://www.hl7.org/special/Committees/arden/index.cfm12 http://www.swi-prolog.org/

Page 9: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

SemanticCT: A Semantically-Enabled System for Clinical Trials 9

are trial independent. We formalize those part of knowledge which are relevantwith domain knowledge in the libraries of domain-specific knowledge. For ex-ample, for clinical trials of breast cancer, we formalize the knowledge of breastcancer in the knowledge bases of breast cancer, a domain-specific library of rules.

An example of this type of knowledge is a patient of breast cancer is triplenegative if the patient has estrogon receptor negative, progesterone receptornegative and protein HER2 negative status. It can be formalized in Prolog asfollows:

triple_negative(Patient):- er_negative(Patient),

pr_negative(Patient),

her2_negative(Patient).

We consider patient data as a set of property-value pairs. A general format ofpatient data, called the PrologCMR format, is designed to be a list of property-value pairs. This general format of patient data is flexible to represent the datafrom different formats of CMRs, because we can design a CMR-specific interfaceto obtain the corresponding data via different data servers, which can be aSPARQL endpoint, internal data storage server, or a database server[5]. Then,we can convert the patient data into one in the PrologCMR format. We introducethe general predicate getItem(PatientData, Property, Value) to get the value ofthe property from the patient data.

For example, these receptor status can be straightforward formalized as fol-lows:

er_positive(PatientData):- getItem(PatientData, er, ER),

ER = ’positive’.

Common Knowledge The specification of the eligibility criteria may in-volve some knowledge which are domain independent, like the knowledge fortemporal reasoning and the knowledge for manipulating semantic data and in-teracting with data servers, e.g. how to obtain the data from SPARQL endpoints.We formalize the knowledge in several rule libraries, which can be reusable fordifferent applications.

Example of this type of knowledge is temporal reasoning with constructs likelast-month.

lastmonth(LastMonth):- today(Today),

Today = date(_Year, ThisMonth, _Date),

ThisMonth > 1,

LastMonth is ThisMonth - 1.

lastmonth(LastMonth):- today(Today),

Today = date(_Year, ThisMonth, _Date),

ThisMonth is 1,

LastMonth is 12.

Page 10: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

10 Zhisheng Huang, Annette ten Teije, and Frank van Harmelen

Based on the SWI-Prolog’s Web libraries, we can develop the interface withSPARQL endpoints to obtain semantic data (e.g. semantics-enable patient dataand medical ontologies) for the rule-based formulation of eligibility criteria.

Fig. 1. The architecture of SemanticCT.

This reasoning component is developed as a LarKC component for the taskof identifying eligible patients for trials. The rule-based reasoning componentis also useful for trial finding service, because it can provide exactly matchingon the data, without relying on exhaustive regular expression patterns. In thefuture we want to use this component for trial finding for patients. This requiresthat all eligibility criteria of the trials are modeled in this rule-based approach.

5 System

5.1 Architecture

The architecture of SemanticCT is shown in Figure 1. SemanticCT Managementplays a central role of the system. It launches a web server which serves as theapplication interface of SemanticCT, so that the users can use a web browserto access the system locally (i.e., from the localhost) or remotely (i.e., via theWeb). SemanticCT Management manages SPARQL endpoints which are builtas SemanticCT workflows. A generic reasoning plug-in in LarKC provides thebasic reasoning service over large-scale semantic data, like RDF/RDFS/OWLdata. SemanticCT Management interacts with the SemanticCT Prolog compo-nent which provides the rule-based reasoning[5,3].

Page 11: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

SemanticCT: A Semantically-Enabled System for Clinical Trials 11

Fig. 2. The GUI of SemanticCT.

LarKC, which consists of the LarKC core for plug-in and workflow man-agement and the LarKC data layer, serves as the infrastructure of SemanticCTfor semantic data management. The LarKC data layer manages the semanticdata repositories of SemanticCT. Those semantic data repositories consist of i)biomedical terminologies or ontologies, such as SNOMED CT, LOINC, MeSH,RxNorm, etc., ii) semantic data of clinical trials, like those from LinkedCT, orsemantic data which are converted from the original XML-encoded data of clin-ical trials, iii) semantic annotation data of trials, which are generated from thebiomedical semantic annotation servers, and iv) patient data, which can be thesemantic data obtained from EHR systems, or created by the knowledge-basedpatient data generator[6]. Those semantic data repositories can be located locallyor distributively.

Fig. 3. The interface of semantic search.

Page 12: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

12 Zhisheng Huang, Annette ten Teije, and Frank van Harmelen

Fig. 4. Semantic Annotation

5.2 Interface and Service

For the demonstration prototype of SemanticCT, we merge the interfaces forvarious groups of users into a unique one on a Web browser, without consider-ing their data protection issues, like access authority and password checking. Ascreenshot of the interface of the demonstration prototype SemanticCT is shownin Figure 2. Notice the several tabs that are available for various services anddifferent types of users and discussed below.

– Semantic search: Figure 3 shows the interface of the semantic search, witha SPARQL example which searches for all recruiting phase 3 trials for fe-male patients with the age between 70 and 75. We provide a set of SPARQLquery templates, so that the users can select some of them and change someparameters of the templates to make their own queries (see SPARQL exam-ples)

– Keyword search: We provide the ordinary search by using keywords to searchover the eligibility criteria, or summaries of clinical trials. The extended key-word search provides complex keyword searches with the Boolean operators.

– Eligibility criteria: the eligibility criteria of the trial are shown.– Annotated criteria: the service for browsing semantically annotated eligibil-

ity criteria of trials, see Figure 4.– For patients: One of the main services in SemanticCT is the trial finding

service. Currently, we provide the trial finding service by using SPARQLqueries with regular expressions. The interface of patient services is shownin Figure 5. Notice that the SPARQL query is not visible for the user, butbehind the button ”show the CTs for this patient”.

– For clinicians: SemanticCT provides several services for clinicians (see Figure6). Those services include i) patient data browsing, ii) specific patient find-ing, such as, show all triple-negative breast cancer patients, and iii) patientrecruitment for the selected clinical trial. The interface of clinician servicefor patient recruitment is shown in Figure 6. Notice that patient recruitmentservice is based on the rule-based formalization of the eligibility criteria.

– For Researchers: Semantic search for patient recruitment is one of the mainservices here.

Notice as well that the user can select an ontology from a list. In Figure 2SNOMED is selected.

Page 13: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

SemanticCT: A Semantically-Enabled System for Clinical Trials 13

Fig. 5. Patient Service view and Trial Finding

Fig. 6. Clinician services view and Rule-based formalization for Eligible Patient Iden-tification

6 Discussion and Conclusion

6.1 Related Work

One of the obstacle to automate a clinical task like improving cohort selectionfor clinical trials is the need to bridge the semantic gap between raw patientdata, such as laboratory tests or specific medications, and the way a clinicianinterprets this data. In [7] they presented a feasibility study for an ontology-based approach to match patient records to clinical trials. This is inline with

Page 14: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

14 Zhisheng Huang, Annette ten Teije, and Frank van Harmelen

SemanticCT which enables to bridge this semantic gap as well by exploitingontologies.

The work in [1] is also focused on the enabling of the semantic interoper-ability between clinical research and clinical practice. Their approach is basedon a SOA-oriented approach combined with the exploitation of ontologies whichforms an ”intelligence” layer for interpreting and analyzing existing data, whichis dispersed, heterogeneous information, which is to a great extend publicly avail-able. In [2] the authors present a method, entirely based on standard semanticweb technologies and tool, that allows the automatic recruitment of a patient tothe available clinical trials. They use a domain specific ontology to represent datafrom patients’ health records and use SWRL to verify the eligibility of patientsto clinical trials. Although we propose an even more expressive language (e.g.,support for temporal reasoning and others) for modeling the eligibility criteria,this is in the same spirit as our approach. Furthermore, we use a general frame-work for specifying the eligibility criteria in three types of knowledge which canbe reused.

6.2 Discussion

In this paper, we have presented a semantically-enabled system for clinical trials.We have proposed the architecture of SemanticCT, which have been designedto build on the top of LarKC, a platform for scalable semantic data processing.The logic programming language Prolog has been introduced to a rule-basedformulization of eligibility criteria for clinical trials. SemanticCT has been se-mantically integrated with large-scale and heterogeneous data.

We have conducted several experiments for reasoning and data processingservices over SemanticCT. The experiment of trial finding service shows thatSPARQL queries with regular expressions are useful to deal with the informationwhich can be easily obtained by the processing (like menopausal status andhormone receptor status). The experiment of the rule-based formalization showsthat it is efficient and effective approach for faster identifying eligible patients.What we have implemented and tested is just a prototype of SemanticCT. Thus,it provides only a basic step for developing semantically enabled systems forclinical trials.

6.3 Future work

There are many interesting issues for future work of SemanticCT, which includetrial finding by using rule-based reasoning, more comprehensive workflow pro-cessing for decision support procedure, deeper reasoning with biomedical ontolo-gies, personalized information services for patients, clinicians, and researchers,etc. We are going to provide more extended services for clinicians, which includefinding relevant and latest literature like those from PubMed for the selectedpatient, and showing prognosis for selected patients. The existing implementedprognosis service in SemanticCT is quite simple, for it shows only the 5 yearsurvival rate, based on the TNM stage of patients. A comprehensive prognosisservice would be able to make analysis of all the relevant patient data to finding

Page 15: SemanticCT: A Semantically-Enabled System for Clinical Trialsfrankh/postscript/KR4HC2013... · 2013-06-28 · ous reasoning and data processing services for clinical trials, which

SemanticCT: A Semantically-Enabled System for Clinical Trials 15

most-relevant clinical evidence for the prognosis analysis. We will continue thedevelopment of SemanticCT and deploy it in real application scenarios.

Acknowledgments This work is partially supported by the European Commis-sion under the 7th framework programme EURECA Project (FP7-ICT-2011-7,Grant 288048).

References

1. V. Andronikou, E. Karanastasis, E. Chondrogiannis, K. Tserpes, and T. A. Var-varigou. Semantically-enabled intelligent patient recruitment in clinical trials. InProceedings of International Conference on P2P, Parallel, Grid, Cloud and Inter-net Computing, pages 326–331, 2010.

2. P. Besana, M. Cuggia, O. Zekri, A. Bourde, and A. Burgun. Using semanticweb technologies for clinical trial recruitment. In International Semantic WebConference, pages 34–49, 2010.

3. A. Bucur, A. ten Teije, F. van Harmelen, G. Tagni, H. Kondylakis, J. van Leeuwen,K. D. Schepper, and Z. Huang. Formalization of eligibility conditions of CT anda patient recruitment method, D6.1. Technical report, EURECA Project, 2012.

4. D. Fensel, F. van Harmelen, B. Andersson, P. Brennan, H. Cunningham, E. DellaValle, F. Fischer, Z. Huang, A. Kiryakov, T. Lee, L. School, V. Tresp, S. Wesner,M. Witbrock, and N. Zhong. Towards LarKC: a platform for web-scale reason-ing. In Proceedings of the IEEE International Conference on Semantic Computing(ICSC 2008). IEEE Computer Society Press, CA, USA, 2008.

5. Z. Huang, A. den Teije, and F. van Harmelen. Rule-based formalization of eligi-bility criteria for clinical trials. In Proceedings of the 14th Conference on ArtificialIntelligence in Medicine(AIME 2013), 2013.

6. Z. Huang, F. van Harmelen, A. ten Teije, and K. Dentler. Knowledge-based patientdata generation. In R. Lenz, S. Mikszh, M. Peleg, M. Reichert, and D. R. A.ten Teije, editors, Process Support and Knowledge Representation in Health Care.Springer LNAI, 2013.

7. C. Patel, J. J. Cimino, J. Dolby, A. Fokoue, A. Kalyanpur, A. Kershenbaum, L. Ma,E. Schonberg, and K. Srinivas. Matching patient records to clinical trials usingontologies. In Proceedings of the International Semantic Web Conference, pages816–829, 2007.

8. J. Wielemaker, Z. Huang, and L. van der Meij. SWI-Prolog and the web. Journalof Theory and Practice of Logic Programming, (3):363–392, 2008.

9. J. Wielemaker, T. Schrijvers, M. Triska, and T. Lager. SWI-Prolog. Journal ofTheory and Practice of Logic Programming, (1-2):67–96, 2012.

10. M. Witbrock, B. Fortuna, L. Bradesko, M. Kerrigan, B. Bishop, F. van Harmelen,A. ten Teije, E. Oren, V. Momtchev, A. Tenschert, A. Cheptsov, S. Roller, andG. Gallizo. D5.3.1 - requirements analysis and report on lessons learned duringprototyping. Larkc project deliverable, June 2009.