Associating Clinical Archetypes Through UMLS Metathesaurus Term Clusters

10
J Med Syst (2012) 36:1249–1258 DOI 10.1007/s10916-010-9586-9 ORIGINAL PAPER Associating Clinical Archetypes Through UMLS Metathesaurus Term Clusters Leonardo Lezcano · Salvador Sánchez-Alonso · Miguel-Angel Sicilia Received: 29 May 2010 / Accepted: 24 August 2010 / Published online: 9 September 2010 © Springer Science+Business Media, LLC 2010 Abstract Clinical archetypes are modular definitions of clinical data, expressed using standard or open constraint-based data models as the CEN EN13606 and openEHR. There is an increasing archetype speci- fication activity that raises the need for techniques to associate archetypes to support better management and user navigation in archetype repositories. This paper reports on a computational technique to generate ten- tative archetype associations by mapping them through term clusters obtained from the UMLS Metathesaurus. The terms are used to build a bipartite graph model and graph connectivity measures can be used for deriving associations. Keywords Clinical archetypes · UMLS · Graphs Introduction The archetypes philosophy is a two-level approach to address the lack of interoperability between health L. Lezcano · S. Sánchez-Alonso · M.-A. Sicilia (B ) Information Engineering Research Unit, Computer Science Department, University of Alcalá, Alcalá, Spain e-mail: [email protected] L. Lezcano e-mail: [email protected] S. Sánchez-Alonso e-mail: [email protected] information systems. Under the openEHR 1 two-level model, a stable reference information model consti- tutes the first level of modeling, while formal defi- nitions of clinical content in the form of archetypes constitute the second. Only the first level (the Reference Model) is implemented in software, significantly re- ducing the dependency of deployed systems and data on variable content definitions. The only other parts of the model universe implemented in software are highly stable languages/models of representation. As a consequence, systems have the possibility of being far smaller and more maintainable than single-level systems. The openEHR approach as well as concepts like interoperability, two-level modeling and the formal language for the distributed definition of archetypes (ADL) are explained in the “Background” section. Such environment allows archetypes to be devel- oped by disparate groups that work independently and that eventually publish their results in archetypes repositories. Nevertheless, as this approach is becom- ing widely accepted, it is certain that the number of available archetypes will become very large and hard to manage. Besides, while one of the greatest advantages of two-level modeling is the development of archetype definitions as a decentralized process, it is exposed to content overlapping and limits the normalization scope. This paper addresses how to provide a better inte- gration and management of archetypes by perform- ing their semantic classification and clustering. Such framework could then support navigation across the archetypes repositories by providing similarities be- tween definitions. To accomplish those tasks, this 1 http://www.openehr.org/

Transcript of Associating Clinical Archetypes Through UMLS Metathesaurus Term Clusters

Page 1: Associating Clinical Archetypes Through UMLS Metathesaurus Term Clusters

J Med Syst (2012) 36:1249–1258DOI 10.1007/s10916-010-9586-9

ORIGINAL PAPER

Associating Clinical Archetypes Through UMLSMetathesaurus Term Clusters

Leonardo Lezcano · Salvador Sánchez-Alonso ·Miguel-Angel Sicilia

Received: 29 May 2010 / Accepted: 24 August 2010 / Published online: 9 September 2010© Springer Science+Business Media, LLC 2010

Abstract Clinical archetypes are modular definitionsof clinical data, expressed using standard or openconstraint-based data models as the CEN EN13606and openEHR. There is an increasing archetype speci-fication activity that raises the need for techniques toassociate archetypes to support better management anduser navigation in archetype repositories. This paperreports on a computational technique to generate ten-tative archetype associations by mapping them throughterm clusters obtained from the UMLS Metathesaurus.The terms are used to build a bipartite graph model andgraph connectivity measures can be used for derivingassociations.

Keywords Clinical archetypes · UMLS · Graphs

Introduction

The archetypes philosophy is a two-level approach toaddress the lack of interoperability between health

L. Lezcano · S. Sánchez-Alonso · M.-A. Sicilia (B)Information Engineering Research Unit,Computer Science Department,University of Alcalá,Alcalá, Spaine-mail: [email protected]

L. Lezcanoe-mail: [email protected]

S. Sánchez-Alonsoe-mail: [email protected]

information systems. Under the openEHR1 two-levelmodel, a stable reference information model consti-tutes the first level of modeling, while formal defi-nitions of clinical content in the form of archetypesconstitute the second. Only the first level (the ReferenceModel) is implemented in software, significantly re-ducing the dependency of deployed systems and dataon variable content definitions. The only other partsof the model universe implemented in software arehighly stable languages/models of representation. Asa consequence, systems have the possibility of beingfar smaller and more maintainable than single-levelsystems. The openEHR approach as well as conceptslike interoperability, two-level modeling and the formallanguage for the distributed definition of archetypes(ADL) are explained in the “Background” section.

Such environment allows archetypes to be devel-oped by disparate groups that work independentlyand that eventually publish their results in archetypesrepositories. Nevertheless, as this approach is becom-ing widely accepted, it is certain that the number ofavailable archetypes will become very large and hard tomanage. Besides, while one of the greatest advantagesof two-level modeling is the development of archetypedefinitions as a decentralized process, it is exposed tocontent overlapping and limits the normalization scope.

This paper addresses how to provide a better inte-gration and management of archetypes by perform-ing their semantic classification and clustering. Suchframework could then support navigation across thearchetypes repositories by providing similarities be-tween definitions. To accomplish those tasks, this

1http://www.openehr.org/

Page 2: Associating Clinical Archetypes Through UMLS Metathesaurus Term Clusters

1250 J Med Syst (2012) 36:1249–1258

research reports on a computational approach thatallows ADL definitions to be categorized with theexisting archetypes by means of the Unified MedicalLanguage System.2 The UMLS has been used formany different purposes [1–5] and its structure is alsodescribed in the Background section.

Along with the theoretical issues, this paper de-scribes a first case study that has been build upon40 archetype samples. The ADL instances weretaken from the openEHR repository that containsquite known clinical statements like the Health RateOBSERVATION,3 the Pregnancy EVALUATION and theTransfusion ACTION. The case study outcome revealsuncovered areas inside the inputted archetypes domainas well as saturated fields. The semantic feature of theobtained associations goes beyond the static classifi-cation provided by the Reference Model, connectingarchetypes which are subsumed by different ReferenceModel entries. The case study has also detected isolatedarchetypes like the Intravenous f luid administration.

The rest of this paper is structured as follows.The next section provides a review of related re-searches, including similarities and differences withthis approach. Then Section “Background” gives abackground overview, introducing the involved stan-dards and technologies like the ADL and the UMLSKnowledge Source Server. Section “Mapping archetypeterms to UMLS concepts” describes a method to maparchetypes local terms onto UMLS term clusters.Then, Section “Analyzing archetypes intersection” de-scribes how archetypes cohesion can be measuredusing graph techniques, especially recommending them-slices method. The paper continues with Section“Working the Metathesaurus relations”, where theassociations obtained so far are enriched with Meta-thesaurus Relations. Section “Conclusions and furtherwork” finishes the article with a conclusions and furtherwork explanation.

Related work

Qamar and Rector [6] discussed the principles andmethods that allow for semantic mapping of data inthe Archetype model, amongst others, to formal bio-medical terminologies. The first step of the currentapproach is similar to their first stage as they bothend by offering mappings between ADL terms andterminology codes on the basis of context and non-

2http://umlsinfo.nlm.nih.gov3Elements in the openEHR Reference Model are in Courier fontfrom here on.

context methods, including lexical and semantic tech-niques. However, they considered those results ascandidate mappings that are then processed by filteringmechanisms based on certain description logic axiomsas well as on the intervention of a clinical modeler inorder to ensure the mapping accuracy and correctness.In this manner, [6] main goal is to assist experts duringarchetype modeling, that is to say a pre-definition aid.In contrast, our final objective is different as it is ori-ented to find relevant intersections and similarities be-tween already defined archetypes. It is a post-definitionapproach whose input consists of already existing ADLfiles. The methodology presented below relies on pat-terns of the graph theory and network analysis tech-niques that allow rapidly and automatically handlinglarge repositories of archetypes.

The Bisbal and Berry research [7] is designed to findalignments between archetypes about the same clinicalconcept, but defined in heterogeneous two-level ap-proaches. Matching archetypes from different sourcesfacilitates the interoperability of the three majorplayers in this domain: CEN 13606, HL7’s CDA RIMand openEHR. Therefore the objective of [7] does notoverlap with our goals either. We are instead concernwith management, categorization and semantic brows-ing of archetypes within a given repository. Moreover,their methods are also different as the mapping steprelies on the bindings to biomedical terminologies pro-vided by the archetype’s ontology section. Subontolo-gies of the biomedical terminologies are then alignedto complete the match of archetypes.

Related work also includes the master’s thesis [8]that elucidates methods for connecting digital publica-tions inside the Biomedical literature. It is presentedas an automated approach to biological knowledgediscovery from PubMed abstracts that involves the sys-tematic application of a storytelling algorithm followedby a series of filtering and compression operationsover the mined stories. According to the author, theautomatic integration of information across multiplepublications is a key task to gaining insight into thefunctioning of biological systems as a whole.

Background

This section briefly covers the underlying technologiessupporting the results described in this paper. Subsec-tion “The openEHR RM, AOM & ADL” describesthe basics of the openEHR Reference Model, theArchetype Object Model and the Archetype DefinitionLanguage. Then Subsection “The Unified MedicalLanguage System (UMLS)” briefly presents the UMLS

Page 3: Associating Clinical Archetypes Through UMLS Metathesaurus Term Clusters

J Med Syst (2012) 36:1249–1258 1251

structure. The section ends by introducing the JUNGframework and the Pajek software.

The openEHR RM, AOM & ADL

A number of approaches to the interoperability ofhealth systems have been proposed in the last years.Among them, the paradigm of archetypes has broughtto the field a new way to define the models of interop-erable health records and to normalize the informationtransfer between heterogeneous healthcare systems,thus improving their interoperability (IOp). Healthsystems IOp is defined in the SemanticHEALTH re-port [9] as “the ability, facilitated by ICT applicationsand systems, to exchange, understand and act on citi-zens/patients and other health-related information andknowledge among linguistically and culturally disparatehealth professionals, patients and other actors andorganizations within and across health system jurisdic-tions in a collaborative manner”.

The openEHR Foundation is an international, not-for-profit company and online community supportingthe development of specifications and tools for EHRinteroperability that follow the archetype methodol-ogy. Another initiative supporting archetypes is theEuropean Committee for Standardization (CEN)Technical Committee 251 (CEN/TC 251), which hasproduced CEN 13606, an electronic health record(EHR) extract standard based on archetypes. In thispaper, we consider openEHR instead of CEN 13606 asthe former can be considered a superset of the latter[10] thus providing richer built-in semantics.

The contribution of archetypes to this general pur-pose comes from the paradigm of “two-level modeling”[11] by which once an application is implemented ac-cording to a formal language syntax, it will not requirefurther substantial adjustments in order to handle newclinical statements. The two levels can be describedas follows. As the basic framework sustaining the ap-proach, a generic Reference Model (RM) [12] definesa logical information architecture for the interopera-bility of Electronic Health Record (EHR) systems. ThisReference Model includes a flexible syntax and somegeneric types of clinical information, as observations orevaluations. Then, instances or specializations of thatReference Model are devised in the form of constraintsexpressed through more concrete “archetypes” thatserve as a shared language for common and specializedclinical concepts. In other words, the Reference Modelencloses stable features like the set of classes that makeup the blocks constituting an EHR and the basic syntaxof statements, while archetypes allow for sharing awide variety of combinations of those classes corre-

sponding to EHR fragments created in specific clinicalsituations. Observation concepts like “blood pressure”,evaluations like “Airway assessment anaesthesiology”,instructions like “medication order” or actions like“transfusion” are clinical statements that have alreadybeen specified as archetypes, so they can be used orrefined as reference data structures for the interchangeof clinical data. One of the greatest advantages of thephilosophy of two–level modeling [13] with archetypesresides in that it allows the definition and sharing ofarchetype expressions as a decentralized process, thatis, a process where large repositories of archetypes areupdated and maintained by a variety of cooperatinggroups of experts, working over the same domain.

An ADL4 file starts with a header section followedby a definition section and an ontology section. Theheader section uniquely identifies the archetype andthe clinical concept involved. The definition sectioncontains constraints in a tree-like structure createdfrom the reference information model. Finally, codesrepresenting the meanings of nodes and constraints ontext or terms as well as bindings to terminologies suchas SNOMED-CT,5 are stated in the ontology sectionof the archetype. However, these are optional and theyare not available in most archetypes openly publishedon the Web nowadays. Extracts from a typical ADLfile6 are shown in Fig. 1.

Archetypes are themselves instances of theopenEHR Archetype Object Model (AOM) that speci-fies the formalisms in which to define them. The AOMis an object model of the semantics of archetypes.4 Inour approach, the ADL definitions are parsed for thepurpose of gathering the archetypes local terms (ofthe form [atXXXX]). Such functionality is providedby the The openEHR Java reference implementationproject [14].

The Unified Medical Language System (UMLS)

The UMLS is based on three knowledge sources, whichare distributed with several tools that facilitate their use[15, 16]:7

• Metathesaurus: The Metathesaurus is a large,multi-purpose, and multi-lingual vocabulary data-base that is organized by concepts. The current

4http://www.openehr.org/releases/1.0.2/5http://www.ihtsdo.org/snomed-ct/6Full ADL files for all the archetypes cited in this paper areavailable at http://openehr.org/knowledge/.7Further details can be consulted in the UMLS Reference Man-ual, http://www.nlm.nih.gov/research/umls/documentation.html.

Page 4: Associating Clinical Archetypes Through UMLS Metathesaurus Term Clusters

1252 J Med Syst (2012) 36:1249–1258

archetype (adl_version=1.4)openEHR-EHR-OBSERVATION.heart_rate.v1concept[at0000] -- Heart rate...definitionOBSERVATION[at0000] matches { -- Heart ratedata matches {

...ELEMENT[at0004] occurrences matches {0..1} matches { -- Ratevalue matches {

C_DV_QUANTITY <property = <[openehr::382]>list = <["1"] = <

units = <"/min">magnitude = <|>=0.0|>precision = <|0|>

>...ELEMENT[at0005] occurrences matches {0..1} matches { -- Rhythm

value matches {DV_CODED_TEXT matches {defining_code matches {

[local::at0006, -- Regularat0007, -- Irregularat0008] -- Irregularly irregular

}...

ontology...term_definitions = <["en"] = <items = <["at0000"] = <

description = <"The rate the heart is beating -either mechanically or electrically">

text = <"Heart rate">>

...

Fig. 1 A fragment from the Heart rate ADL source code

release comprises more than 1.5 million biomedicalterms from over 100 sources. Synonymous termsare clustered together to form a concept, for exam-ple, Heart Rate and Cardiac Rate statements belongto the same UMLS concept which unique identifier(CUI) is C0018810. Concepts are linked to otherconcepts by means of various types of relationships,resulting in a rich graph. Inter-concept relation-ships are either inherited from the structure of thesource vocabularies or generated specifically by theeditors of the Metathesaurus, giving it additionalsemantic structure.

• Semantic Network: The Semantic Network pro-vides a consistent categorization of all conceptsrepresented in the Metathesaurus as well as infor-mation about the set of basic Semantic Types, orcategories, which may be assigned to those con-cepts. The Network contains 133 Semantic Typesand 54 relationships. There are major groupings ofSemantic Types for organisms, anatomical struc-tures, biologic function, chemicals, events, physicalobjects, and concepts or ideas. The current scope ofthe UMLS Semantic Types is quite broad, allowingfor the semantic categorization of a wide range ofterminology in multiple domains.

• SPECIALIST Lexicon and Lexical Tools: TheSPECIALIST Lexicon is a general English lexiconincluding many biomedical terms and the LexicalTools are designed to address the high degree ofvariability in the natural language. Words often

have several inflected forms which would properlybe considered instances of the same word.

• UMLS Knowledge Source Server (UMLSKS): TheUMLSKS8 [17], developed at the U.S. NationalLibrary of Medicine (NLM), is the set of machines,programs and Application Programmer Interfaces(APIs), written in Java, located and maintained bystaff at the NLM that allows access to the UMLSKSservices. The UMLSKS provides two mechanismsfor external entities to interface with the UMLSKS.The first is through a web server running on a localmachine at the NLM. The software here coded usesthe second mechanism that is through an API thatconnects the application to the UMLSKS. Mostlyall the requests to the UMLSKS are sent via thepublic class UMLSKSServiceClient that imple-ments several search methods to find concepts andrelations.

The UMLSKS normalized search

The ADL to UMLS mapping step, that is described inSection “Mapping archetype terms to UMLS concepts”as the first stage of our approach, is based on theNormalized Search service provided by the UMLSKS.Considering that the definition of archetypes is car-ried out as a decentralized process, this search toolis very useful during automatic mapping and align-ment of archetypes because it standardizes the terms attwo levels:

1. Firstly, the Normalized Search applies a morpho-logical normalization to the input that includesgenitive removal and uninflection, among othermodifications. Such functionality is provided by theSPECIALIST Lexical Tools [18–20].

2. At a second level, the synonymous terms aresubstituted by a unique concept defined in theUMLS Metathesaurus as the synonyms cluster rep-resentative. The synonyms merging step allowsarchetypes from different sources to be semanti-cally interpreted without dependence on the ter-minology or the language that they are based on.As important as the synonym detection is thetreatment of the common case where archetypelocal terms are not concrete enough to identifya single UMLS concept. The Normalized Searchdeals with such ambiguity by providing all possi-ble CUIs that the input may refer to. For exam-ple the search for “blood pressure” returns fourpossible mappings: Blood Pressure [C0005823],

8http://umlsks.nlm.nih.gov/

Page 5: Associating Clinical Archetypes Through UMLS Metathesaurus Term Clusters

J Med Syst (2012) 36:1249–1258 1253

Blood Pressure Determination [C0005824], BloodPressure Finding [C1271104] and Systemic ArterialPressure [C1272641].

JUNG and Pajek graphs

Mappings between archetypes terms and UMLS con-cepts are represented in this paper by means of abipartite graph9 that has been built, edited and drawnwith the JUNG framework10 [22]. The Java UniversalNetwork/Graph Framework (JUNG) is an open sourcegraph modeling and visualization framework written inJava. The framework implements a number of layoutalgorithms, like the one depicted in Section “Analyzingarchetypes intersection”, Fig. 2, as well as analysisalgorithms such as graph clustering and metrics fornode centrality.

At a later step, the Pajek software11 is used to trans-form the bipartite graph into a one-mode networkand to generate the Figs. 3 and 4. Pajek [21, 24], thatmeans spider in the Slovenian language, is a free fornoncommercial use software oriented to the analysisand visualization of large networks.

Mapping archetype terms to UMLS concepts

Our approach starts with the mapping of the archetypeslocal terms to UMLS concepts by means of theUMLSKS Normalized Search that was detailed in Sub-section “The UMLSKS normalized search”. By localterms, we are referring to every concept in the ADLfile that has been assigned an atXXXX code. Taking asexample the archetype fragment in Fig. 1, the followingterms are searched in the UMLSKS: Heart Rate, Rate,Rhythm, Regular, Irregular and Irregularly irregular.Note that, as explained below, some of these termswill be previously filtered out in order to improveperformance.

A full automatic approach to rapidly handle largeamounts of archetypes requires an automatic filteringmethod that helps to avoid the ambiguity and mini-mize the incorrect behavior of the intersection process.Several implementations have been tested for this pro-cedure, arriving to the empirical conclusion that termswhose name matches a token from the ADL syntax

9A bipartite graph, bigraph or two-mode network is a networkwhere vertices are divided into two sets and vertices can only berelated to vertices in the other set [21].10http://jung.sourceforge.net/11http://pajek.imfm.si/doku.php

(e.g. History and Events) are meaningless outsidethe Archetype Model and therefore they should befiltered from the mapping process input.

Furthermore, a context method that reduces theambiguity is to concatenate each local term withupper level terms from the archetypes hierarchy inorder to generate more specific statements. For in-stance, the Blood Pressure archetype defines the sys-tolic ELEMENT that if taken just as it is then maps toSystole [C0039155]. According to the archetype seman-tic, such mapping is less precise than Systolic Pressure[C0871470] which is returned if the ELEMENT is previ-ously concatenated to the upper level (OBSERVATIONnode) so that the inputted string is “Blood Pressuresystolic”. Both of these steps make the first stage ofthe filtering method that will be completed upon thesearch results, see Algorithm 1. Less abundant causesof failure to map archetype terms include unusuallyqualified terms and local abbreviations terms (e.g. TotalTriiodothyronine (Total T3)) as well as words concate-nation like DateTime instead of Date Time.

The overall algorithm presented in Algorithm 2 con-tinues with a bipartite graph construction to store theconnections between the archetypes and the found

Page 6: Associating Clinical Archetypes Through UMLS Metathesaurus Term Clusters

1254 J Med Syst (2012) 36:1249–1258

mappings. Then the post-map filtering method is exe-cuted to remove the insignificant nodes and edges fromthe bigraph. Finally, the data is analyzed applying cer-tain graph theory techniques. Each one of these taskswill be discussed in the following sections.

Analyzing archetypes intersection

Figure 2 shows a representation of the found mappingsas a bipartite graph, where the edges connect membersof an archetypes set (big blue nodes) to members ofa set of UMLS concepts (small green nodes). Everytime the Normalized Search outputs a CUI for a localterm, a new edge between the archetype (owner ofthe term) and the CUI node is added to the bigraph.Consequently, the degree of a CUI is equivalent to thenumber of archetypes related with this concept. Afterthe sequential importing of the archetypes is finished,we obtain a sparse bigraph that will offer not onlythe conclusions for this section analysis but also theskeleton for further semantic traverse patterns.

In spite of the filtering first stage, it is very likelynot to exclude some terms like Device or Method whichare very ambiguous as isolated concepts. They inflatethe results with so many relations between archetypesthat useful links are obscured. The second stage of thefiltering method is designed to remove such noisy CUIsfrom the graph. Mathematically speaking, that goal isachieved by removing the CUI nodes whose degree isalmost reaching the cardinality of the archetypes set, asdetailed in Algorithm 3.

The graph shown in Fig. 2 was achieved when apply-ing above explained process to a set of 40 archetypes.The Fruchterman–Reingold algorithm for nodes layout[23] suggests preliminary results about the cohesionbetween archetypes.

Fig. 2 Fruchterman–Reingold layout for the bigraph betweenarchetypes and CUIs

As it can be appreciated, the assessment upon Fig. 2of the level of closeness between archetypes is difficult,especially when we increase the number of analyzedarchetypes. A helpful technique for the analysis of thiskind of data is that of m-slices. An m-slice (also knownas an m-core) is a maximal sub-network containingthe lines with a multiplicity equal or higher to m andthe vertexes incident with these lines [21]. In otherwords, vertices in an m-slice are connected by lines ofmultiplicity m or higher to at least one other vertex. LetG = {V,E} where V is the set of vertices and E the setof edges of an undirected graph, and let vi, vj ∈ V:

〈vi, vj〉 ∈ m_slice ⇐⇒ multiplicity(〈vi, vj〉) ≥ m

But for this technique to be applied, the bipartitegraph (also called two-mode network) has to be firstconverted into a one-mode network, that is to say, anetwork where each vertex can be related to each othervertex. It is possible to derive a couple of (valued) one-mode networks from the original bipartite graph, onerepresenting the intersection among archetypes, andanother one representing common references betweenUMLS concepts. We focus on the former case wheretwo archetypes connected with an edge of multiplicity10, means that their definitions share ten CUIs after themapping is over. When testing with different archetypesets of the same size, the connectivity and density ofthe obtained graph will depend on the volume of cov-ered knowledge. The wider the covered field, the moresparse will be the graph. So, resuming our case study,the two-mode network in Fig. 2 is transformed into theone-mode network depicted in Fig. 3.

This model of m-slice can be used to identify highlyrelated archetypes to a given intensity, according toedges multiplicity. High multiplicity values are consid-ered more important because they are less likely to beweak and therefore, the provided information aboutarchetype proximity is more reliable. From this point ofview, we are interested in finding connected archetypesinside the top m-slices. Note that the archetypes in

Page 7: Associating Clinical Archetypes Through UMLS Metathesaurus Term Clusters

J Med Syst (2012) 36:1249–1258 1255

Fig. 3 A derived one-mode network that binds archetypes. The edges width is proportional to their multiplicity

the same m-slice may conform several connected com-ponents instead of one, for instance, the Heart ratearchetype and the Thyroid function tests archetype be-long to the 40-slice because their strongest tie is 45 and43, respectively, which are higher than 40, however, thedirect line between them (multiplicity = 3) is far frombeing so strong. Thus, filtering out the weakest lines,the stronger compartments emerge from the network,revealing that Heart rate, Pulse and Fetal heart ratecompose a complete strong triad (complete subnetworkconsisting of three vertexes) as well as Thyroid functiontests, Liver function tests and Lipid studies. Anothersignificant clique detected when analyzing the 30-sliceis the one of Movement, Movement of a joint andMovement of the spine.

Working the Metathesaurus relations

In addition to above mentioned cliques, new archetypecores can be discovered by decreasing the m valueuntil we hit the m-slice at the bottom of the net-work. Even though almost all nodes in Fig. 3 can beprocessed this way, there are a few completely un-connected archetypes remaining that could never bereached. Liver stigmata and Body mass index are oc-currences of such isolated nodes because their degree iszero. Unfortunately, the fact that a couple of archetypesshare no CUI do not guarantee that they have nothingin common with the rest (e.g., unconnected Body massindex and Body weight are clearly near concepts).

To solve this inconvenient, the current researchrecommends taking advantages of the MetathesaurusRelations. Regarding that each of the UMLS conceptscontained in the initial bipartite graph (Fig. 2) de-scribes a piece of the archetype they are linked to, wemay consider that any relationship pointing to one ofthose CUIs can be somehow traced to that archetype.Therefore, in order to integrate detached archetypeswe look for relationships between every unshared CUI(i.e. linked to only one archetype) and the rest of CUIs.Note that an archetype stays isolated when all of itsmapped CUIs are unshared so they will all be includedin the Metathesaurus Relations search. As bipartitegraphs do not support bindings between elements in thesame set, one of the two CUIs involved in every foundrelationship has to be replaced by the archetype(s) it islinked to, generating a new bigraph compatible edge.The Algorithm 4 is an improved version of the process

Page 8: Associating Clinical Archetypes Through UMLS Metathesaurus Term Clusters

1256 J Med Syst (2012) 36:1249–1258

Fig. 4 Improving connectivity with RO Relationships

that integrates the Metathesaurus relationships step inline 5.

Thereupon, what type of relationship to look forarises as a new matter of concern. Several tests overthe case study have shown that the Sibling (SIB) re-lationship, as well as hierarchical relationships likeBroader (RB), Narrower (RN), Parent (PAR) andChild (CHD), are so abundant among working con-cepts that their insertion tends to blur the possible con-clusions. In the other hand, the RO relationship,12 thatcomprises the relationships other than synonymous,narrower or broader, is more appropriated to link withisolated archetypes. Hence, the tested algorithm to addthe RO relationships is listed in Algorithm 5.

Figure 4 illustrate the results after adding the cor-responding RO relations to the bipartite graph andrebuilding the one-mode network.

Essential improvements can be recognized by ana-lyzing the new connectivity status. For example, threenew bindings were established for the previously iso-lated Liver stigmata, an archetype designed to recordsymptoms found in diseases affecting the liver. Sig-nificantly, one of these relations points to the Liverfunction tests that detects such malfunctions. Also notethat some links go further the static classification ofthe Reference Model achieving semantic associationsas the one between the Measurement of chest and ex-pansion and the Examination of the chest even though

12http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html

the former is an OBSERVATION instance in contrast tothe latter that is a CLUSTER.

Results

The performance of the mapping algorithm is satis-factory taking into account that a 60% of all pro-cessed terms was successfully mapped. Each of thesearchetype terms was mapped to 2,4 UMLS concepts,on average. Moreover, the mean of UMLS mappingsper archetype was found to be 39,3, providing a strongset of semantic bindings for the next stages of thisresearch. Note that the meaningless mappings de-scribed in Section “Mapping archetype terms to UMLS

Page 9: Associating Clinical Archetypes Through UMLS Metathesaurus Term Clusters

J Med Syst (2012) 36:1249–1258 1257

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100 %

0 1 7 11 16 22 31 37 43 >43

Dis

trib

utio

n of

arc

hety

pes

in m

-slic

es

m-slices

Including MetathesaurusRO Relations

Without MetathesaurusRO Relations

Fig. 5 Distribution of archetypes according to the intensity of thesemantic connections to the rest of the repository. The red areasillustrate the percent of archetype that increase their intensity(m-slice) when adding the Metathesaurus Relationships

concepts” were already discarded when calculatingthese averages.

Then the connection statistics, generated fromSection “Analyzing archetypes intersection” onwards,reveal that a 12,5% of the archetypes failed tobe semantically connected before considering theMetathesaurus Relationships. The relationships addi-tion reduced that percent to only a 2,5% of isolatedarchetypes. Figure 5 highlights that improvement in redcolor. Note that the Metathesaurus Relationships havea higher impact on lower m-slices, thus reducing thelack of connectivity.

Conclusions and further work

The UMLS clusters and semantics have proven to beuseful for the categorization and semantic browsingof openEHR archetypes, finding different levels ofassociation between them. Hence, this computationalapproach represents a management enhancement thatallows organizing the ADL files repositories. It helps tounderstand the archetypes domain as a whole becauseeven the distant concepts are linked through chains ofnearly related concepts. Note that the openEHR repos-itory has grown up to 273 archetypes so it is becomingincreasingly necessary to offer new mechanisms, like

the one presented here, in order to avoid overlappingand to handle this large amount of definitions.

Direct applications of our approach include the gen-eration of archetype maps to visually represent the se-mantic distance between concepts. This provides meansto offer graphic interfaces, facilitating the navigationacross repositories. Considering that such repositoriesare continuously growing with the addition of newarchetype definitions, an hyperbolic browser [25] canbe developed on top of our research in order to handlethose large hierarchies. The number of archetypes thatcan coherently be displayed on the screen of a com-puter system can dramatically affect the ease of inter-acting with a large information structure. Hyperbolictechniques propose a smoothly-varying focus contextthat always includes several generations of parents,siblings, and children, making it easy for the user toexplore the hierarchy without getting lost. It would bedifficult to build such functionality without relying onthis approach or similar, because current repositoriesstructure present archetypes as a flat set of ADL fileswhich are semantically unconnected.

Further work will adjust our approach to cover thenew rules to build ADL Node Identifiers (“at” codes),published in the last ADL revision (ADL 1.5),13 thatallow specialised versions of nodes to be defined inspecialized archetypes.14

Besides the “at”, there is a second kind of codes(the “ac”) that is used within the ontology section ofarchetypes (see Section “The openEHR RM, AOM& ADL”) as a placeholder of constraints on themeaning of terms referred to external sources (e.g.SNOMED-CT). Although they are not extensivelyused yet in the OpenEHR repository, these “ac” codeswill also be added to the mapping step in order to betterspecify the semantics of the archetype.

As different levels of the Reference Model hierar-chy lead to different weights in the semantic of thearchetype, future improvements of the context methodswill include weighted mappings according to the depthof the “at” codes in the archetype hierarchy. However,it is quite common to find broad and unspecific namesat high levels (e.g. “any event”), so above improvementshould be complemented by modifying the weights also

13http://www.openehr.org/svn/specification/TRUNK/publishing/architecture/am/adl1.5.pdf14Specialised archetypes use a differential style of declaration (i.e.the contents of a specialised entity are expressed as differenceswith respect to the parent). The .adls standard file extension hasbeen introduced for differential ADL files while the .adl files areretained for standalone archetypes.

Page 10: Associating Clinical Archetypes Through UMLS Metathesaurus Term Clusters

1258 J Med Syst (2012) 36:1249–1258

according to the frequency of the archetype terms. Inthe current implementation, the terms with a very highfrequency in the repository are simply discarded (seethe post-map filtering in Section “Analyzing archetypesintersection”).

With all these improvements, future research willfocus on widely testing the developed application bycategorizing all the archetypes currently available inopen repositories. This will provide means to createa semantic navigation interface accessible to clinicalmodelers and end users.

Acknowledgements This work has been supported by theproject “Historia Clínica Inteligente para la seguridad delPaciente/Intelligent Clinical Records for Patient Safety”(CISEP), code FIT-350301-2007-18, funded by the SpanishMinistry of Science and Technology.

References

1. Bertaud, V., Lasbleiz, J., Mougin, F., Burgun, A., andDuvauferrier, R., A unified representation of findings in clin-ical radiology using the UMLS and DICOM. Int. J. Med.Inform. 77(9):621–629, 2008.

2. Suebnukarn, S., Haddawy, P., and Rhienmora, P., A col-laborative medical case authoring environment based onthe UMLS. Journal of Biomedical Informatics 41(2):318–326,2008.

3. Marquet, G., Mosser, J., and Burgun, A., A method exploit-ing syntactic patterns and the UMLS semantics for aligningbiomedical ontologies: The case of OBO disease ontologies.Int. J. Med. Inform., 76(3):S353–S361, 2007.

4. Woods, J. W., Sneiderman, C. A., Hameed, K., Ackerman,M. J., and Hatton, C., Using UMLS metathesaurus conceptsto describe medical images: Dermatology vocabulary. Com-put. Biol. Med. 36(1):89–100, 2006.

5. Brennan, P. F., and Aronson, A. R., Towards linking patientsand clinical information: Detecting UMLS concepts in e-mail.Journal of Biomedical Informatics 36(4–5):334–341, 2003.

6. Qamar, R., and Rector, A., Semantic mapping of clinicalmodel data to biomedical terminologies to facilitate datainteroperability. Presented at Healthcare Computing 2007Conference, Harrogate, UK, 2007. Available: http://www.cs.man.ac.uk/∼qamarr/papers/HealthcareComputing2007_Qamar.pdf. Accessed 9 August 2010.

7. Bisbal, J., and Berry, D., Archetype alignment—a two-leveldriven semantic matching approach to interoperability in theclinical domain. Presented at International Conference onHealth Informatics (HEALTHINF 2009), Porto, Portugal,2009.

8. Gresock, J. A., Finding combinatorial connections be-tween concepts in the biomedical literature. M.S. thesis,Department of Computer Science, Faculty of the VirginiaPolytechnic Institute and State University, Blacksburg,Virginia, U.S., 2007.

9. Stroetmann, V. N., Kalra, D., Lewalle, P., Rodrigues, J.M., Stroetmann, K. A., Surjan, G., Ustun, B., Virtanen,

M., and Zanstra, P. E., Semantic interoperability for betterhealth and safer healthcare. Project report, EuropeanCommission. Of f ice for Of f icial Publications of theEuropean Communities: Luxembourg, 2009.

10. Schloeffel, P., Beale, T., Hayworth, G., Heard, S., andLeslie, H., The relationship between CEN 13606, HL7,and openEHR. In: Proceedings of HIC 2006 Bridging theDigital Divide: Clinician, Consumer and Computer. Seattle,Washington, USA, 2006.

11. Michelsen, L., Pedersen, S. S., Tilma, H. B., and Andersen,S. K., Comparing different approaches to two-level modellingof electronic health records. Stud. Health Technol. Inform.116:113–118, 2005.

12. Beale, T., Heard, S., Kalra, D., and Lloyd, D., The openEHREHR information model, The openEHR Reference Model,The openEHR Foundation, rev. 5.1.0, 2007.

13. Beale, T., Archetypes, constraint-based domain models forfuture-proof information systems. In: Proceedings of theOOPSLA 2002 Conference. pp. 16–32. Boston, Seattle,Washington, USA, 2002.

14. Chen, R., and Klein, G., The openEHR Java reference im-plementation project. Stud. Health Technol. Inform., 129(pt.1):58–62, 2007.

15. Bodenreider, O., The Unified Medical Language System(UMLS): Integrating biomedical terminology. Nucleic AcidsRes. 32(database issue D267–D270), 2004.

16. Lindberg, D. A., Humphreys, B. L., and McCray, A. T.,The unified medical language system. Methods Inf. Med.32(4):281–291, 1993.

17. Bangalore, A., Thorn, K. E., Tilley, C., and Peters, L., TheUMLS knowledge source server: An object model for deliv-ering UMLS data. In: Proceedings of the AMIA 2003 Sympo-sium. pp. 51–55, 2003.

18. McCray, A. T., Srinivasan, S., and Browne, A. C., Lexicalmethods for managing variation in biomedical terminologies.In: Proceedings of the Annual Symposium on Computer Ap-plications in Medical Care. pp. 235–239, 1994.

19. Browne, A. C., Divita, G., Lu, C., McCreedy, L., andNace, D., Lexical systems: A report to the Board of Scien-tif ic Counselors, Lister Hill National Center for BiomedicalCommunications, National Library of Medicine. Tech. Rep.LHNCBC-TR-2003-003, 2003.

20. Browne, A. C., McCray, A. T., and Srinivasan, S., TheSPECIALIST LEXICON, Lister Hill National Center forBiomedical Communications, National Library of Medicine:Bethesda, MD, 2000.

21. de Nooy, W., Mrvar, A., and Batagelj, V., Exploratory so-cial network analysis with Pajek. New York: CambridgeUniversity Press, 2005.

22. O’Madadhain, J., Fisher, D., White, S., and Boey, Y. B.,The JUNG (Java Universal Network/Graph) Framework,University of California, Irvine, CA. Tech Rep. UCI-ICS 03-17, 2003.

23. Fruchterman, T., and Reingold, E., Graph drawing by force-directed placement. Softw. Pract. Exp. 21(11):1129–1164,1991.

24. Batagelj, V., and Mrvar, A., Pajek: Analysis and visualizationof large networks. In: Graph Drawing Software. pp. 77–103.Springer, 2003.

25. Lamping, J., and Rao, R., The hyperbolic browser: A focus+context technique for visualizing large hierarchies. J. Vis.Lang. Comput. 6(4), 1995.