Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

download Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

of 14

Transcript of Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

  • 7/26/2019 Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

    1/14

    Expert Systems With Applications 44 (2016) 386399

    Contents lists available atScienceDirect

    Expert Systems With Applications

    journal homepage:www.elsevier.com/locate/eswa

    Evaluation of semantic similarity metrics applied to the automaticretrieval of medical documents: An UMLS approach

    Israel Alonso,David Contreras

    Department of Telematics and Computer Science, Comillas Pontifical University, C/ Alberto Aguilera, 25, 28015 Madrid, Spain

    a r t i c l e i n f o

    Keywords:

    Semantic similarity

    Information retrieval

    Electronic Health Record

    UMLS

    a b s t r a c t

    One promise of current information retrieval systems is the capability to identify risk groups for certain dis-

    eases and pathologies based on the automatic analysis of vast amounts of Electronic Medical Records repos-itories. However, the complexity and the degree of specialization of the language used by the experts in this

    context, make this task both challenging and complex. In this work, we introduce a novel experimental study

    to evaluate the performance of the two semantic similarity metrics (Pathand Intrinsic IC-Path, both widely

    accepted in the literature) in a real-life information retrieval situation. In order to achieve this goal and due

    to the lack of methodologies for this context in the literature, we propose a straightforward information re-

    trieval system for the biomedical field based on the UMLSMetathesaurus and on semantic similarity metrics.

    In contrast with previous studies which focus on testbeds with limited and controlled sets of concepts, we

    use a large amount of information (101,712 medical documents extracted from TREC Medical Records Track

    2011). Our results show that in real-life cases, both metrics display similar performance, Path (F-Measure

    =0.430) e Intrinsic IC-Path(F-Measure =0.427). Thereby we suggest that the use ofIntrinsic IC-Pathis not

    justified in real scenarios.

    2015 Elsevier Ltd. All rights reserved.

    1. Introduction

    The exponential growth, in recent times, of the amount of

    biomedical information that is stored on purely electronic supports

    Electronic Health Records, or EHR, spring promptly to our mind

    has turned them into an element of undeniable relevance to the

    most diverse fields of scientific research (Hoffman, 2010; Prokosch, &

    Ganslandt, 2009).

    One of these fields is that of Information Retrieval, and its tradi-

    tional challenge of identifying those records which most efficiently

    answer a users immediate needs for information; for this task

    to be accomplished, it is critical to first establish a recognition of

    patterns in medical histories which would permit, ultimately, the

    early detection of epidemic outbreaks, the prevention of disease, or

    the identification of cohort groups (Roque, et al., 2011). The maindifficulty in undertaking this task arises from Natural Language

    Processing, as natural language is not only complex, but also highly

    context-sensitive. In a broad field such as that of the English lan-

    guage, for instance, it becomes necessary draw upon resources and

    ontologies like WordNet to aid representation (Fellbaum, 1998).

    Corresponding author. Tel.: +34 915422800; fax: +34 91 559 65 69.

    E-mail addresses: [email protected] (I. Alonso), [email protected]

    (D. Contreras).

    Unfortunately, these tools are of limited use to more specializeddisciplines, such as that of biomedicine, whose technical jargon

    is often as complex as it is ambiguous; the parsing of biomedical

    information calls for very specific terminology (Friedman, Kra, &

    Rzhetsky, 2002) and, hence, for new search strategies, designed

    from the outset to the particular demands of this branch of science

    (Alpi, 2005). In such cases, one must resort to specialist resources

    dictionaries and thesauri like UMLS (McCray et al., 1993) to give a

    semantic value to relevant information.

    Our present work aims to bridge this gap, helping the information

    retrieval systems based on Electronic Health Records, according to

    their semantic content; in a nutshell, being able to interpret the

    information needs of any given query, and consequently select those

    medical documents most relevant in terms of semantic proximity.

    An endeavor which is, we believe, much needed for the correctidentification of patients in cohort studies, given the complexity,

    variability, and lack of structure in the information traditionally

    contained in such records. This will require, to define and represent,

    through biomedical concepts, the information contained in both

    health records and medical queries, in order to establish the seman-

    tic proximity between them. The use, in this fashion, of semantic

    relationships between said concepts, closely emulates the analogous

    process in the human mind to establish similarity between two

    given terms (Miller, & Charles, 1991; Rubenstein, & Goodenough,

    1965). It should be pointed out beforehand that previous works have

    http://dx.doi.org/10.1016/j.eswa.2015.09.028

    0957-4174/ 2015 E lsevier Ltd. All rights reserved.

    http://dx.doi.org/10.1016/j.eswa.2015.09.028http://www.sciencedirect.com/http://www.elsevier.com/locate/eswamailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://dx.doi.org/10.1016/j.eswa.2015.09.028http://dx.doi.org/10.1016/j.eswa.2015.09.028mailto:[email protected]:[email protected]://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2015.09.028&domain=pdfhttp://www.elsevier.com/locate/eswahttp://www.sciencedirect.com/http://dx.doi.org/10.1016/j.eswa.2015.09.028
  • 7/26/2019 Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

    2/14

    I. Alonso, D. Contreras/ Expert Systems With Applications 44 (2016) 386399 387

    shown interest in establishing metrics for determining the degree of

    semantic similarity between two terms (Collins, & Loftus, 1975) in

    a more general context like the English language, and based on the

    WordNet infrastructure (Meng, Huang, & Gu, 2013). Unfortunately,

    however, these approaches not always yield satisfactory results when

    they are applied in the biomedical domain, since WordNets coverage

    of this domain is rather limited. (Burgun, & Bodenreider, 2001).

    Later works have attempted to solve this by incorporating specific

    resources and ontologies (MeSH, and SNOMED CT) in the study ofsimilarity metrics in the field of biomedicine, always in a theoretical

    context and a controlled environment. (Al-Mubaid & Nguyen, 2006;

    Batet, Snchez, & Valls, 2011; Caviedes, & Cimino, 2004; Nguyen, &

    Al-Mubaid, 2006; Pedersen, Pakhomov, Patwardhan, & Chute, 2007).

    These works prove it becomes necessary to resort to a specialised

    infrastructure namely UMLS if we are to determine the similarity

    existing between two concepts in the field of biomedicine, with the

    degree of precision that a human expert would expect to achieve.

    In this work we propose an experimental study to evaluate the

    performance of the two semantic similarity metrics (PathandIntrin-

    sic IC-Path, both widely accepted in the literature) in a real-life in-

    formation retrieval context. Moreover, to perform this assessment,

    we deploy a straightforward information retrieval system for the

    biomedical field based on the UMLS Metathesaurus and on semantic

    similarity metrics, due to the lack of methodologies for this context

    in the literature.

    Our paper will be structured as follows:

    InSection 2,we will describe the main components and charac-

    teristics of UMLS. In Section 3, we offer an outline of the current state

    of the art, focusing on different tools and strategies used nowadays

    in the retrieval of biomedical information, as well as the metrics used

    in calculating the semantic similarity between two concepts in this

    particular field. Then, inSection 4,we will define our proposal, along

    with the materials used in our work. InSection 5,we conduct a study

    of the inner workings of the different sources and relationships con-

    tained in UMLS, and how they are reflected in the results obtained by

    semantic similarity metrics in a purely theoretical context; we will

    later use, as our reference, the two main metrics based on the ap-

    proaches Intrinsic ICandPath findingfor their study and their appli-cation to a real-life context;Section 6will describe the procedures

    involved in our proposal for an ad-hoc and straightforward concept-

    based medical document retrieval system, and evaluate the efficacy

    of the two main semantic similarity metrics when applied to a real-

    life context (reflected inTREC 2011).Section 7covers the analysis and

    interpretation of the results obtained. Last, Section 8will comment

    on the conclusions derived from all conducted tests, as well as the

    contributions obtained from their results, and the future lines of re-

    search that would give continuity to our work.

    2. UMLS

    UMLS1 (Unified Medical Language System) is an ongoing project

    started in 1986 by the National Library of Medicine. It was envi-sioned as a common environment for the access and treatment of

    biomedical information (Bodenreider, 2004; Humphreys, Lindberg,

    Schoolman, & Barnett, 1998; Lindberg, Humphreys, & McCray, 1993).

    To this end, it structures said information as a series of concepts, with

    a setrelationship between them. At itscore,UMLS is made up of three

    components, all of which undergo regular updates and revision: a

    Metathesaurus, a Semantic Network, and a Specialist Lexicon (lexical

    information and tools for natural language processing). Of these ele-

    ments, the Metathesaurus and the Semantic Network are of particular

    interest to our work: the former for its contained concepts, sources

    and relationships, and the latter for its offer of semantic types.

    1

    http://www.nlm.nih.gov/research/umls/.

    Table 1

    Representation structure of UMLS concept C0018787.

    CUI LUI SUI AUI Source String

    C0 018787 L0 018787 S0 047194 A0 06 636 8 M eSH Heart

    C0 018787 L0 018787 S0 047194 A16757661 NCI Heart

    C0018787 L0018787 S0047194 A2882201 S NOMED Heart

    C0 018787 L0 018787 S03759 48 A16766 657 NCI H EART

    C0 018787 L0018787 S0419735 A0480532 CSP heart

    C0 018787 L0 018787 S0 419735 A18628913 C HV heart

    C0 018787 L024 8647 S0324326 A1280280 6 NCI C ardiacC0 018787 L024 8647 S134 4787 A1304355 CSP Cardiac

    C0 018787 L024 8647 S134 4787 A186 47556 C HV Cardiac

    The Metathesaurus is, in essence, a vast multipurpose and multi-

    language database covering more than one million concepts, all of

    them represented under a common framework, and stored in over a

    hundred different sources. Said sources are grouped in several dis-

    tinct perspectives of the biomedical environment, such as scientific

    information (MeSH-CRISP), clinical terminology (SNOMED-CT), ad-

    ministrative terminology (ICD-9-CM, CPT-4), or data exchange (HL7,

    LOINC), as well as general or specific thesauri including anatomy

    (UWDA, NeuroNames ), drugs (RxNorm, First Data Bank), medical de-

    vices (UMD, SPN), nursing (NIC, NOC, NANDA), oncology (PDG), ad-verse reactions (COSTART, WHO) or gene products (Gene Ontology-

    GO), to name a few.

    The data compiled in these various sources is organized in the

    Metathesaurus following a unique identifier structure, with a hier-

    archy of four significance levels: Concepts, Terms, Strings, and Atoms.

    In this order:

    CUI (Concept Unique Identifier): Each concept represents a dis-

    tinct meaning, which encompasses, within a unique code, all its

    synonym terms. LUI (Lexical Unique Identifier): Identifies each of the known lexi-

    cal variations or terms for any given concept. SUI (String Unique Identifier): Represents each descriptive string

    associated to a given term. One of them is designated as its name,or preferred term. All predicted variations in the character se-

    quence of the string (upper and lower case, punctuation) are cov-

    ered in separate identifiers. AUI (Atom Unique Identifier): correspond to each individual oc-

    currence of a given string in a specific source.

    Hence, for instance, the concept (C0018787), which represents the

    muscle organ that keeps blood circulation going, is grouped into a

    number of descriptive strings, of which we now show a few for the

    sake of the example. (Table 1).

    We must keep in mind that a givendescriptive string (SUI),may be

    referenced in oneor manyconceptidentifiers (CUIs). For example, the

    string Heart, identifies the preferred term for concept (C0018787),

    but it is also oneof the synonym terms forconcept (C1281570) Entire

    heart. We will now show the series of descriptive strings for both

    these concepts, as well as the semantic type they belong to:

    CUI: C0018787

    SUI (Prefered term): Heart

    Other SUIs (string terms): Hearts; Cardiac; coronary; cardiac struc-

    ture; heart structure; structure of heart, unspecified; corazn; es-

    tructura cardiaca; Cuore; herzen; Hart; etc.

    Semantic Type: (bpoc) - Body Part, Organ, or Organ Component.

    CUI: C1281570

    SUI (Prefered term): Entire heart

    Other SUIs (string terms): Heart; Entire heart (body structure);

    corazn; etc.

    Semantic Type: (bpoc) - Body Part, Organ, or Organ Component.

    http://www.nlm.nih.gov/research/umls/http://www.nlm.nih.gov/research/umls/http://www.nlm.nih.gov/research/umls/
  • 7/26/2019 Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

    3/14

    388 I. Alonso, D. Contreras / Expert Systems With Applications 44 (2016) 386399

    The Metathesaurus includes different kinds of relationships

    between concepts, whether they are found in the same source

    (intra-source) or in two different ones (inter-source). These relation-

    ships may be hierarchical and non-hierarchical (Burgun, & Bodenrei-

    der, 2001). Hierarchical relationships cover either direct synonymical

    relations of the Parent/Child (PAR/CHD) type or indirect ones

    of the Broader/Narrower (RB/RN) type. In turn, non-hierarchical re-

    lationships may belong to the Siblings (SIB), Other (RO), Similar (RL),

    Source Asserted Synonymy (SY), Possible Synonymy (RQ), AllowedQualifier (AQ), or Can Be Qualifier (QB) types. One advantage of the

    UMLS Metathesaurus lies in that it comprehends all known biomed-

    ical sources, which can then be used as a whole or independently.

    As for the Semantic Network, it categorizes all concepts in UMLS

    into 133 semantic types, between which 54 relationships are estab-

    lished. This is done through a tree structure, stemming from two

    main hierarchies:Entityand Event(Bodenreider, 2001; Bodenreider,

    & McCray, 2003; Erdogan, Erdem, & Bodenreider, 2010). Each con-

    cept (CUI) in the Metathesaurus belongs to, at least, one semantic

    type; the deeper in the structure these types lie (that is, the closer

    to the tree leaves), the more specific they will be. Thus, the concept

    (C0497327), associated to the termDementia, belongs to theseman-

    tic type Mental or Behavioral Dysfunction (mobd), which in turn is

    encompassed by the semantic group Disorders(DISO), which at last

    is contained in the hierarchy Event.

    The representation hence achieved of the biomedical knowledge

    contained in UMLS is the basis for itsapplicationin a variety of strate-

    gies and tools for natural language processing, and the computation

    of semantic similarity between concepts.

    3. Related work

    Information Retrieval Systems (IRS) based on natural language

    processing has been thoroughly studied in the context of biomedi-

    cal literature and, more recently, in that of clinical documentation.

    This interdisciplinary research field, dubbed biomedical informatics

    (Jiang et al., 2013), is among the fastest-growing in recent years.

    3.1. Information Retrieval Systems (IRS) in the biomedical environment

    IRS attempt to solve the information needs that users set out

    through queries. These queries contain, intrinsically, their search ob-

    jectives, whose sensitivity proves to be a crucial factor in the devel-

    opment of search algorithms and the retrieval of their results (Rose, &

    Levinson, 2004). Themain difficulty in reaching said results is thelack

    of precision in the queries themselves a problem only made worse

    by the inherent complexity of the language, and by the kind of infor-

    mation at hand. To address this problem, by completing and improv-

    ing the information presented in the query, the application of query

    expansion techniques was developed (Efthimiadis, 1996). These tech-

    niques, widely used in IRS to improve performance, focus on the ad-

    dition of new terms to the original query to narrowdown results. In a

    broad categorization, we could separate query expansion techniquesinto two general approaches.

    A first group of techniques is established on the analysis of vast

    collections of documents, for their grouping through co-occurring

    vectors (Xu, Zhu, Zhang, Hu, & Song, 2006; Zhu, Wu, Carterette, & Liu,

    2014) and probabilistic models (Qi, & Laquerre, 2012) of the most rel-

    evant terms. More recent works employ semantic distribution mod-

    els on linguistic elementsin said collections, in order to automatically

    extract synonyms and abbreviations (Henriksson, Moen, Skeppstedt,

    Daudaravicius, & Duneld, 2014; Zeng, Redd, Rindflesch, & Nebeker,

    2012). Lastly, otherworks in this groupanalyzethe applicationof spe-

    cific semantic similarity metrics on structures defined beforehand as

    containing the elements to be evaluated such as the Vector Space

    Model (Turney & Pantel, 2010)and the comparison of histogram dis-

    tance orcross-bin distances(Kurtz, Beaulieu, Napel, & Rubin, 2014).

    A second group would instead focus on the use and analysis of

    structures based on existing knowledge of the field of biomedicine,

    such as UMLS. The use of these resources calls for the disambiguation

    of the original querys terms, so that they point at unique concepts

    within the ontology (Bhogal, Macfarlane, & Smith, 2007; Voorhees,

    1994). In this manner, those concepts which arerelatedto theoriginal

    search terms would be used to expand the query.

    One tool that, alongside the UMLS Metathesaurus, allows us to

    identify the concepts that are referred to in a given text, is Metamap(Aronson, 2001; Aronson, & Lang, 2010). This tool gives us the foun-

    dation for the development of various query expansion techniques

    (Aronson, & Rindflesch, 1997), through the exploitation of the seman-

    tic relations contained in UMLS. In this approach, different efforts use

    defined UMLS structures to develop solutions stemming from, for in-

    stance: the representation of texts from clinical documents via se-

    mantic graphs, based on concepts and relationships (Plaza, & Daz,

    2010); query expansion through random walks based on the UMLS

    structure (Martinez, Otegi, Soroa, & Agirre, 2014); query expansion

    through the creation of an ontology of the query itself, associated to

    closely related concepts(Babashzadeh, Huang, & Daoud, 2013); the

    use of relationships between concepts, to reflect the semantic dis-

    tance between patients from stored information (Melton et al., 2006).

    As a direct consequence of the need to improve exploitation tech-

    niques of the semantics of various biomedical sources, several works

    arise which focused on evaluating the semantic similarity between

    any given concepts in the field of biomedicine.

    3.2. Semantic similarity

    Over time, a large variety of metrics have been defined, analyzed,

    and implemented, for the computation of semantic similarity be-

    tween concepts contained in biomedical sources such as SNOMED-

    CT, MeSH, UMLS, etc. These metrics can be categorized according to

    two major strategies:

    Based on the estimation of the semantic similarity between two

    terms, on account of the distance between the links relating them

    within the ontology (Path finding). Based on the semantic similarity between two concepts according

    to the information they contain (Information Content).

    3.2.1. Path finding similarity measures on taxonomical structure

    These metrics attempt to measure semantic information across

    concepts, based on the hierarchical relationships defined between

    them in biomedical sources. The most important among these will

    now be explained:

    The first metric defines the semantic similarity (sim) between two

    concepts as the shortest path between them (sp), according to their

    interrelationships. This metric, known as Concept Distance(CDist), is

    defined by Rada, Mili, Bicknell, and Blettner (1989) as the number

    of nodes in the shortest path between two concepts, c1 andc2, and

    is applied with (RB/RN) relationships on MeSH vocabulary.Caviedes,

    & Cimino (2004) later evaluate it with (PAR/CHD) relationships on

    MeSH, SNMI, ICD9-CM resources in the field of biomedicine.

    simCDist(c1, c2) = sp(c1, c2)wheresp is theshortest pathbetween c1, c2

    (1)

    A later variation, called Path Measure or Path Length (Path), was

    defined byPedersen et al. (2007)and applied from is-a type rela-

    tionships in SNOMED-CT. This variation corresponds to the inverse

    of the distance between two concepts (CDist), hence normalising the

    similarity result to a value ranging from 0 to 1.

    simPath(c1, c2) = 1/sp(c1, c2) (2)

    Later metrics introduce certain characteristics associated to the

    structure of taxonomy which had not been explored before, such as

    its depth or size, or thelocation of different concepts within it (Fig. 1).

  • 7/26/2019 Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

    4/14

    I. Alonso, D. Contreras/ Expert Systems With Applications 44 (2016) 386399 389

    Fig. 1. Example of hierarchical relationships between concepts in UMLS Metathesaurus. The termsdepth, path, andLCS, are represented.

    For instance, Leacock and Chodorow (1998) (lch) consider the

    shortest path (sp) between two concepts (c1, c2), scaling it logarith-

    mically to the total depth of the taxonomy (D). Thus, the deeper the

    taxonomy (that is, the more complex and thorough), the larger the

    relative value of semantic proximity between two terms would be. A

    proposal for the normalization of this metric to the unit interval can

    be found inGarla and Brandt (2012).

    simlch(c1, c2) = 1 log (sp)/ log2D (3)

    Other approaches introduce a new element in the hierarchy of

    both concepts, corresponding to their closest common ancestor. The

    depth of both concepts will be established according to the depth of

    theirLeast Common Ancestor (LCA), also known asLeast Common Sub-sumer (LCS)(Fig. 1).

    Wu and Palmer (1994) (wup) apply a measurement of the similar-

    ity between two concepts obtain by scaling the depth (depth) of their

    Least Common Ancestor(LCS) to the depth of each of the two concepts

    from the root of the taxonomy by way of their LCS.Garla and Brandt

    (2012) introduces a change including theshortestpath (sp) inthedef-

    inition, to avoid the case (c1 = c2), which would result in simwup(c1,

    c2)0and>0are contribution factors of two features andkisa constant.

    3.2.2. Information Content (IC) similarity measures

    The following approaches to calculating semantic similarity are

    based on Shannons Information Theory (Shannon, 2001), by which

    2

    http://search.cpan.org/dist/UMLS-Similarity/.

    http://search.cpan.org/dist/UMLS-Similarity/http://search.cpan.org/dist/UMLS-Similarity/http://search.cpan.org/dist/UMLS-Similarity/
  • 7/26/2019 Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

    5/14

    390 I. Alonso, D. Contreras / Expert Systems With Applications 44 (2016) 386399

    the similarity between two concepts must be measured according to

    the amount of information content they provide

    Information Content (IC) may be obtained from the distribution

    of a concept within a text corpus alongside a taxonomy (Corpus IC)

    (Resnik, 1995), or from the structure of a taxonomy alone (Intrinsic

    IC) (Snchez, Batet, & Isern, 2011; Seco, Veale, & Hayes, 2004; Zhou,

    Wang, & Gu, 2008).

    TheCorpus IC of any concept c is defined as the inverse of the

    log of the concepts frequency, whereas this frequency is the proba-bility of said concept occurring a number of times in a given corpus

    C, q(c,C).The number of times that the children concepts (cs) of

    the first appear within the said corpus also adds up, so that the more

    frequently the concept occurs, the lesser its information content will

    be(Resnik, 1995).

    ICCorpus(c) = log (f q(c))

    f q(c) = f q(c,C) +

    cs children(c)

    f q(cs) (8)

    On the other hand, the Intrinsic ICof a concept c is a proposal

    also defined in various works(Seco et al., 2004; Zhou et al., 2008)

    which has however been adapted to the biomedical context (Batet

    et al., 2011). The Intrinsic IC of a concept c is defined as the ratioof the number of its terminal concepts ( leaves(c)) to its associated

    ancestors (subsumers(c))(Snchez & Batet, 2011). This ratio is then

    normalized to the interval [0, 1] by the total number of leaves in the

    taxonomy (max_leaves). Thus, the more terminal elements a concept

    has relative to its number of ancestors, the lesser its information con-

    tent will be.

    ICIntrinsic(c) = log

    leaves(c)subsumers(c)

    + 1

    max_leaves + 1

    (9)

    These Information Content approaches lead to the semantic simi-

    larity measures defined byLin (1998) andlJiang and Conrath (1997).

    Lins definition proposes a ratio of the common information content

    of a given pair of concepts IC(LCS(c1, c2))to the information contentthat describes each concept separately IC(ci). In this approach, the

    higher the IC value of the LCS (that is, the more specific), the greater

    the similarity between the concepts thus compared.

    simLin(c1, c2) =2 IC(LCS(c1, c2))

    IC(c1) + IC(c2) (10)

    Jian and Conrath, in turn, propose an analogousmeasure (opposite

    to similarity) based on the distance between to concepts, which is

    evaluated as the difference between the information content of the

    two conceptsIC(ci)and that of their common ancestor IC(LCS(c1,c2)).

    DistJC(c1, c2) = IC(c1) + IC(c2) 2 IC(LCS(c1, c2)) (11)

    Metrics based on the Path findingapproach have been redefined

    (Batet et al., 2011) in terms of Information Content, and implemented

    for their evaluation(Garla, & Brandt, 2012). To this end, the shortest

    path (sp) between two concepts is redefined as the semantic distance

    (as proposed by ), and maximum depth as the maximum Information

    Content of any concept (icmax).

    All told, the metric (lch) based on Intrinsic IC (Intrinsic IC-lch) is

    redefined as:

    simIntrinsic

    IClch

    (c1, c2) = 1 (log(DistJC(c1, c2) + 1))

    log(2 icmax + 1) (12)

    and the metric (Path) based on Intrinsic IC (Intrinsic IC-Path) as:

    simIntrinsic

    IClch

    (c1, c2) =1

    DistJC(c1, c2) + 1 (13)

    These metrics have been evaluated in various works and on dif-

    ferent test benchmarks (Batet et al., 2011; Garla, & Brandt, 2012;

    Pedersen et al., 2007). Said works reveal a betterperformance of met-

    rics based on Intrinsic IC over those based on Path finding.

    4. Proposal and materials

    As described in the previous section, metrics based on Intrin-

    sic IC perform better than those based on Path Finding (Batet

    et al., 2011; Garla, & Brandt, 2012) working on testbeds with limited

    and controlled sets. For this reason, our experimental study will fo-

    cus on assessing the performance, in a real-life context, of the Intrin-

    sic IC-Path metric itself, andon thesimplestof distance-based metrics

    (for its lower computational cost), Path. In order to perform this as-

    sessment, we have deployed an information retrieval system for the

    biomedical field based on the UMLS Metathesaurus and on semantic

    similarity metrics.

    As we covered in the previous section, some earlier works focused

    on defining retrieval systems and language processing supported bythe UMLS resource(McCray et al., 1993) and others on the applica-

    tion of semantic similarity metrics on defined structures, indepen-

    dent from UMLS, such as thecomparisonof histogram distance (Kurtz

    et al., 2014). None of these works integrate the use of the UMLS re-

    source with semantic similarity metrics into information retrieval

    systems for the context of biomedical information.

    Later on, in Section 5, we will assess the performance of the UMLS

    Metathesaurus at calculating the semantic similarity between con-

    cepts from a theoretical perspective. To this end, we will use previous

    works as reference (Batet et al., 2011; Garla, & Brandt, 2012; McInnes

    et al., 2009; Pedersen et al., 2007), and compare their results with

    the ones attained in our own work, in order to validate our frame-

    work. Said works evaluate the semantic similarity of several lists of

    paired concepts, using different metrics, and compare the results to

    those proposed by a team of medical coders and physicians. We will

    also analyze the impact, in those results, of using different versions

    of UMLS and new types of relationships. Lastly, we will highlight the

    great diversity appreciable in the results of previous studies, which is

    due to the lack of a single correlation coefficient.

    InSection 6, we will analyze the results of the Path and Intrin-

    sic IC-Path metrics in a real-life information retrieval context based

    on semantic similarity. In this part of our paper, we will use the test

    dataset from the 2011 Text Retrieval Conference (TREC) (Voorhees &

    Tong, 2011). This test dataset is made up of three elements: a cor-

    pus of 101,712 de-identified documents or health records, compris-

    ing 17,265 visits or medical episodes of various patients (each visit

    canhave between 1 and 415 documents or reports);35 queries repre-

    senting information needs or inclusion criteria that must be fulfilled

    by theretrieval of themost relevantvisits or episodes;and lastly, a se-ries of relevance judgements defined by a team of experts, in which

    each individual visit is deemed relevant or not relevant according

    to the information needs of each search query in real-life context.

    For the development of the solution proposed in this paper,

    we have used a range of different tools: the UMLS Metathesaurus,

    2010AB and 2011AB, as base for medical knowledge; Metamap20133

    for concept-based representation this version of Metamap allows

    for the identification of negative statements, and the classification

    of concepts for any semantic type they may possess; and two open-

    source tools for the semantic similarity computation between con-

    cepts the first(McInnes et al., 2009)is a framework composed of

    3

    http://metamap.nlm.nih.gov/Docs/MM_2013_ReleaseNotes.pdf.

    http://metamap.nlm.nih.gov/Docs/MM_2013_ReleaseNotes.pdfhttp://metamap.nlm.nih.gov/Docs/MM_2013_ReleaseNotes.pdfhttp://metamap.nlm.nih.gov/Docs/MM_2013_ReleaseNotes.pdf
  • 7/26/2019 Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

    6/14

    I. Alonso, D. Contreras/ Expert Systems With Applications 44 (2016) 386399 391

    Table 2

    Semantic similarity correlation values using the Path metric, with PAR/CHD rela-

    tionships, for SNOMED-CT (UMLS versions 2008AB and 2010AB). Spearman correla-

    tions based on minimum rank values (results reproduced from Pedersen and McInnes

    (McInnes et al., 2009; Pedersen et al., 2007) forversion2008AB) andon average values

    (used in this work).

    S NO MED- CT 2008AB S NO MED- CT 2010AB

    Minimum values Average values Minimum values Average values

    Physicians 0.3500 0.3170 0.3134 0.2744Coders 0.5000 0.4500 0.4596 0.4160

    two packages (UMLS-Similarity4 y UMSL-Interface5)based on PERL

    modules available in CPAN (The Comprehensive Perl Archive Net-

    work), and the second (Garla & Brandt, 2012) one of the components

    of the Ytex6 framework.

    Finally, it is worth noting that health records processed by the

    system are a series of XML files. Although each document is hence

    structured in XML language, the label containing the most important

    information (the document itself) is written as natural language

    5. Evaluation of UMLS

    Firstly, wewill analyze theexisting characteristics of the main tool

    used in this work the UMLS Metathesaurus that can effect an im-

    provement in semantic similarity computation. These would be the

    evaluation of sources, and the existing types of relationship between

    concepts, present in the different UMLS versions used.

    5.1. Versions of UMLS Metathesaurus

    UMLS compiles the knowledge of the biomedical domain, and is

    thus undergoing constant evolution and improvement. Small changes

    between versions, affecting concepts or relationships, can have a no-

    ticeable impact in the results obtained in a tightly-defined context,

    such as Pedersens benchmark (29 pairs of concepts) (McInnes et al.,

    2009; Pedersen et al., 2007).To reflect this, we have reproduced the results obtained by

    Pedersen et al. (2007) and McInnes et al. (2009) on the source

    SNOMED-CT of UMLS version 2008AB (used in their work), compared

    to those of version 2010AB(Table 2).

    This table shows Spearmans rank correlation coefficients for the

    Path metric, compared to the estimates of physicians and medical

    coders. On one side, we show the correlation coefficient results based

    on theminimum rank values, forgroups of similarity values with rep-

    etition (as used in(McInnes et al., 2009; Pedersen et al., 2007)); on

    the other, the correlation results based on the average values of said

    rank (employed in this paper, as we consider it to be the most ade-

    quate approach in this context).

    As we can see, the obtained correlations (using Spearmans co-

    efficient) vary significantly (from 6% to 13%) between versions ofSNOMED-CT. These results show some refinement to the relation-

    ships between concepts in version 2010AB, which leads to lower val-

    ues(similarityrelationships found in theearlier version arenot found

    anymore). We must, then, bear this in mind when we compare the

    results of different studies, since many of them may be comparing

    the performance of metrics run on different versions of the UMLS

    Metathesaurus.

    In these results, and in others obtained throughout this work, we

    can observe that the metrics are better adjusted to the similarity cri-

    teria defined by medical coders than to those set by physicians.

    4 http://search.cpan.org/dist/UMLS-Similarity/.5 http://search.cpan.org/dist/UMLS-Interface/.6

    https://code.google.com/p/ytex/.

    Table 3

    Semantic similarity correlation values, using Spearman and Pearson, for a

    number of metrics based on Path findingwith PAR/CHD relationships,

    for SNOMED-CT with UMLS 2010AB.

    Path lch wup nam

    Spearman Physicians 0.2744 0.2744 0.3377 0.4063

    Coders 0.4160 0.4156 0.4190 0.5578

    Pearson Physicians 0.5451 0 .3348 0.3372 0.4301

    Coders 0.7170 0 .4566 0.3840 0.4456

    5.2. Impact of correlations used

    Analyzing the results obtained, and comparing them with the re-

    sults of previous works, we observe a lack of a standard criterion

    for the coefficient used. For instance, some works used Spearmans

    correlation coefficient (Garla, & Brandt, 2012; Pedersen et al., 2007)

    while others use Pearsons linear coefficient (Batet et al., 2011). This

    took us to the study and interpretation of both kinds of correlations,

    for the analysis of various semantic similarity metrics. Pedersen him-

    self, who uses Spearmans coefficient in his results ( Pedersen et al.,

    2007), points to a maximum Pearson correlation of 0.85 between the

    estimates of the evaluating experts (medical coders and physicians).

    For this, and for the sake of a better interpretation, we calculatesimilarity for the 29 paired concepts (McInnes et al., 2009)with the

    main metrics based on Pathfinding, andobserve that there is signif-

    icant variation in the results depending on the correlation coefficient

    used (Table 3).

    As in previous works (McInnes et al., 2009; Nguyen, & Al-Mubaid,

    2006), the nam metric (Nguyen & Al-Mubaid), applied to SNOMED-

    CT sources, reflects better correlation values for the Spearman coeffi-

    cient. Pearsons correlation, however, offers better results for the Path

    metric(Table 3).

    Far from joining the discussion over the kind of correlation that

    should be used (Pearson correlates similarity values, while Spearman

    correlates their order), our study reveals that the results of various

    works are simply not comparable with each other, as was already

    pointed out byGarla and Brandt (2012). For this reason, and to fur-ther clarify the matter, we now show the results obtained with both

    correlation coefficients.

    5.3. Study of UMLS relationships and resources

    The semantic similarity calculations in previous works defined by

    Pedersen et al. (2007) andMcInnes et al. (2009) were done through

    direct hierarchical relationships (PAR-CHD), defined also as type

    is-a semantic relationships, on a single source. Later works, such

    asGarla and Brandt (2012) andBatet et al. (2011), do not specify the

    kind of relationships used in the calculation of semantic similarity, so

    it is not possible to determine the implications of their results.

    For this reason, in the first part of our work, we have also evalu-

    ated the impact that different kinds of relationships between con-cepts can have in the calculation of semantic similarity. The kinds

    of relationships we evaluate are: direct hierarchical relationships

    (PAR/CHD), indirect hierarchical (RB/RN), and non-hierarchical exist-

    ing in the UMLS Metathesaurus (SIB, RO, RL, SY, RQ, AQ, and QB).

    Firstly, we will run the similarity calculations for Pedersens

    benchmark, using the Path metric applied to the sources and rela-

    tionships contained in UMLS 2010AB. As shown in Table 4,there is

    a significant improvement in the correlation coefficients for hierar-

    chical relationships. However, the combined use of all relationships

    (both hierarchical and non-hierarchical) degrades the results consid-

    erably. This is dueto thefact that these non-hierarchical relationships

    generate cycles that do not represent parent/child or sibling relation-

    ships between concepts (synonymy) (Bodenreider, 2001; Erdogan

    et al., 2010) we do not, then, recommend using them, as they add

    http://search.cpan.org/dist/UMLS-Similarity/http://search.cpan.org/dist/UMLS-Similarity/http://search.cpan.org/dist/UMLS-Interface/http://search.cpan.org/dist/UMLS-Interface/https://code.google.com/p/ytex/https://code.google.com/p/ytex/https://code.google.com/p/ytex/http://search.cpan.org/dist/UMLS-Interface/http://search.cpan.org/dist/UMLS-Similarity/
  • 7/26/2019 Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

    7/14

    392 I. Alonso, D. Contreras / Expert Systems With Applications 44 (2016) 386399

    Table 4

    Semantic similarity correlations for the Pathmetric, with relationships

    PAR/CHD, PAR/CHD+RB/RN, and ALL relationships for sources existing in

    UMLS 2010AB.

    PAR/CHD PAR/CHD + RB/RN ALL

    Spearman Phys. 0.6382 0.5761 0.4788

    Coders 0.6 422 0.6 495 0.4338

    Pear so n Phys. 0.7059 0.6740 0.6168

    Coders 0.7982 0.8012 0.7046

    Table 5

    Table summarising results, extracted from Garlas work for UMLS re-

    lease 2011AB concept graph(Garla, & Brandt, 2012).

    Benchmarks Knowledge Based

    Path Finding Intrinsic IC

    wup Path / lch Path

    Pedersen Combined N = 29 0.70 0.61 0.70

    Mayo N = 101 0.38 0. 30 0.41

    UMN relatedness N = 430 0.33 0.36 0.36

    UMN similarity N = 566 0.39 0. 40 0.43

    UMN relatedness N = 587 0.32 0.34 0.35

    noise to the results. We can also observe how the application of the

    entirety of the knowledge offered by the sources within the UMLS

    Metathesaurus improves results as well.

    The previous tests were also conducted on version 2011AB of the

    UMLS Metathesaurus, obtaining similar results.

    5.4. Comparison of Path and Intrinsic IC-Path metrics

    Although many works have evaluated the performance of metrics

    with different sets of pairedconcepts, Garlaoffers a definitive view on

    various existing frameworks, and the various metrics defined (Garla,

    & Brandt, 2012). As we can see inTable 5 (summary of the resultsreached by Garla), the best overall results are given by the Intrinsic

    IC-Pathmetric.

    For this reason, our work will be evaluating the performance of

    the metric yielding the best results (Intrinsic IC-Path) and the compu-

    tationally simplest metric (Path), in a real information retrieval sce-

    nario that is, working on large volumes of information.

    6. Evaluation of metrics in a real information retrieval context

    Now that we have shown the importance of using the latest ver-

    sions of the UMLS Metathesaurus (for an updated knowledge of the

    biomedical domain) and of applying the right relationships (to re-

    duce noise), we will now focus, in this section, on the applicationof these conclusions to a real environment. Also, in contrast with

    of earlier works, we will evaluate the impact of using the Path and

    Intrinsic IC-Pathmetrics in this real environment, namely the set of

    electronic medical records found in TREC Medical Records Track 2011

    (Voorhees, & Tong, 2011).

    In order to perform this evaluation, each of the medical reports

    making up eachvisitfor a given patient, along with the search topics,

    will be represented via concepts contained within UMLS. This rep-

    resentation will allow us to relate the topics concepts semantically

    with the contents of each report; the semantic similarity between

    these will determine the relevance of each visit.

    For the calculation of metrics of semantic similarity between con-

    cepts, we have used Ytex, developed by Garla and Brandt (Garla, &

    Brandt, 2012).

    6.1. Processing the information to be used

    In order to represent, treat, and evaluate the semantic similarity

    described above, we must extract UMLS concepts from the search

    topic, as well as from the report. That done, the semantic similarity

    between the concepts extracted from both is calculated. Lastly, these

    results will be aggregated into a single similarity value, which will

    determine the relevance (or irrelevance) of the document for a given

    search topic. We will now detail the process.Pre-processing of reports and search topics:reports taken from

    Text Retrieval Conference (TREC) are in XML format, and contain a

    series of headers, footers, codes, and labels that must be removed be-

    fore processing. Hence, in this stage, we remove the documents XML

    tags, as well as any information that is not relevant to this study, such

    as the reports checksum code which identifies the visit it belongs to

    its signatures, and its ICD-9 codes. The result is a plain-text version

    of the report, written in natural language with no codes or labels.

    Topics, on the other hand, require no such processing, as they al-

    ready are a mere text string.

    Processing of search topics: the topics are processed using the

    tool Metamap, breaking them into simple strings termed phrases

    which represent symptoms, parts of the body, illnesses, etc. After

    this, we obtain the CUIs of each of these resulting phrases. Some of

    these phrases or strings may generate more than one CUI (as was

    described inSection 2), in which case we combine these CUIs, giv-

    ing each phrase a number of sub-phrases, and hence expanding the

    query.

    As an example of this method, we will now describe the process-

    ing ofTopic 104, which defines the search criteriaPatients diagnosed

    with localized prostate cancer and treated with robotic surgery. The

    strings or phrases that make up this topic are:

    1. Patients.

    2. diagnosed with localized prostate cancer.

    3. treated with robotic surgery.

    Following this, we extract the UMLS concepts (CUIs) associatedto each topic phrase, obtaining the 11 sub-phrases shown in Table 6.

    For instance, phrase 3 (treated with robotic surgery) generates sub-

    phrases 1009, 1010, and 1011, while phrase 1 (Patients) generates

    only sub-phrase 1001.

    In case of processing a single search criteria, for example Topic 101

    Patients with hearing loss, only one phrase will be generated, with

    the concept sub-phrases 1001, 1002, and 1003, as seen inTable 7.

    In both examples, we can see how phrases given more than one

    sub-phrase implicitly expand the original query, through the varia-

    tions in the concepts (CUIs) they contain; all of them carry a meaning

    that is unique, but common to that of the original query.

    Processing of medical reports: reports are processed in a sim-

    ilar fashion to topics, identifying the UMLS concepts correspond-

    ing to each phrase in the document, and generating all the possiblesub-phrases from the combination of CUIs of its different contextual

    phrases.

    As an example, we show a brief excerpt from a report (Fig. 2),

    after being pre-processed in this stage to generate the correspond-

    ing phrases. These phrases are expanded into different sub-phrases

    through variations in the concepts (CUIs) that represent them

    (Table 8). This way, we will be able to combine and match them with

    each of thesub-phrasesdefining thetopic,and obtain the maximum

    semantic proximity betweentopicandreport(as will be explained in

    detail inSection 6.3). It is also worth noting that those phrases con-

    taining a negation (assigned code 1), will be eliminated from the sim-

    ilarity calculation process. In both cases, topicand reporthave been

    conceptually expanded from the sub-phrases generated in both pro-

    cesses.

  • 7/26/2019 Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

    8/14

    I. Alonso, D. Contreras/ Expert Systems With Applications 44 (2016) 386399 393

    CONGESTIVE HEART FAILURE. CYSTIC STRUCTURE AT THE POSTERIOR LEFT SIDE OF THE URINARY BLADDER WHICH

    CAUSES MASS EFFECT ON THE URINARY BLADDER AND ADJACENT TO UTERUS, DETECTED ON CT OF THE ABDO-

    MEN. NO CHANGE IN 7 X 8 CM FOCAL CYSTIC STRUCTURE. HARD OF HEARING. IRON DEFICIENCY ANEMIA.

    Fig. 2. Example pre-processed excerpt of a report (Report 90230).

    Table 6

    Phrase table (Topic 104).

    SUBPHRASE PHRASE Topic 104:"Patients diagnosed with localized prostatecancer and treated with robotic surgery"

    1001 1 CUI1 = (C0030705) : podg : "Patients"

    1002 2 CUI1 = (C0011900) : fndg : "Diagnosis"

    CUI2 = (C0796563) : neop : "Localized Malignant

    Neoplasm"

    CUI3 = (C0033572) : bpoc : "Prostate"

    1003 2 CUI1 = (C0011900) : fndg : "Diagnosis"

    CUI2 = (C0796563) :neop : "Localized Malignant

    Neoplasm"

    CUI3 = (C1278980) :bpoc : "Entire prostate"

    1004 2 CUI1 = (C0011900) : fndg : "Diagnosis"

    CUI2 = (C1334407) : neop : "Localized Carcinoma"

    CUI3 = (C0033572) : bpoc : "Prostate"

    1005 2 CUI1 = (C0011900) : fndg : "Diagnosis"

    CUI2 = (C1334407) : neop : "Localized Carcinoma"

    CUI3 = (C1278980) : bpoc : "Entire prostate"1006 2 CUI1 = (C0011900) : fndg : "Diagnosis"

    CUI2 = (C0392752) : spco : "Localized"

    CUI3 = (C0376358) : neop : "Malignant neoplasm of

    prostate"

    1007 2 CUI1 = (C0011900) : fndg : "Diagnosis"

    CUI2 = (C0392752) : spco : "Localized"

    CUI3 = (C0600139) : neop : "Prostate carcinoma"

    1008 2 CUI1 = (C0011900) : fndg : "Diagnosis"

    CUI2 = (C0392752) : spco : "Localized"

    CUI3 = (C2984325) : ftcn : "Prostate Cancer Pathway"

    1009 3 CUI1 = (C0332293) : topp : " Treated with"

    CUI2 = (C0035785) : ocdi : "Robotics"

    CUI3 = (C0038894) : bmod : "Surgery specialty"

    1010 3 CUI1 = (C0332293) : topp : " Treated with"

    CUI2 = (C0035785) : ocdi : "Robotics"

    CUI3 = (C0038895) : ftcn : "Surgical aspects"

    1011 3 CUI1 = (C0332293) : topp : " Treated with"

    CUI2 = (C0035785) : ocdi : "Robotics"

    CUI3 = (C0543467) : diap : "Operative Surgical

    Procedures"

    Table 7

    Phrase table (Topic 101).

    SUBPHRASE PHRASE Topic 101:"Patients with hearing loss"

    1001 1 CUI1 = (C0030705) : podg : "Patients"

    CUI2 = (C0011053) : dsyn : "Deafness"

    1002 1 CUI1 = (C0030705) : podg : "Patients"

    CUI2 = (C0018772) : fndg: "Hearing Loss, Partial"

    1003 1 CUI1 = (C0030705) : podg : "Patients"

    CUI2 = (C1384666) : fndg: "hearing impairment"

    6.2. Filtering by topic semantic types

    The query expansion conducted in the previous point enhances

    the information retrieval process, as it unveils new relationships be-

    tween concepts. Still, this expansion may generate relationships be-

    tween concepts belonging to semantic types with little semantic

    specialization or specificity (Bodenreider, 2001; Bodenreider, &

    McCray, 2003; Erdogan et al., 2010; Plaza, & Daz, 2010). These re-

    lationships may skew the accuracy of similarity results for those con-

    cepts of greater semantic relevance to our current context.

    For this reason, wehavegone on to classify semantic typesby their

    importance, dividing them into generic and specific types. Spe-

    cific semantic types group concepts that carry more importance in

    the biomedical domain, such as diseases, symptoms, procedures, and

    Table 8

    Example processed excerpt of a report (report90230).

    SUBPHRASE PHRASE Negation Excerptreport90230

    190 254 0 C0018802 dsyn CONGESTIVE HEART

    FAILURE.

    191 255 0 C0010709 dsyn CYSTIC STRUCTURE AT

    THE POSTERIOR LEFT SIDE

    191 255 0 C0678594 spco CYSTIC STRUCTURE AT

    THE POSTERIOR LEFT SIDE

    191 255 0 C0456856 spco CYSTIC STRUCTURE AT

    THE POSTERIOR LEFT SIDE

    191 255 0 C0441987 spco CYSTIC STRUCTURE AT

    THE POSTERIOR LEFT SIDE

    192 255 0 C0010709 dsyn CYSTIC STRUCTURE AT

    THE POSTERIOR LEFT SIDE

    192 255 0 C0678594 spco CYSTIC STRUCTURE AT

    THE POSTERIOR LEFT SIDE

    192 255 0 C0205095 spco CYSTIC STRUCTURE ATTHE POSTERIOR LEFT SIDE

    192 255 0 C0205091 spco CYSTIC STRUCTURE AT

    THE POSTERIOR LEFT SIDE

    193 256 0 C0577559 fndg MASS EFFECT ON THE

    URINARY BLADDER

    193 256 0 C1280500 qlco MASS EFFECT ON THE

    URINARY BLADDER

    193 256 0 C0005682 bpoc MASS EFFECT ON THE

    URINARY BLADDER

    194 256 0 C0577559 fndg MASS EFFECT ON THE

    URINARY BLADDER

    194 256 0 C2348382 qlco MASS EFFECT ON THE

    URINARY BLADDER

    194 256 0 C0005682 bpoc MASS EFFECT ON THE

    URINARY BLADDER

    195 256 0 C1280500 qlco MASS EFFECT ON THE

    URINARY BLADDER195 256 0 C0042027 bpoc MASS EFFECT ON THE

    URINARY BLADDER

    195 256 0 C0238775 fndg MASS EFFECT ON THE

    URINARY BLADDER

    196 256 0 C1280500 qlco MASS EFFECT ON THE

    URINARY BLADDER

    196 256 0 C1524119 qlco MASS EFFECT ON THE

    URINARY BLADDER

    196 256 0 C0238775 fndg MASS EFFECT ON THE

    URINARY BLADDER

    197 256 0 C2348382 qlco MASS EFFECT ON THE

    URINARY BLADDER

    197 256 0 C0042027 bpoc MASS EFFECT ON THE

    URINARY BLADDER

    197 256 0 C0238775 fndg MASS EFFECT ON THE

    URINARY BLADDER

    198 256 0 C2348382 qlco MASS EFFECT ON THE

    URINARY BLADDER

    198 256 0 C1524119 qlco MASS EFFECT ON THE

    URINARY BLADDER

    198 256 0 C0238775 fndg MASS EFFECT ON THE

    URINARY BLADDER

    199 257 0 C0442726 fndg DETECTED ON CT

    200 258 1 C0205234 spco NO CHANGE IN 7 8 CM

    FOCAL CYSTIC STRUCTURE

    200 258 1 C1511605 fndg NO CHANGE IN 7 8 CM

    FOCAL CYSTIC STRUCTURE

    200 258 1 C0678594 spco NO CHANGE IN 7 8 CM

    FOCAL CYSTIC STRUCTURE

    201 259 0 C0018772 fndg HARD OF HEARING

    202 259 0 C1384666 fndg HARD OF HEARING

    203 260 0 C0162316 dsyn IRON DEFICIENCY

    ANEMIA.

  • 7/26/2019 Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

    9/14

    394 I. Alonso, D. Contreras / Expert Systems With Applications 44 (2016) 386399

    medication. We now show, as an example, the semantic types which

    appear in Topics 104 and 101:

    -Generic

    spco - Spatial Concept (CONC)

    podg - Patient or Disabled Group (LIVB)

    ftcn - Functional Concept (CONC)

    -Specific

    dsyn - Disease or Syndrome (DISO)diap - Diagnostic Procedure (PROC)

    neop - Neoplastic Process (DISO)

    fndg - Finding (DISO)

    bpoc - Body Part, Organ, or Org.Component (ANAT)

    topp - Thrapeutic or Preventive Procedure (PROC)

    bmod - Biomedical Occupation or Discipline (OCCU)

    ocdi - Occupation or Discipline (OCCU)

    Concepts associated to generic semantic types will be eliminated

    from the phrase table generated in the previous step. For example, in

    Tables 6and7,we show eliminated concepts in grey, for Topics 104

    and 101. This way, we can identify the concepts that are not relevant

    in the context of the specific phrase.

    Note how, from sub-phrase 1008, the concepts Localized and

    Prostate Cancer Pathway are eliminated, as they belong to generic

    types spco and ftcn respectively. We will also note how in Topic

    104, phrase 1 will be completely eliminated.

    6.3. Maximum semantic similarity matrix (topic vs report) and

    relevance computation

    In order to assess the semantic similarity between each topicand

    each report, we perform a similarity evaluation at several aggregation

    levels: CUIs, sub-phrases, and phrases. Similarity computation (Sim)

    is achieved by a matrix that pairs the topics CUIs with the reports

    CUIs, for both of our chosen metrics: Path and Intrinsic IC-Path. Af-

    terwards, we select, for every CUI in the topicsub-phrase, the paired

    concepts (topic-report) with the highest similarity value. This process

    is then repeated for everytopicsub-phrase within a phrase.

    Simsubphrasesubphrasei cuij= max

    Sim

    CUIsubphraseij CUIreportk

    (15)

    whereiis each of the topic sub-phrases, jeach of the

    sub-phrases CUIs, andkeach of the reports CUIs.

    Later, for each individual phrase, we select the maximum similar-

    ity value of each CUI present in its topicsub-phrases. In the individual

    case of Topic 104 and phrase 2(Table 6) we will obtain the maximum

    similarity value of CUI1, CUI2, and CUI3.

    Sim_max_phrasecuij = max

    Sim_subphrasesubphrasei cuij

    (16)

    This done, we average their values, obtaining a single similar-

    ity value per phrase. In our example for Topic 104 and phrase 2,

    Sim_avg_phrase = CUI1 + CUI2 + CUI3/3.Sim_avg_phrasei

    =

    num_cuis_phrasei=0

    (Sim_max_phrasei/num_cuis_phrase) (17)

    Lastly, we average the maximum similarity values of all the

    phrases in the search, which will derive the final relevance of the re-

    portrespecting the topic. In thecase of Topic 104, Sim_topicvsreport =

    (Sim_avg_phrase1+ Sim_avg_phrase2+ Sim_avg_phrase3)/3.

    It is interesting to point out that, in the particular case of Topic

    104, the final relevance value is determined by the average similar-

    ity value of the last two phrases. Phrase 1 (Patient)is completely

    eliminated from the result, since all the concepts (CUIs) that make it

    up are associated to generic semantic types (podg).

    Sim_topicvsreport =

    num_phrasesi=0

    (Sim_avg_phrasei/num_phrases)

    (18)

    We can then say that the final value (Sim_topic vsreport) of the

    maximum similarity matrix of a reportin relation toa topicwill deter-

    mine whether or not it is relevant for the terms defined by said topic.

    The lower extreme (value 0) indicates maximum non-relevance, and

    the upper extreme (value 1) indicates maximum relevance.In order to compare the final value obtained by the semantic sim-

    ilarity matrix to the relevance criteria offered by experts in each case,

    it will be necessary to establish a cut-off value (within the range

    [0,1]), which will determine whether a certain report is relevant or

    not to a given topic. This will be studied and defined in the next sec-

    tion.

    Since a medicalvisitmay be made up of more than onereport, the

    visits relevance will be determined by the maximum similarity value

    of itsreports.

    This method tries to preserve the informational uniqueness and

    completeness of the query (topic) for its automated treatment, with-

    out any input needed from the user. For this, it is necessary toinclude

    each of thetopiccomponents by a process of aggregation of the aver-

    age of the maximum similarity values of the different phrases. In thisway, each subphrase, which is expanded from the phrases that make

    up the topic, is measured with the same precision when the aggrega-

    tion of their averages takes place. However, what will determine, in

    the end, the relevance of each component, will be the maximum se-

    mantic similarity of the topic concepts in relation to the report, along

    with the semantic type they belong to.

    Through this straightforward example (Table 9), we can observe

    the importance of concept-based expansion, both of the topicand of

    the report, between theconcepts of whichwe canestablish maximum

    similarity relationships, even when the terms or strings are different

    in themselves. So, for example, the terms associated to the CUIs of

    topic (Deafness; Hearing Loss, Partial; hearing impairment), are

    different from the terms associated to the CUIs of report (Hard of

    Hearing), and yet, we obtain the maximum possible similarity.Tables 9 and 10, are composed by the following elements: the first

    two columns (topicandreport)are formed by the id. sub-phrases, id.

    phrases, CUIs, semantic type and string phrases of the topic and the

    report respectively. The two last columns correspond to the maxi-

    mum similarity for each metric between pairs of topic-reportCUIs.

    7. Result analysis

    In this section, we will analyze the results obtained after evaluat-

    ingtopicsmatched toreports, by the procedure described in the pre-

    vious section.

    In order to contrast the relevance criteria set by the experts with

    the results of the retrieval system we propose in this paper, we have

    generated a histogram(Fig. 3) which reflects the similarity of eachvisit (thereportwith the highest similarity value in each) to a search

    topic. Thesereportsare distributed along the X axis according to their

    degree of relevance (0 being Not relevant, and 1 Relevant). Lastly,

    to ease the understanding of the histogram, we highlight in black

    those reports which were deemed Relevant by the experts, and in

    ochre those deemed Not relevant.

    7.1. Justification of topic semantic type filtering

    Firstly, we have carried out a series of experiments to validate fil-

    tering by concepts associated to specific topic semantic types. Thus,

    inFig. 3, we show the results of evaluating the reports matched to

    Topic 107(Patients with ductal carcinoma in situ (DCIS)), both filtered

    by semantic types (Fig. 3b) and unfiltered(Fig. 3a). We can easily see

  • 7/26/2019 Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

    10/14

    I. Alonso, D. Contreras/ Expert Systems With Applications 44 (2016) 386399 395

    Table 9

    Example maximum similarity values matrix for each sub-phrase Sim_subphrase from Topic101 - Report90230.

    Topic 101 Report90230 Path IC-Path

    Max. Sim Max. Sim

    1001 1 C0030705 podg Patients 168 52 C0030705 podg the patient on consultation 1.0000 1.0000

    1001 1 C0011053 d syn De afnes s 201 73 C0018772 fnd g HARD O F HEARING. 0.5000 0.8042

    1002 1 C0030705 podg Patients 113 11 C0030705 podg the patient in consultation 1.0000 1.0000

    1002 1 C0018772 fndg Hearing Loss, Partial 201 73 C0018772 fndg HARD OF HEARING. 1.0000 1.0000

    1003 1 C0030705 podg Pati ents 111 9 C0030705 podg The patient appare ntly 1.0000 1.0000

    1003 1 C1384666 fndg hearing impairment 202 73 C1384666 fndg HARD OF HEARING. 1.0000 1.0000

    Table 10

    Example maximum similarity values matrix of each sub-phrase Sim_subphrase from Topic104 - Report51139.

    Topic 104 Report51139 Path IC-Path

    Max. Sim Max. Sim

    1001 1 C0030705 podg Patients 43 15 C0030705 podg he patient 1.0000 1.0000

    1002 2 C0011900 fndg Diagnosis 75 28 C0543467 diap DESCRIPTION OF OPERATION 0.3333 0.7172

    1002 2 C0796563 neo p Local ized Mal ignant Neo plasm 28 12 C0796563 neop LOCALIZED PRO STATE CANCER. 1.0000 1.0000

    1002 2 C0033572 bpoc Prostate 61 18 C0033572 bpoc now for removal of his prostate 1.0000 1.0000

    1003 2 C0011900 fndg Diagnosis 40 14 C0376358 neop LOCALIZED PROSTATE CANCER. 0.3333 0.7172

    1003 2 C0796563 neo p Local ized Mal ignant Neo plasm 28 12 C0796563 neop LOCALIZED PRO STATE CANCER. 1.0000 1.0000

    1003 2 C1278980 bpoc Entire prostate 380 19 C1278980 bpoc The prostate 1.0000 1.0000

    1004 2 C0011900 fndg Diagnosis 40 14 C0376358 neop LOCALIZED PROSTATE CANCER. 0.3333 0.7172

    10 04 2 C1334407 neop Localized Carcinoma 30 12 C1334407 neop LOCALIZED PROSTATE CANCER. 1.00 00 1.00 00

    1004 2 C0033572 bpoc Prostate 171 75 C0033572 bpoc at the prostate. 1.0000 1.00001005 2 C0011900 fndg Diagnosis 63 19 C0184661 diap benefits of the procedure 0.3333 0.7172

    10 05 2 C1334407 neop Localized Carcinoma 30 12 C1334407 neop LOCALIZED PROSTATE CANCER. 1.00 00 1.00 00

    10 05 2 C12789 80 bpoc E ntire prostate 236 113 C12789 80 bpoc sharp dissec tion until the prostate 1.0 00 0 1. 00 00

    1006 2 C0011900 fndg Diagnosis 15 7 C0184661 diap PROCEDURE 0.3333 0.7172

    1006 2 C0392752 spco Local ized 53 16 C0392752 spco 50s- ye ar-o ld male wi th l ocal ized adenocarcinoma o f 1.0000 1.0000

    1006 2 C0376358 neo p Mali gnant neo plasm o f prostate 32 12 C0376358 neop LOCALIZED PRO STATE CANCER. 1.0000 1.0000

    1007 2 C0011900 fndg Diagnosis 14 6 C0543467 diap SURGERY DATE 0.3333 0.7172

    1007 2 C0392752 s pco Locali ze d 44 16 C0392752 spco 50s- ye ar-o ld male wi th l ocal ized adenocarcinoma o f 1.0000 1.0000

    10 07 2 C060 0139 neop Prostate carcinoma 33 12 C060 0139 neop LOCALIZED PROSTATE CANCER. 1.00 00 1.00 00

    1008 2 C0011900 fndg Diagnosis 32 12 C0376358 neop LOCALIZED PROSTATE CANCER. 0.3333 0.7172

    1008 2 C0392752 s pco Locali ze d 44 16 C0392752 spco 50s- ye ar- old male wi th l ocal ized adenocarcinoma o f 1.0000 1.0000

    10 08 2 C2984325 f tcn Prostate Cancer Path way 42 14 C29 84325 ftc n LOCALIZED PROSTATE CANCER. 1.0 00 0 1. 00 00

    1009 3 C0332293 topp Treated with 523 24 C0444667 qnco present for the entire procedure. 0.0000 0.0000

    1009 3 C0035785 ocdi Robotics 17 8 C0035785 ocdi ROBOTIC-ASSISTED LAPAROSCOPIC RADICAL PROSTATECTOMY 1.0000 1.0000

    1009 3 C0038894 bmod Surgery specialty 9 6 C0038894 bmod SURGERY DATE 1.0000 1.0000

    1010 3 C0332293 topp Treated with 522 24 C0450011 topp present for the entire procedure. 0.0000 0.0000

    1010 3 C0035785 ocdi Robotics 19 8 C0035785 ocdi ROBOTIC-ASSISTED LAPAROSCOPIC RADICAL PROSTATECTOMY 1.0000 1.00001010 3 C0038895 ftcn Surgical aspects 11 6 C0038895 ftcn SURGERY DATE 1.0000 1.0000

    1011 3 C0332293 topp Treated with 522 24 C0450011 topp present for the entire procedure. 0.0000 0.0000

    1011 3 C0035785 ocdi Robotics 19 8 C0035785 ocdi ROBOTIC-ASSISTED LAPAROSCOPIC RADICAL PROSTATECTOMY 1.0000 1.0000

    1011 3 C0543467 diap Opera tive Surgical P rocedures 75 28 C0543467 diap DESCRIP TION OF OPERATION 1.0 00 0 1. 00 00

    how, after filtering, the most significant reports deemed Not rele-

    vant (ochre) and Relevant (black) are displaced towards areas of

    lower and higher relevance respectively.

    These results highlight the necessity to perform a query expan-

    sion by specific semantic types only, hence obtaining more accu-

    rate results for a lower computational cost (as we eliminate the need

    to calculate similarity for generic semantic types).

    7.2. Behavior of Path and Intrinsic IC-Path metrics

    To comparatively evaluate the performance (in terms of semantic

    similarity) of the PathandIntrinsic IC-Pathmetrics in a real-life con-

    text, we show a preliminary experiment on two search criteria. One

    is a simple topic, Topic 101 (Patients with hearing loss), applied to

    4073 reports grouped in 249 visits. The other is a complex topic, Topic

    104, (Patients diagnosed withlocalized prostate cancer and treated with

    robotic surgery), applied to 3439 reports grouped in 196 visits.

    The results obtained from applying the Path metric to a simple

    topic (Topic 101), show a discrete distribution of results, derived from

    its definition which is based on the inverse of the distances(Fig. 4a).

    This makes for uncertainty zones, since some reportsare localised in

    similarityvalues between 0.45 and 0.50 (27 non-relevant reports, and

    9 relevant).

    In the case of the Intrinsic IC-Path metric, the internal nature of

    its calculation does away with this discrete character (Fig. 4b). The

    global results compared to those of the Path metric are similar, but

    distributed in a smoother fashion, more evenly distributed towards

    both extremes.

    Conversely, when processing complex topics (with multiple

    phrases) such as Topic 104, calculations based on aggregated averages

    of the maximum similarity values obtained (Section 6.3)counter the

    discrete character of thePathmetric. Also, for both metrics, the sim-ilarity values of thereportstend to spread following a normal distri-

    bution function(Fig. 5a and b), which removes the previously men-

    tioned discrepancies.

    7.3. Choosing the cut-off value

    From thereportsimilarity distributions generated for each search

    criteria, as shown in the previous part (Figs. 4 and 5), we must

    establish a cut-off value to determine whether the report is relevant.

    Based on that value, reports with an estimated similarity greater or

    equal to it will be deemed relevant by the system, and the rest not

    relevant. By doing this with reports that have already been assessed

    by experts as Relevant or Not relevant for each topic, we can esti-

    mate the accuracy of the retrieval system we propose in this work.

  • 7/26/2019 Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

    11/14

    396 I. Alonso, D. Contreras / Expert Systems With Applications 44 (2016) 386399

    Fig. 3. (a) Histogram for Topic 107withoutsemantic type filtering. (b) Histogram for Topic 107withsemantic type filtering.

    Fig. 4. PathvsIntrinsic IC-Pathfor a simple search topic (Topic 101).

    Fig. 5. PathvsIntrinsic IC-Pathfor a complex search topic (Topic 104).

    As the previous part gives out, it is easy to determine the cut-

    off point for simple topics, due to the observed distribution of their

    values towards the extremes. However, when working with complex

    searches, the decision will be more complex, as well as more critical

    for the performance of the system. For all this, to define the cut-off

    value, we will adhere to the following premises:

    The value must be common to both metrics and lie between 0

    and 1. It must be greater than 0.5, as this value represents a syn-

    onymy relationship between concepts under the Path metric,

    but is not sufficient in itself to establish relevance in complex

    search.

  • 7/26/2019 Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

    12/14

    I. Alonso, D. Contreras/ Expert Systems With Applications 44 (2016) 386399 397

    Fig. 6. Documents evaluated by the proposed system forTopic 104, usingPathandIntrinsic IC-Pathmetrics.

    Table 11

    Final relevance values for the examples inSection 6.3.

    Path Intrinsic IC-Path

    Sim_topic101vsreport90230 = 1.0 00 1.0 00

    Sim_topic104vsreport51139= 0.7149 0.7783

    It must show a balance in classifying documents by relevance;

    that is, the higher the cut-off value is, the more documents it will

    classify as not relevant, to the detriment of relevant results.

    From the stated premises, and for a simple maximum similarity

    matrix of a mere two concepts, a reportwould only be deemed rel-

    evant to a topic, if the two concepts had a similarity value of 1.0

    (distance equals 1 and represents the same concept) or 0.5 (dis-

    tance equals 2 and represents a synonym). In a real-life context, with

    complex phrases made of multiple pairs of concepts, applying an av-

    erage value to all similarities carries errorsof variance that distort thefinal results. For this reason, it is necessary to enact two additional

    requirements to ensure the proper application of the average value:

    that atleast one of the pairs of concepts has a similarity value of 1.0,

    and that at most one of the pairs has a value lower than 0.5 (values

    lower than 0.5 represent a distant synonymy between concepts). If

    these two additional criteria are not met, the report is deemed Not

    Relevant.

    Once this correction was applied, a test group of 1000 reports

    showed that all Relevant documents presented values equal to or

    greater than 0.6.

    For this reason, we have established a cut-off value as 0.6, as the

    minimum value to meet all the requirements above.

    Hence, the final result of the maximum similarity matrix

    (Sim_topicvsreport) will reflect the relevance of a reportin relation toatopicin the following manner:

    - If value of (Sim_topicvsreport) is within the range [0.0; 0.6); thereport, is Not Relevant to the topic.

    - If value of (Sim_topicvsreport) is within the range [0.6; 1.0]; the

    report, is Relevant to the topic.

    In this way, the examples shown inSection 6.3(Tables 9and10),

    correspond to tworeportsthat were deemed Relevant to the topics

    they were evaluated for both metricsTable 11.

    UsingTopic 104 as an example, for the Pathmetric with the pro-

    posed cut-off value, we can observe (Fig. 6) how 9 reports assessed by

    experts as not relevant aretagged as relevant by thesystem, while

    1 deemed relevant by the experts turns out as not relevant. In the

    case of theIntrinsic IC-Pathmetric, 4 reports tagged not relevant by

    Table 12

    Aggregated results.

    Path Intrinsic IC-Path

    Recall 0 .753 0.639

    Precision 0 .364 0.392

    F-Measure 0.430 0.427

    experts are seen as relevant by the system, and 2 relevant as not

    relevant.

    All told, in the specific case of Topic 104, the results obtained

    for the Path metric are: (Precision = 44.4%; Recall = 88.9%; F-

    Measure = 59.3%). And for the Intrinsic IC-Path metric: (Preci-

    sion = 63.6%; Recall = 77.8%; F-Measure =70.0%). In this case, the

    Intrinsic IC-Pathmetric shows a better performance thanPathmetric.

    7.4. Evaluation of Path and Intrinsic IC- Path metrics with the

    TREC dataset

    In this part, we will evaluate the performance of the two met-

    rics analyzed in our work (PathandIntrinsic IC-Path) in a real-life in-

    formation retrieval scenario. To do so, we will use the 35 topics (or

    search criteria) proposed in TREC 2011, with an information source of

    101,712 reports (grouped into 17,265 visits).

    The metrics used in this evaluation are the standard ones in the

    field of information retrieval: Precision, Recall, and F-Measure. The

    latter is best at reflecting a balance between the first two, since it is

    defined as:

    F-Mesaure = 2

    Precision Recall

    Precision + Recall

    (19)

    Table 12shows the average of all the results obtained in the re-trieval of relevant reports for each of the proposed search topics.

    As we can see, the F-Measure value of both metrics is very similar

    (Path = 0.430, Intrinsic IC-Path = 0.427), with a slight edge for the

    Path metric. Although these results suggest that, in terms of Recall,

    Path is the superior metric, with Intrinsic IC-Path having the upper

    hand in Precision, we cannot consider them to be conclusive, as both

    indicators are complementary.

    Digging deeper into the results, and analyzing their dispersal pat-

    tern, Figs. 7 and 8 shows the detailed values of theprevious indicators

    for all the search topics, and for both metrics studied in this work.

    This figure reveals the complexity of a number of topics (such as 116,

    123, 124, 125, 126, 130, 133, or 134) for which the results, in terms

    of F-Measure, lie below 20%, for both metrics; good examples further

    illustrating the level of complexity of these topics would beTopic 123

  • 7/26/2019 Evaluation of Sttyemantic Similarity Metrics Applied to the Automatic

    13/14

    398 I. Alonso, D. Contreras / Expert Systems With Applications 44 (2016) 386399

    Fig. 7. Results usingPathfor the 35 topics. Recall, Precision, and F-Measure shown.

    Fig. 8. Results usingIntrinsic IC-Pathfor the 35 topics. Recall, Precision, and F-Measure shown.

    (Diabetic patients who received diabetic education in the hospital) or

    Topic 133 (Patients admitted for care who take herbal products for os-teoarthritis) .

    Topics 123 and 134 produce a completely anomalous result, due

    to an error detected in the UMLS relationships for two particular

    concepts. These concepts, C0241863 Diabetic for Topic 123, and

    C1148454 Seizure activityfor Topic 134, offer no similarity distance,

    and are particularly important for said topics.

    8. Conclusions

    The extraction of information through natural language process-

    ing in biomedical documents is both important and complex enough

    to deserve very particular attention. For this reason, many works have

    been published that address the matter by dealing with similarity

    metrics in a theoretical context, using the UMLS resource; however,none of them manage to fulfil the actual need for information re-

    trieval from medical documents.

    It is for this reason that, in this paper, we have proposed a novel

    experimental study for assessing the performance ofIntrinsic IC-Path

    and Path metrics in a real-life context that is, real medical re-

    ports. Also, in order to perform that study, we have deployed an ad-

    hoc framework to formalize the use of the UMLS Metathesaurus for

    the retrieval of medical information from these actual reports (TREC

    Medical Records Track 2011) through maximum semantic similarity

    matrices.

    The conclusion drawn from our work is that, in a real-life con-

    text, both assessed metrics display similar performance, Path (F-

    Measure = 0.430) e Intrinsic IC-Path (F-Measure = 0.427). Therefore,

    the variations in performance obtained in these theoretical contextsdisappear when the amount of data is increased, and real visits and

    reports are used. So, these results do not justify the use of complex

    metrics (with their associated high computational cost) as are these

    variations of the Path metric, particularly Intrinsic IC-Path in this case.

    The justification for these results lies in the fact that, unlike the com-

    parison between isolated pairs of concepts conducted in previous

    works, the information contained within a report or topic is inter-

    related, extensive, and expressed in a natural language.

    Theresults of this work are applicable to any similarity search pro-

    cess conducted on biomedical documents (patient histories, clinical

    reports, diagnostic tests like CT scans, X-Rays, etc.) as long as they are

    contained in text files.

    Once we have determined that the improved performance of

    these similarity metrics has no impact in a real-life context, it be-

    comes necessary to improve, in the future, the straightforward re-

    trieval system we have proposed to perform this assessment. In thissense, it may prove beneficial to eliminate those sub-phrases within

    a topic which, although syntactically correct, are not semantically

    related to its meaning. Furthermore, the reports dealt with are fre-

    quently ambiguous, as they refer to disparate (subjects) symptoms or

    illnesses for the same patient, making automatic retrieval more dif-

    ficult. It would be appropriate to filter or separate these documents

    so that eachreportcovers one subject exclusively. By relating the re-

    ports subject more closely with the search topic, we could exclude

    secondary subjects from the results, which merely add noise, and in-

    crease the computational costs of the query.

    References

    Alpi, K. M. (2005). Expert searching in public health. Journal of the Medical Library As-sociation, 93(1), 97103.

    Al-Mubaid, H., & Nguyen, H. (2006). A cluster-based approach for semantic similar-ity in the biomedical domain. In Engineering in Medicine and Biology Society, 2006.EMBS06. 28th Annual International Conference of the IEEE(pp. 27132717).

    Aronson, A. R. (2001). Effective mapping of biomedical text to the UMLS Metathe-saurus: The MetaMap program. In Proceedings of the American Medical Informatics

    Association Symposium 2001(pp. 1721).Aronson, A. R., & Lang, F. M. (2010). An overview of MetaMap: Historical perspective

    and recent advances.Journal of the American Medical Informatics Association, 17(3),229236.

    Aronson, A. R., & Rindflesch, T. C. (1997). Query expansion using the UMLS Metathe-saurus. In Proceedings of the American Medical Informatics Association Annual FallSymposium(p. 485).

    Babashzadeh, A., Huang, J., & Daoud, M. (2013, July). Exploiting semantics for improv-ing clinical information retrieval. In Proceedingsof the 36thinternational Association

    for Computing Machinerys Special Interest Group on Information Retrieval Confer-ence on Research and development in information retrieval (pp. 801804). ACM SIGIR2013.

    Batet, M., Snchez, D., & Valls, A. (2011). An ontology-based measure to compute se-mantic similarity in biomedicine. Journal of biomedical informatics, 44(1), 118125.

    Bhogal, J., Macfarlane, A., & Smith, P. (2007). A review of ontology based query expan-sion.Information processing & management, 43(4), 866886.

    Bodenreider, O. (2001). Circular hierarchical relationships in the UMLS: Etiology, di-agnosis, treatment, complications and prevention. In Proceedings of the AmericanMedical Informatics Association Symposium(p. 57).

    Bodenreider, O. (2004). The unified medical language system (UMLS): Integratingbiomedical terminology.Nucleic acids research, 32(suppl 1), D267D270.

    Bodenreider, O., & McCray, A. T. (2003). Exploring semantic groups through visual ap-proaches.Journal of biomedical informatics, 36(6), 414432.

    Burgun, A., & Bodenreider, O. (2001). Comparing terms, concepts and semantic classesin WordNet and the Unified Medical Language System. InProceedings of the North

    American Chapter of th e Association for Computational Linguistics 2001; WorkshopWordNet and Other Lexical Resources: Applications, Extensions and Customiza-tions (pp. 7782).

    Caviedes,J. E., & Cimino, J. J. (2004).Towards thedevelopment of a conceptual distance

    metric for the UMLS.Journal of biomedical informatics, 37(2), 7785.

    http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0001http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0001http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0002http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0002http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0002http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0002http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0002http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0003http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0003http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0003http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0003http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0003http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0004http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0004http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0004http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0004http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0004http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0004http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0005http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0005http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0005http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0005http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0006http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0006http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0006http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0006http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0006http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0006http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0006http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0006http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0007http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0007http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0007http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0007http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0007http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0008http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0008http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0009http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0009http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0009http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0009http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0009http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0009http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0010http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0010http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0010http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0011http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0011http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0011http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0011http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0015ahttp://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0015ahttp://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0015ahttp://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0015ahttp://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0015ahttp://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0015ahttp://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0015ahttp://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0015ahttp://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0015ahttp://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0012http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0012http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0012http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0012http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0012http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0012http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0012http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0012http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0012http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0015ahttp://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0015ahttp://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0015ahttp://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0015ahttp://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0011http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0011http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0011http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0011http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0010http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0010http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0009http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0009http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0008http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0008http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0008http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0008http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0008http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0007http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0007http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0007http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0007http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0007http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0006http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0006http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0006http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0006http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0006http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0005http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0005http://refhub.elsevier.com/S0957-4174(15)00656-9/sbref0005http://refhub.