DependencyParsingforRelationExtractionin...

Dependency Parsing for Relation Extraction inBiomedical Literature

Master Thesis in Computer Science

presented by Nicola ColicZurich, Switzerland

Immatriculation Number: 09-716-572

to the Institute of Computational Linguistics,Department of Informatics at the University of Zurich

Supervisor: Prof. Dr. Martin VolkInstructor: Dr. Fabio Rinaldi

submitted on the 20th of March, 2016

i

Abstract

This thesis describes the development of a system for the extraction of en-tities in biomedical literature, as well as their relationships with each other.We leverage efficient dependency parsers to provide fast relation extraction,in order for the system to be potentially able to process large collections ofpublications (such as PubMed) in useful time. The main contributions arethe finding and integration of a suitable dependency parser, and the devel-opment of a system for creating and executing rules to find relations. Forthe evaluation of the system, a previously annotated corpus was further re-fined, and insights for the further development of this and similar systemsare drawn.

ii

Acknowledgements

I would like to thank Prof. Martin Volk for supervising the writing of thisthesis, and especially my direct instructor Dr. Fabio Rinaldi for his never-ceasing help and motivation.

Contents

1 Introduction 11.1 The Need for Biomedical Text Mining . . . . . . . . . . . . . . 11.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Named Entity Recognition . . . . . . . . . . . . . . . . 21.2.2 Relation Extraction . . . . . . . . . . . . . . . . . . . . 3

1.3 Beyond Automated Curation . . . . . . . . . . . . . . . . . . . 41.4 Importance of PubMed . . . . . . . . . . . . . . . . . . . . . . 51.5 This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 python-ontogene Pipeline 72.1 OntoGene Pipeline . . . . . . . . . . . . . . . . . . . . . . . . 72.2 python-ontogene . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Architecture of the System . . . . . . . . . . . . . . . . 82.2.2 Configuration . . . . . . . . . . . . . . . . . . . . . . . 82.2.3 Backwards Compatibility . . . . . . . . . . . . . . . . . 8

2.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Module: Article . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . 102.4.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.3 Export . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Module: File Import and Accessing PubMed . . . . . . . . . . 122.5.1 Updating the PubMed Dump . . . . . . . . . . . . . . 122.5.2 Downloading via the API . . . . . . . . . . . . . . . . 132.5.3 Dealing with the large number of files . . . . . . . . . . 132.5.4 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 Module: Text Processing . . . . . . . . . . . . . . . . . . . . . 142.6.1 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.7 Module: Entity Recognition . . . . . . . . . . . . . . . . . . . 15

iii

CONTENTS iv

2.7.1 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.8.1 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.8.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Parsing 203.1 Selection Process . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.1 spaCy . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.2 spaCy + Stanford POS tagger . . . . . . . . . . . . . . 213.1.3 Stanford Parser . . . . . . . . . . . . . . . . . . . . . . 223.1.4 Charniak-Johnson . . . . . . . . . . . . . . . . . . . . . 223.1.5 Malt Parser . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.1 Ease of Use and Documentation . . . . . . . . . . . . . 233.2.2 Evaluation of Speed . . . . . . . . . . . . . . . . . . . . 243.2.3 Evaluation of Accuracy . . . . . . . . . . . . . . . . . . 263.2.4 Prospective Benefits . . . . . . . . . . . . . . . . . . . 473.2.5 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Rule-Based Relation Extraction 494.1 Design Considerations . . . . . . . . . . . . . . . . . . . . . . 494.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.1 stanford_pos_to_db . . . . . . . . . . . . . . . . . . . 514.2.2 Database . . . . . . . . . . . . . . . . . . . . . . . . . 524.2.3 query_helper . . . . . . . . . . . . . . . . . . . . . . . 544.2.4 browse_db . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.3.1 Conversion . . . . . . . . . . . . . . . . . . . . . . . . . 644.3.2 Categorization . . . . . . . . . . . . . . . . . . . . . . . 654.3.3 Development and Test Subsets . . . . . . . . . . . . . . 69

4.4 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.4.1 HYPHEN queries . . . . . . . . . . . . . . . . . . . . . 704.4.2 ACTIVE queries . . . . . . . . . . . . . . . . . . . . . 714.4.3 DEVELOP queries . . . . . . . . . . . . . . . . . . . . 74

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.5.1 Arity of Relations . . . . . . . . . . . . . . . . . . . . . 78

CONTENTS v

4.5.2 Query Development Insights . . . . . . . . . . . . . . . 794.5.3 Augmented Corpus . . . . . . . . . . . . . . . . . . . . 79

5 Evaluation 805.1 Evaluation of epythemeus . . . . . . . . . . . . . . . . . . . . 80

5.1.1 Query Evaluation . . . . . . . . . . . . . . . . . . . . . 805.1.2 Speed Evaluation and Effect of Indices . . . . . . . . . 85

5.2 Processing PubMed . . . . . . . . . . . . . . . . . . . . . . . . 875.2.1 Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . 875.2.2 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.2.3 Downloading PubMed . . . . . . . . . . . . . . . . . . 875.2.4 Tagging and Parsing . . . . . . . . . . . . . . . . . . . 885.2.5 Running Queries . . . . . . . . . . . . . . . . . . . . . 885.2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.3.1 epythemeus . . . . . . . . . . . . . . . . . . . . . . . . 895.3.2 Processing PubMed . . . . . . . . . . . . . . . . . . . . 90

6 Conclusion 916.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . 91

6.1.1 python-ontogene . . . . . . . . . . . . . . . . . . . . . . 916.1.2 Parser Evaluation . . . . . . . . . . . . . . . . . . . . . 926.1.3 epythemeus . . . . . . . . . . . . . . . . . . . . . . . . 926.1.4 Fragments . . . . . . . . . . . . . . . . . . . . . . . . . 926.1.5 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.2.1 Improving spaCy POS tagging . . . . . . . . . . . . . . 936.2.2 Integration of spaCy and python-ontogene . . . . . . . 936.2.3 Improvements for epythemeus . . . . . . . . . . . . . . 936.2.4 Evaluation Methods . . . . . . . . . . . . . . . . . . . 93

6.3 Processing PubMed . . . . . . . . . . . . . . . . . . . . . . . . 94

Chapter 1

Introduction

1.1 The Need for Biomedical Text MiningOne of the defining factors of our time is an unprecedented growth in infor-mation, and resulting from this is the challenge of information overload bothin the personal and professional space.

Independent of the respective domain, recent years have seen a shift in fo-cus from information retrieval to information extraction. That means, ratherthan attempting to bringing the right document containing relevant informa-tion to the user, research is now concerned with processing and extractingspecific information contained in unstructured text [1].

This holds particularly true in the biomedical domain, where the rateat which biomedical papers are published is ever increasing, leading to whatHunter and Cohen call literature overload [16]. In their 2006 paper, they showthat the number of articles published on PubMed, the largest collection ofbiomedical publications, is growing at double-exponential rate.

Because of this, biology researchers need to rely on manually curateddatabases that list information relevant to their research in order to stayup-to-date. From PubMed, information is manually curated, that is, humanexperts compile key facts of publications into dedicated databases. Thisprocess of curation is expensive and labor intensive, and causes a substantialtime lag between publication and appearance of its key information in therespective database [31]. Curators, too, struggle to cope with the amount ofpapers published, and thus need to turn to automated processing, that is,biomedical text mining.

1

CHAPTER 1. INTRODUCTION 2

However, the field of biomedical text mining is not only limited to aidingor automating curation of databases. It covers a variety of applications rang-ing from simpler information extraction to question answering to literature-based discovery. Generally speaking, it is concerned with the discovery offacts as well as the association between them in unstructured text. Theseassociations can be explicit or implicit. As Simpson [33] note, advances inbiomedical text mining can help prevent or alter the course of many diseases,and thus are not only of relevance to professional researchers, but also benefitthe general public. Furthermore, they rely on the combination of efforts byboth experts in the biomedical domain as well as computational linguists.

We describe the different applications of biomedical text mining below.

1.2 Related WorkSimpson [33] attributes much of the development in the field to community-wide evaluations and shared tasks such as BioCreative [15] and BioNLP[18]. Such shared tasks focus on different aspects of biomedical text mining:Named entity recognition (NER) and relation extraction are the main tasks,which are briefly discussed here.

1.2.1 Named Entity Recognition

In NER, biological and medical terms are identified and marked in an un-structured text. Examples for such entities include proteins, drugs or dis-eases, or any other semantically well-defined data. This task is often coupledwith assigning each of the found entities with a unique identifier, called entitynormalization.

Named entity recognition is particularly difficult in the biomedical do-main given the constant discovery of new concepts and entities. Becauseof this, approaches that utilize a dictionary containing known entities needtake extraordinary measures to keep their dictionaries up to date with cur-rent research, mirroring the problem of database curation described above.In spite of this, dictionary-based methods can achieve favorable results [22].In particular, dictionaries can automatically be generated from pre-existingontologies [11], making them easier to maintain and to be kept up-to-date.

Other approaches to NER are rule-based, which exploit patterns in pro-tein names [12], for example, or statistically inspired, in which features such


as word sequences or part-of-speech tags are used by machine learning algo-rithms to infer occurrences of a named entity [14].

The related task of entity normalization is made difficult by the fact thatoften, there’s no universal accord on the preferred name of a specific entity.Particularly with protein and gene names, variations can go as far as to cometo the authors’ personal preference. Another complication are abbreviations,which are largely context-dependent: The same abbreviation can refer tovery different entities in different contexts. However, as Zweigenbaum et al.note, problems such as this can essentially be considered solved problems [42].

1.2.2 Relation Extraction

The goal of relation extraction is to extract interactions between entities.In the biomedical domain, extracting drug-drug interactions [32], chemical-disease relations (CDR) [39] or protein-protein interactions (PPI) [26] areparticularly relevant examples. However, these are highly specialized prob-lems, and require specialized methods of relation extraction.

Simpson [33] distinguish between relation extraction and event extraction:They define relation extraction as finding binary associations between enti-ties, whereas event extraction is concerned with more complex associationsbetween an arbitrary number of entities.

The simplest approach of extracting relations relies on statistical eval-uation of cooccurrence of entities. A second class of approaches are rule-based. The rules used by these approaches are either created manually byexperts, or stem from automated analysis of annotated texts. Simpson notethat co-occurrence approaches commonly exhibit high recall and low precision,and rule-based approaches typically demonstrate high precision and low recall[33]. A third class of approaches uses machine learning to directly identifyrelations, using a variety of features. These approaches can both be used forrelation as well as event extraction. For both rule-based as well as machinelearning approaches, syntactic information is an invaluable feature [37] [2].In particular, dependency-based representations of the syntactic structure ofa sentence have proven to be particularly useful for text mining purposes.

Given the fact that the approach used in this thesis is a rule-based ap-proach capable of extracting both binary as well as more complicated re-lations, we will not adhere to the distinction between relation and eventextraction, and use the term relation extraction for both problems.


1.3 Beyond Automated CurationMore complex applications of biomedical text mining are summarization,question answering and literature-based discovery.

In summarization, the goal is to extract important facts or passages from asingle document or a collection of documents, and represent them in a concisefashion. This is particularly relevant in the light of the aforementioned liter-ature overload. The approaches employed here either identify representativecomponents of the articles using statistical methods, or extract importantfacts and use them to generate summaries.

Question answering aims at providing precise answers rather than relevantdocuments to natural language queries. On one hand, this relies on naturallanguage processing of the user-supplied queries, and the processing of acollection of documents potentially containing the answer, on the other hand.For the latter case, named entity recognition and relation extraction are ofmajor importance.

The problems described above are all concerned with finding of facts ex-plicitly stated in biomedical literature. Literature-based discovery, however,aims at revealing implicit facts, namely, relations that have not previouslybeen discovered. Swanson was one of the first to explore this research field.The following is a simplified account of his definition from 1991: Given two re-lated scientific communities that do not communicate, where one communityestablishes a relationship between A and B, and the other the relationshipbetween B and C, infer the relation of A and C [35]. As Simpson [33] explain,recent systems use semantic information to uncover such A-C relations, andthus build heavily on relation extraction.

With the rapid growth of publications, the aspect of literature-based dis-covery becomes more important. The body of publications is quickly be-coming too vast to be manually processed, and research communities find itimpossible to keep up-to-date with their peers, leading to disjoint scientificcommunities. Literature-based discovery thus holds the promise of leveragingthe unprecedented scale of publications and contribute to the advancementof biomedical science that would not otherwise be possible.


1.4 Importance of PubMedMEDLINE is the largest database containing articles from the biomedicaldomain, and is maintained by the US National Library of Medicine (NLM).Currently, it contains more than 25 million articles published from 19461, andas Hunter and Cohen note, it is growing at extraordinary pace [16]. Between2011 and today, the amount of articles it contains has more than doubled.

The abstracts of MEDLINE can be freely accessed and downloaded viaPubMed2, making it one of the most important resource for biomedical textmining [42] [33]. Furthermore, thanks to a new National Institutes of Health(NIH)-issued Policy on Enhancing Public Access to Archived Publications Re-sulting From NIH-Funded Research from 2005, more than 3.8 million full-textarticles can now be freely downloaded on PubMedCentral3. The goal of thisendeavor, as stated by the NLM, is as follows: To integrate the literature witha variety of other information resources such as sequence databases and otherfactual databases that are available to scientists, clinicians and everyone elseinterested in the life sciences. The intentional and serendipitous discoveriesthat such links might foster excite us and stimulate us to move forward [16].

In the course of this thesis, we will focus on the article abstracts availablevia PubMed. Given its importance and size, we conduct our efforts with theprocessing of the entire PubMed database in mind.

1.5 This ThesisWith such a large corpus of freely available biomedical texts, efficiency ofbiomedical text mining becomes increasingly more important: Text miningsystems need to be able to cope with rapidly growing collections of text, andin order to be relevant and timely, need to do so in an efficient manner.

The goal of this thesis is to explore how relation extraction canbe efficiently performed using dependency parsing. Recent techno-logical advances make dependency parsing computationally cheap, and asexplained in Sections 1.2.2 and 1.3, it lies at the core of many other aspectsof biomedical text mining. We explore how to leverage this availability ofefficient dependency parsing, especially in regard to processing the entire

1https://www.nlm.nih.gov/pubs/factsheets/medline.html2https://www.ncbi.nlm.nih.gov/pubmed/3http://www.ncbi.nlm.nih.gov/pmc/


PubMed.We first describe our pipeline for named entity recognition and part-of-

speech tagging in Chapter 2. We expand on this previous work by finding anaccurate and efficient dependency parser in Chapter 3. These results are thenused to develop a new, independent system aimed at exploiting dependencyparse information to find relations using manually written rules (Chapter 4).This system uses a novel way of creating rules, and is evaluated against amanually annotated corpus in Chapter 5. Furthermore, we give an estimateof time it would take to process the entire PubMed and search it for relationsusing our approach. In Chapter 6 we draw some conclusions on the resultsof this research.

Chapter 2

python-ontogene Pipeline

This chapter describes the development of a new text processing pipelinethat performs tokenization, tagging and named entity recognition, buildingon previous work by Rinaldi et al. [28] [29] [30], and their OntoGene pipeline,in particular.

2.1 OntoGene PipelineThe OntoGene system is a pipeline that is patched together from differentmodules written in different programming languages, which communicatewith each other via files. Each module takes a file as input, and producesa file, typically in a predefined XML format (called OntoGene XML). Thesubsequent module will then read the files produced from the antecedentmodules. These different modules themselves are coordinated by bash scrips.This is inefficient for two reasons:

1. Every module needs to parse the precedent module’s output. The re-peated accessing of the disk, reading and writing considerably slowsdown the processing.

2. Usage of the pipeline is not easy for new users, since the different mod-ules are written in different languages; and since there is no centralizeddocumentation.

The low processing speed described in 1. makes it impossible to processlarger collections of text, such as PubMed. Because of that there is demandfor a streamlined pipeline.

7

CHAPTER 2. PYTHON-ONTOGENE PIPELINE 8

2.2 python-ontogeneConsequently, the existing OntoGene pipeline was rewritten in python3, andfocus was placed particularly on reducing communication between modulesvia files. Through this, processing is accelerated, and makes the processing ofthe entire PubMed possible. Furthermore, the new pipeline has a consistentdocumentation, and is hence easier to understand for the user.

The pipeline is currently developed up to the point of entity recognition,and can be found online1 or in the python-ontogene directory that accompa-nies this thesis.

2.2.1 Architecture of the System

The python-ontogene pipeline is composed of several independent modules,which are coordinated by a control script. The main mode of communicationbetween the modules is via objects of a custom Article class, which mimicsan XML structure. All modules read and return objects of this class, whichensures independence of the modules.

The modules are coordinated via a control script written in python, whichpasses the various Article objects produced by the modules to the subse-quent modules.

2.2.2 Configuration

All variables relevant for the pipeline (such as input files, output directoriesand bookkeeping parameters) are stored in a single file, which is read by thecontrol script. The control script will then supply the relevant argumentsread from the configuration file to the individual modules. This ensures thatthe user only has to edit a single file, while at the same time keeping themodules independent.

2.2.3 Backwards Compatibility

In order to preserve compatibility to the existing pipeline (see Section 2.1),the Article objects can be exported to OntoGene XML format at variousstages of processing.

1https://gitlab.cl.uzh.ch/colic/python-ontogene


Figure 2.1: The architecture of the python-ontogene pipeline

2.3 UsageThe exact usage of the individual modules is described in the subsequentchapters. Furthermore, the in-file documentation in example.py file can serveto provide a more concrete idea of how to use the pipeline.

2.4 Module: ArticleThe article module is a collection of various classes, such as Token, SentenceandSection. The classes are hierarchically organized (e.g. Article hasSections), but kept flexible to allow to future variations in the structure.Each class offers methods particularly suited to dealing with its contents,such as writing to file or performing further processing.


However, in order to keep the pipeline flexible, the article class relieson other modules to perform tasks such as tokenization or entity recognition.While this leads to coupling between the modules, it also allows for easyreplacement of modules. For example, if the tokenizer that is currently usedneeds replacing, it is easy to just supply a new tokenization module to theArticle object to perform tokenization.

2.4.1 Implementation

Currently, there are the following classes, all of which implement an abstractUnit class: Article, Section, Sentence, Token and Term. Each of theseclasses has a subelements list, which contains objects of other classes. Inthis fashion, a tree-like structure is built, in which an Article object hasa subelements list of Sections, which each have a subelements list ofSentences and so on.

The abstract Unit class implements, amongst others, the get_subelement()function, which will traverse the object’s subelements list recursively untilthe elements of type of the argument have been found. In this fashion, thedata structure is kept flexible for future changes. For example, Articlesmay be gathered in Collections, or Sections might contain Paragraphs.

As for tokenization, the Article class expects the tokenize() functionto be called with an tokenizer object as argument. This tokenizer objectneeds to implement the following two functions: tokenize_sentences(), andtokenize_words(). The first function is expected to return a list of strings;the second one to return a list of tuples, which store token text as well asstart and end position in text.

Finally, the Article class implements functions such as add_section()and add_term(), which internally create the corresponding objects. This isdone so that other modules only need to import the Article class, which inturn will take care of accessing and creating other classes.

2.4.2 Usage

The example below will create an Article object, manually add a Sectionwith some text, tokenize it and print it to console and file.

1 import article2


3 my_article = Article("12345678") # constructor needsID

4 my_article.addSection("S1","abstract","this is anexample text")

5 my_article.tokenize ()6 print(my_article)7 my_article.print_xml(’path/to/output_file.xml’)

2.4.3 Export

At the time of writing, the Article class implements a print_xml(), whichallows exporting of the data structure to a file. This function in turn re-cursively calls an xml() function on the elements of the data structure. Likethis, it lies in the responsibility of the respective class to implement the xml()function.

The goal of this function is to export the Article object in its currentstate of processing. For example, if no tokenization has yet taken place, itwill not try to export tokens. This however, requires much processing work.Because of this, this function and the related functions need to be updatedas the pipeline is updated.

Pickling

To store and load Article objects without exporting them to a specificformat, the Article class implements the pickle() and unpickle() functions.These allow dumping the current Article object as a pickle file, and restoringa previously pickled Article object.

1 import article2

3 my_article = None4

5 # create Article object6

7 my_article.pickle(’path/to/pickle ’)8 new_article = article.Article.unpickle(’path/to/

pickle ’)


Exporting Entities

The Article class implements a print_entities_xml() function, which ex-ports the found entities to an XML file. As with the general export function,the XML file is built recursively by calling an entity_xml() function on theEntity objects that are linked to the Article.

2.5 Module: File Import and Accessing PubMedThis module allows importing texts from files, or to download them fromPubMed, and converts them into the Article format discussed above. Fromthere, they can be handed to the other modules and exported to XML.

There are three ways how the PubMed can be accessed:

• PubMed dump. After applying for a free licence, the whole ofPubMed can be downloaded as a collection of around 700.xml.gz files,each of which contain about 30000 PubMed articles. This dump isupdated once a year (in November / December).

• API. This allows the individual downloading of PubMed articles giventheir ID. If theentrez library is used, PubMed returns XML, if theBioPython library is used, PubMed returns python objects. However,PubMed enforces a throttling of download speed in order to preventoverloading their systems: If more than three articles are downloadedper second, the user risks being denied further access to PubMed usingthe API.

• BioC. For the BioCreative V: Task 3 challenge, participants are sup-plied with data in BioC format. BioC is an XML format tailored to-wards representing annotations in the biomedical domain [8].

2.5.1 Updating the PubMed Dump

Since the PubMed dump is only updated once per year, additional articlespublished throughout the year need to be downloaded manually using theAPI.

This takes substantial effort: Between the last publication of the PubMeddump in December 2014 and August 1st 2015, 800000 new articles were


published. Given the aforementioned limitations of download speed, thistakes about 3 days to download using the API.

2.5.2 Downloading via the API

In order to prevent repeated downloads from PubMed, the module keeps acopy of downloaded articles as python pickle objects.

2.5.3 Dealing with the large number of files

Since the pipeline operates on the basis of single articles, the PubMed dumpwas converted into multiple files, each of which corresponds to one article.However, most file systems, such as FAT32, and ext2, cannot cope with 25million files in one directory. Because of this, the following structure waschosen:

Every article has a PubMed ID with up to 8 digits. If lower, they arepadded from left with zeros. All articles are then grouped by their first 4digits into directories, resulting in up to 10 000 folders with each up to 10 000files. For example, the file with ID 12345678 would reside in the directory1234.

However, different solutions for efficient dealing with the large numberof files could be explored in the continuation of this project. Especiallydatabases, inherently suited to large data sets, such as NoSQL, seem promis-ing.

2.5.4 Usage

The following code snippet demonstrates how to import from file and fromPubMed. The import_file module allows to specify a directory rather thana path. In that case, it will load all files in the directory and convert themto Article objects.

1 from text_import.pubmed_import import pubmed_import2

3 article = pubmed_import("12345678","[email protected]")

4 # email can be omitted if file has already beendownloaded


5 # if file was downloaded before , the module willload it from local dump_directory

6 article.print_xml(’path/to/file’)7

8 from text_import.file_import import import_file9

10 articles = import_file(’/path/to/directory/or/file.txt’)

11 # always returns a list12 for article in article:13 print(article)

2.6 Module: Text ProcessingThis module wraps around the NLTK library, to make sentence splitting,tokenization and part-of-speech tagging useable to the Article class. Thismodule can be swapped out for a different one in the future, provided thefunctions tokenize_sentences(), tokenize_words() and pos_tag() are imple-mented.

2.6.1 Usage

Since NLTK offers several tokenizers based on different modules, and allowsyou to train own models, this wrapper needs you to specify which model youwant to use. The config module gives convenient ways to do this.

1 from config.config import Configuration2 from text_processing.text_processing import

Text_processing as tp3

4 my_config = Configuration ()5 my_tp = tp(word_tokenizer=my_config.

word_tokenizer_object ,6 sentence_tokenizer=my_config.

sentence_tokenizer_object)7

8 for pmid , article in pubmed_articles.items():9 article.tokenize(tokenizer=my_tp)


2.7 Module: Entity RecognitionThis module implements a dictionary-based entity recognition algorithm. Inthis approach, a list of known entities is used to find entities in a text. Thisapproach is not without limitations: Notably, considerable effort must be un-dertaken to keep the dictionary up-to-date in order to find newly discoveredentities, and entities not previously described cannot be found [41].

We alleviate this problem by using an approach put forth by Ellendorf etal. [11]. Here, a dictionary is automatically generated drawing from a varietyof different ontologies. Their approach also helps to take into considerationthe problem of homonymy as described by [21], by mapping every term toan internal concept ID and to the ID of the respective origin databases.

We opted for this approach in order to deliver a fast solution able to copewith large amounts of data. This aspect has so far received little attentionin the field.

2.7.1 Usage

The user first needs to instantiate an Entity Recognition object, whichwill hold the entity list in memory. This object is then passed to the rec-ognize_entities() function of the Article object, which will then use theEntity Recognition object to find entities. While this is slightly convo-luted, it ensures that different entity recognition approaches can be used inconjunction with the Article class.

When creating the Entity Recognition object, the user needs to supplyan entity list as discussed above, and a Tokenizer object. The Tokenizerobject is used to tokenize multi-word entries in the entity list. The tokeniza-tion applied here should be the same as the one used to tokenize the articles.The config module ensures this.

1 from config.config import Configuration2 from text_processing.text_processing import

Text_processing as tp3 from entity_recognition.entity_recognition import

Entity_recognition as er4

5 my_config = Configuration ()6 my_tp = tp(word_tokenizer=my_config.

word_tokenizer_object , sentence_tokenizer=


my_config.sentence_tokenizer_object)7 my_er = er( my_config.termlist_file_absolute ,

my_config.termlist_format , word_tokenizer=my_tp )8

9 # create tokenised Article object10

11 my_article.recognize_entities(my_er)12 my_article.print_entities_xml(’output/file/path’,

pretty_print=True)

2.8 EvaluationTwo factors have been evaluated: speed and accuracy of named entity recog-nition.

2.8.1 Speed

Both the existing OntoGene pipeline as well as the new python-ontogenepipeline were run on the same machine on the same data set and their runningtime measured using the Unix time command. The test data set consists of9559 randomly selected text files, each containing the abstract of a Pubmedarticle. References to the test set can be found in the data/pythononto-gene_comparison directory.

The Unix time command returns three values: real, user and system.real time refers to the so-called wall clock time, that is, the time that hasactually passed for the execution of the command. user and system referto the period of time during which the CPU was engaged in the respectivemode. For example, system calls will add to the system time, but normaluser mode programs to the user time. Table 2.1 lists the measured results.

Table 2.1: Speed evaluation for Ontogene and python-ongogene pipelinespipeline real user + system s / articleOntoGene 37m5.153s 59 323s ( 16.5 hours) 6.206

python-ontogene 21m22.359s 1280s ( 0.36 hours) 0.133

Note that the OntoGene pipeline is explicitly parallelized: Because ofthis, real time is relatively low. The python-ontogene pipeline is not explic-


itly parallelized. However, this could be the subject of future development,resulting in faster real operation still.

2.8.2 Accuracy

To compare the results of named entity recognition of both pipelines, testingwas done on the same test data set of 9559 files as above. The data setcontains both chemical named entities as well as diseases, which are listedseparately in the evaluation below.

A testing script compares entities found by one pipeline and comparesthem against a gold standard. Here, we used the output of the old On-toGene pipeline as gold standard. The test scripts requires the input tobe in BioC format. Because of this, the output of both pipelines was firstconverted to BioC format. Test scripts can be found in the accompanyingdata/pythonontogene_comparison directory.

The script calculates TP, FP, FN, as well as precision and recall values ona document basis, as well as average values for the entire data set evaluated.Table 2.2 lists the results returned by the evaluation script:

Table 2.2: Evaluation of python-ontogene against OntoGene NEREntity Type Precision Recall F-ScoreChemical 0.835 0.865 0.850Disease 0.946 0.826 0.882

Note that precision and recall are measured against the output of theOntoGene pipeline. This means that true positives found by the new python-ontogene pipeline that the old pipeline did not find are treated as false posi-tives by the evaluation script. In table 2.3 we list some examples of differencesbetween what the two pipelines produce.

As example 15552512 in table 2.3 shows, the new pipeline lists manyentities several times, due to them having several entries with different IDsin the term list. While this can be useful, future development should allowfor this behavior to be optional.

Simpson [33] report that community-wide evaluations have demonstratedthat NER systems are typically capable of achieving favorable results. Whileour values obtained above cannot be directly compared, systems in theBioCreative gene mention recognition tasks were able to obtain F-scores be-tween 0.83 and 0.87 [34].


2.9 SummaryIn this chapter, we presented an efficient pipeline for tokenization, POS tag-ging and named entity recognition, which focuses on modularity and well-documented code. In the rest of this dissertation, we describe the finding ofa dependency parser to be included as a module. The modular nature of thepython-ontogene pipeline should make the inclusion of new modules easy, aswell as facilitate the use of different modules for POS tagging, for example.

We specially note the considerable improvements in speed as shown intable 2.1. The new python-ontogene pipeline runs approximately 46 timesfaster than the old OntoGene pipeline, making it a promising starting pointfor future developments.


Table 2.3: Differences in NER between the two pipelinesPMID Original text Comment15552511 These indices include 3 types

of measures, which are de-rived from a health profes-sional [joint counts, global]; alaboratory [erythrocyte sedi-mentation rate (ESR) and C-reactive protein (CRP)]; or apatient questionnaire [physi-cal function, pain, global].

Here the new pipeline doesn’tmark C-reactive protein as anentity, but the old one does(False Negative). This isprobably due to different tok-enization in regards to paren-theses.

15552512 Patient-derived measureshave been increasingly rec-ognized as a valuable meansfor monitoirng patients withrheumatoid arthritis.

The new pipeline lists bothrheumatoid arthritis as wellas arthritis as entities in sep-arate entries. This behav-ior is quite common: Thenew pipeline will try to matchas many entities as possible.Other examples include tumornecrosis and necrosis (in ar-ticle 15552517). This behav-ior makes the python-ontogenepipeline more robust.

15552518 It is now accepted thatrheumatoid arthritis is not abenign disease.

Here, the old pipeline marksnot a as an entity, and listsit with the preferred formof 1,4,7-triazacyclononane-N,N’,N”-triacetic acid. Thisis obviously a mistake, whichthe new pipeline does notmake, attributed to thequality of the dictionary used.

Chapter 3

Parsing

This chapter describes the process of finding a suitable parser to be inte-grated into the python-ontogene pipeline, and to be used as basis for relationextraction. We evaluate a set of different dependency parsers in terms ofspeed, ease of use, and accuracy, and select the most promising parser.

3.1 Selection ProcessThe 2009 BioNLP shared task was concerned with event extraction, and Kimet al. list the parsers that were used in this challenge [17]. Building on thiswork, Kong et al. recently evaluated 8 parsers for speed and accuracy [20].To our knowledge, this is the most recent and substantiated evaluation ofparsers. Based on their findings, we selected a set of parsers for our ownevaluation. We included only parsers for which a readily available implemen-tation exists, and which performed above average in the respective evaluationabove.

Recall that the python-ontogene pipeline is entirely written in python3and aims at reducing time lost at reading and writing to disk by keeping asmuch communication between modules in memory as possible. In trying tomaintain this advantage, we narrow our selection further by only choosingparsers that are either written in python or have already been interfaced forpython.

Given the considerations described above, the following dependency parserswere selected for further evaluation.

• Stanford parser, as it was described as the state-of-the-art parser by

20

CHAPTER 3. PARSING 21

Kim et al. as well as Kong et al., and has recently been updated.

• Charniak-Johnson (also known as BLLIP or Brown reranking parser),as it was the most accurate parser Kong et al.’s study mentioned above.

• Malt parser, as it performed fastest in the above evaluation when usingits Stack-Projective algorithm.

Furthermore, we also include spaCy, a dependency parser written entirelyin python3 that has not yet been the subject of scientific evaluation to ourknowledge.

Except for spaCy, all parsers mentioned above are written in differentlanguages than python, but claim to offer python interfaces.

3.1.1 spaCy

spaCy1 is library including a dependency parser written entirely in python3with focus on good documentation and use in production systems, and ispublished under the MIT license. To our knowledge, there are no publica-tions that evaluate its performance; however the developer self-reports on theproject’s website2 that the parser out-performs the Stanford parser in termsof accuracy and speed. For our tests, we used version v0.100.

spaCy attempts to achieve high performance by the fact that the userinterfaces are written in python, but the actual algorithms are written incython. cython is a programming language and a compiler that aims atproviding C’s optimized performance and python’s ease of use simultaneously[3].

spaCy also provides tokenization and POS tagging models trained on theOntoNotes 5 corpus3. The Universal POS Tag set the tagger maps to isdescribed in [25], and the dependency parsing annotation scheme in [7].

3.1.2 spaCy + Stanford POS tagger

Our preliminary evaluation, however, showed that the POS tagger the spaCylibrary provides does not perform well on biomedical texts, and thus af-fects the accuracy of the dependency parser. We found that the results of

1https://spacy.io/2https://spacy.io/blog/parsing-english-in-python3https://catalog.ldc.upenn.edu/LDC2013T19


spaCy ’s dependency parser can be improved when used in conjunction witha more accurate POS tagger. For part-of-speech tagging, we thus employedthe widely-used Stanford POS tagger 3.6.04 with the pre-trained english-left3words-distsim.tagger model, which is the model recommended by thedevelopers5. The results obtained by combining spaCy and Stanford POStagger are included in the evaluation below.

3.1.3 Stanford Parser

The Stanford parsing suite6 is a collection of different parsers written in Java.The parsers annotate according to the Universal Dependency scheme7 or tothe older Stanford dependencies described in [9]

In our tests, we used version 3.5.2. It was tested using the englishPCFGparser (see [19]), which is the default setting.

3.1.4 Charniak-Johnson

The most recent release (4.12.2015) of the implementation of the Charniak-Johnson parser8 was originally described in [6]. The parser is written inC++, and suffers from two major shortcomings:

1. It does not compile under OS X

2. It does not perform sentence splitting, but requires the input to bealready split into sentences.

Because of 1., we conducted our tests for this parser on a 2.6GHz In-tel Xeon E5-2670 machine running Ubuntu 14.04.3 LTS. Note that all otherparsers were tested on a different machine running OS X. Given this differ-ence, and because of the fact that all other parsers perform sentence splittingthemselves, the results obtained for the Charniak-Johnson parser cannot di-rectly be compared.

4http://nlp.stanford.edu/software/tagger.shtml5http://nlp.stanford.edu/software/pos-tagger-faq.shtml#h6http://nlp.stanford.edu/software/lex-parser.shtml7http://universaldependencies.github.io/docs/8https://github.com/BLLIP/bllip-parser


3.1.5 Malt Parser

The MaltParser was first described in [24] and is written in Java. Version1.8.1 of the MaltParser9 requires the input to be already tagged with the PennTreebank PoS set in order to work. As in the case of spaCy, we prepared thetest set using the Stanford POS Tagger 3.6.0, using the pre-trained english-left3words-distsim.tagger model.

3.2 EvaluationFollowing a preliminary assessment of ease of use and quality of documenta-tion, the parsers were first tested in their native environment (e.g. Java orpython) for speed. In a second step, the fastest parsers were then manuallyevaluated in terms of accuracy.

3.2.1 Ease of Use and Documentation

• spaCy offers a centralized documentation10 and tutorials. Furthermore,being written entirely in python3 it suffers little from difficulties thatarise in cross-platform use.

• The Stanford parser has an extensive FAQ11, but documentation isspread across several files as well as JavaDocs. There’s no centralizeddocumentation: The user is dependent on sample files and in-codedocumentation. However, the code is well-documented. There is awealth of options, most of which can be applied using the commandline, making the software very easy to use.

• The Charniak-Johnson parser offers little documentation on how touse it, and being written in C++ it is not trivial to use across differentplatforms.

• The Malt parser offers a centralized documentation12, however it fo-cuses mostly on training a custom model and offers little help on using

9http://www.maltparser.org/index.html10https://spacy.io/docs11http://nlp.stanford.edu/software/parser-faq.shtml12http://www.maltparser.org/optiondesc.html


pre-trained models. The need for tagged data as input is a majorshortcoming, necessitating additional steps in order to use it.

Table 3.1 summarizes these results.

parser cross-plattfromuse

documentation further com-ments

spaCy easy (python) centralized docu-mentation, tuto-rials

inferior POS tag-ger

Stanford easy (Java) extensive FAQ,well-documentedcode, sample files

Charniak-J. difficult (C++) little documenta-tion

requires sentencesplit input

Malt easy (Java) centralized docu-mentation

requires taggedinput

Table 3.1: Summary of assessment of ease of use for different parsers

3.2.2 Evaluation of Speed

The parsers were compared on a test set consisting of 1000 randomly selectedtext files containing abstracts from PubMed articles, averaging at 1277 char-acters each. The test set as well as intermediary results can be found in thedata/parser_evaluation directory accompanying this thesis. The tests wererun on a 3.5 GHz Intel Core i5 machine with 8GB RAM.

Table 3.2 lists the various processing speeds measured using the Unixtime command. In reading the table, bear in mind the following points:

• The spaCy library takes considerable time to load, but then processesdocuments comparably fast. To demonstrate this, we list separatelythe time for processing the test set including loading of the library(loading in the table) and excluding loading time (loaded). We do so,since the overhead the loading of the library presents will diminish insignificance with increasing size of the data to be processed.

• We also take separate note of spaCy ’s performance when using plaintext files as input and applying its own part-of-speech tagger (plain text


in the table), and when provided with previously tagged text (taggedtext). In the latter case, a small parsing step takes place to extracttags and tokens from the output produced by Stanford POS tagger.

• The evaluation of the Charniak-Johnson parser should not be directlycompared to the other two, since it was performed on a different ma-chine (see 3.1.4).

parser time characters / sStanford POS tagger (SPT) 29.126s 43 840spaCy (plain text, loading) 49.236s 25 933spaCy (plain text, loaded) 26.113s 48 896spaCy (tagged text, loading) 48.342s 26 413spaCy (tagged text, loaded) 23.662s 53962spaCy + SPT (loading) 77.468s 12 482Stanford 2 430.141s 525Charniak-Johnson 6 069.198s 210Malt 52 509.288s 24Malt + SPT 52 538.414s 24

Table 3.2: Processing time for different parsers

Discussion

Table 3.2 shows the simple parsing step necessitated to make Stanford POStagger output useable by spaCy and loading of tags thus provided by spaCytakes approximately the same amount of time as relying on spaCy ’s inter-nal parser. Furthermore, the time to load the spaCy library is substantial,although negligible in absolute terms.

In relative terms, the combination of spaCy + Stanford POS tagger sig-nificantly slows down spaCy ’s performance. However, as we shall show inSection 3.2.3, it is practically inevitable given the poor accuracy of spaCy ’spart-of-speech tagger.

Apart from algorithmic differences, the big gap in speed between theparsers is probably due to the fact that a new Java virtual machine is invokedfor the processing of every document for the Stanford and Malt parsers. This


could be amended by configuring the parsers in such a way that the Javavirtual machine acts as a server that processes requests. However, this isbeyond the scope of this work.

3.2.3 Evaluation of Accuracy

10 sentences from the test set have been selected to evaluate the output ofthe parsers visualized as parse trees by hand. The parses were convertedinto CoNLL 2006 format [4], and then visualized using the Whatswrong vi-sualizer13. Of the 10 sentences, the first five are considered easy sentences toparse, while the latter five are more difficult. While we do not provide a quan-titative evaluation, but the qualitative evaluation below gives a good indica-tion of the individual parsers’ performance. We only present the parse treesrelevant for the discussion below; for a complete list and higher-resolutionimages refer to the additional material14 that accompanies this dissertation.

Only the parses of spaCy, Stanford Parser and Malt Parser are considered,as well as all the parses produced by the combination of spaCy + StanfordPOS tagger. Given the lack of ease of use of the Charniak-Johnson parser,and its difficulty to produce parse trees, it is omitted from this evaluation.

The parse trees below highlight how poorly spaCy parser performs usingits own tagger (for example in sentence 8), often yielding parses that wouldmake a meaningful extraction of relations impossible. The Malt parser neveryields parses that are superior to Stanford ones, and sometimes makes mis-takes that the Stanford parser does not do (for example in sentence 5). How-ever, using spaCy + Stanford POS tagger results comparable to Stanfordparser are achieved, with the exception of minor mistakes (see sentences 3and 6, for example).

13https://code.google.com/p/whatswrong/14data/parser_evaluation/accuracy_evaluation/parse_tree


Sentence 1

Neurons and other cells require intracellular transport of essentialcomponents for viability and function.

All three parsers accurately mark require as the root of the sentence and thephrase neurons and other cells as its subject. None of the parsers accuratelydepicts the dependency of the phrase for viability and function on require,assigning it to either transport or components.

Figure 3.1: spaCy parser

Figure 3.2: Malt parser


Sentence 2

Strikingly, PS deficiency has no effect on an unrelated cargo vesi-cle class containing synaptotagmin, which is powered by a differ-ent kinesin motor.

All three parsers correctly identify root and subject. Noticeably, they also allcorrectly recognize PS deficiency as a compound. Unlike spaCy and Stan-ford parser, the Malt parser here incorrectly indicates a dependency betweensynaptotagmin and effect (rather than between effect and class (contain-ing synaptotagmin)). Without expert knowledge it is not possible to decidewether the dependency between synaptotagmin and the relative clause whichis powered by a different kinesin motor is correct, or if the relative clausedepends on class (containing synaptotagmin).


Figure3.3:

spaC

ypa

rser

Figure3.4:

Maltpa

rser


Sentence 3

However, it is unclear how mutations in NLGN4X result in neu-rodevelopmental defects.

Stanford and Malt parser deal with the sentence similarly. spaCy, how-ever, incorrectly marks NLGN4X result as a compound (as opposed to mu-tations in NLGN4X. This error seems to be caused by result being tagged asNN (noun) rather than VBP (verb, non-3rd person singular present) by thespaCy tagger. Indeed, providing spaCy with tags from Stanford tagger helpsto ameliorate this problem, but it still incorrectly marks NLGN4X result asa compound.


Figure 3.6: Stanford tagger, spaCy parser

Figure 3.7: Stanford parser


Sentence 4

Diurnal and seasonal cues play critical and conserved roles inbehavior, physiology, and reproduction in diverse animals.

spaCy has a peculiar way to describe the dependencies in the phrase diurnaland seasonal cues and play. This is likely to be caused by diurnal beingtagged as NNP (proper noun). Indeed, this problem is solved by provid-ing Stanford tagger’s tags to spaCy. Furthermore, all three parsers assigndifferent dependencies to the phrase in diverse animals : spaCy marks it asdependent on play, Stanford on behaviour, physiology, and reproduction, andMalt on reproduction only. Without expert knowledge, it is hard to decidewhich one is the most correct way; but Stanford’s assessment seems mostplausible, while spaCy is the most simplistic one.


Figure3.8:

spaC

ypa

rser

Figure3.9:

Stan

ford

tagg

er,spaCypa

rser


Figure3.10

:Stan

ford

parser

Figure3.11

:Maltpa

rser


Sentence 5

The nine articles contained within this issue address aspects of cir-cadian signaling in diverse taxa, utilize wide-ranging approaches,and collectively provide thought-provoking discussion of futuredirections in circadian research.

None of the parsers identify address as the root of the sentence. BothspaCy and Malt parser mark utilize as the root of the phrase, and consider theactual root address as either as the root of an adverbial clause, or to be partof a composite noun (this) issue address aspects. Stanford also fails to markaddress as the root, but captures the dependencies between address, utilizeand provide appropriately. spaCy only captures the dependency betweenutilize and provide, while Malt parser falsely identifies a dependency betweenapproaches and provides.

The phrase aspects of circadian signaling is correctly parsed by the Maltparser, while spaCy and Stanford both mark signaling to be an ACI of aspectsof circadian. Utilizing the Stanford tagger in conjunction with the spaCyparser yields the best results: While the true root address of the sentence isstill not found, it parses the phrase aspects of circadian signaling in diversetaxa correctly, and accurately describes it as the object of address.


Figure3.12

:spaC

ypa

rser

Figure3.13

:Stan

ford

tagg

er,spaCypa

rser

Figure3.14

:Stan

ford

parser

Figure3.15

:Maltpa

rser


Sentence 6

Thus, perturbations of APP/PS transport could contribute toearly neuropathology observed in AD, and highlight a potentialnovel therapeutic pathway for early intervention, prior to neu-ronal loss and clinical manifestation of disease.

spaCy, unlike Stanford and Malt, fails to correctly identify the depen-dency of observed in AD on neuropathology. It also does not accurately markhighlight as dependent on contribute, which Stanford and Malt do. This isnot fixed by providing it with Stanford tagger’s tags, but it results in a slightimprovement in marking novel an adverbial modifier of pathway.

Without expert knowledge it cannot be established wether spaCy ’s es-tablished dependency of prior to neuronal loss ... on highlight is correct, orif Stanford’s and Malt’s one on intervention is.


Figure3.16

:spaC

ypa

rser

Figure3.17

:Stan

ford

tagg

er,spaCypa

rser

Figure3.18

:Stan

ford

parser


Sentence 7

Presenilin controls kinesin-1 and dynein function during APP-vesicle transport in vivo.

All parsers parse this sentence correctly. Without expert knowledge, itcannot be decided if it is more correct to mark the phrase during APP-vesicle transport in vivo as dependent on function (as spaCy and Malt do)or on controls (as Stanford does).


Figure 3.20: Stanford parser

Figure 3.21: Malt parser


Sentence 8

Log EuroSCORE I of octogenarians was significantly higher (30±5 17 vs 20 ±5 16, P < 0.001).

spaCy does not recognize Log EuroSCORE I of octogenarians as onephrase. In fact, the I is tagged as a personal pronoun. Stanford and Maltdo recognize it correctly, and Malt in particular identifies the I as a cardinalnumber. Consequently, providing spaCy parser with Stanford tags yieldsmuch better results.

The phrase in parentheses is marked differently by all parsers: spaCymarks it as an attribute of was, while Stanford and Malt mark it as an un-classified dependency on was higher. Within the parentheses, only Stanfordrecognizes vs as the root of the phrase, and P < 0.001 as an apposition.Using spaCy parser in conjunction with Stanford parser also improves on theparse provided.


Figure3.22

:spaC

ypa

rser

Figure3.23

:Stan

ford

tagg

er,spaCypa

rser


Figure3.24

:Stan

ford

parser

Figure3.25

:Maltpa

rser


Sentence 9

Introduction to the symposium–keeping time during evolution:conservation and innovation of the circadian clock.

spaCy incorrectly marks symposium-keeping time as one phrase. It cor-rectly parses this phrase once using Stanford tagger’s tags. In that setting, italso parses the phrase keeping time during evolution as an ACI that dependson symposium, while Stanford marks keeping time during evolution as anunclassified dependency of introduction, and Malt as an adjectival modifier.Syntactically, the ACI is the most accurate interpretation, but in this specialconstellation Stanford or Malt parser’s results may be more accurate.

All parsers, however, deal well with the segmentation of syntactically in-dependent phrases by the colon :, marking it as an apposition to introduction(spaCy), and unclassified dependency of time (Stanford) or of introduction(Malt)


Figure3.26

:spaC

ypa

rser

Figure3.27

:Stan

ford

tagg

er,spaCypa

rser


Figure3.28

:Stan

ford

parser

Figure3.29

:Maltpa

rser


Sentence 10

Genetic mutations in NLGN4X (neuroligin 4), including pointmutations and copy number variants (CNVs), have been associ-ated with susceptibility to autism spectrum disorders (ASDs).

This sentence is parsed surprisingly well by all parsers. However, spaCymarks (neuroligin 4) as an adverbial modifier of including rather than anapposition of NLGN4X (as Stanford does). Using Stanford tags, though, itmarks it as an apposition of mutations in NLGN4X, which is not as correctas Stanford parser’s results, but an improvement over its default usage.


Figure3.30

:spaC

ypa

rser

Figure3.31

:Stan

ford

tagg

er,spaCypa

rser

Figure3.32

:Stan

ford

parser

Figure3.33

:Maltpa

rser


3.2.4 Prospective Benefits

spaCy ’s POS tagger can be trained on user-supplied data. While this is be-yond the scope of this work, spaCy ’s part-of-speech tagger could be trained ondata tagged with the Stanford POS tagger, hopefully yielding better resultsthan its default model. It could then be used instead of the Stanford taggerin the pipeline. This would greatly increase performance for two reasons:

1. Switching environments (python3 and Java) relies on reading and writ-ing to file. As table 3.2 shows, the small parsing step introduced byhaving to make Stanford POS tagger output available to spaCy furtherslows down processing. If tagging and parsing can both be done inpython3, this would make disk access and conversion superfluous andfurther speed up the pipeline.

2. spaCy ’s tagger itself seems comparably fast. If retraining does notimpact its performance, it could yield further increase in speed.

3.2.5 Selection

The combination of spaCy + Stanford POS tagger outperforms the otherparsers by at least two orders of magnitude in terms of speed, and maintainscomparable accuracy. Because of this, and taken the prospective benefitsdescribed in 3.2.4 into account, we opt to use spaCy in conjunction withStanford POS tagger in the course of this dissertation.

Given the modular nature and lose coupling with the part-of-speech tag-ger in the python-ontogene pipeline, integrating the retrained spaCy POStagger should be easy, and would hopefully yield further increase in process-ing speed.

3.3 SummaryIn this chapter we described the selection process of a suitable dependencyparser for the python-ontogene pipeline. We evaluated a series of differentparsers, and decided to use the spaCy parser in conjunction with StanfordPOS tagger. Not only does this approach outperform the other parsers interms of speed, it also offers the potential for further improvement. Namely, ifthe spaCy POS tagger is trained using the output of Stanford POS tagger, or


another means is found to improve on the spaCy POS tagger’s performance,we presume that performance in accuracy and speed can be dramaticallyincreased.

Chapter 4

Rule-Based Relation Extraction

In this chapter, we explain our approach for relation extraction based onhand-written rules. Building on the methods of parsing described in Chap-ter 3, we created an independent system, which we call epymetheus. It allowssearching a corpus of parsed texts for specific relations defined by rules pro-vided by the user.

We first discuss fundamental design decisions made for the epythemeussystem in Section 5.1.1. The system and its components are described inSection 4.2. A brief account of the data set used for the development andevaluation follows in Section 4.3. We present a set of manually created rulesaimed at finding a large portion of relations in a specific domain of medicalliterature to demonstrate the functionality of our system (Section 4.4), andconclude with a summary in Section 4.5. The system is evaluated in Chapter5.

All modules and queries described in this chapter can be found in thepython-ontogene/epythemeus directory that accompanies this dissertation.

4.1 Design ConsiderationsWhile rule-based approaches usually perform well, Simpson explain that themanual generation of ... rules is a time-consuming process [33]. Consider-able effort has been taken to facilitate the writing of rules, and thus reducedevelopment time. We attempt to make these efforts benefit a wider audienceby converting rules into queries of a common, widely-used format. We optedfor the Structured Query Language (SQL), the most widely used query lan-

49

CHAPTER 4. RULE-BASED RELATION EXTRACTION 50

guage for relational databases. This dictates the architecture of our systemdescribed at the beginning of Section 4.2.

The epythemeus system builds solely on the syntactic information pro-duced by dependency parsing as described in Chapter 3, and explicitly doesnot yet take named entity recognition into account. While we point out thatincluding NER information can improve results, allowing the epythemeusto utilize such information only at a future stage of development offers thefollowing advantages:

1. Systems utilizing different approaches in a sequential manner can besubject to cascading errors. In the case at hand, this means that arelation may not be found if the system does not detect the corre-sponding named entity in a previous step. Postponing the inclusion ofnamed entity recognition prevents such cascading errors to occur as aconsequence of system architecture.

2. Given our focus on aiding query development process, limiting the fea-tures available for phrasing rules allows us to explain our approach withgreater clarity and conciseness.

3. In not developing the epythemeus system as a component of the python-ontogene pipeline, but as an independent system, we can ensure thatour contributions can be of use for a greater audience.

Especially in regards to 3., we attempt to keep the epythemeus as inde-pendent as possible, allowing it to be used with different parsers and allowingfurther features for rules to be included easily.

4.2 ImplementationThe epymetheus system consists of three python modules (stanford_pos_to_db,browse_db, query_helper), and a database. The stanford_pos_to_db popu-lates the database given a previously tagged corpus as input. The databasecan be accessed either via the browse_db module, or through third-partysoftware. The query_helper module facilitates the creation of queries usedby either browse_db or the third-party software to extract relations from thedatabase.


Figure 4.1: Schematic overview of the architecture of the epymetheus system.

4.2.1 stanford_pos_to_db

This module uses spaCy to parse a previously POS tagged text. While spaCyoffers POS tagging functionality, we found that parsing quality is increasedwhen using a different tagger (see Chapter 3). In the implementation athand, the module expects as input a directory of plain text articles con-taining tokens and tags as it is produced by the Stanford POS tagger. Themodule will take the input, create a new spaCy object for every article, usespaCy to parse the articles, and commit both the spaCy objects as well asall dependencies to the database.

1 Within_IN the_DT last_JJ several_JJ years_NNS ,_,previously_RB rare_JJ liver_NN tumors_NNShave_VBP been_VBN seen_VBN in_IN young_JJwomen_NNS using_VBG oral_JJ contraceptive_JJsteroids_NNS ._.

Listing 4.1: Example of the format expected by the stanford_pos_to_dbmodule.

Note that this module can be swapped out to convert the output of adifferent parser into the database without affecting the remainder of thesystem.


4.2.2 Database

The database is implemented using SQLite1, which was chosen for two rea-sons:

1. Its python3 interface called sqlite3 allows for easy integration with therest of the epymetheus system and with the python-ontogene pipeline.

2. It can potentially cope with large amounts of data (up to 140 TB2).While the system has only been tested with comparably small data sets(see 5.1), this allows epymetheus to be used with much larger data setssuch as PubMed in the future.

Schema

The database has two tables: dependency and article. While the dependencytable stores dependency tuples generated from the stanford_pos_to_db mod-ule, the article table contains serialized python objects generated by thespaCy library.

This approach was chosen to make use of the highly optimized searchalgorithms employed by SQLite in order to find articles or sentences con-taining a relation given a certain pattern. At the same time, we maintainthe ability to load the corresponding python object containing additionalinformation such as part-of-speech tags, dependency trees and lemmata forfurther analysis and processing.

The tuples saved in the dependency table have the following format:

dependency( article_id, sentence_id, dependency_type, head_id,head_token, dependent_token, dependent_id, dependency_id)

To demonstrate the relation of database entries and dependency parses,consider the following sample sentence, and the related parse tree (figure 4.2)and set of tuples (table 4.1).

The ventricular arrhythmias responded to intravenous adminis-tration of lidocaine and to direct current electric shock ...3

1https://www.sqlite.org/2https://www.sqlite.org/whentouse.html32004 | 6 in the development set


Figure 4.2: Parse tree of a sample sentence.

aid sid type hid head_token dep_token did id2004 6 amod 115 arrhythmias ventricular 114 567332004 6 nsubj 116 responded arrhythmias 115 567342004 6 ccomp 132 required responded 116 567352004 6 prep 116 responded to 117 567362004 6 amod 119 administration intravenous 118 567372004 6 pobj 117 to administration 119 567382004 6 prep 119 administration of 120 567392004 6 pobj 120 of lidocaine 121 567402004 6 cc 117 to and 122 567412004 6 aux 124 direct to 123 567422004 6 xcomp 116 responded direct 124 567432004 6 amod 127 shock current 125 567442004 6 amod 127 shock electric 126 567452004 6 dobj 124 direct shock 127 56746

Table 4.1: Dependency tuples for a sample sentence (abbreviated headernames).

Indices

Indices are used to increase the efficiency of querying the database. Find-ing relations relies heavily on joining on the dependent_id and head_idcolumns, while maintaining that a relationship cannot extend over severalarticles. This forces all potential constituents of a relation to have the samearticle_id. In order to maintain an easy mapping between the position ofthe tokens in the article and the dependent_id or head_id, respectively, the


dependent_id or head_id are not unique across the database, but rathercommence at 0 for every new article_id.

Because of this, so-called compound indices that allow easy joining onseveral columns are created on the column pairs (article_id, head_id) and(article_id, dependent_id). The effects of these compound indices onquery performance are described in Section 5.1.2.

4.2.3 query_helper

This module aids with the creation of complex SQL queries for extractingrelations. The key idea is that relation patterns can be split into fragments,which are then combined in various ways. The module is thus particularlyuseful in automatically generating queries that exhaust all possible combina-tions between fragments.

For example, a relationship might be expressed in the pattern of X causesY or X induces Y, which are equivalent in terms of their dependency pattern.Another way to express the same relation is X appears to cause Y, or Xseems to cause Y. In this example, six different queries are needed to captureall possibilities. This highlights the usefulness of a tool that automaticallygenerates all possible queries given a minimal set of building blocks.

We have thus created our own short-hand notation for such fragments,which are parsed by the query_helper module. The module then in turnoffers functions that generate queries based on the user-supplied fragments.

Fragments

Fragments represent conditions that apply to dependencies in the database,and that can be chained together. Example 4.2 is a comparably simplefragment that would match the phrase result in. It is used here to explainthe notation of fragments used by the query_helper module. Fragments aresaved in plain text files, and a single text file can contain multiple fragments.Having multiple fragments in a single file allows for similar fragments toreside in the same file, and thus helps organization.

1 // result in2 d1.head_id ,d1.head_id3 d1.head_token LIKE ’result%’


4 d1.dependent_token LIKE ’in’Listing 4.2: Simple fragment matching the phrase result in.

The first two lines in every fragment carry special meaning. Line 1 isthe title line, and contains the name of the fragment so that it can later bereferred to. The title line is marked by being prefixed with //. It marksthe beginning of a new fragment: Every subsequent line that does not beginwith // is considered a part of the fragment.

Line 2 is the joining elements line, the use of which will be explainedbelow.

The remaining lines contain conditions. Every fragment is defined by aset of conditions that apply to a set of dependency tuples as they are storedin the database (see Section 4.2.2). A single dependency tuple is referredto by a name such as d1 within a fragment. Inspired by SQL notation, theelements of a tuple are referred to by the following notation dependency_name.element_name. Conditions on the elements can either be expressed by = orby LIKE, which is the same operator as in SQL. Namely, it allows the right-hand operand to contain the wild-card %, which represents missing letters,and also allows for the matching to be case-indifferent. The condition d1.head_token LIKE ’result%’ thus applies to all dependency tuples in whichthe head_token begins with result, including results, resulted and resulting.

Using different names for tuples allows the fragment to describe patternsthat extend over several tuples, as the example 4.3 shows. The conditiond1.dependent_id = d2.head_id indicates the connection between the de-pendency tuples. If no condition specifies the relation between the two tuples,the system will merely assume that the two dependency tuples have to be inthe same sentence.

1 // be responsible2 d1.head_id ,d2.head_id3 d1.dependency_type = ’acomp ’4 d1.dependent_id = d2.head_id5 d2.head_token LIKE ’responsible%’Listing 4.3: Fragment involving multiple dependency tuples, matchingphrases such as is responsible or be responsible.

The following sample sentence contains such a phrase. Again, a simplifiedparse tree and dependency tuples are provided below.


... different anatomical or pathophysiological substrates may beresponsible for the generation of parkinsonian ’off’ signs and dysk-inesias4

Figure 4.3: Simplified parse tree of a sample sentence containing the beresponsible fragment.

aid sid type hid head_token dep_token did id11099450 5 amod 227 substrates different 223 731911099450 5 amod 227 substrates anatomical 224 732011099450 5 cc 224 anatomical or 225 732111099450 5 conj 224 anatomical pathophysiological 226 732211099450 5 nsubj 229 be substrates 227 732311099450 5 aux 229 be may 228 732411099450 5 acomp 229 be responsible 230 732611099450 5 prep 230 responsible for 231 732711099450 5 det 233 generation the 232 732811099450 5 pobj 231 for generation 233 732911099450 5 prep 233 generation of 234 733011099450 5 amod 239 signs parkinsonian 235 733111099450 5 punct 239 signs ‘ 236 733211099450 5 amod 239 signs off 237 733311099450 5 punct 239 signs ’ 238 733411099450 5 pobj 234 of signs 239 7335

Table 4.2: Dependency tuples corresponding to the sample sentences con-taining the be responsible fragment.

411099450 | 5 in the development set


Joining Fragments

In both example 4.2 and 4.3, the second line does not represent a condition.The first non-empty line to follow the title line describes the left and rightelements of the fragment. These specify which elements of the fragment touse when several fragments are joined together using the join_fragments(left_fragment,right_fragment) function of the query_helper module.Consider fragment 4.4 below, and the results produced by calling join_fragments(subj,result in) (4.5), and join_fragments(subj,be responsible) (4.6),respectively.

1 // subj2 d1.head_id ,d1.head_id3 d1.dependency_type = ’nsubj ’

Listing 4.4: subj fragment matching any phrase containing a subject.

1 d1.head_id ,d1.head_id2 d1.dependency_type = ’nsubj ’3 d1.head_id = d2.head_id4 d2.head_token LIKE ’result%’5 d2.dependent_token LIKE ’in’Listing 4.5: result of joining fragments 4.4 (subj) and 4.2 (result in), whichwill match phrases in which the word result is governed by a subject.

1 d1.head_id ,d3.head_id2 d1.dependency_type = ’nsubj ’3 d1.head_id = d2.head_id4 d2.dependency_type = ’acomp ’5 d2.dependent_id = d3.head_id6 d3.head_token LIKE ’responsible%’Listing 4.6: result of joining the fragments 4.4 (subj) and 4.3 (beresponsible), which will match phrases in which a subphrases like isresponsible or be responsible is governed by a subject.

Note that in example 4.5 (and analogously 4.6), the resulting fragment ismuch more specific than just matching phrases in which a subject exists andwhich contain the words result. Specifying left and right joining elements inthe fragment definition allows the join_fragments() function to connect thefragments in a more meaningful manner, and truly chain fragments together.


As can be seen, the join_fragments() function also automatically re-names the tuple identifiers, and adds a condition equating the left-hand tu-ple’s right element with the right-hand tuple’s left element. In this fashion,the relevant elements for joining fragments can be defined as part of the frag-ment definition. This allows for the automated joining of fragments, insteadof having to specify the element on which to join elements individually forevery join.

Setting the option cooccur=True when calling the join_fragments()function disables this behavior, and will merely rename tuples and mergeconditions.

Alternatives

The fragment notation allows for an effortless listing of alternatives. Considerthe example 4.7, which describes phrases such as lead to or leads to. Thereare several verbs that behave like lead, such as attribute or relate. In orderto easily account for such structurally equivalent verbs, the notation as inexample 4.8 can be used.

1 // lead to2 d1.head_id ,d1.head_id3 d1.head_token LIKE ’lead%’4 d1.dependent_token LIKE ’to’

Listing 4.7: fragment matching the phrase lead to or leads to

1 // to2 d1.head_id ,d1.head_id3 d1.head_token LIKE ’attribute%’4 || led%5 || lead%6 || relate%7 d1.dependent_token LIKE ’to’

Listing 4.8: fragment matching several phrases similar to lead to

Every line beginning with || refers to the closest previous line that isnot preceded by ||, and will describe an alternative to that line’s right-handoperand.


From Fragments to SQL Queries

Fragments can be directly translated into SQL queries, or first joined as manytimes as necessary before being turned into queries that can be used by thebrowse_db module, using the querify() function. This function ensures thatall dependency tuples have the same article_id as well as sentence_id.

For example, calling querify(join_fragments(nsubj, to)), using thefragments from examples 4.4 and 4.8, results in the following SQL query:

1 SELECT d1.article_id , d1.sentence_id2 FROM dependency AS d1 , dependency AS d23 WHERE d1.article_id = d2.article_id AND d1.

sentence_id = d2.sentence_id4 AND d1.dependency_type = ’nsubj’5 AND (d2.head_token LIKE ’attribute%’6 OR d2.head_token LIKE ’led%’7 OR d2.head_token LIKE ’lead%’8 OR d2.head_token LIKE ’relate%’)9 AND d2.dependent_token LIKE ’to’

10 AND d1.head_id = d2.head_idListing 4.9: query generated from the result of joining the fragments subjand to.

Automated Joining of Fragments

The introduction of this section highlighted the importance of automaticallygenerating all possible combinations of fragments. The function active()in the query_helper module gives an example of how from a very limitedset of fragments a large set of queries can be generated. The data/exam-ple_generate directory that accompanies this thesis contains both the frag-ments as well as the generated queries, showcasing the usefulness of thequery_helper module.

4.2.4 browse_db

The browse_db module is a shell-like environment that serves three purposes:

• the execution of custom queries


• the easy execution of predefined queries on the database

• the execution of related queries in one batch

When calling the browse_db module, the argument -d ’path/to/file’can be used to access a custom database. This is particularly useful whenthe same queries need to be executed on different data sets.

After calling the browse_db module, it will present a new command-lineprompt composed of the database name and the $ sign, waiting for the userto input one of the commands explained below.

1 user_shell$ python3 browse_db.py2 dependency_db.sqlite$

Custom Queries

The database can be queried from the browse_db environment using the qcommand followed by the SQL query in quotes. For example, a simple searchfor a specific token can be performed as follows:

1 $ python3 browse_db.py2 dependency_db.sqlite$ q "SELECT * FROM dependency

WHERE article_id = 2004 AND head_id = 5"3 2004,0,pobj ,5,in,patients ,6 ,56625

Predefined Queries

Predefined queries written in SQL are saved in plain text files, which areloaded by browse_db. Every file contains one query, and can be called fromwithin the browse_db environment using the q command and its file name.For example, a query stored in the file x_causes_y.sql can be executed asfollows:

1 $ python3 browse_db.py2 dependency_db.sqlite$ q x_causes_y

Several predefined queries are described in section 4.4. More queries caneasily be added by creating a new file containing the new query to either thepredefined_queries directory, or by adding the new file to a custom directoryand calling browse_db.py as follows:

1 $ python3 browse_db.py -q path/to/custom/directory


Specialized Queries and Helper Functions

browse_db furthermore offers helper functions that perform subtree traversal(subtree()), negation detection (is_negated()) and relative clause resolu-tion (relative_clause()) given an article_id and a token_id. Thesefunctions can be used in user mode as follows:

1 dependency_db.sqlite$ subtree 2004 52 in patients receiving psychotropic drugs

Furthermore, browse_db allows for specialized functions that will not onlyexecute the query, but in addition perform further analysis of the results. Thefunctions need to be specifically written, and can utilize the helper functionsdescribed above. For example, for the query x_cause_y such a specializedfunction has been written; and the listing below highlights the difference inoutput for custom queries and queries with specialized functions:

1 dependency_db.sqlite$ q "SELECT * FROM dependencyWHERE article_id = 2004"

2 2004,0,amod ,1,changes ,Electrocardiographic ,0 ,566193 ...4

5 dependency_db.sqlite$ q x_cause_y6 ID 6504332: that (55) cause disorders (59)7 -> subj : that8 -> obj : movement disorders9 ...

These specialized functions will be automatically used if their functionname and the query name coincide. This allows for the easy addition offurther specialized functions in the browse_db module by the user.

Categories of Queries

When loading predefined queries from a directory, the browse_db module willalso keep the names of the directory and all the queries it contains, consid-ering also subdirectories. This allows for the organization of related queriesinto directories, which can then be called using the name of the directory. Forexample, if the predefined_queries or the directory provided to browse_dbcontains a sub-directory example, all the queries that are contained in theexample directory can be easily executed in one command as follows:


1 dependency_db.sqlite$ q example2 Running query ’x_cause_y ’ from category ’example ’...3 10539815 ,47 ,it ,50,cause ,function ,534 10728962 ,31 ,they ,32,cause ,vasodilation ,335 ...

Again, the system will check if any specialized function has been writtenthat matches the name of any of the queries supplied in the directory. If so,it will use that function rather than the query provided.

Command-Line Mode

The execution of all queries in one category can also be initiated withoutentering the shell-like mode. For this, the argument --ids is used whencalling the module. In that case, the system will not consider any specializedfunctions and only execute functions as they are provided as text files asdescribed above. It will also only return the first two fields of every row,assumed to always be article_id and sentence_id.

1 $ python3 browse_db.py --ids ACTIVE_GENERATED2 9041081 | 103 2004 | 24 2886572 | 105 ...

This way of using the module is suited for subsequent automatic use,especially when the output of the module is redirected to a file (using python3browse_db --ids category > output.txt).

4.3 Data SetIn their foundational book Mining Text Data, Aggarwal et al. [1] describe thewealth of annotated corpora for domain-independent text mining. However,all these data sets are draw on broadcast news, newspaper and newswire data(as in the case of ACE [10]), from the Wall Street Journal (MUC) [13] or onthe Reuters corpus (English CoNLL-2003 [36]).

However, as Simpson explain, a major obstacle to the advancement ofbiomedical text mining is the shortage of quality annotated corpora for this


specific domain [33]. Neves [23], for example, gives an overview of 36 anno-tated corpora in the biomedical domain, most of which, however, do not offerannotations of relations between entities. The study points to the quality ofthe corpora released in conjunction with the BioCreative challenges5, whichorganizes challenges in the fields of evaluating text mining and informationextraction systems applied to the biological domain, and releases annotatedcorpora for evaluation.

For the development of predefined queries as well as the evaluation ofour epymetheus system, we use the annotated corpus originally provided forthe BioCreative V challenge [39]. It contains 1500 PubMed article abstractsthat have been manually annotated for chemicals and diseases, as well asChemical-Disease Relations (CDRs). It is split into three data sets (de-velopment, training, testing), each containing 500 documents. The data ispresented both in BioC format, a XML standard for biomedical text mining[8], and PubTator format, a special tab-delimited text format used by thePubTator annotation tool [38].

One major shortcoming of the data set, however, is that CDR annota-tions are made on document level, not on mention level. This means thatfor every document, the annotation notes which relations are found in theentire document, but do not offer further information on which occurrenceof an entity is the argument of the relation and where it is found withinthe document. The PubTator annotation tool highlights named entities asshown in figure 4.4, but it does not provide out-of-the-box visualization forrelations, and hence is not fit for our purpose.

Based on the BioCreative V corpus, we automatically extracted candi-date sentences, which are likely to contain a relation (Section 4.3.1). Thesesentences were then manually categorized according to the pattern that con-tains the relation (Section 4.3.2), in order to develop queries that matchthe patterns and to be able to evaluate the effectiveness of the epythemeussystem.

Note that we chose to use a corpus containing CDR annotations notbecause the epythemeus system is specific to that subdomain, but due to thescarcity of high-quality annotated corpora in the biomedical domain. In fact,our system is just as suitable for relation extraction in any other subdomain.

5http://www.biocreative.org/about/background/description/


Figure 4.4: Exemplary view of named entity highlighting on PubTator.

4.3.1 Conversion

For the development of queries and evaluations, we only consider relationsthat are confined within a single sentence. While the epythemeus system istechnically able to deal with relations that transcend sentence boundaries,this is beyond of the scope of this work. We thus converted the documentsof the corpus6 as follows:

The document is split into sentences using spaCy, and only sentencesthat contain both entities of an annotated relation are maintained. Thesesentences are printed out separately, and the entities participating in theannotated relationship are capitalized to facilitate human evaluation.

1 804391|t|Light chain proteinuria and cellularmediated immunity in rifampin treated patientswith tuberculosis.

2 804391|a|Light chain proteinuria was found in 9 of17 tuberculosis patients treated with rifampin....

3 804391 12 23 proteinuria Disease D0115074 804391 58 66 rifampin Chemical D0122935 ...6 804391 CID D012293 D011507

Listing 4.10: Example of PubTator format

6using the script python-ontogene/converters/pubtator_to_relations.py


1 804391 | 0 | Light chain PROTEINURIA and cellularmediated immunity in RIFAMPIN treated patientswith tuberculosis.

2 804391 | 1 | Light chain PROTEINURIA was found in 9of 17 tuberculosis patients treated with RIFAMPIN

Listing 4.11: Extracted relations after conversion

Table 4.3 lists the number of sentences containing a probable mention ofan annotated relationship extracted from the respective subset of the corpus.

subset articles in subset sentences extracteddevelopment 500 623

training 500 581test 500 604

Table 4.3: Sentences extracted per subset.

Note that table 4.3 also lists the training data set for the sake of com-pleteness and comparison. However, that set is not used in the course of thiswork.

4.3.2 Categorization

From the manual analysis of the sentences in the development subset, a setof 8 categories was derived, and each of the sentences manually assigned toone of these categories. The categories describe the structure of the sentencepointing towards the relation they contain. Following this, the sentences inthe test set were each assigned to the same set of categories.

Below, we describe the categories and the criteria that determine the asso-ciation of a sentence with the respective category. While the categories couldapply to other domains, too, they have been developed from sentences con-taining chemical-disease relations, and thus their precise definition is specificto the CDR domain.

ACTIVE

This category involves active sentences in the form of X causes Y (or X causeY ). Included are constructions with modal verbs such as X may cause Y or


X did cause Y, as well as extended patterns such as X appears to cause Y.The following sentence stands as an example for this category.

This is the first documentation that METOCLOPRAMIDE pro-vokes TORSADE DE POINTES clinically.7

A collection of verbs that establish relation in the development subset hasbeen established.

• accompany• associate• attenuate• attribute• cause• decrease• elicit• enhance

• increase• induce• invoke• kindle• lead to• precipitate• produce• provoke

• recur on• relate• reflect• be responsible• resolve• result in• suppress• use

DEVELOP

A common setting for establishing relationships between chemicals and dis-eases is to expose a subject to a chemical X and observe a subsequent caseof disease Y [27]. This category captures sentences that express such cases.It is the broadest category, including a vast variety of patterns. An examplefor a simple patterns is X in _ on Y, where X is a disease, _ represents anentity, usually patient, and Y is the chemical. A more complicated patternsis case of X within _ receiving Y or X in _ admitted to using Y. Manyof these patterns also contain a temporal component, such as developmentof X following Y treatment or X within _ of administration of Y, where _represents some time period.

The sentence below is a typical example of this category.

Five patients with cancer who developed ACUTE RENAL FAIL-URE that followed treatment with CIPROFLOXACIN are de-scribed ...8

711858397 | 6 in the development set88494478 | 2 in the development set


DUE

This simple category captures sentences in the form of X due to Y and relatedvariants that contain the word due. An example of such a sentences is listedbeneath:

Fatal APLASTIC ANEMIA due to INDOMETHACIN–lymphocytetransformation tests in vitro.9

HYPHEN

A large proportion of annotated relations were found in the pattern of X-induced Y, such as APOMORPHINE-induced HYPERACTIVITY 10. Thecategory also includes more complicated variation of the pattern such asKETAMINE- or diazepam-induced NARCOSIS 11 or PILOCARPINE (416mg/kg, s.c.)-induced limbic motor SEIZURES 12. It also extends to the samepattern using different words, namely:

• associate• attribute

• induce• kindle

• mediate• relate

NOUN

This category revolves around nouns that can express relations in patternssuch as the action of X on Y or the X action of Y. For example, a sentencecontaining the dual action of MELATONIN on pharmacological NARCOSISseems ...13 is considered to belong to this category. Nouns that have beenfound to express relations in this sense in the development subset are:

• action• association• case• cause• complication

• effect• enhancement• factor• induction• marker

• pathogenesis• relationship• role

97263204 | 0 in the development set106293644 | 2 in the development set1111226639 | 6 in the development set129121607 | 2 in the development set1311226639 | 7 in the development set


NOUN+VERB

This category extends the previous one in that it applies to sentences inwhich one of the nouns of the NOUN category are used in conjunction witha verb to express a relation. The pattern X plays role in Y as expressed inthe sentence below is a prime example of this category.

ERGOT preparations continue to play a major role in MIGRAINEtherapy14

PASSIVE

Sentences in the form of X associated with Y or X is associated with Ybelong to this category. This includes all tempora (X was associated withY ), sentences of the pattern X appears to be associated with Y, as well as therare case of X associated by Y. The same set of verbs as used in the ACTIVEcategory applies here. For example, the following sentence is assigned to thiscategory.

The HYPERACTIVITY induced by NOMIFENSINE in mice re-mained ...15 Symptomatic VISUAL FIELD CONSTRICTIONthought to be associated with VIGABATRIN ...16

NO CATEGORY

Sentences that did not match any of the previously mentioned categorieswere assigned the NO CATEGORY label. Note that the sentences extractedfrom the development subset of the original corpus do not necessarily expressthe annotated relation, even though both entities in the relation appear inthe sentence. A large proportion of sentences were assigned to this categoryfor that reason. For example, the following sentence does not establish anyrelation between the two entities sirolimus and capillary leak :

Systemic toxicity following administration of SIROLIMUS (for-merly rapamycin) for psoriasis: association of CAPILLARY LEAKsyndrome with apoptosis of lesional lymphocytes.17

143300918 | 4 in the development set152576810 | 3 in the development set1611077455 | 1 in the development set1710328196 | 0 in the development set


Another minor reason for attribution to this category are entities withshort names that also occur in natural language, and are thus extractedfalsely by the system. An extraction process more elaborate than the onedescribed in 4.3.1, for example involving tokenization or even entity normal-ization, could ameliorate this shortcoming, but lies beyond the score of thiswork.

4.3.3 Development and Test Subsets

Tables 4.4 and 4.5 list the number of sentences for every category in thedevelopment subset and test subset, respectively, as well as their percentage.

category sentences percentageACTIVE 49 7.865%DEVELOP 146 23.43%

DUE 7 1.124%HYPHEN 181 29.05%

NO CATEGORY 109 17.5%NOUN 23 3.692%

NOUN+VERB 11 1.766%PASSIVE 97 15.57%Total 623 100%

Table 4.4: Categorization for sentences extracted from the development sub-set.

category sentences percentageACTIVE 47 8.09%DEVELOP 128 22.03%

DUE 6 1.033%HYPHEN 150 25.82%

NO CATEGORY 122 21%NOUN 21 3.614%

NOUN+VERB 22 3.787%PASSIVE 85 14.63%Total 581 100%

Table 4.5: Categorization for sentences extracted from the test subset.


The annotated corpora, the extracted sentences and their categorizationas well as related material can be found in the data/manual_corpus directorythat accompanies this thesis.

4.4 QueriesThis section describes the development of query sets, which should providethe reader with a fair notion of how to use the epythemeus system. Basedon the development set and using the query_helper module, a set of querieswas developed for three categories:

• the trivial case of the HYPHEN category

• the ACTIVE category, considered relatively simple

• the complex DEVELOP category

The query sets were aimed at having near-perfect recall for their respec-tive category on the development set, while generalizing as much as possible.The fragments and generated queries for each query set can be found in thedata/query_set directory that accompanies this work.

4.4.1 HYPHEN queries

Queries for this category are trivially easy to make. The following single frag-ment produces a query that achieves almost perfect recall on the developmentset:

1 // hyphen2 d1.dependent_id ,d1.head_id3 d1.dependent_token LIKE ’%-induced ’4 || %-associated5 || %-attributed6 || %-kindled7 || %-mediated8 || %-related

Note that it is the dependent_token where we expect words such aslevodopa-induced to occur. This is because most commonly, the phrases inthe pattern X-induced Y are parsed as an amod-dependency, where Y will


be the head_token and X-induced the dependent_token. Table 4.6 belowshows the corresponding dependency tuples.

aid sid type hid head_token dep_token did id10091616 0 prep 0 Worsening of 1 63110091616 0 amod 3 dyskinesias levodopa-induced 2 63210091616 0 pobj 1 of dyskinesias 3 633

Table 4.6: Dependency tuples representing a HYPHEN relation.

4.4.2 ACTIVE queries

In order to maximize generalization, a set of minimal fragments was deter-mined that would cover as many sentences from the development as possible,and then all possible combinations of these were automatically generated.This required manual analysis of every sentence’s structure and key words.We found that sentences in the ACTIVE category are made up out of up tothree sets of fragments:

A first set of fragments describes a set of verbs that express a directrelationship between two entities. These words may either take direct objects(such as to cause), or require a preposition (such as to associate with). Wealso added to this set of fragments the case of to be responsible for. This setalso includes variations involving modal verbs (may cause), different numeri(X causes Y and X and Y cause Z ) as well as tempora (X causes Y and Xcaused Y ). Below is an example of fragments in this set. For a full accountof such fragments, refer to the data/example_generate directory.

1 // with2 d1.head_id ,d1.head_id3 d1.head_token LIKE ’associate%’4 || co -occur%5 || coincide%6 d1.dependent_token LIKE ’with ’7

8 // active9 d1.head_id ,d1.head_id

10 d1.head_token LIKE ’accompan%’11 || associate%12 || attenuate%


13 || cause%14 ...15 || use%16 d1.dependency_type = ’dobj ’17

18 // be responsible19 d1.head_id ,d2.head_id20 d1.dependency_type = ’acomp ’21 d1.dependent_id = d2.head_id22 d2.head_token LIKE ’responsible%’

Figure 4.5 and table 4.7 demonstrate a parse trees for a typical sentencein this category, as well as the corresponding dependency tuples.

Figure 4.5: Typical parse tree for an ACTIVE sentence.

aid sid type hid head_token dep_token did id11858397 6 nsubj 132 provokes metoclopramide 131 1319711858397 6 dobj 132 provokes pointes 135 1320111858397 6 advmod 132 provokes clinically 136 1320211858397 6 compound 135 pointes torsade 133 1319911858397 6 nsubj 135 pointes de 134 13200

Table 4.7: Dependency tuples for a typical ACTIVE sentence.

A second set captures cases in the pattern of X verb_a and verb_bY, where verb_b expresses the relation in questin. An example of such acase is the following sentence, where the relation enhances(oral hydrocorti-sone,pressor responsiveness) is captured by this pattern. Note that because


of the way the sentence is parsed, this relation would not be discovered with-out this fragment (see figure 4.6).

Oral hydrocortisone increases blood pressure and enhances pres-sor responsiveness in normal human subjects.18

Figure 4.6: Parse tree of a sentence in the pattern X verb_a and verb_b Y.

1 // conj2 d1.head_id ,d1.dependent_id3 d1.dependency_type = ’conj ’

A third set entails structures like X appears to cause Y or X seems tocause Y.

1 // appears2 d1.head_id ,d1.dependent_id3 d1.head_token LIKE ’appear%’4 || seem%5 d1.dependency_type = ’xcomp ’

Note that such patterns can be combined in various ways: for example,the verb to cause can occur in the pattern X causes Y, X appears to causeY, X some_verb and causes Y, X appears to some_verb and cause Y andX some_verb and appears to cause Y. Queries that match the latter case,however, are not generated, as there are no such sentences in the developmentset.

From these fragments, a set of 29 queries was automatically generatedusing the query_helper module. The set of generated queries can be foundin the data/example_generate directory that accompanies this thesis.

182722224 | 1 in the original development set of the BioCreative corpus


4.4.3 DEVELOP queries

The patterns of sentences in the DEVELOP category certainly are the mostvaried ones. Recall that sentences in the DEVELOP category describe asituation where a chemical is administered to a recipient, and a disease ob-served in that recipient. Every pattern must thus have a part describing thedisease, and one describing the chemical.

Since the database does not store entity recognition information, theepythemeus system needs to rely on parsing patterns to identify diseases andchemicals, respectively. The fact that the administration of the chemical aswell as the observation of the disease need to be described in the sentencemakes it possible to identify the elements of a chemical-disease relationship.

While the fragments presented here do not cover all the cases in thedevelopment set, they give an idea of how more complicated relations can befound.

Chemicals

The patterns that identify chemicals revolve around the administration ofthe chemical, which can manifest in a variety of ways. Below we give anexample of the kind of structures that can express the administration of achemical. The fragment titles should give sufficient description of the patternthe fragments describe.

1 // X therapy2 d1.head_id , d1.head_id3 d1.head_token = ’therapy ’4 || injection%5 d1.dependency_type = ’amod ’6

7 // therapy with X8 d1.head_id ,d2.dependent_id9 d1.head_token LIKE ’therap%’

10 || injection%11 d1.dependent_id = d2.head_id12 d2.head_token = ’with ’13 d2.dependency_type = ’pobj ’14

15 // injection of X


16 d1.head_id ,d2.dependent_id17 d1.head_token LIKE ’injection%’18 || administration19 || dose%20 d1.dependent_id = d2.head_id21 d2.head_token = ’of’22 d2.dependency_type = ’pobj ’

A particular case that was often encountered when chemical administra-tion is not explicitly described, such as in the following sentence:

... effects were ... VOMITING in the FLUMAZENIL group.19

Here, the chemical administration is only implicitly indicated as a qual-ity of the recipient. The fragment below describes the pattern X group, butthere are many other sentences in this sense such as women on ORAL CON-TRACEPTIVES 20 or occurrence of SEIZURES and neurotoxicity in D2R -/-mice treated with the cholinergic agonist PILOCARPINE 21.

1 // X group2 d1.head_id ,d1.head_id3 d1.dependency_type = ’compound ’4 d1.head_token LIKE ’group ’

Diseases

A simple example of the description of the occurrence of a disease follows:

The development of CARDIAC HYPERTROPHY was studied...22

In fact, such constructions involving similar nouns as in the NOUN cate-gory are quite common, and it might be fruitful to explore possible synergiesbetween queries for the two categories. The fragments below exemplify howsuch constructions can be represented as fragments:

191286498 | 10 in the development set20839274 | 0 in the development set2111860278 | 4 in the development set226203632 | 1 in the development set


1 // development of X2 d1.head_id ,d1.head_id3 d1.head_token LIKE ’development%’4 d1.dependent_id = d2.head_id5 d2.dependency_type = ’pobj ’6

7 // effects of X8 d1.head_id , d2.dependent_id9 d1.dependent_token LIKE ’effect%’

10 d1.dependency_type = ’nsubj ’11 d1.head_id = d2.head_id

Chemical Disease Relation

The patterns that actually capture the structures representing a relationbetween a disease and a chemical are very varied. We’ve identified threeways of finding them:

1. The subject exposed to the chemical can be used to establish the con-nection between the disease and the chemical.

2. A time word establishes a temporal relation between administration ofa chemical and disease onset.

3. A preposition is used instead of a verb.

The first case is the most straight-forward approach given our system.However, such sentences are surprisingly rare. While the sentence below isa good example of the kind of sentences that can be found in this approach,we found that the second case is far more fruitful.

... NICOTINE-treated rats develop LOCOMOTOR HYPERAC-TIVITY ...23

In fact, it seems like time words such as after are often used when de-scribing chemical administration, which could be exploited to create morerobust queries. The following fragments can be joined to fragments describ-ing chemical administration as described above.



1 // after X2 d1.head_id , d1.dependent_id3 d1.dependent_token = ’after ’4 d1.dependency_type = ’prep ’5

6 // following X7 d1.head_id , d1.dependent_id8 d1.head_token = ’following ’9 d1.dependency_type = ’dobj ’

The resulting fragment, called time word+chemical for the purposes ofdiscussion, can then be joined directly to the occurrence of a disease, whichwould allow for the finding of sentences as below:

Delayed asystolic CARDIAC ARREST after DILTIAZEM over-dose; resuscitation with high dose intravenous calcium.24

It could also be joined to a verb expressing disease occurrence to findsentences such as the following:

A 42-year-old white man developed acute hypertension with se-vere HEADACHE and vomiting 2 hours after the first doses ofamisulpride 100 mg and TIAPRIDE 100 mg.25

It seems, however, as if not joining the time word+chemical fragment toanything directly, but rather to create a query that merely checks for the co-occurrence of pattern expressed by the time word+chemical fragment anda disease yields good results with few false positives. While this claim needsfurther substantiation, we suggest that this is due to the fact that the timeword+chemical fragment almost exclusively is used in sentences expressing

a chemical-disease relation.A third way that is often used to express relations in the development set

is to rely on prepositions rather than verbs. While this is common especiallyin titles, the example below shows how this can also be the case in normaltext.

2412101159 | 0 in the development set2515811908 | 2 in the development set


Two cases of postinfarction ventricular SEPTAL RUPTURE inpatients on long-term STEROID therapy are presented ...26

However, since prepositions such as in in the example above are so com-mon, it is very difficult to write queries that will only return sentences thatuse them to express a chemical-disease relation.

4.5 SummaryIn summary, we created a system capable of extracting relations of any kind,and introduced the concept of fragments that aids with the process of writingqueries. Both are domain-independent, and while we developed them withbiomedical text mining in mind, they are just as applicable to other fields.

4.5.1 Arity of Relations

Note that no constraints are put on the number of entities participating ina relation. The distinction between relation and event extraction, as it hasbeen suggested by Simpson [33], for example (see Section 1.3), thus has littlemeaning.

Currently, the query_helper module will generate queries that return theidentifier of the sentence in which the relation is found. Using specializedqueries, and making use of the subtree() function as described in Section4.2.4, however, the epythemeus system can be adapted to return the individ-ual entities participating in relationships of arbitrary complexity.

In fact, the queries developed for the ACTIVE and HYPHEN categoriesreturn relations consisting of three entities each, where the verb expressesthe quality of the relation. For example, relations of the pattern X increasesY and X suppresses Y are currently both part of the ACTIVE category, butcould be assigned to different categories to allow for a more differentiatedextraction of relations.

In the same spirit, queries in the DEVELOP category in particular canextract relations consisting of various entities: capturing dosage of drug ad-ministration, for example, is made quite easy using fragments.



4.5.2 Query Development Insights

The examples above showcase how the concept of fragments greatly facilitatesthe creation of queries, especially in cases where many possible combinationsof similar structural patterns occur. However, the writing of queries could befacilitated if the dependency tuples would also store lemmata (forgoing theneed to use the LIKE operand and allowing for more concise queries), and ifword-lists could be supplied for alternatives, rather than listing every wordindividually. This might be especially useful to increase re-use: For example,queries for the PASSIVE category are very likely to use the same verbs asare used in the ACTIVE category.

While the manual creation of queries requires a good understanding of theannotation scheme used by the parser, the automatic generation of possiblevariations allows the system to cover a large proportion of relations. The useof the system has been demonstrated using the example of chemical-diseaserelation extraction, and the queries written for the demonstration are specificto that domain.

Note that the fragments and queries developed are specific to the depen-dency scheme employed by the parser. While efforts are made to establishuniversally accepted standards such as the Universal Dependency scheme27,these are not yet widely used, limiting the re-use of existing fragments andqueries.

4.5.3 Augmented Corpus

In order to evaluate the system and building on a previously annotated cor-pus, we manually categorized over 1000 sentences extracted from PubMedarticles according to the pattern that defines the relation they contain. Whilethis categorization does not follow any particular standard such as the oneslaid out byWillibur [40], and in particular offers no measure of inter-annotatoragreement, we hope that this categorization will help the reader to under-stand how to use the epythemeus system, and may be useful for other relatedresearch.

27http://universaldependencies.github.io/docs/

Chapter 5

Evaluation

In this chapter we evaluate the epythemeus system against the test corpusdescribed in Chapter 4. Furthermore, in order give an estimate of the effortrequired to process the entire PubMed using the approach described in thisthesis, we apply our approach to a small test set and extrapolate the measuredresults.

5.1 Evaluation of epythemeus

5.1.1 Query Evaluation

We evaluate the query sets developed for the HYPHEN and the ACTIVEcategory (described in Sections 4.4.1 and 4.4.2, respectively). While we alsolist the results for the queries written for the DEVELOP category (Section4.4.3), the set of queries only served to exemplify how queries for more com-plicated patterns can be obtained. The results for this query set thus do notgive any account of the efficacy of the epythemeus system, but serve only tohelp to further demonstrate the query development process.

For evaluation, the query sets were executed on the development set andthe test set. The queries return article_id and sentence_id for every sen-tence in which a relation is found. From the manually categorized sentencesof the development, the article_id and sentence_id for every sentencebelonging to the category in question are extracted using the categories.pyscript. The article_id and sentence_id pairs extracted in this fashion aretaken as the gold standard.

80

CHAPTER 5. EVALUATION 81

The gold standard and the output produced by the query sets are thencompared using the evaluate.py script. All scripts used for evaluation, as wellas intermediate results can be found in the data/manual_corpus directory.

HYPHEN queries

Table 5.1 shows how the query set has very high recall on the developmentset, and comparable recall on the test set.

set recall precision F1 measure TP FP FNdevelopment 0.961 0.316 0.476 174 7 376

test 0.953 0.294 0.449 142 7 341

Table 5.1: Results of HYPHEN query set executed on development and testset

The false positives (FP) on the development set are all sentences in whichthe hyphen connects an element in parentheses, such as the sentence below:

decreased the incidence of LIDOCAINE (50 mg/kg)-induced CON-VULSIONS.1

Such cases cannot be easily covered, given that in such situations thespaCy parser will treat the hyphen as an individual token, and produces aconsiderably more complex parse tree.

Of the seven FPs on the test set, five are due to the same issue. Theremaining two are explained by a spelling mistake in the original text (theappearance of these LEVODOPA-induce OCULAR DYSKINESIAS 2), andby the use of a word not previously encountered (rats with LITHIUM-treatedDIABETES INSIPIDUS 3).

The huge number of false negatives (FN) warrants a more thorough dis-cussion: While a systematic evaluation of these is beyond the scope of thework, ten randomly selected FNs were manually evaluated.

In one sentence4 marked as FN the relation between amiodarone and pul-monary lesion should have been found. However, the sentence only contains

111243580 | 5 in the development set211835460 | 3 in the test set36321816 | 2 in the test set418801087 | 3 in the test set


the words amiodarone-related lesion, and thus was not extracted as one ofthe sentences for the gold standard, but found by the query. Similar prob-lems arise with abbreviations: In the example below, the relation betweenstreptozotocin and nephropathy is not recognised:

... STZ-induced diabetic nephropathy in ... mice.5

However, in the original annotation, this abbreviation STZ is annotatedand given the same identifier as streptozotocin. In fact, it is the pubta-tor_to_relations.py converter that fails to resolve this correctly.

Another problem with the same converter that led to two FNs is thatsentences are not extracted from the original PubTator file (see Section 4.3.1)if the participants of the relation occur in the text with their starting letterscapitalized. This occurs occasionally in titles, and is a trivial bug to fix.However, the fixing of this bug would partially invalidate the results obtainedso far. Two of the ten randomly selected FNs are attribute to this fault.

In one case6, a possible relation (glucocorticoid-induced glaucoma) thathas not been annotated was returned. However, without expert knowledge, itis not possible to decide wether this is an oversight of the original annotation,or correctly classified as a FN.

The remaining three sentences are indeed false negatives, returning phrasessuch as amphetamine-induced group7 or 5HT-related behaviors8.

Table 5.2 below summarizes the findings from the manual evaluation ofthe randomly selected sample of false negatives.

reason number of sentences projectedupper case in conversion 3 102.3different names for entities 3 102.3correct FNs 3 102.3requires expert knowledge 1 34.1

Table 5.2: summary of reasons for a random sample of FNs in the test set,and projected number for the entire test set

520682692 | 7 in test set624691439 | 3 in test case724739405 | 8 in the test set824114426 | 3 in the test set


ACTIVE queries

Table 5.3 lists the results of the ACTIVE query set performed on the devel-opment and test set, respectively.


test 0.596 0.065 0.117 28 19 403

Table 5.3: Results of active query set executed on development and test set

All three FNs on the development set are due to incorrect parses. Forexample, in the sentence below, the spaCy parser considers isoniazid increasea compound noun.

High doses of ISONIAZID increase HYPOTENSION induced byvasodilators9

Again, from the FNs on the test set, ten randomly selected sentences weremanually evaluated. In contrast to the HYPHEN query set, the FNs for thisset seem to fall in one of two categories. In six cases, the sentence returneddid seem to contain a relation, but not the one that was annotated. Withoutexpert knowledge, it is not possible to make a definite assessment, but theexample below shows that it is plausible that a relation was indeed found,and the complexity of sentences that are still recognized by the query set.

Application of an irreversible inhibitor of GABA transaminase,gamma-vinyl-GABA (D,L-4-amino-hex-5-enoic acid), 5 micrograms,into the SNR, bilaterally, suppressed the appearance of electro-graphic and behavioral seizures produced by pilocarpine10

Note that the above sentence was categorized as PASSIVE (for the rela-tion between polcarpine and seizures, which are the annotated entities). How-ever, the relation between gamma-vinyl-GABA and seizures which caused thesentence to be returned by the ACTIVE query set, was not annotated in theoriginal corpus.

99915601 | 1 in the development set103708328 | 7 in the test set


Three sentences are correctly marked as FPs. For example, the sentencewe used Poisson regression11 is found, indicating that the word use maybe to ambiguous to be used in ACTIVE queries without other structures.The sentence below is a correct FN, which however could hint at a possiblerelation, if the reference each drug could be resolved.

Administration of each drug and their combinations did not pro-duce any effect on locomotor activity.12

One exception is the following sentence, in which an incorrect parse causesit to be found.

Naloxone (0.3, 1, 3, and 10 mg/kg) injected prior to training at-tenuated the retention deficit with a peak of activity at 3 mg/kg.13

The following table summarizes the random sample evaluation of falsenegatives.

reason number of sentences projectedcorrect FNs 3 120.9requires expert knowledge 6 241.8incorrect parse 1 40.3

Table 5.4: Summary of reasons for a random sample of FNs

DEVELOP queries

As explained above, the DEVELOP query set is intended to demonstratethe query creation process, and does not aim at high performance. Table 5.5lists its results to convey a notion of what a few fragments can achieve.


test 0.102 0.157 0.123 13 115 70

Table 5.5: Results of DEVELOP query set executed on development and testset

1125907210 | 5 in the test set1215991002 | 12 in the test set133088653 | 3 in the test set


Query Evaluation Discussion

The results of the HYPHEN and ACTIVE query sets indicate that theepythemeus system is capable of delivering useful results.

The biggest obstacle to favorable performance is the low precision (0.294on test set for HYPHEN queries, 0.065 on the test set for the ACTIVEqueries, and 0.157 for for DEVELOP queries). Systems in the BioNLP ’09shared tasks achieved F1 measures of up to 0.52 [17], and thus surpass ourresults (F1 measure of 0.449 for HYPHEN queries, 0.117 for ACTIVE queriesand 0.123 for DEVELOP queries) by far.

As the discussion above shows, these values are partially due to the infe-rior quality of the gold standard used for evaluation: It is very possible thatour queries find relations that experts would consider as such, but are notannotated in our reference corpus.

Besides that, further action needs to be taken to prune false negatives. Asstated in Section , we deliberately do not include named entity informationin our current approach. However, future versions of the epythemeus systemcould use NER information to reduce the number of FNs, and thus increaseF1 measure. For example, all FNs returned by the HYPHEN query set onthe test set could have been identified as such if NER information had beenmade use of.

While we suggest here that NER information be used to prune the re-sults returned by queries based solely on syntactic information, it is certainlymore common to reverse the order of these approaches. As we describe in ,however, this introduces the problem of cascading errors. It would thus beinteresting to compare the outcomes of systems using NER as a means ofpruning previously obtained results, or using NER as the basis for furtherrefinement, respectively.

5.1.2 Speed Evaluation and Effect of Indices

The effect of using the compound indices described in 4.2.2 on query executiontime was evaluated using 3 sample queries:Q1 X-induced YQ2 X causes YQ3 X is caused by Y

Additionally, one meaningless query (Q4) was created that uses a largernumber of self-joins. The queries were executed in two different databases:


D1 containing 323 004 actual dependencies, and D2 containing 1 000 000randomly generated entries, using the command line tool of SQLite. Querieshave been slightly modified to match the random nature of the data in D214.

Since the creation of D1 using the standford_pos_to_db module involvesother processing such extraction of dependencies from spaCy objects, we onlytake note of the different creation times for D2 with and without indices. Astable 5.6 below shows, adding new entries into the database took about 5times longer when using indices. However, the database, once created, isnot expected to change frequently. Thus these numbers have little relevancecompared to the increase in query speed displayed in tables 5.7 and 5.8. Asthese tables show, the querying time can be increased by a factor of about1.98 to 12.93 depending on the number of self-joins.

Table 5.6: Table and entry creation speeds with and without indexingindexing total creation time time per entry

without indices 82.368s 0.00824 mswith indices 401.685s 0.0402 ms

Table 5.7: Querying times for D1query self-joins in query without index with indexQ1 0 17ms 17msQ2 1 29ms 21msQ3 2 194ms 15msQ4 5 1549ms 781ms

Table 5.8: Querying times for D2query self-joins in query without index with indexQ1 0 1588ms 1579msQ2 1 4641ms 14811msQ3 2 53281ms 19553ms

The execution of Q4 was interrupted after 600s, and thus is not listedin table 5.8. Note than in table 5.8, Q2 will take about 3.19 times longer

14The materials and data used to generate the numbers described in this section can befound in the data/db_indices directory that accompanies this thesis.


for the indexed D2 than for D2 without the index. We attribute this to therandom nature of the data, and the fact that the index cannot fully unfoldits potential for queries that contain only one self-join.

5.2 Processing PubMedAs Section 1.4 explains, processing the entire PubMed database of over 25million articles is considered the ultimate goal of our research. In this section,we thus attempt to give an estimate of time it would take to process PubMedin its entirety using the approaches described in this dissertation. The testset used for this evaluation as well as other intermediary results can be foundin the data/pubmed_projection directory that accompanies this thesis.

5.2.1 Test Set

We selected a random set of 1000 article abstracts from PubMed. The testset has an average document length of 828 characters.

5.2.2 Timing

We measured the processing time for the individual stages using the Unixtime command (taking the sum of user and system values). This meansthat the times noted below are in terms of processor time for a single core15,and does not take into account that this task can be easily parallelized.

5.2.3 Downloading PubMed

As described in Section 2.5, there are several ways to access PubMed: Down-loading the complete PubMed dump published on a yearly basis is certainlymost efficient, but needs to be updated to include more recent publications.Because of this variability, we do not include the time it takes to prepare thePubMed article abstracts in our calculations.

151,8 GHz Intel Core i7


5.2.4 Tagging and Parsing

We used the Stanford POS tagger as described in Chapter 3 with the english-left3words-distsim.tagger model for POS tagging; and the stanford_pos_to_db.pymodule as described in Section 4.2.1 for database conversion.

5.2.5 Running Queries

The queries for the ACTIVE, HYPHEN and DEVELOP categories as de-scribed in Section 4.4 are executed using the browse_db module with the--ids all argument.

Note that the queries written as part of this thesis do not cover all possiblerelations, and that they are specific to the CDR task. In order to generalize,we thus make the following assumptions:

• Query sets for other applications than CDR behave similarly in termsof execution time.

• The execution time of a query set is proportional to the amount ofrelations it is intended to find.

Based on these assumptions, we note that the queries written as part ofthis thesis are aimed to cover the ACTIVE and HYPHEN categories com-pletely, and achieve 23.3% recall on the DEVELOP category, thus coveringabout 42.374% of all relations. We thus note an extrapolated processing time(running queries* in table 5.9), which multiplies actual processing time bya factor of 2.36.

5.2.6 Results

Table 5.9 below lists measured and estimated processing times. While thetotal projected time is an estimate that relies on many assumptions, it alsoshows how the systems presented in this thesis are indeed capable of process-ing the entire PubMed in reasonable time given appropriate infrastructure.


step measured time projected for PubMedPOS tagging 23.832s 595 800s (6 days, 22h)database conversion 48.882s 1 222 100s (14 days, 3h)running queries 30.584s 764 600s (8 days, 20h)running queries* 72.176s 1 804 401s (20 days, 21h)TOTAL 144.89s 3 622 301 (41 days, 22h)

Table 5.9: Estimated processing time for the entire PubMed

5.3 Summary

5.3.1 epythemeus

While the performance of the epythemeus is inferior to current state-of-theart systems, our evaluation points at the validity of our approach. We iden-tify several key factors that could unlock gains in performance, such as theinclusion of NER and lemmatization information, the employment of a moresuitable evaluation corpus and the further development of queries.

However, changing the database to store such information would not com-promise its independence. Lemmata and named entity information could beprovided by the python-ontogene pipeline, as well as other systems. Thisinformation could directly be used by fragments and queries alike withoutnecessitating further development of the system to improve precision.

More complex means to improve the performance of the epythemeus sys-tem could include pruning of dependency trees such as suggested by [5]. Thiscould make queries more robust to variations in parsing.

While the manually categorized sentences proved very useful both forquery development and evaluation, the gold standard against which thequeries were evaluated could have been improved. Partially, this is an exten-sion of a shortcoming of the original BioCreative V corpus, in which relationsare not annotated on a mention-level, but rather on a document basis. Thisprompted the need for a error-prone extraction process, and lead to lowerprecision in the evaluation. Given that the epythemeus system is not specific,however, to chemical-disease relation extraction, other corpora could be usedto obtain more reliable results.


5.3.2 Processing PubMed

As table 5.9 indicates, we estimate a total of almost 42 days of processingtime to process PubMed and run a hypothetical set of queries to extractrelations. This estimate assumes that query processing time is linear indatabase size. While such a number may seem daunting, recall that thismeasure is in terms of processing time for a single core, and that test wereperformed on a general-use home machine. Using a dedicated infrastructurewith several cores, the goal of processing PubMed seems to be in reach.

Chapter 6

Conclusion

In this thesis, we explored efficient rule-based relation extraction, and presenta set of systems as well as a novel way to facilitate the process of generat-ing hand-written rules. We recapitulate our contributions briefly in section6.1. Special attention is devoted to processing speed: The final objective ofthis research is extraction of relations in the entire set of 25 million articleabstracts that PubMed contains. This has not been possible so far, but ourresults put such an endeavor at reach. In this short chapter, we conclude byassessing shortcomings and highlight potential for future research.

6.1 Our ContributionsIn order to extract biomedical relations from unstructured text, three systemsare used:

1. The python-ontogene pipeline2. The combination of Stanford POS tagger and spaCy3. The epythemeus system

6.1.1 python-ontogene

The python-ontogene pipeline revolves around a custom Article class, whichis well-suited to store biomedical articles in memory at various stages of pro-cessing. Special care was taken to keep this class flexible for various applica-tions. The pipeline currently uses NLTK to provide tokenization and POStagging, but was developed with modularity in mind, allowing the NLTK

91

CHAPTER 6. CONCLUSION 92

library to be replaced by other tokenizers and POS taggers. A dictionary-based named entity recognition is used to extract named entities. By avoid-ing file-based communication between modules, the python-ontogene pipelineoutperforms existing systems by far in terms of speed while maintaining com-parable levels of accuracy.

6.1.2 Parser Evaluation

To our knowledge, the spaCy parser included in our evaluation has not previ-ously been the subject of scientific evaluation. We evaluated it together withthree state-of-the-art parsers in terms of accuracy and speed. spaCy by faroutperforms the other parsers in terms of speed, but does not yield satisfyingaccuracy. We show how this shortcoming can be overcome by using spaCyparser in conjunction with Stanford POS tagger.

6.1.3 epythemeus

The epythemeus builds on the work described above, but is an independentsystem: It takes Stanford POS tagged files as input, dependency parses themusing spaCy and saves the results in a database; but other approaches canbe used to populate the database. The database can then be queried usingmanually-created rules interactively or programmatically.

6.1.4 Fragments

The main contribution of the epythemeus system, however, lies in a newapproach to phrase rules and turn them into executable queries. A specialshorthand notation has been developed for so-called fragments, which rep-resent building blocks of rules. These fragments can be programmaticallycombined to create a set of queries that generalize well. This greatly aids thedevelopment of rules. The fragments are converted in SQL queries, allowingthe concept of fragments to be useful for other systems.

6.1.5 Corpus

In order to develop queries and to evaluate them, a set of over 1000 sen-tences containing chemical-disease relations has been manually categorized


according to the structure that points to the relation. We hope that thiscategorization can be useful in similar research.

6.2 Future Work

6.2.1 Improving spaCy POS tagging

While using spaCy parser and Stanford POS tagger together yields goodresults, the switching of environment (python3 and Java) considerably slowsdown processing. Given spaCy ’s ability to train POS tagging models, its ownPOS tagger could be improved. In particular, training spaCy ’s POS taggeron the output of Stanford POS tagger would allow for spaCy to deliver high-quality parses while forgoing the need to leave the python3 environment.

6.2.2 Integration of spaCy and python-ontogene

Development of python-ontogene preceded the evaluation of parsers describedin Chapter 3. Using the spaCy library in the fashion described above wouldallow for it to be integrated easily into the pipeline for POS tagging.

Building on that, a mapping between the spaCy objects containing de-pendency parses, and the above mentioned Article objects would allow forthe python-ontogene pipeline to also include dependency parsing, again for-going the need for file-based communication between modules and repeatedparsing.

6.2.3 Improvements for epythemeus

While the performance of the epythemeus system largely lies in the qualityof the queries, the system itself has two shortcomings: Since the databasedoes not store lemmatization nor named entity information, precision cannotas easily be improved. Especially named entity information would allow forqueries to be more robust, and yield much more satisfactory results. Again,the integration of the systems would alleviate this problem.

6.2.4 Evaluation Methods

The test set used for evaluation described in Chapter 4 suffers from errors inthe software that generated it. While this does not jeopardize the quality of


the epythemeus system, a more reliable evaluation could be performed.

6.3 Processing PubMedAs we explain in Section refsec:possiblepubmed, the ultimate goal of pro-cessing the entire PubMed is put at reach, owing to the special attention toefficiency we paid when developing the aforementioned systems.

Bibliography

[1] Charu C Aggarwal and ChengXiang Zhai. Mining text data. SpringerScience & Business Media, 2012.

[2] Sophia Ananiadou, Sampo Pyysalo, Jun’ichi Tsujii, and Douglas BKell. Event extraction for systems biology by text mining the litera-ture. Trends in biotechnology, 28(7):381–390, 2010.

[3] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin,Dag Sverre Seljebotn, and Kurt Smith. Cython: The best of both worlds.Computing in Science & Engineering, 13(2):31–39, 2011.

[4] Sabine Buchholz and Erwin Marsi. Conll-x shared task on multilingualdependency parsing. In Proceedings of the Tenth Conference on Com-putational Natural Language Learning, pages 149–164. Association forComputational Linguistics, 2006.

[5] Ekaterina Buyko, Erik Faessler, Joachim Wermter, and Udo Hahn.Event extraction from trimmed dependency graphs. In Proceedings ofthe Workshop on Current Trends in Biomedical Natural Language Pro-cessing: Shared Task, pages 19–27. Association for Computational Lin-guistics, 2009.

[6] Eugene Charniak. A maximum-entropy-inspired parser. In Proceedingsof the 1st North American chapter of the Association for ComputationalLinguistics conference, pages 132–139. Association for ComputationalLinguistics, 2000.

[7] Jinho D Choi and Martha Palmer. Guidelines for the clear style con-stituent to dependency conversion. Technical report, Technical Report01-12, University of Colorado at Boulder, 2012.

95

BIBLIOGRAPHY 96

[8] Donald C Comeau, Rezarta Islamaj Doğan, Paolo Ciccarese, Kevin Bre-tonnel Cohen, Martin Krallinger, Florian Leitner, Zhiyong Lu, YifanPeng, Fabio Rinaldi, Manabu Torii, et al. Bioc: a minimalist approach tointeroperability for biomedical text processing. Database, 2013:bat064,2013.

[9] Marie-Catherine De Marneffe and Christopher D Manning. Stanfordtyped dependencies manual. Technical report, Technical report, Stan-ford University, 2008.

[10] George R Doddington, Alexis Mitchell, Mark A Przybocki, Lance ARamshaw, Stephanie Strassel, and Ralph M Weischedel. The automaticcontent extraction (ace) program-tasks, data, and evaluation. In LREC,volume 2, page 1, 2004.

[11] Tilia Renate Ellendorff, Adrian van der Lek, Lenz Furrer, and FabioRinaldi. A combined resource of biomedical terminology and its statis-tics. Proceedings of the conference Terminology and Artificial Intelli-gence (Granada, Spain), 2015.

[12] Ken-ichiro Fukuda, Tatsuhiko Tsunoda, Ayuchi Tamura, Toshihisa Tak-agi, et al. Toward information extraction: identifying protein namesfrom biological papers. In Pac Symp Biocomput, volume 707, pages707–718. Citeseer, 1998.

[13] Ralph Grishman and Beth Sundheim. Message understandingconference-6: A brief history. In COLING, volume 96, pages 466–471,1996.

[14] Jörg Hakenberg, Steffen Bickel, Conrad Plake, Ulf Brefeld, Hagen Zahn,Lukas Faulstich, Ulf Leser, and Tobias Scheffer. Systematic feature eval-uation for gene name recognition. BMC bioinformatics, 6(1):1, 2005.

[15] Lynette Hirschman, Alexander Yeh, Christian Blaschke, and AlfonsoValencia. Overview of biocreative: critical assessment of informationextraction for biology. BMC bioinformatics, 6(Suppl 1):S1, 2005.

[16] Lawrence Hunter and K Bretonnel Cohen. Biomedical language pro-cessing: what’s beyond pubmed? Molecular cell, 21(5):589–594, 2006.

BIBLIOGRAPHY 97

[17] Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, andJun’ichi Tsujii. Overview of bionlp’09 shared task on event extraction.In Proceedings of the Workshop on Current Trends in Biomedical NaturalLanguage Processing: Shared Task, pages 1–9. Association for Compu-tational Linguistics, 2009.

[18] Jin-Dong Kim, Sampo Pyysalo, Tomoko Ohta, Robert Bossy, NganNguyen, and Jun’ichi Tsujii. Overview of bionlp shared task 2011. InProceedings of the BioNLP Shared Task 2011 Workshop, pages 1–6. As-sociation for Computational Linguistics, 2011.

[19] Dan Klein and Christopher D Manning. Accurate unlexicalized parsing.In Proceedings of the 41st Annual Meeting on Association for Computa-tional Linguistics-Volume 1, pages 423–430. Association for Computa-tional Linguistics, 2003.

[20] Lingpeng Kong and Noah A Smith. An empirical comparison of parsingmethods for stanford dependencies. arXiv preprint arXiv:1404.4314,2014.

[21] Michael Krauthammer and Goran Nenadic. Term identification in thebiomedical literature. Journal of biomedical informatics, 37(6):512–526,2004.

[22] Ulf Leser and Jörg Hakenberg. What makes a gene name? named entityrecognition in the biomedical literature. Briefings in bioinformatics,6(4):357–369, 2005.

[23] Mariana Neves. An analysis on the entity annotations in biologicalcorpora. F1000Research, 3, 2014.

[24] Joakim Nivre. An efficient algorithm for projective dependency parsing.In Proceedings of the 8th International Workshop on Parsing Technolo-gies (IWPT. Citeseer, 2003.

[25] Slav Petrov, Dipanjan Das, and Ryan McDonald. A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086, 2011.

[26] Longhua Qian and Guodong Zhou. Tree kernel-based protein–proteininteraction extraction from biomedical literature. Journal of biomedicalinformatics, 45(3):535–543, 2012.

BIBLIOGRAPHY 98

[27] W Scott Richardson, Mark CWilson, Jim Nishikawa, and Robert S Hay-ward. The well-built clinical question: a key to evidence-based decisions.Acp j club, 123(3):A12–3, 1995.

[28] Fabio Rinaldi, Thomas Kappeler, Kaarel Kaljurand, Gerold Schneider,Manfred Klenner, Simon Clematide, Michael Hess, Jean-Marc Von All-men, Pierre Parisot, Martin Romacker, et al. Ontogene in biocreativeii. Genome Biology, 9(Suppl 2):S13, 2008.

[29] Fabio Rinaldi, Gerold Schneider, and Simon Clematide. Relation miningexperiments in the pharmacogenomics domain. Journal of BiomedicalInformatics, 45(5):851–861, 2012.

[30] Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand, Simon Clematide,Therese Vachon, and Martin Romacker. Ontogene in biocreative ii. 5.IEEE/ACM Transactions on Computational Biology and Bioinformatics(TCBB), 7(3):472–480, 2010.

[31] Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand, Michael Hess, andMartin Romacker. An environment for relation mining over richly an-notated corpora: the case of genia. BMC bioinformatics, 7(Suppl 3):S3,2006.

[32] Isabel Segura Bedmar, Paloma Martínez, and María Herrero Zazo.Semeval-2013 task 9: Extraction of drug-drug interactions from biomed-ical texts (ddiextraction 2013). Association for Computational Linguis-tics, 2013.

[33] Matthew S Simpson and Dina Demner-Fushman. Biomedical text min-ing: A survey of recent progress. In Mining Text Data, pages 465–517.Springer, 2012.

[34] Larry Smith, Lorraine K Tanabe, Rie Johnson nee Ando, Cheng-Ju Kuo,I-Fang Chung, Chun-Nan Hsu, Yu-Shi Lin, Roman Klinger, Christoph MFriedrich, Kuzman Ganchev, et al. Overview of biocreative ii gene men-tion recognition. Genome biology, 9(Suppl 2):1–19, 2008.

[35] Don R Swanson. Complementary structures in disjoint science liter-atures. In Proceedings of the 14th annual international ACM SIGIRconference on Research and development in information retrieval, pages280–289. ACM, 1991.

BIBLIOGRAPHY 99

[36] Erik F Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. InProceedings of the seventh conference on Natural language learning atHLT-NAACL 2003-Volume 4, pages 142–147. Association for Computa-tional Linguistics, 2003.

[37] Tuangthong Wattarujeekrit, Parantu K Shah, and Nigel Collier. Pasbio:predicate-argument structures for event extraction in molecular biology.BMC bioinformatics, 5(1):155, 2004.

[38] Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. Pubtator: a web-based text mining tool for assisting biocuration. Nucleic Acids Research,41, 07 2013.

[39] Chih-Hsuan Wei, Yifan Peng, Robert Leaman, Allan Peter Davis, Car-olyn J Mattingly, Jiao Li, Thomas CWiegers, and Zhiyong Lu. Overviewof the biocreative v chemical disease relation (cdr) task. In Proceedingsof the fifth BioCreative challenge evaluation workshop, Sevilla, Spain,2015.

[40] W John Wilbur, Andrey Rzhetsky, and Hagit Shatkay. New directionsin biomedical text annotation: definitions, guidelines and corpus con-struction. BMC bioinformatics, 7(1):1, 2006.

[41] Alexander Yeh, Alexander Morgan, Marc Colosimo, and LynetteHirschman. Biocreative task 1a: gene mention finding evaluation. BMCbioinformatics, 6(Suppl 1):S2, 2005.

[42] Pierre Zweigenbaum, Dina Demner-Fushman, Hong Yu, and Kevin BCohen. Frontiers of biomedical text mining: current progress. Briefingsin bioinformatics, 8(5):358–375, 2007.

DependencyParsingforRelationExtractionin...

Documents

Transcript of DependencyParsingforRelationExtractionin...