Anemone: a Visual Semantic Graphkth.diva-portal.org/smash/get/diva2:1322081/FULLTEXT02.pdf · the...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2018

Anemone: a Visual Semantic Graph

JOAN FICAPAL VILA

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Anemone: a Visual SemanticGraph

JOAN FICAPAL VILA

Master in Data ScienceDate: March 15, 2019Supervisor: Olof GörnerupExaminer: Henrik BoströmSchool of Electrical Engineering and Computer Science

iii

Abstract

Semantic graphs have been used for optimizing various natural lan-guage processing tasks as well as augmenting search and informationretrieval tasks. In most cases these semantic graphs have been con-structed through supervised machine learning methodologies that de-pend on manually curated ontologies such as Wikipedia or similar. Inthis thesis, which consists of two parts, we explore in the first partthe possibility to automatically populate a semantic graph from an adhoc data set of 50 000 newspaper articles in a completely unsuper-vised manner. The utility of the visual representation of the resultinggraph is tested on 14 human subjects performing basic informationretrieval tasks on a subset of the articles. Our study shows that, forentity finding and document similarity our feature engineering is vi-able and the visual map produced by our artifact is visually useful.In the second part, we explore the possibility to identify entity rela-tionships in an unsupervised fashion by employing abstractive deeplearning methods for sentence reformulation. The reformulated sen-tence structures are qualitatively assessed with respect to grammaticalcorrectness and meaningfulness as perceived by 14 test subjects. Wenegatively evaluate the outcomes of this second part as they have notbeen good enough to acquire any definitive conclusion but have in-stead opened new doors to explore.

Keywords: Neo4j, Topic Modelling, Semantic Graph, Latent DirichletAllocation (LDA), NER, Sentence Reformulation.

iv

Sammanfattning

Semantiska grafer har använts för att optimera olika processer för na-turlig språkbehandling samt för att förbättra sök- och informations-inhämtningsuppgifter. I de flesta fall har sådana semantiska graferkonstruerats genom övervakade maskininlärningsmetoder som förut-sätter manuellt kurerade ontologier såsom Wikipedia eller liknande.I denna uppsats, som består av två delar, undersöker vi i första de-len möjligheten att automatiskt generera en semantisk graf från ett adhoc dataset bestående av 50 000 tidningsartiklar på ett helt oövervakatsätt. Användbarheten hos den visuella representationen av den resul-terande grafen testas på 14 försökspersoner som utför grundläggandeinformationshämtningsuppgifter på en delmängd av artiklarna. Vårstudie visar att vår funktionalitet är lönsam för att hitta och dokumen-tera likhet med varandra, och den visuella kartan som produceras avvår artefakt är visuellt användbar. I den andra delen utforskar vi möj-ligheten att identifiera entitetsrelationer på ett oövervakat sätt genomatt använda abstraktiva djupa inlärningsmetoder för meningsomfor-mulering. De omformulerade meningarna utvärderas kvalitativt medavseende på grammatisk korrekthet och meningsfullhet såsom dettauppfattas av 14 testpersoner.Vi utvärderar negativt resultaten av den-na andra del, eftersom de inte har varit tillräckligt bra för att få någondefinitiv slutsats, men har istället öppnat nya dörrar för att utforska.

v

Acknowledgements

I thank Fredrik Olsson, Nicolas Espinoza and Fredrik Espinoza for thevery gentle and attentive tutoring they have provided me in Gavagai.For what the university side is concerned, I would also like to expressmy gratitude to Barbara Pernici and Olof Görnerup for being my su-pervisors in my entry and exit universities, and to Henrik Boström forhelping me to shape my thesis and being my examiner.

Contents

1 Introduction 31.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 First Problem . . . . . . . . . . . . . . . . . . . . . 51.2.2 Second Problem . . . . . . . . . . . . . . . . . . . . 7

1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4.1 Benefits, Ethics and Sustainability . . . . . . . . . 111.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 121.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Extended Background 142.1 Building the Graph . . . . . . . . . . . . . . . . . . . . . . 14

2.1.1 Not-only Structured Query Language (NoSQL)Databases . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.2 Named Entity Recognition . . . . . . . . . . . . . 142.1.3 Topic Model - Latent Dirichlet Allocation . . . . . 15

2.2 Identifying Entity Relationships through Sentence Re-formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.1 Abstractive Deep Learning Methods . . . . . . . . 16

3 Methodology 183.1 Evaluating the Usability of the Semantic Graph . . . . . . 193.2 Evaluating the quality of the sentence reformulations . . 19

4 Unsupervised Semantic Graph Generation 214.1 Motivation: From SQL to NoSQL Databases . . . . . . . . 214.2 Motivation: Latent Dirichlet Allocation . . . . . . . . . . 234.3 Graph Generation . . . . . . . . . . . . . . . . . . . . . . . 25

4.3.1 Basic Architecture . . . . . . . . . . . . . . . . . . 25

vi

CONTENTS vii

4.3.2 Performance and Approximate Searches . . . . . 26

5 Sentence Reformulations 345.1 Recurrent Neural Networks . . . . . . . . . . . . . . . . . 34

5.1.1 Time Series . . . . . . . . . . . . . . . . . . . . . . 345.1.2 Recurrent Neural Networks . . . . . . . . . . . . . 345.1.3 Long Short-Term Memory cells . . . . . . . . . . . 35

5.2 From LSTM cells to Sentence Reformulation . . . . . . . . 355.3 Sentence Reformulation . . . . . . . . . . . . . . . . . . . 375.4 Relationships between entities . . . . . . . . . . . . . . . . 41

6 Evaluation of the Results 446.1 Evaluation of Usability . . . . . . . . . . . . . . . . . . . . 446.2 Evaluation of Entity Relationships . . . . . . . . . . . . . 46

7 Conclusions 497.1 Usability of the Semantic Graph . . . . . . . . . . . . . . . 497.2 Entity Relationships . . . . . . . . . . . . . . . . . . . . . . 507.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 50

List of Figures

4.1 LDA diagram . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Topicality extraction from query. . . . . . . . . . . . . . . 274.3 Appears-in relationship example. . . . . . . . . . . . . . . 294.4 Entity node example. . . . . . . . . . . . . . . . . . . . . . 294.5 Appears-in expansion of the Entity Apple. . . . . . . . . 304.6 Similar-to expansion of the Entity Apple. . . . . . . . . . 314.7 Multiple query showcase 1. . . . . . . . . . . . . . . . . . 324.8 Multiple query showcase 2. . . . . . . . . . . . . . . . . . 33

5.1 Sequence to sequence translation . . . . . . . . . . . . . . 375.2 Word space example 1 . . . . . . . . . . . . . . . . . . . . 385.3 Word space example 2 . . . . . . . . . . . . . . . . . . . . 405.4 Abstractor model diagram. . . . . . . . . . . . . . . . . . 42

6.1 Evaluation of the graph generation. . . . . . . . . . . . . 456.2 Qualitative analysis of the model . . . . . . . . . . . . . . 47

1

Acronyms

ACID Atomicity, Consistency, Isolation, Durability. 21

LDA Latent Dirichlet Allocation. iii, 15, 22, 23, 24, 25, 26, 27, 28

LSTM Long Short-Term Memory. 35

NER Named Entity Recognizer. 14, 25, 26

NoSQL Not-only Structured Query Language. vi, 14, 21, 22, 26

RNN Recurrent Neural Network. 34, 35

SQL Structured Query Language. 14

2

Chapter 1

Introduction

1.1 Background

Suppose you wanted to find out which cities were closest to a partic-ular location. How would you then want the results of your search tobe presented; as a mere ranking that orders the cities by distance, or asa map that also expresses in which direction the cities lie, and also thedistance between them? Presumably you would prefer the second so-lution since it obviously provides more information, by incorporatingnot only the results, but also the relationships between the results.Today, this type of search result is commonplace mainly due to Google’sintroduction of its so-called Knowledge Graph in 2012 [25]. The Knowl-edge Graph enables search engine features that complement ordinarystring matching results with additional information, such as the birth-place, profession, years in office, and so on, when searching for in-stance for “Nikola Tesla”.Google’s Knowledge Graph is a type of semantic graph. A semanticgraph refers to a network that represents semantic relationships be-tween concepts. It is a directed or undirected graph consisting of ver-tices, which represent concepts, and edges, which represent semanticrelations between concepts.

A semantic relationship is a logical binding between objects. Theincorporation of certain semantic relationships into DBMS designs,commonly referred to as data abstractions. It refers to a simplifieddescription, or specification, of a system that emphasizes some of thedetails or properties while suppressing others [52] [4].

The first semantic graph was introduced in 1956 by Richard H.

3

4 CHAPTER 1. INTRODUCTION

Richens for the task of machine translation of natural language [45][46]. Since then, much work has gone under the epithet of “seman-tic graph”, most notably the SYNTHEX project [43] and MultiNet [20].Over time, the variations of the structures referred to as semantic graphshave increased which has made the standard definition more loose[60].

Information retrieval consists in finding material of an unstruc-tured nature that satisfies an information need from within large col-lections. This material usually comes from documents stored on com-puters [27] [7] [51] [48] [24]. There are many ways to retrieve this data.This work will focus on the visual presentation of information. Forthis reason, we will pay special attention to building a visually use-ful semantic graph [58] [15] [26]. Making a visually useful semanticgraph consists in representing the optimal amount of features so thatthe summary is as meaningful as possible in a limited space [13].

The loose definition of what a semantic graph is makes it some-thing that can be accomplished through several technologies as longas it contains a logical binding between objects [4]. To automaticallypopulate it, in this thesis we focus on machine learning and naturallanguage processing.

Using supervised methods to populate the semantic graph requireslarge amounts of training data [2]. This is because the wider the scopeof the handled relationship types is, the larger the example set willhave to be. This is not a scalable approach when facing large sets ofnew data that need to be processed on an ad hoc basis, because thedata would need to be labelled and it would require a large amount oftime.

Therefore it is of great interest to explore whether a semantic graphcan be generated in a wholly unsupervised manner from a large set ofpreviously unseen data.

In this thesis we explore a set of unsupervised methodologies forbuilding a semantic graph which is generated automatically from largeamounts of unstructured text data. The usefulness of such a graph willdepend on whether or not the automatically populated graph containsmeaningful and accurate data, and if an actual user can gain insightsfrom it.

In this thesis, which consists of two parts, we explore in the firstpart the possibility to automatically populate a semantic graph froman ad hoc data set of 50 000 newspaper articles in a completely unsu-

CHAPTER 1. INTRODUCTION 5

pervised manner. The utility of the visual representation of the result-ing graph is tested on 14 human subjects performing basic informationretrieval tasks on a subset of the articles. In the second part, we ex-plore the possibility to garner entity relationships in an unsupervisedfashion by employing abstractive deep learning methods for sentencereformulation. The reformulated sentence structures are qualitativelyassessed with respect to grammatical correctness and meaningfulnessas perceived by 14 test subjects.

1.2 Problem

In this thesis we will primarily consider two problems. The first con-cerns the way visually useful semantic graphs are generated, or ratherthe methods used to generate a visually useful graph, and the secondconcerns the more specific problem of how the relationships betweenentities in a graph can be automatically identified and represented.

1.2.1 First Problem

Semantic graphs are a powerful way of summarizing large amountsof data by presenting a visual map that makes relational insights visu-ally apparent. If our aim is to reduce the internal memory task, theseare two winning features when compared to other forms such as tables[26]. Graphs help visualization and cognition by grouping informationthat is used together, thereby reducing search times and the demandon working memory [32]. We often see graphical tools that provide awide range of design features, many of which only have an aestheticpurpose. This does not necessarily give a visually useful tool. Accord-ing to the study of Jacques Bertin [13], the optimal design is the one inwhich the answer of a question can be perceived in a single image. Itis for this reason that the feature engineering of a semantic graph canbe a determining factor. Adding irrelevant nodes or relationships canmake the summary redundant and the important information couldbe lost. Hence, choosing which features and technologies to includein a semantic graph is an important aspect that needs to be carefullyaddressed.

During this thesis’ literature study, we saw a tendency among theresearch community to contribute to the field of semantic graphs re-garding them as a structure to query or preliminary layer for a com-


plementary algorithm. Unfortunately, previous work studying themas a visual organization to be directly interpreted by a human is morelimited.

As explained in section 1.1, [45] [46] [43][20] marked a beginningin the field of semantic graphs. This said, we could argue if they aredirectly correlated with our aim as they do not compose a productshaped tool that is intended to be directly used by people.

One of the first systems of this kind to emerge was KnowItAll in2006 [14]. Its purpose is to unsupervisedly automate the extraction oflarge collections of facts from the web, in a domain-independent andscalable manner. Their algorithm’s most fundamental part is com-posed by two modules, the Extractor and the Assessor. The formeris in charge of creating a query from keywords and send it to a Websearch engine to download relevant information through regular ex-pressions, based on part-of-speech tags. The Assessor does a sanitycheck before adding the extraction to the knowledge base. The resultof this is a knowledge base of entities that are interconnected betweenthem through relationship tags.

Similarly, Christian Bizer et. al. contributed to the field with DBpe-dia [3], a platform that extracts structured information from Wikipedia.The knowledge base interconnects a large amount of entities whichcontain rich descriptions and references to other resources. They makeuse of other publicly available tools to improve their artifact as theyare interlinked with other data sources on the Web, and also serve aspreliminary layer for other applications.

One of the resources that DBpedia uses is WordNet [41]. Differingfrom the other tools listed in this section, WordNet is a manually com-piled common sense and lexical knowledge base. Although its impres-sive size and completeness, it lacks the enabler of extensional knowl-edge about individual entities of this world and their relationships,and limits its coverage to hyponymys (subclass-of) and meronymys(part-of).

The most recent work we found is the one of Johannes Hoffart et.al. who released YAGO2 [21]. YAGO2 is a knowledge base that im-proves its predecessor YAGO [53] [54] where entities, facts, and eventsare anchored in both time and space. The content is extracted auto-matically from a set of online public encyclopedias such as Wikipedia,GeoNames, and WordNet. In the cited source, they contribute by pre-senting their extracting methodology, the integration of the spatio-


temporal dimension, and a novel way of knowledge representationthat they call SPOTL. Where "SPO" stands for subject–property–objecttriples, "T" for temporal and "L" for location.

In the previously exposed literature, we find studies that measurethe success of their methodologies according to an accuracy or pre-cision of the represented features. These are for sure important met-rics, but if the tool is intended to be utilized by a human, we believethat some human usefulness metric should be introduced as well. Bystating this, we mean that it should not be given for granted that avery complete and accurate knowledge base will result in an opti-mal resource to look at when searching for information. This precisegraph will surely be better than the same inaccurate version of thesame source, but if the relevant information is lost in the middle of theremaining majority, it will still be questionable to assume that the ar-tifact is more useful than a rawer version of itself. For this reason, inthis thesis we will contribute to the research area by exploring seman-tic graphs from another perspective, the visual usefulness. In order todo so, our metrics will also comprise the time to find an answer.

1.2.2 Second Problem

A graph is created by establishing links between documents and en-tities. A link between two documents could be that they mention thesame entity, that they are about the same topic, or that they are aboutthe same relationships. For some specific application we could findthat if two documents mention “Obama” then they are linked withrespect to the entity [Person: Obama, Barack]. And if the two docu-ments are about Putin visiting the White House, then they are linkedwith respect to that topic.

Links between entities, or entity relationships, are more difficult toidentify. Take for instance the sentence “President Obama expressedgreat admiration for President Putin”, here we would like to identifythe following structure Person -> Feeling -> Person, i.e. the Obamawould be linked to Putin through admiration.

Although it is possible to employ supervised Natural LanguageProcessing and Machine Learning techniques to build a semantic graph,these algorithms must be trained to perform well in any specific do-main. The current state-of-the-art has focused on solutions that areable to extract elements of a graph only within very specific topics on


which they have been trained, or that use distant supervision and/orrely on syntactical rules.

Supervised algorithms perform well if there is training data avail-able. Given a certain amount of examples where a relationship thatlinks two entities appears, a supervised algorithm is able to identifythis same pattern in future unseen examples. The limitation of this al-gorithm is that it will require examples for each relationship tag thatwe want it to be able to identify.

One example of supervised learning for semantic relation classifi-cation is the work of Bryan Rink and Sanda Harabagiu [47], that ex-plores a way to multi-way classification of semantic relations betweennominals using various linguistic resources. Their algorithm deter-mines the relation and direction in two different supervised classifiersthat exploit 47 types of features which can be grouped in the followingsets: lexical features, hypernyms, dependency parse, predicates andnominal similarity.

Exploiting the dependency parsing as well, Kun Xu et. al. [63] pro-pose to learn robust relation representations from shortest dependencypaths through a convolution neural network.

From the unsupervised machine learning perspective, previous workcan fundamentally be split in two: the approaches that use a distantsupervision with a semantic database to deduce relationships, and onthe other hand, the ones that do not use Machine Learning but exclu-sively linguistic rule-based Natural Language Processing techniques.

Distant supervision methods [29] use an external semantic databaseas ground-truth that is then used to complement the algorithm andfind a middle term solution between a supervised and an unsuper-vised algorithm.

In [37], Mike Mintz et. al. propose a paradigm that skips the stepof labelling the corpora by exploiting a semantic database of severalthousand relations to provide distant supervision. Their algorithm re-lies on being supervised by a database, rather than by labeled text,and for this reason, it does not suffer from the problems of overfittingand domain-dependence that occur supervised systems. But it still re-quires manual labelling to tag each relationship name in the externalsemantic database.

Kumar and Manocha follow another approach that takes as exter-nal database a word embedding space. This study [29] relies on theassumption that the distance between two terms appearing in a word


embedding space is meaningful and can be found, or at least approx-imately, between other term pairs. When the same distance is foundbetween two pairs of terms this indicates that they have the same re-lationship. For example, the distance between Berlin-Germany andanalogously Stockholm-Sweden, would be "capital of". This word em-bedding space must be labelled on large amounts of textual data inorder to accomplish accurate coordinates for each of the terms. Whichmakes this method unable to infer quality relationship tags that arespecific for a single document.

Another work that fits into this section is the one of Hoffmann et.al., where they present a weakly supervised approach [22] for multi-instance learning with overlapping relations that combines a sentence-level extraction model with a simple, corpus-level component for ag-gregating the individual facts.

In order to tackle the problem of multi-instance multi-label learn-ing, Mihai Surdeanu et. al. introduce a challenging learning scenario[55] where the relation expressed by a pair of entities found in a sen-tence is unknown. Their algorithm jointly models all the instances of apair of entities in text and all their labels using a graphical model withlatent variables.

Finally, Yankai Lin et. al. propose a complex sentence-level attention-based model for relation extraction [35]. By employing convolutionalneural networks to embed the semantics of sentences and build sentence-level attention over multiple instances, they attempt to reduce the weightsof the noisy wrong labelling problem that is present in distant super-vision methods.

For methods employing syntactical rules, we observe that they havestrong dependencies and lack of a true understanding of the wordmeanings beyond the the parsing trees. The fact of relying on thesyntactical rules often makes them identify low quality tags, althoughthey have arguably more predictable results.

For instance, in [39] Sergio Oramas et. al. combine the output ofa named entity recognizer with a syntactical rule based component toextract relationships from unstructured music text sources. Candidaterelations are obtained by traversing the dependency parsing tree ofeach sentence in the text with at least two identified entities by theNER.

The limitations of the algorithms employed in the previously men-tioned studies are that they require labelled training data, and that


they produce low quality or incomprehensible relationship tags be-tween extracted entities in the graph. But most importantly, there isa wide range of use cases where it is crucial that the algorithm whichis responsible for extracting and structuring the information of the se-mantic graph, easily adapts to new kinds of data, and this is difficultto accomplish when using algorithms that require labelling.

1.3 Purpose

The purpose of this thesis is to, firstly, test the claim that our set of tech-nologies can automatically generate a semantic graph in an entirelyunsupervised fashion, which is an effective visual tool to increase thespeed of human information retrieval tasks on a large set of unstruc-tured text data. We will contribute to the research area by studyingthe success of a semantic graph from the visual usefulness perspectiveby introducing the time to find a correct answer in the test metrics,addressing in this way the first problem 1.2.1. We state in advancethat the algorithms used to find and link the entities are not intendedto be the best performing ones but only viable options. Since we usethem to test the format of the overall tool, we consider that they oc-cupy a placeholder that could also be filled by another algorithm withthe same functionality and requirements.

Secondly, the thesis will explore the possibility to identify not onlyentity and topic links, but also entity relationships in an unsupervisedfashion by employing abstractive deep learning methods for sentencereformulation, which is directly related to the second problem 1.2.2.

1.4 Goal

The purpose of the thesis will be achieved by fulfilling three intercon-nected objectives.The first is to deal with the practical concerns, primarily those pertain-ing to scalability, of designing and constructing a visualizable semanticgraph by employing unsupervised machine learning techniques com-monly used in the field of semantic graphs and related areas [1], [9][5]. The outcome of this is an interactive visual representation of a se-mantic graph which has been automatically generated after inputting50 000 newspaper articles.


It has previously been demonstrated that graphical visualization ofdata is useful to find patterns that are not otherwise attainable throughordinary human cognition [12]. Although it is highly plausible thatthis applies also to data visualized through a semantic graph, we arenot aware of previous works that assess the usability of a visual repre-sentation of a semantic graph.Since there is no established criterion or objective measure for whatconstitutes a well functioning semantic graph, the second objective isto qualitatively assess the usability of the implemented semantic graphthrough user testing [49] [58] [26] [15]. These first two objectives com-bined specifically address the First Problem mentioned in 1.2.1.The third objective is to implement a novel approach for entity rela-tionship identification by employing abstractive deep learning meth-ods for sentence reformulation. The method reformulates natural lan-guage sentences into a NODE-RELATIONSHIP-NODE structure, infirst order relationship cases. The reformulated sentence structuresare qualitatively assessed with respect to grammatical correctness andmeaningfulness as perceived by 14 test subjects. This third objectiveaddresses the Second Problem in 1.2.2.

1.4.1 Benefits, Ethics and Sustainability

The benefits of the thesis consist in making the tested approach avail-able. The conclusions obtained throughout the process of reaching thegoals of this thesis are intended to contribute to other studies withsimilar aims.

From a sustainability perspective, our policy is far from high pri-vacy oriented. We make all of our methodology publicly availablein this report. As a matter of fact, our intent is to enrich the world’sknowledge in the cause, which we believe to share with an importantsize of researchers and enterprises. Many are the domains in which asemantic graph can bring added value and improve a process.

For what the risks regards, all of our test individuals were explainedthe whole purpose of the project and an introduction to the study. Thesubjects participated voluntarily and, in order to respect their privacy,their names have been hidden from this report although the collecteddata was not sensitive.

We would also like to state that, as we are testing the format ofan artifact in the first phase of our work, this study might not re-


sult resourceful for those readers who use algorithms with the samefunctionality but different inputs or features for their artifacts, as theymight not be able to completely adapt to our format.

For the latter phase of our work, whilst the sentence reformulationapproach might not be deemed entirely successful, it will constitutea starting point for continued research. Since the research part is anhypothesis, there is the chance that we do not succeed in making itwork. Given this unfortunate but possible situation, we would thenprove that our followed methodology does not improve the currentbenchmarks, and our conclusions would serve as a starting point forother researchers. Showing that the applied techniques do not solvethe problem, and which are the reasons.

Neither the results of the thesis nor the methods applied duringthe thesis have had any negative ethical impact, as far as the author isaware.

1.5 Methodology

We implement the semantic graph algorithm through an exploratoryapproach, utilizing readily available methods and technologies whenpossible and appropriate, with focus on maintaining a scalable model.With the algorithm in place we devise the sentence reformulation algo-rithm utilizing deep learning abstractive methods. Both the usabilityof the graph, and the quality of the reformulated sentences are thenempirically tested with humans subjects. In order to build the graph,we use a NoSQL database that will store the information extractedfrom a dataset of 50.000 news articles. This information is composedof documents and relationships. We will use two different methodsto evaluate the outcomes of our methods to generate a visually usefulsemantic graph, on the one hand, and also explore a way of reformu-lating sentences into valid relationship tags. The first method will beevaluated according to the methodology explained in section 3.1. Thesecond one will follow the methodology explained in section 3.2.

1.6 Outline

In Section 4 we explicate the motivations behind technological choicesthat underlie our semantic graph approach. In Section 5 we elaborate


on how we can reformulate sentences using abstractive deep learningmethods. Last, in section 6 we present the user results where both thegraph and the reformulations are evaluated.

Chapter 2

Extended Background

2.1 Building the Graph

2.1.1 NoSQL Databases

Until recently, rigid and highly structured relational databases werethe norm. This situation has changed with the arrival of big data: inter-net scale data size, frequent schema changes, and the high read-writes,caused unmanageable spikes in web-based applications [33].This new situation has necessitated a new kind of structure that is ableto scale and manage massive amounts of information. Not Only Struc-tured Query Language (SQL) (NoSQL) databases does this by com-promising reliability for a better performance. They handle a highthroughput and improve horizontal scalability [19]. In this thesis, wefocus on graph databases [11] [1], a kind of NoSQL structure that addsthe relationship dimension to the more basic (key, value) model, andallows objects to be linked together. The benefits of using this designinclude a better understanding of the whole distribution and organiza-tion of the data just by looking at it, faster relationship based queries,and a naturally adaptive model.

2.1.2 Named Entity Recognition

Named Entity Recognition is the task of identifying the named entitiesthat appear in a text and often their type or predefined categories. Thefirst Named Entity Recognizer (NER) task was performed by Grish-man and Sundheim in 1996 [17]. At the time, NER systems were based

14

CHAPTER 2. EXTENDED BACKGROUND 15

on handcrafted rules, lexicons, orthographic features and ontologies.Today the process starts with the tokenization of the sentences andwords of the documents, and continues with lemmatizing and stem-ming the terms to reduce inflectional forms and sometimes derivationrelated forms of a word to a common base form.Part of Speech tagging is then placed on top of the stack to give thegrammatical tags to a Dependency Parser that will output the syn-tactic dependency tag of each word, and will finally serve to identify,together with some customized method, whether a word is a namedentity or not.During the last decade, this process has been improved by deep learn-ing and time series. See [64] for an extensive survey.

2.1.3 Topic Model - Latent Dirichlet Allocation

A topic model is a type of statistical model designed to capture thetopics that are present in a collection of documents by discovering la-tent semantic structures. These topics are in some cases abstract, inthe sense that they do not need to be directly understandable or read-able by a human. Latent Dirichlet Allocation (LDA) [5] is a generativestatistical model that is formed by a three level hierarchical Bayesianmodels.LDA is able to cluster a set of documents and identify the words thatare most likely to determine whether a text is of a specific topic ornot. Its goal is to find short descriptions of the members of a collectionthat enable efficient processing of large collections while preservingthe essential statistical relationships that are useful for basic tasks. Theidea behind the algorithm is that documents are represented as ran-dom mixtures over latent topics, where each topic is characterized bya distribution over words. Its results are understandable for a human,and it is efficient to query.

2.2 Identifying Entity Relationships throughSentence Reformulation

In order to generate a visual representation of a semantic graph that ismeaningful to a human, the unsupervised process must generate re-lationship links between the identified entities. To relate the entities,

16 CHAPTER 2. EXTENDED BACKGROUND

we will explore a method to reformulate the sentences in a way thatfollows the syntax: entity1-relationship-entity2.One constraint we impose to these relationships is the one of beingdeduced exclusively from the text. For example, representations oflexicons such as the ones reviewed in [36] realized by O. Medelyan et.al do not apply to our case, because their domain is general.Some work has already been done training supervised algorithms tomine this kind of node links, but to our knowledge, only techniqueswith syntactical rules without deep learning have tried to solve thisproblem in an unsupervised manner.We will focus on unsupervised methods for generating new versionsof a given sentence by using abstractive deep learning methods withlong short term memory cells. This will allow us to get the overallmeaning of what is written and possibly express it with words that arenot appearing in the input corpus.What we will strive to do is to reformulate sentences so that they canbe displayed in the format: “NODE 1” – [RELATIONSHIP] –> “NODE2”. For example, transforming: “Leonardo was born in the beautifulcity of Venice” to the following format: “Leonardo” – [BIRTHPLACE]–> “Venice”.We will approach this challenge by employing the same method thatis used to perform abstractive language translations, but instead oftranslating from one language to another, we will translate a sentencein English to another sentence in English.The kind of model that we use could be seen as a chatbot that is pluggedto a time series autoencoder to output words that are syntactically cor-rect but also preserve the meaning of the sentence to be translated.We think we can display a tree of generated translations, and then pickonly those that have a desired length, and then start and end with therespective named entities. In the previous example, we would onlypick the sentences that have length 3 and start and end with Leonardoand Venice respectively.

2.2.1 Abstractive Deep Learning Methods

Abstractive deep learning methods were introduced by Chopra et. al[10] for text summarization. The idea was to use a time series modelthat keeps adding the information to the state as it keeps reading text.This information is stored in a vector of fixed dimension that acts as the

CHAPTER 2. EXTENDED BACKGROUND 17

context and captures, in some way, the meaning of what has been saidup to the moment. Every word of the sequence will update the state ofthe model and once finished, the internal state will serve as an inputfeature and initial state to train another component of the model thattries to match the target label using this information. In other words,the machine learning algorithm is forced to capture the essential infor-mation of a sentence in the vector of fixed dimension, because it willbe the main feature it will have to minimize the loss function.Additionally, the output of the model will depend only on the vocab-ulary or number of output classes that it has been given a priori, but itwill not depend on the input sequence, because it has been encoded.Finally, the model of Chopra et. al uses long short term memory cellsbecause they perform in a outstanding way in the time series sequence-to-sequence task, being able to coherently capture information throughtime, and keep track of hard to handle words such as adverbs or quan-tifiers. Since then, significant research [8] [38] [16] [56] has been con-ducted dealing with abstractive methods in order to reformulate orsummarize text.

Chapter 3

Methodology

In this chapter, we discuss the choice of research method for gener-ating a visually useful graph and the validation of the outcomes. Asexplained in the section 1.4, it is among our aims to validate the viabil-ity of a set of technologies from the visual usability perspective.

In the first problem 1.2.1, we want to test the advantages of usinga stack of layers that incrementally cooperate in order to enhance a vi-sually useful tool. It can be easily deduced from this statement that anenvironment able to test the whole system working together is needed.For this reason, we chose to build an artifact made out of the selectedtechnologies. This artifact is an enabler to empirically test our featureengineering and hypothesis through observation and experience.

In the second problem 1.2.2, the purpose swifts towards provingwhether a single technique is able to provide good quality sentencereformulations independently of the other layer’s behavior. For thisreason, we will leave this component outside from the artifact and testit individually.

The results are characterized by having a large degree of freedom,flexibility and even subjectivity. Two different answer formats couldbe correct, and it is the ability of presenting a dataset and making itunderstandable that should be evaluated. For this reason, a qualitativeapproach will be followed.

In the following sections, we describe the methodology we fol-lowed to implement the semantic graph algorithm through an exploratoryapproach, utilizing readily available methods and technologies whenpossible and appropriate, with focus on maintaining a scalable model.With the artifact and algorithms in place we devise the sentence refor-

18

CHAPTER 3. METHODOLOGY 19

mulation algorithm utilizing deep learning abstractive methods. Boththe usability of the graph, and the quality of the reformulated sen-tences are then empirically tested with humans subjects.

3.1 Evaluating the Usability of the SemanticGraph

To evaluate the visual usefulness of the result of the work we will par-tially follow the methodology of Renshaw [26]. In this study, the us-ability of two graphical formats are compared. The usability of eachformat is evaluated by the measures of success rates, time to comple-tion, and user satisfaction together with more recently established eyemovement based tests.For our purposes we omitted the eye tracking measures and focusedonly on the usability part. Our visual usefulness evaluation will con-sist in testing volunteer participants in the task of identifying the re-latedness and entity identification over a set of texts.In order to do so, the subjects were presented with three simple ques-tions to answer about 15 articles from the The New York Times (2016-2017) that they were able to reference throughout the task.

We compared the time the subjects needed to answer the questionsand their accuracy with or without the help of the semantic graph.Finally, the subjects were also asked which method was more satisfac-tory to them. 14 volunteer participants (9 male, 5 female) took part inthe experiment. Their ages ranged from 21 to 57.

The questions were of the following structure:

1. Does the named entity X appear in the article Y?

2. Take the article Y, which of the other 14 articles is the most similarto it?

3. Did the software help you answering questions 1 and 2?

3.2 Evaluating the quality of the sentence re-formulations

To generate the solutions to evaluate (reformulated sentences), we willtake the top 5 most probable chains of tokens. Understanding as chain

20 CHAPTER 3. METHODOLOGY

of tokens a sequence from its beginning till the <eos> token. To com-pute the probability of a whole sentence we will keep multiplying theconditional probabilities at each step of the prediction. After obtain-ing the samples, we will evaluate the performance of the model ina qualitative way. We do not evaluate them quantitatively becauseit is very difficult to automatically assess the results of a summariza-tion/reformulation task that is obtained via meaning abstraction.

These algorithms use words that are not necessarily present in theinput sentence to express the meaning that they encode in their inter-nal state, and frequently the words used in the output are completelydifferent from the ones seen in the input. Furthermore, two differentoutputs can be right but use completely different words as it happenswith paraphrases. This introduces a big problem, specially for sen-tence reformulation, because we want to list as many sentences thathave the same meaning as possible.

In [42], Toutanova et. al studied a similar problem, and proposeddifferent metrics to evaluate this task. We will use their human judge-ment method which qualitatively evaluates the obtained sentences interms of grammatical correctness and meaning preservation. We willask 14 judges (raters) to grammatically evaluate each reformulation ona scale of 1 (major errors, disfluent) through 3 (fluent), and again anal-ogously for meaning preservation on a scale from 1 (bad) through 3(most important meaning preserving). The raters will also have theoption to rate the reformulation as 0 if it has nothing to do with theoriginal or it is completely wrong in terms of syntax. As quality con-trol we set as condition that the judges must be self reported fluentEnglish speakers.

Chapter 4

Unsupervised Semantic GraphGeneration

4.1 Motivation: From SQL to NoSQL Databases

In relational databases, information is highly structured in tables whereeach row represents an entry and each column sorts a specific typeof information. The relationship between tables and types is calledschema and must be specified before any information is added, whichmeans that if we want to change it after it has been defined, then thewhole database needs to be updated.

This rigidity maintains the Atomicity, Consistency, Isolation, Dura-bility (Atomicity, Consistency, Isolation, Durability (ACID)) proper ties,but it also slows down procedures when scaling out. This does nothappen with NoSQL databases where, instead of using tables, a docu-ment oriented approach is used. These can contain many fields whichdo not need to be pre-specified as in SQL databases. This change re-quires an increase in the processing effort and the storage required,but offers increased flexibility, an intuitive structure, and simpler hor-izontal scaling. The usefulness of this approach has increased withthe arrival of big data: Modern web-based applications caused spikesbecause of Internet scale data size, high readwrite rates, and frequentschema changes.

For this reason, a new kind of database was created, the so-calledNoSQL (Not Only SQL), which was designed to provide:

1. Reduced complexity

21

22 CHAPTER 4. UNSUPERVISED SEMANTIC GRAPH GENERATION

2. Higher throughput

3. Horizontal scalability and running on commodity hardware

4. Lesser reliability in exchange for better performance

These databases store the data in new formats such as (key, value)or with relationships. This is the case of Graph data models, whichadd the relationship dimension allowing objects and documents to belinked together. In this thesis we are going to focus only on Graphdatabases, but there are several kinds of NoSQL databases:

1. Column store: storing the data by column instead of row, whichoften increases the query performance and scalability in big datasets.

2. Key-value model: this model stores data in a schemaless way byindexing the information using the classical (key, value) pairDoc-ument oriented: which adds complexity to the key-value model,transforming the value into a document that can store many fields.Horizontal scalability and running on commodity hardware

3. Graph database: very similar to document oriented databasesbut they add another layer, the relationship, which allows themto link documents for fast traversal.

The reasons why we have chosen the graph database are: our datarepresentation perfectly fits the node-relationship-node paradigm, andwe want to efficiently retrieve the nodes that are connected with oth-ers.

As we mentioned, in relational databases, the information is onlyaccessed through its primary key attributes, imposing the need of com-puting at every query the matching keys if we want to perform jointoperations, which are the equivalent to a connected representation,resulting in memory intensive and computationally expensive oper-ations. Graph databases instead are designed to satisfy these kinds ofqueries by explicitly adding the relationships in the fields of the nodeitself, as if the joint operations were already computed and just had tobe retrieved from the node information.

CHAPTER 4. UNSUPERVISED SEMANTIC GRAPH GENERATION 23

4.2 Motivation: Latent Dirichlet Allocation

Latent Dirichlet Allocation is an algorithm to automatically discovertopics from a set of composites or documents. As explained in [6]and [5], LDA is a generative probabilistic model for collections of dis-crete data such as text corpora. It is a three level hierarchical Bayesianmodel in which each item of a collection is modeled as an infinite mix-ture over an underlying set of topics. Each topic is, in turn, modeledas an infinite mixture over an underlying set of topic probabilities. Ifwe watch it from the context of text modeling perspective, the topicprobabilities provide an explicit representation of a document. Eachdocument is generated as a mixture of topics, where the continuousvalued mixture proportions are distributed as a latent Dirichlet ran-dom variable.

We define a topic as a distribution over a fixed vocabulary, in a waythat, an hypothetical topic about “The Universe” has words such as“star”, “black hole” or “earth” with high probability, or analogically, atopic about “Food” has words such as “kiwi”, “meatballs” or “lettuce”with high probability as well. So the topics will not have a unique tagcalled “The Universe” or “Food”, each of them will instead be a groupof words that appeared in similar documents during the training ofthe algorithm.

LDA technically assumes that the topics are generated before thedocuments, this means that we select a fixed amount of topics beforeany data has been generated. These topics will be the the corners ofa simplex space where we will place the documents, attributing themin this way a soft topicality pertinence that could be regarded also ascoordinates of this space, which makes the model easy to infer. Theonly observable features that the model actually sees are the words,that will determine the further hidden or latent parameters. For thisreason, LDA is very sensitive to the preprocessing of the data. Fil-tering stop words or punctuation signs that do not help determiningany meaningful topic can be critical since all the generative model willstart exclusively from the text.

To train, the algorithm builds its topic model from a collection ofdocuments made up of words that consists on two matrices: one ex-presses the probability of selecting a word when sampling a particulartopic, and the other one expresses the probability of selecting a topicwhen sampling a document. Additionally, the model also introduces


two parameters β and θ to control the distribution of words per topicand the mixture of topics respectively. Coming back on how to obtainthe probability matrices, once set a fixed amount of topics, the algo-rithm iterates through each document in the collection generating thewords in a two stage process:

1. Randomly initialize the distribution of words over topics.

2. For each word in the document:

(a) Randomly choose a topic from the distribution over topicsin step 1. This corresponds to the per-document distributionover topics P(topic t | document d), the proportion of wordsin document d that are currently assigned to topic t.

(b) Randomly choose a word from the corresponding distribu-tion over the vocabulary. This will be P(word w | topic t),the proportion of assignments to topic t over all documentsthat come from this word w. The word w will now be as-signed to topic t with probability P(topic t | document d) *P(word w | topic t).

We formally describe LDA with the joint distribution 4.1. Where eachtopic β is expressed as a distribution over the vocabulary and eachtopic proportion for a document θ is a distribution over the topic pro-portions for that document. The topic assignments for a document arerepresented by z, and finally, the observed words are w.

p(β1:K , θ1:D, w1:D) =K∏i=1

p(βi)D∏

d=1

p(θd)(N∏

n=1

p(zd,n|θd)p(wd,n|β1:K , zd,n))

(4.1)Or alternatively, we can also see the main idea of LDA in a more

schematic manner in figure 4.1, where the boxes group the operationsof parameters. The topic structure, which is composed by the hid-den variables (topic proportions, assignments and topics), appear un-shaded, and the words of the documents, which are the observed vari-ables appear shadded.

The motivation for utilizing LDA as the topic model is because itis scalable, unsupervised and it is not a black box. Latent DirichletAllocation is a robust and generative probabilistic model that allows


Figure 4.1: LDA diagram

us to optimally infer on it and add new information without having toretrain.

When the data that we are dealing with reaches the order of thou-sands or millions and needs to include new information periodicallyit is important to keep an eye on the problems this can lead us to. Byobtaining a topical probabilistic composition of a document we are re-ducing the features needed to represent it in a space. As we explainedin the previous sections, we can transform any document to its top-icality pertinence from its own words using the table word to topic.Considering that the amount of words exceeds by far the amount oftopics needed to represent a set of documents, this represents an im-portant benefit.

Furthermore, examining a document at topic level instead of at theword level fits for the task of discovering structure in the documents.And finally, in contrast to other algorithms, with LDA we are able tosee what is happening just by checking the word probability distribu-tion over topics: the words that have high pertinence to a topic willexplain what that topic is about.

4.3 Graph Generation

4.3.1 Basic Architecture

Entities that appear in the text are captured through Named EntityRecognition (NER). The NER is provided by SpaCy, which operates in27 languages with a state-of-the-art speed [9]. Documents that sharecommon entities are linked. Documents are also linked based on sim-


ilarity, which is derived from the topicality of the documents. Forthis we employed the Latent Dirichlet Allocation implementation fromGensim [44], which is a robust open source vector space and topicmodelling toolkit for Python. In order to store all of the the documentlinks we used the Neo4j [18] graph database which is open source andconveniently includes a browser visualization tool.

4.3.2 Performance and Approximate Searches

Having combined NER with LDA and storing the connections in aNoSQL database (in this case Neo4j), we were in position to plot agraph from unstructured text and make entities and documents ap-pear as nodes.

We were now also able to relate documents with a percent rela-tionship “Similar To” and indicate where the entities appear with therelationship “Appears in”.

Initially, it was taking considerably long to to process and add thedata to the graph, to give an approximate number, to plot a dataset of50.000 news articles from New York Times, it took us two days evenlimiting the number of “Similar To” neighbours to 20.

Secondly, the graph was interesting to see and explore its parts, butit was really difficult to find something related to something specificwhich had been defined before plotting it. In a graph of more thanone million nodes, it is practically impossible to check one by one thenodes until finding the one you are looking for. The data was there,semantically organized, but extremely difficult to access.

The Cypher query language is able to find nodes that match exactlythe proper ties we declare, but it is not able to find similar ones. Wecan look for an existing article by its name: “The cat was having funin the pool”, but we cannot ask the graph about articles where cats arehaving fun. This fact evidenced the need of a query system able tosatisfy these needs retrieving approximated suggestions.

To solve the first problem, a deeper understanding of Neo4j wasneeded. After carefully analyzing our algorithm to find out wherethe bottleneck was, we discovered that queries were fast performingMATCH queries, which are somewhat equal to “find” a structure inthe database, but they were instead slow to create or add information.We tried to solve it by setting indexes on the nodes’ most importantfields (most used to access information) to accelerate their time to be


found, which increased the total time and evidenced that the problemwas not going to be solved in this way. This practice is useful to findinformation by its hash, but it does not help in the process of addinginformation because there is extra information in the node to create ateach time.

Finally, we found out that there were other methods to import datainto the graph. What we were doing up to the moment was query-ing the database through one by one Cypher queries, which were builtthrough a python function that filled the blanks using the passed pa-rameters (e.g.: name, type...) and out putted a string with the de-sired syntax. Although the query structure was correct, this was slow-ing down the overall process because they were performed individ-ually. Fortunately, Neo4j has an import CSV option which allows toset a periodic commit and optimally manages the interaction with thedatabase. Using this feature we decreased the time required to insertthe same data by a factor of more than 100.

To solve the problem approximate searches, we added a function-ality that created a new node every time a search query was given. Theproperties and relationships were the same as the ones of our previousnodes (:Document types) and included LDA similarities and entities aswell, which made the system understand natural language queries wemean understanding in the sense of topicality, which is the one thatLDA provides, and not the true meaning of the sentence, see see 4.2.

Figure 4.2: Topicality extraction from query.

Finally, we decided to add a session to enhance its usability. In thisfeature, the user is able to perform many queries in a row that organizethe data by tensing it towards the searches as if they were vertexeswith gravity. If a user looks up three statements, when plotting thenodes and relationships that connect two or more search queries, the


user will quickly realize which existing documents are a good matchfor one or more queries at a time.

To recap, after completing the previous phases of the thesis we findourselves in a position where we have a scalable graph that is ableto express the semantic similarity relationships among the documentsand show the different entities that appear in each of them. We findthis current map of what happens in the text corpora useful because itreveals insights of where do things come from, discovering the connec-tions gives an idea of why something is important and why to choosea source instead of another.

To show the outcomes, we will first perform a data exploration ona dataset of 50.000 news articles from the New York Times 2016-2017.

The following explanation will show how each layer of the stackhas improved and complemented its predecessor with a brief explana-tion and visual support. On the other hand, a qualitative analysis toevaluate the sentence reformulation task will be done in section 6.2.

The first step is to extract the entities from the documents, and thenplotting them to the graph. In the figure below 4.3, some random doc-uments are represented with a big red node, and the entities appear assmaller bubbles of different colors, each one of them represents a type.The entities are linked to each document they appear with a “AppearsIn” relationship, which can be multiple and relate many documents.

Additionally, it is also possible to navigate through the documentsthat also mention a specific entity by expanding it is relationships. Forexample, if we take the entity Apple of type Organization, we can ex-pand it and know in which other articles it was also mentioned (4.3)(4.5).

In order to relate documents and navigate through them, we addeda "Similar To" tag, which adds connections with the top k neighbors ofeach document. In this case, we added an amount of k=20 neighbors toeach document that had previously been clustered in a space createdby LDA with 200 topics. As it happens with the entities, we can alsointeractively expand or contract the relationships of a document (4.6).

The query system we have incorporated works through terminal,retrieving at each search statement a ranking of the top 1000 similardocument results using their titles (in less than 0.5 s). Multiple queriescan be done, which can later be represented in the Neo4j browsergraph and give you the best recommendation by looking at the in-tersection of neighbors between them. For example, in the figure be-


Figure 4.3: Appears-in relationship example.

Figure 4.4: Entity node example.

low 4.7, we looked up: “International politics and global economy”,“American politics” and “Italian politics” and set the limit of elementsdisplayed to 50. This procedure has narrowed down the 50.000 articlesto few ones just by specifying three statements. Furthermore, if welook more in detail, we can also appreciate that there are three articlesthat relate to “International politics and global economy” and “Italianpolitics” but not to “American politics”, which is useful if we wantto explore the different paths and kind of similarities that lay underour data. Finally, another example including the named entities is dis-played in figure 4.8, which shows the full functionality of Anemone.


Figure 4.5: Appears-in expansion of the Entity Apple.


Figure 4.6: Similar-to expansion of the Entity Apple.


Figure 4.7: Multiple query showcase 1.


Figure 4.8: Multiple query showcase 2.

Chapter 5

Sentence Reformulations

5.1 Recurrent Neural Networks

5.1.1 Time Series

A time series is defined to be “an observation on a stochastic process”[40], or in other terms, a set of points organized in chronological orderand expressing the evolution of a certain value, statue or figure overa certain time span. A time series is defined to be multivariate in casethe input channel whose variation is recorded over time are multiple(e.g. rain and humidity levels over a certain period of time). Differ-ent approaches are available to face the time series forecasting prob-lem among which are to be mentioned statistical, probabilistic and an-alytic approaches such as auto-regressive integrated moving averagemodels (ARIMA), diffusion models and the modeling through Markovprocesses. In recent years, machine learning techniques have showedtheir potential when applied to time series forecasting and, in partic-ular, Artificial Neural Networks have resulted in being an extremelypowerful tool when it comes to the problem of analyzing a time series.

5.1.2 Recurrent Neural Networks

Recurrent Neural Network (RNN) are a strict superset of artificial neu-ral networks, disposing of memory cells and where the neurons areconnected with direct cycles. They were firstly introduced by Hop-field in 1982 [23], and representative of this class are Elman [34] andJordan Networks [28]. The possibility to include stored information

34

CHAPTER 5. SENTENCE REFORMULATIONS 35

in the computation makes this architectures particularly suitable to beapplied on multiple inputs to be interpreted as a sequence, such as asthe the word embeddings of each token of a sentence. This kind ofarchitectures are generally trained through a technique that is consid-ered an adaptation of standard back-propagation and it’s called back-propagation through time [59]. These networks showed empirical ev-idence to be particularly suitable for time series forecasting thanks totheir ability to extend the input output relationship to the whole se-quence due to their learning technique. The main limitation of thismodel is that when the network is unfolded for a big number of steps,the gradient could tend to be amplified or suppressed causing respec-tively the exploding or vanishing gradient phenomena.

5.1.3 Long Short-Term Memory cells

Long short term memory [50] is a recurrent architecture that “is de-signed to overcome these error back-flow problems”. This model solvesthe problems of RNN and other proposed solutions applied to learntime structures “by enforcing constant error flows through constanterror carousels within special units called cells”. An Long Short-TermMemory (LSTM) memory ”contains a node with a self-connected re-current edge of weight 1, ensuring that the gradient can pass acrossmany time steps without vanishing or exploding”. Empirical evalua-tion has showed that LSTM outperform traditional RNN in terms ofcapability to learn long term dependencies.

5.2 From LSTM cells to Sentence Reformu-lation

Long Short Term Memory cells have proven to capture long term de-pendencies using a special channel and gates that make it very easyfor the information to flow along it without changing the state. Thistakes the RNN to a new level, allowing to capture more informationthat can be then used to decide which outputs to choose. These inter-esting properties make them useful in the field of Natural Languagewhen combined with Word Embeddings.

In Sequence to Sequence Learning with Neural Networks [56], O.Vinyals et. al show the great performance on a English to French trans-

36 CHAPTER 5. SENTENCE REFORMULATIONS

lation task using a multi-layered Long Short Term Memory to map theinput sequence to a vector of fixed dimensionality, and then anotherdeep LSTM to decode the target sequence from the vector. Addition-ally, they also discovered that reversing the order of the words in allsource sentences improved the LSTM’s performance because of intro-ducing many short term dependencies between the source and the tar-get sentence.

This encoder decoder structure allows them to split the tasks tobe performed in a way that the encoder is in charge of converting asequence of embedded words into a vector of fixed dimension. Tomake the explanation easier, this vector could be seen as an embeddingof the whole sentence, it is the internal state of the model that will befed to the decoder. The decoder is another set of layers that are trainedto output syntactically correct sentences that also fit with the targettranslation, this is the reason why the internal state is used as well.More specifically, the decoder learns to output one by one each tokengiven the previous ones and the state produced by the encoder.

Again to simplify the explanation, is as if we evaluate this compo-nent in the task of outputting a precise label using the state generatedfrom the previous network.

The point of the overall structure is abstracting the meaning of aninput sentence and turning it into a vector of numbers that are notcompletely language independent, but at least are capturing the con-cept of what has been entered into the machine, and allows the de-coder to have tags in another language. We say that the vector is prob-ably not language independent because each language has differentstructure composition, and the model will “take into consideration”different aspects of the sentence that will result in a different optimalsentence representation to be interpreted by the following layer. At theend of the day, the model will have to be retrained and retune all theparameters for each language we want to translate.

The work of O. Vinyals et. al is interesting for us not only for itsoutstanding performance but also for the way the model is inferred.We mentioned before that this structure tries to guess each token giventhe previous one. This is done by performing a maximum argumentoperation on the probabilities that the decoder predicts about on theavailable classes, which in this case are words. It is fair to mention thatthere are also other more complicated methods such as Beam Searchby S. Wiseman and Alexander M. Rush [62] that improve the optimiza-


tion process and have proved to work on sequence-to-sequence tasks,but since we are not going to use them we will not enter more in de-tails. Returning to the inference model, as it is shown in figure 5.1, wechoose each token in the sequence and this one will condition the nextone, and so on.

Figure 5.1: Sequence to sequence translation

We see in the previous example that we input one by one the wordsof the sentence “I am big” followed by an <eos> character, which standsfor end of sentence. Once this symbol is received, the inference phasestarts, and the machine makes the translation to French: “Je suis grand”.As we mentioned, the output sentence tries to express the meaning ofthe state while keeping it syntactically correct. This also means that if,in a given moment, while we are inferring we choose another tokenrather than the one that is results with the highest probability, it willprobably generate a whole new tree of outputs.

This is base of our hypothesis for this section, we plan to generatedifferent versions by choosing diverse words in the argument ranking.The structure will be similar to the one explained in the work of O.Vinyals et. al, but we will simplify it for a matter of computationalresources.

Additionally, we will use an English-to-English instead of English-to-French dataset, which might be a bit confusing, but further detailswill be explained in the following chapter.

5.3 Sentence Reformulation

We recall that we are interested in reformulating the sentences in or-der to make their structure fit in the schema node-relationship-node.


We discovered that several works have tried to tackle this problemusing supervised or distant machine learning methods, but few thatwork purely unsupervised, we will not enter in details with these casesbut there is a survey [30] by Shantanu Kumar that exposes the currentstate-of-the-art for supervised methods.

It is difficult to find metrics to evaluate abstractive compression ofsentences. As explained in [57], it is very hard to evaluate the newversion of a sentence that comes out from a vector of fixed length be-cause there can be multiple valid versions, leaving sometimes as thebest evaluation method a qualitative analysis.

From the unsupervised machine learning perspective, there are fewertechniques that have been tested and are fundamentally split in two:the approaches that use a Word Embedding space to deduce relation-ships from distances of the words using analogies [29], and on theother hand, the ones that do not use Machine Learning but exclusivelyNatural Language Processing.

Before explaining the algorithm of this thesis we will give a generalidea of these paths to be able to justify with stronger reasons the engi-neering decisions we have taken. To explain how we extract analogiesfrom Word Embedding spaces we will help ourselves with a graphi-cal support of a 2-dimensional approximation of a partial word spaceexample in figure 5.2.

Figure 5.2: Word space example 1

The coordinates of these spaces are usually tuned on large volumes


of data so that the position of each word is meaningful and two wordsthat are close in the space are also close in the meaning. This is im-portant because it is the basis of the next point: here we are just repre-senting two dimensions, but word spaces usually have many more inorder to capture different kind of relationships.

It depends on the algorithm that lays behind the word space, but areasonable amount would be at least 100 dimensions. For us humans,the sense of each of these axis is hard to understand because it is opti-mized for machines, again symbolically, maybe one dimension couldrepresent the sentiment by capturing if it is positive or negative, andanother one the size and color together.

The important thing is to understand that the coordinates of eachword have a meaning and are not randomly set. When looking foranalogies, not only the position but also the distance between wordsis taken into account.

It can be appreciated in the plot 5.3 that there is the same distancebetween Woman-Queen and Man-King, suggesting that a “king is toqueen what a man is to woman”, and similarly, a “Cat is to Cats whata Dog is to Dogs”. This can help us deduce that the distance betweenCat-Cats and Dog-Dogs means Singular/Plural, depending on the di-rection, which is a very sophisticated deduction coming from an un-supervised algorithm. It means that you can take a dead language textdatabase, tokenize it and train your Word Embedding algorithm withit to relate the tokens without really knowing the meaning they have.

The downside of this technique, at least for our purposes, lies in thecorpus this word space is trained on. The relationship between Cat-Cats seems evident, but what about the relationship between Man-University? There are many possible relationships here, such as stu-dent, professor, graduated, and so on. Word embeddings are too fo-cused in general purpose because they need large amounts of text toproduce quality coordinates, and we lose the context of the entitiesof the document. If we used this technique we would not be able toexpress what happened or which was the relationship of two NamedEntities in a specific document. To resume, this method would be use-ful to show you relationships that are always true, such as capitalsof countries, but it would be bad to deduce what happened betweenDonald J. Trump and Barack Obama in an article of a newspaper.

To make the algorithm learn only from the document that is be-ing read at the moment, other works have focused on paying attention


Figure 5.3: Word space example 2

to the syntactical structure of sentences using Natural Language Pro-cessing. For example, if there is a verb between two entities, this willbecome the relationship they have. Take the example of “This morn-ing the president of Russia talked to the president of United States ofAmerica”, it would transform it into something like (the president ofRussia) – [talked to] –>( the president of United States of America).

These techniques have also been extended taking into account ad-verbial modifiers adding them in the relationship tag. And althoughthese techniques explain what happened in that particular situationthe sentences often get too complicated to extract meaningful, conciseand readable tags. When a sentence with the following shape appears:“John does not like going to ski”, the task becomes tougher. First ofall, there are two verbs, but even considering the assumption that weare able to group the information in the right way and take “going toski” as a concept we would have: (John)[does not like](going to ski),which is not optimal to be represented in a semantic graph. A repre-sentation such as (John) [dislikes](ski) or (John) [dislikes](going to ski)would be more synthetic and preferable in terms of visual usefulness,which motivates a sentence reformulation model.


5.4 Relationships between entities

By adding this dimension we want the graph to show which kind ofinteraction two entities have in a document. To do this we mimic alanguage translator, but instead of translating from one language toanother, we make it translate to the same language that is inputted.During the inference, we will explore the results and analyze themqualitatively to validate our hypothesis.

For this reason, we implemented a simplified version of the deeplearning translator used by [56] in python and keras using tensorflowas backend. Due to the nature of the training and inference models werealized that our model was not doable by a linear stack of layers, andwe used the Keras Model class API, which allows you to create modelsthat interact between them.

The basis of our model 5.4 is a Encoder Decoder structure, whereeach of them loads the GloVe word embeddings [42] to convert theword tokens to numerical values. These vectors are then feed into abidirectional LSTM of 128 * 2 cells to produce the states and outputs ofeach of the two components. The state of the encoder after predictionis meant to be inputted to the Decoder LSTM as initial state, whichwill output the values that will, through a dense layer of size = totalvocabulary, provide the probability ranking for each output class anddecide the best choice. Additionally, we made some modifications inthis skeleton to obtain two different models, one for training and theother one for the inference. This is because our ambition is to verydeeply interact with the model during the prediction, and we did notwant it to output the whole sentence at once. Now that the generalskeleton of our model has been stated, we will proceed to explain howthe data flows through our two kinds of models.

Both the encoder and decoder pipelines work in parallel withoutexchanging information until the Encoder passes its internal state tothe Decoder. This state will be used as initial state for the Decoder, thatwill be updated every time a new token is predicted. Both inputs havepreviously been encoded in a way that each token is replaced by anidentifier using a conventional word2idx dictionary, hence their size isamount of files * max sentence length. This information is then trans-formed in the embedding layer where each token identifier is changedto its GloVe embedding of 300 dimensions, leaving a new tensor of sizeamount of files * max sentence length * amount of dimensions.


Figure 5.4: Abstractor model diagram.

Since the exchange of information between the two models is oftencomplicated in situations of this nature, to make it converge faster andbe more stable we used a technique called teacher forcing [61] [31].It consists in replacing the output that should have directly been fedfrom the Encoder to the Decoder by a groundtruth, but still feedingthe obtained states of the encoder to the decoder. We use the encoderstates but not the output.

Teacher-forcing brings a point of reference so that the model tendsless to converge towards wrong paths if the encoder is not performingwell enough. It is difficult to explain its meaning in a translator thatis translating English-to-English, but if it was translating English-to-German, the groundtruth would be the German translation delayed byone for each step, because it is the token that the whole model shouldhave predicted in the previous round. Once the information passesthrough the decoder the features are used by the dense layer to predictthe output, which is then employed to compute the error between theprediction and the labels with a categorical cross-entropy function.

The purpose of the model is to predict. We want to give it an initial


state and then proceed to infer multiple times with the chosen tokens.We are interested in listing the ranking of probabilities of each class(words), and chose the one to include in our sequence. In this com-ponent instead of using teacher forcing, we directly use the previousoutput of the decoder, recurring in this way both, the states and alsothe outputs. In order to accomplish this behavior, we leave the firstcomponent as in the training model, but we use the second one in thefollowing way:

1. Encode the input sentence to get the initial decoder state.

2. Repeat until we find an <eos> token:

(a) Predict with the decoder (add <sos> symbol if it is the firstiteration).

(b) Append the predicted word to the sequence and update in-ternal state.

Chapter 6

Evaluation of the Results

6.1 Evaluation of Usability

Among the newspaper articles that were to be analysed, two wereclearly more related to each other than the others with the purposeof including them in the questions. These are the 15 headers of thenewspaper articles, the two last ones are the highly related:

• House Republicans Fret About Winning Their Health Care Suit.• Rift Between Officers and Residents as Killings Persist in South

Bronx.• Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial Bias, Dies at 106.• Among Deaths in 2016, a Heavy Toll in Pop Music.• Kim Jong-un Says North Korea Is Preparing to Test Long-Range

Missile.• Sick With a Cold, Queen Elizabeth Misses New Year’s Service.• Taiwan’s President Accuses China of Renewed Intimidation.• After ‘The Biggest Loser,’ Their Bodies Fought to Regain Weight• First, a Mixtape. Then a Romance.• Calling on Angels While Enduring the Trials of Job.• Weak Federal Powers Could Limit Trump’s Climate-Policy Roll-

back• Can Carbon Capture Technology Prosper Under Trump?• Mar-a-Lago, the Future Winter White House and Home of the

Calmer Trump• Italy still without government after Renzi Resignation• Italy in limbo as Renzi designation delayed

44

CHAPTER 6. EVALUATION OF THE RESULTS 45

We created two variations of the questions and distributed them toour volunteers in a way that a randomly chosen variation was filledwith the help of the artifact, and then the other one without it. Theseare the two variations:First variation:

1. Does the named entity Sergio Mattarella appear in the article"Italy still without government after Renzi Resignation"Correct answer: YES

2. Take the article "Italy still without government after Renzi Res-ignation", which of the other 14 articles is the most similar to it?Correct answer: "Italy in limbo as Renzi designation delayed"

3. Did the artifact help you answering questions 1 and 2?

Second variation:

1. Does the named entity Sicily appear in the article "Italy in limboas Renzi designation delayed"Correct answer: NO

2. Take the article "Italy in limbo as Renzi designation delayed",which of the other 14 articles is the most similar to it?Correct answer: "Italy still without government after Renzi Res-ignation"

3. Did the artifact help you answering questions 1 and 2?

Figure 6.1: Evaluation of the graph generation.

As it can be observed in figure 6.1, the time to completion, accu-racy and satisfaction is favorable when the subjects make use of thevisualized semantic graph.

46 CHAPTER 6. EVALUATION OF THE RESULTS

This brings positive conclusions and shows that, at least for thesample of subjects taken, our combination of techniques result in avisually useful artifact. This said, we are aware that the amount oftesters was not abundant and might not be as representative as wewould have liked, especially for the error rates where the results havebeen tighter.

Finally, this table does not consider the accuracy of the Named En-tity Recognizer because both variations of the questions are about en-tities that we know in advance that the classifier predicted properly.

6.2 Evaluation of Entity Relationships

We evaluate each sentence reformulation in terms of grammatical cor-rectness (G.M) and meaning preserving (M.P) in relation to the originalone, considering the lower score as the most undesired one.

We already spoil that the experiments did not perform as well aswe would have liked to. After the first round of testing (validationaccuracy 0,98) , where the model had been trained to output the sameinput that was given, we observed that the majority of words in theposition ranking were valid for substituting the original one, but thatthe final reformulation was not good enough.

First of all, the model was struggling to deal with new sentences.Additionally, we saw undesired behaviors that tended to follow toomuch the structure of the input. For example, we experienced a wordcount dependency between the input and output, the model was addingunrelated words at the end of a meaningful and properly reformulatedsentence to make the input and target sequence lengths fit. It is rea-sonable that this happens since the vector of fixed dimension tries tocapture all the information that is useful to obtain a higher accuracy.Learning that the length of sentence is a particular number can pro-vide a boost in the accuracy. As an attempt to overcome this problemand give more weight to the meaning, we tried a denoiser approachto introduce more uncertainty among the solutions in the ranking andmake the algorithm pay more attention to the sense rather than thestructure.

To do so, we pick the same dataset as before and but we randomlydeleted a word from each input sequence, keeping the target sequenceintact. In the table below you can find the results (validation accuracy

CHAPTER 6. EVALUATION OF THE RESULTS 47

0,95), where G.C stands for the mean score the judges gave to Gram-matical Correctness and M.P stands for the mean score the judges gaveto Meaning Preserving.

Figure 6.2: Qualitative analysis of the model

As it can be observed in 6.2, the algorithm is capable to use newwords to formulate sentences that are close in meaning to the originalone. This task is not trivial since the algorithm is trained in a purelyunsupervised way, using as features the word embeddings, which arealso obtained using unsupervised algorithms. This indicates that thedirection of this study is, at least, not completely wrong.

On the other hand, we will not be able to directly use this model togenerate candidates with a structure Entity1-relationship-Entity2. Theintention was to generate many sentences from a given one, and pickthose that started with Entity1 as first word, and ended up with theEntity2. Also taking into account that the length of the sentence wasnot excessive, for example 45 words maximum.

48 CHAPTER 6. EVALUATION OF THE RESULTS

The model is not stable enough to provide good quality results be-cause they depend a lot on the specific case the model is dealing with,and the task we are trying to accomplish is too sensible to meaningpreservation mistakes. For example, for the sentence “The man is run-ning downhill” we see that there is at least one sentence better rankedthan it that does not preserve enough meaning. Although deceiving,these results have made us understand the limitations of the approachand will help us choosing the upcoming paths. First of all, we haveunderstood that this model is able to understand all the words that arepresent in the word embedding space that we used, even if the modelhas not seen a word during the training phase, it will still be able toget its meaning because we will use its embedding vector instead ofthe tag, which are just coordinates. This does not happen with theoutput, where we can put arguably as many words as we want, butthey will act as classes and some examples will be required for them inorder to be properly chosen (we need some examples for each outputclass).

Furthermore, a large number of output vocabulary will slow downa lot the training process. Another approach that we have seen thatother works consider is to use characters instead of words as output to-kens, reducing to less than one hundred the number of output classesin the dense layer. We will try this method in our future work.

The second observation we have done is that the internal state orthe vector of fixed dimension that is meant to capture the meaning ofthe input sentence is essential to achieve good results. This gives ushope when we consider that the sentence embedding techniques wehave been using are outdated and far from the state of the art. Moreperforming sentence embedding algorithms exist, on which we willfocus our attention in the future.

Finally, we have also noticed that grammatical correctness is a goodmetric for sentence reformulation, but it is not if we want to transforma sentence to node relationship-node structure. For this reason, we willalso work on finding a more suitable metric in the future.

Chapter 7

Conclusions

This section concludes the thesis with a brief walk through of the out-comes. We will evaluate, positively or negatively, each phase of thethesis and state the possible future directions to take.

7.1 Usability of the Semantic Graph

We were able to create a system that uses reliable techniques and isscalable due to its native graph database and lack of labelling depen-dencies. The results of the performed qualitative analysis show that,for entity finding and document similarity our feature design is promis-ing and the visual map produced by our artifact is visually useful.

We proved that our setting can handle a dataset of at least 50.000news articles, and the query system can retrieve 1000 similar docu-ments in less than 0,5s. We leave the choice of whether this is enoughor not to any reader willing to adopt our approach for his study, as weunderstand that it will depend on the dimensions of the targeted data.Furthermore, we did not attempt to produce the most accurate versionof a semantic graph, and it should not be considered that way.

Finally, our test subjects took considerably less time to find a correctanswer using our artifact in the diverse scenarios they were presented.For this reason, we consider that we demonstrated that the format wepresented is viable and better than the rawer plain text format.

49

50 CHAPTER 7. CONCLUSIONS

7.2 Entity Relationships

In this thesis we have tried to obtain relationship tags to relate entitiesusing a novel unsupervised sentence reformulation algorithm. Ourqualitative analysis shows that, in some cases, our model builds gram-matically correct and meaning preserving versions of a given sentence,but it is not reliable. It seems that the sentence reformulation taskcan somehow be unsupervisedly accomplished through deep learn-ing translators but there is more work to do. Our results have notbeen good enough to acquire any definitive conclusion but have in-stead opened new doors to explore.

7.3 Future Work

As previously mentioned, our first aim was to evaluate the format ofour artifact in terms of visual usability and scalability. However, ourstudy does not guarantee that our set technologies are state-of-the-arttechniques. In order to advance towards building a better artifact, weshould perform a careful literature study on the field of topic similarityand named entity recognizer. This former research should culminatewith a consistent choice of technologies to take part in our tool. Addi-tionally, we believe that another discussion should be put on the table:is topic similarity the best way to relate documents? We are awarethat some technologies are capable of embedding the meaning of doc-uments and relate them through a deeper bound. This point and theeventual addition of new features would require, in any case, anotherqualitative analysis that ensures that the visual usability of the currentartifact is not deteriorated.

From the entity relationships side, our analysis has given us hintsand has introduced two main thoughts to take into account in the fu-ture work. Firstly, auto-encoding sentences in order to generate a sen-tence embedding does not perform well enough for this task and amore powerful method to embed the meaning of the input sentenceshould be tested. And secondly, a high amount of output classes orvocabulary requires a vast dataset of examples in order to performwell and will also require significantly more time to train, for this rea-son we will experiment with predicting characters instead of words inthe future.

Bibliography

[1] Renzo Angles and Claudio Gutierrez. “Survey of graph databasemodels”. en. In: ACM Computing Surveys 40.1 (Feb. 2008), pp. 1–39. ISSN: 03600300. DOI: 10.1145/1322432.1322433. URL:http://portal.acm.org/citation.cfm?doid=1322432.1322433 (visited on 10/01/2018).

[2] Christopher M. Bishop. Pattern recognition and machine learning.en. Information science and statistics. New York: Springer, 2006.ISBN: 978-0-387-31073-2.

[3] Christian Bizer et al. “DBpedia - A crystallization point for theWeb of Data”. en. In: Journal of Web Semantics 7.3 (Sept. 2009),pp. 154–165. ISSN: 15708268. DOI: 10.1016/j.websem.2009.07 . 002. URL: https : / / linkinghub . elsevier . com /retrieve/pii/S1570826809000225 (visited on 03/07/2019).

[4] Michael R. Blaha, William J. Premerlani, and James E. Rumbaugh.“Relational database design using an object-oriented methodol-ogy”. In: Communications of the ACM 31.4 (Apr. 1988), pp. 414–427. ISSN: 00010782. DOI: 10.1145/42404.42407. URL: http://portal.acm.org/citation.cfm?doid=42404.42407(visited on 12/21/2018).

[5] David M Blei. “Latent Dirichlet Allocation”. en. In: Journal ofMachine Learning Research 3 (2003), p. 30.

[6] David M. Blei. “Probabilistic topic models”. en. In: Communi-cations of the ACM 55.4 (Apr. 2012), p. 77. ISSN: 00010782. DOI:10.1145/2133806.2133826. URL: http://dl.acm.org/citation.cfm?doid=2133806.2133826 (visited on 05/25/2018).

[7] Boolean retrieval. Vol. April 1, 2009 Cambridge University Press.

51

https://doi.org/10.1145/1322432.1322433

http://portal.acm.org/citation.cfm?doid=1322432.1322433


https://doi.org/10.1016/j.websem.2009.07.002

https://doi.org/10.1016/j.websem.2009.07.002

https://linkinghub.elsevier.com/retrieve/pii/S1570826809000225


https://doi.org/10.1145/42404.42407



https://doi.org/10.1145/2133806.2133826

http://dl.acm.org/citation.cfm?doid=2133806.2133826

http://dl.acm.org/citation.cfm?doid=2133806.2133826

52 BIBLIOGRAPHY

[8] Yen-Chun Chen and Mohit Bansal. “Fast Abstractive Summa-rization with Reinforce-Selected Sentence Rewriting”. In: arXiv:1805.11080[cs] (May 2018). arXiv: 1805.11080. URL: http://arxiv.org/abs/1805.11080 (visited on 10/02/2018).

[9] Jinho D. Choi, Joel Tetreault, and Amanda Stent. “It Depends:Dependency Parser Comparison Using A Web-based EvaluationTool”. In: Proceedings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th International Joint Con-ference on Natural Language Processing (Volume 1: Long Papers).Beijing, China: Association for Computational Linguistics, July2015, pp. 387–396. URL: http://www.aclweb.org/anthology/P15-1038 (visited on 05/25/2018).

[10] Sumit Chopra, Michael Auli, and Alexander M. Rush. “Abstrac-tive Sentence Summarization with Attentive Recurrent NeuralNetworks”. In: Proceedings of the 2016 Conference of the North Amer-ican Chapter of the Association for Computational Linguistics: Hu-man Language Technologies. San Diego, California: Association forComputational Linguistics, June 2016, pp. 93–98. URL: http://www.aclweb.org/anthology/N16- 1012 (visited on05/25/2018).

[11] E F Codd. “A Relational Model of Data for Large Shared DataBanks”. en. In: 13.6 (1970), p. 11.

[12] Margrethe Ellstrøm. “Benefits of Presenting Information Visu-ally and Guidelines on How to Do It”. en. In: (), p. 14.

[13] Esri Press | Semiology of Graphics | Diagrams, Networks, Maps.URL: https://esripress.esri.com/display/index.cfm?fuseaction=display&websiteID=190&moduleID=0 (visited on 01/07/2019).

[14] Oren Etzioni et al. “Unsupervised named-entity extraction fromthe Web: An experimental study”. en. In: Artificial Intelligence165.1 (June 2005), pp. 91–134. ISSN: 00043702. DOI: 10.1016/j.artint.2005.03.001. URL: https://linkinghub.elsevier.com/retrieve/pii/S0004370205000366 (vis-ited on 03/08/2019).

[15] Stephen Few. “Tapping the Power of Visual Perception”. en. In:(), p. 8.

http://arxiv.org/abs/1805.11080


http://www.aclweb.org/anthology/P15-1038


http://www.aclweb.org/anthology/N16-1012

http://www.aclweb.org/anthology/N16-1012

https://esripress.esri.com/display/index.cfm?fuseaction=display&websiteID=190&moduleID=0



https://doi.org/10.1016/j.artint.2005.03.001

https://doi.org/10.1016/j.artint.2005.03.001



BIBLIOGRAPHY 53

[16] Katja Filippova et al. “Sentence Compression by Deletion withLSTMs”. en. In: Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing. Lisbon, Portugal: Asso-ciation for Computational Linguistics, 2015, pp. 360–368. DOI:10.18653/v1/D15- 1042. URL: http://aclweb.org/anthology/D15-1042 (visited on 10/02/2018).

[17] Ralph Grishman and Beth Sundheim. “Message UnderstandingConference- 6: A Brief History”. en. In: (), p. 6.

[18] José Guia, Valéria Gonçalves Soares, and Jorge Bernardino. “GraphDatabases: Neo4j Analysis:” en. In: Proceedings of the 19th Inter-national Conference on Enterprise Information Systems. Porto, Por-tugal: SCITEPRESS - Science and Technology Publications, 2017,pp. 351–356. ISBN: 978-989-758-247-9 978-989-758-248-6 978-989-758-249-3. DOI: 10.5220/0006356003510356. URL: http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0006356003510356 (visited on 11/05/2018).

[19] Jing Han et al. “Survey on NoSQL database”. In: 2011 6th Inter-national Conference on Pervasive Computing and Applications. Oct.2011, pp. 363–366. DOI: 10.1109/ICPCA.2011.6106531.

[20] Hermann Helbig. Knowledge Representation and the Semantics ofNatural Language. en. Cognitive Technologies. Berlin/Heidelberg,FernUniversitat Hagen: Springer-Verlag, 2006. ISBN: 978-3-540-24461-5. DOI: 10.1007/3-540-29966-1. (Visited on 11/01/2018).

[21] Johannes Hoffart et al. “YAGO2: A spatially and temporally en-hanced knowledge base from Wikipedia”. en. In: Artificial In-telligence 194 (Jan. 2013), pp. 28–61. ISSN: 00043702. (Visited on03/06/2019).

[22] Raphael Hoffmann et al. “Knowledge-Based Weak Supervisionfor Information Extraction of Overlapping Relations”. In: Pro-ceedings of the 49th Annual Meeting of the Association for Compu-tational Linguistics: Human Language Technologies. Portland, Ore-gon, USA: Association for Computational Linguistics, June 2011,pp. 541–550. URL: http://www.aclweb.org/anthology/P11-1055 (visited on 10/01/2018).

https://doi.org/10.18653/v1/D15-1042

http://aclweb.org/anthology/D15-1042


https://doi.org/10.5220/0006356003510356

http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0006356003510356



https://doi.org/10.1109/ICPCA.2011.6106531

https://doi.org/10.1007/3-540-29966-1



54 BIBLIOGRAPHY

[23] J. J. Hopfield. “Neural networks and physical systems with emer-gent collective computational abilities”. en. In: Proceedings of theNational Academy of Sciences 79.8 (Apr. 1982), pp. 2554–2558. ISSN:0027-8424, 1091-6490. DOI: 10.1073/pnas.79.8.2554. URL:http://www.pnas.org/content/79/8/2554 (visited on05/25/2018).

[24] Information retrieval. en. Page Version ID: 876877334. Jan. 2019.URL: https://en.wikipedia.org/w/index.php?title=Information_retrieval&oldid=876877334 (visited on01/08/2019).

[25] “Introducing the Knowledge Graph: things, not strings”. In: ().

[26] J.A. Renshaw et al. “Designing for Visual Influence: an Eye Track-ing Study of the Usability of Graphical Management Informa-tion”. In: IOS Press,(c) IFIP, 2003,pp.144-151 (2003).

[27] Bernard J. Jansen and Soo Young Rieh. “The seventeen theo-retical constructs of information searching and information re-trieval”. en. In: Journal of the American Society for Information Sci-ence and Technology (2010), n/a–n/a. ISSN: 15322882, 15322890.DOI: 10.1002/asi.21358. URL: http://doi.wiley.com/10.1002/asi.21358 (visited on 01/08/2019).

[28] M. I. Jordan. Serial order: a parallel distributed processing approach.Technical report, June 1985-March 1986. English. Tech. rep. AD-A-173989/5/XAB; ICS-8604. California Univ., San Diego, La Jolla(USA). Inst. for Cognitive Science, May 1986. URL: https://www.osti.gov/biblio/6910294 (visited on 05/25/2018).

[29] Kundan Kumar and Siddhant Manocha. “Constructing knowl-edge graph from unstructured text”. en. In: (), p. 17.

[30] Shantanu Kumar. “A Survey of Deep Learning Methods for Re-lation Extraction”. In: arXiv:1705.03645 [cs] (May 2017). arXiv:1705.03645. URL: http://arxiv.org/abs/1705.03645(visited on 05/25/2018).

[31] Alex M Lamb and Anirudh Goyal ALIAS PARTH. “ProfessorForcing: A New Algorithm for Training Recurrent Networks”.en. In: (), p. 9.

https://doi.org/10.1073/pnas.79.8.2554

http://www.pnas.org/content/79/8/2554

https://en.wikipedia.org/w/index.php?title=Information_retrieval&oldid=876877334

https://en.wikipedia.org/w/index.php?title=Information_retrieval&oldid=876877334

https://doi.org/10.1002/asi.21358

http://doi.wiley.com/10.1002/asi.21358

http://doi.wiley.com/10.1002/asi.21358

https://www.osti.gov/biblio/6910294

https://www.osti.gov/biblio/6910294


BIBLIOGRAPHY 55

[32] Jill H. Larkin and Herbert A. Simon. “Why a Diagram is (Some-times) Worth Ten Thousand Words”. en. In: Cognitive Science 11.1(Jan. 1987), pp. 65–100. ISSN: 03640213. DOI: 10.1111/j.1551-6708.1987.tb00863.x. URL: http://doi.wiley.com/10.1111/j.1551- 6708.1987.tb00863.x (visited on01/07/2019).

[33] Neal Leavitt. “Will NoSQL Databases Live Up to Their Promise?”en. In: Computer 43.2 (Feb. 2010), pp. 12–14. ISSN: 0018-9162. DOI:10.1109/MC.2010.58. URL: http://ieeexplore.ieee.org/document/5410700/ (visited on 10/01/2018).

[34] Jeffrey L.Elman. Finding structure in time - ScienceDirect. URL: https://www.sciencedirect.com/science/article/pii/036402139090002E (visited on 05/25/2018).

[35] Yankai Lin et al. “Neural Relation Extraction with Selective At-tention over Instances”. In: Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics (Volume 1: Long Pa-pers). Berlin, Germany: Association for Computational Linguis-tics, Aug. 2016, pp. 2124–2133. URL: http://www.aclweb.org/anthology/P16-1200 (visited on 10/01/2018).

[36] Olena Medelyan et al. “Automatic construction of lexicons, tax-onomies, ontologies, and other knowledge structures: Automaticconstruction of knowledge structures”. en. In: Wiley Interdisci-plinary Reviews: Data Mining and Knowledge Discovery 3.4 (July2013), pp. 257–279. ISSN: 19424787. DOI: 10.1002/widm.1097.URL: http://doi.wiley.com/10.1002/widm.1097 (vis-ited on 05/25/2018).

[37] Mike Mintz et al. “Distant supervision for relation extractionwithout labeled data”. In: Proceedings of the Joint Conference of the47th Annual Meeting of the ACL and the 4th International Joint Con-ference on Natural Language Processing of the AFNLP. Suntec, Sin-gapore: Association for Computational Linguistics, Aug. 2009,pp. 1003–1011. URL: http://www.aclweb.org/anthology/P/P09/P09-1113 (visited on 10/01/2018).

[38] Ramesh Nallapati et al. “Abstractive Text Summarization usingSequence-to-sequence RNNs and Beyond”. In: Proceedings of The20th SIGNLL Conference on Computational Natural Language Learn-ing. Berlin, Germany: Association for Computational Linguis-

https://doi.org/10.1111/j.1551-6708.1987.tb00863.x

https://doi.org/10.1111/j.1551-6708.1987.tb00863.x

http://doi.wiley.com/10.1111/j.1551-6708.1987.tb00863.x

http://doi.wiley.com/10.1111/j.1551-6708.1987.tb00863.x

https://doi.org/10.1109/MC.2010.58

http://ieeexplore.ieee.org/document/5410700/


https://www.sciencedirect.com/science/article/pii/036402139090002E





https://doi.org/10.1002/widm.1097

http://doi.wiley.com/10.1002/widm.1097

http://www.aclweb.org/anthology/P/P09/P09-1113

http://www.aclweb.org/anthology/P/P09/P09-1113

56 BIBLIOGRAPHY

tics, Aug. 2016, pp. 280–290. URL: http://www.aclweb.org/anthology/K16-1028 (visited on 10/02/2018).

[39] Sergio Oramas, Mohamed Sordo, and Luis Espinosa-Anke. “ARule-Based Approach to Extracting Relations from Music Tid-bits”. In: Proceedings of the 24th International Conference on WorldWide Web. WWW ’15 Companion. New York, NY, USA: ACM,2015, pp. 661–666. ISBN: 978-1-4503-3473-0. DOI: 10.1145/2740908.2741709. URL: http://doi.acm.org/10.1145/2740908.2741709 (visited on 03/06/2019).

[40] E. Parzen. “A survey of time series analysis. Applied Mathe-matic and Statistic Labs, Stanford University”. In: (1960).

[41] (PDF) WordNet. An Electronic Lexical Database. en. URL: https://www.researchgate.net/publication/307972585_WordNet_An_Electronic_Lexical_Database (visited on03/08/2019).

[42] Jeffrey Pennington, Richard Socher, and Christopher Manning.“Glove: Global Vectors for Word Representation”. In: Proceed-ings of the 2014 Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP). Doha, Qatar: Association for Com-putational Linguistics, Oct. 2014, pp. 1532–1543. URL: http ://www.aclweb.org/anthology/D14- 1162 (visited on06/05/2018).

[43] Ross Quillan. A notation for representing conceptual information: anapplication to semantics and mechanical English paraphrasing. SantaMonica, Calif., Systems Development Corp., 1963.

[44] Radim Rehurek and Petr Sojka. “Software Framework for TopicModelling with Large Corpora”. English. In: Proceedings of theLREC 2010 Workshop on New Challenges for NLP Frameworks. http://is.muni.cz/publication/884893/en. Valletta, Malta:ELRA, May 2010, pp. 45–50.

[45] R H Richens. “Preprogramming for mechanical translation”. en.In: Mechanical Translation, vol.3, no.1, July 1956; pp. 20-25 (),p. 6.

[46] R H Richens. “Report on research”. en. In: Mechanical Transla-tion, vol.3, no.2, November 1956; pp. 36-37 (), p. 2.

http://www.aclweb.org/anthology/K16-1028

http://www.aclweb.org/anthology/K16-1028

https://doi.org/10.1145/2740908.2741709

https://doi.org/10.1145/2740908.2741709

http://doi.acm.org/10.1145/2740908.2741709

http://doi.acm.org/10.1145/2740908.2741709

https://www.researchgate.net/publication/307972585_WordNet_An_Electronic_Lexical_Database



http://www.aclweb.org/anthology/D14-1162


http://is.muni.cz/publication/884893/en

http://is.muni.cz/publication/884893/en

BIBLIOGRAPHY 57

[47] Bryan Rink and Sanda Harabagiu. “UTD: Classifying SemanticRelations by Combining Lexical and Semantic Resources”. In:Proceedings of the 5th International Workshop on Semantic Evalua-tion. Uppsala, Sweden: Association for Computational Linguis-tics, July 2010, pp. 256–259. URL: http://www.aclweb.org/anthology/S10-1057 (visited on 10/01/2018).

[48] M. Sanderson and W. B. Croft. “The History of Information Re-trieval Research”. In: Proceedings of the IEEE 100.Special Centen-nial Issue (May 2012), pp. 1444–1451. ISSN: 0018-9219, 1558-2256.DOI: 10.1109/JPROC.2012.2189916. URL: http://ieeexplore.ieee.org/document/6182576/ (visited on 01/08/2019).

[49] Tania Schlatter and Deborah Levinson. Visual Usability: Principlesand Practices for Designing Digital Applications. en. Google-Books-ID: h_Ql1uIHftoC. Newnes, Mar. 2013. ISBN: 978-0-12-401713-9.

[50] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Mem-ory | Neural Computation | MIT Press Journals. URL: https://www.mitpressjournals.org/doi/abs/10.1162/neco.1997.9.8.1735 (visited on 05/25/2018).

[51] Amit Singhal. “Modern Information Retrieval: A Brief Overview”.en. In: (), p. 9.

[52] Vede C. Storey. “Understanding semantic relationships”. en. In:The VLDB Journal 2.4 (Oct. 1993), pp. 455–488. ISSN: 1066-8888,0949-877X. DOI: 10.1007/BF01263048. URL: http://link.springer.com/10.1007/BF01263048 (visited on 12/21/2018).

[53] Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. “YAGO:A Core of Semantic Knowledge Unifying WordNet and Wikipedia”.en. In: Semantic Web (2007), p. 10.

[54] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. “YAGO:A Large Ontology from Wikipedia and WordNet”. In: Web Se-mant. 6.3 (Sept. 2008), pp. 203–217. ISSN: 1570-8268. (Visited on03/07/2019).

[55] Mihai Surdeanu et al. “Multi-instance Multi-label Learning forRelation Extraction”. In: Proceedings of the 2012 Joint Conferenceon Empirical Methods in Natural Language Processing and Compu-tational Natural Language Learning. Jeju Island, Korea: Associa-tion for Computational Linguistics, July 2012, pp. 455–465. URL:

http://www.aclweb.org/anthology/S10-1057

http://www.aclweb.org/anthology/S10-1057

https://doi.org/10.1109/JPROC.2012.2189916



https://www.mitpressjournals.org/doi/abs/10.1162/neco.1997.9.8.1735



https://doi.org/10.1007/BF01263048

http://link.springer.com/10.1007/BF01263048

http://link.springer.com/10.1007/BF01263048

58 BIBLIOGRAPHY

http://www.aclweb.org/anthology/D12-1042 (visitedon 10/01/2018).

[56] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. “Sequence to Se-quence Learning with Neural Networks”. In: arXiv:1409.3215 [cs](Sept. 2014). arXiv: 1409.3215. URL: http://arxiv.org/abs/1409.3215 (visited on 05/25/2018).

[57] Kristina Toutanova et al. “A Dataset and Evaluation Metrics forAbstractive Compression of Sentences and Short Paragraphs”.In: Proceedings of the 2016 Conference on Empirical Methods in Nat-ural Language Processing. Austin, Texas: Association for Compu-tational Linguistics, Nov. 2016, pp. 340–350. URL: https://aclweb.org/anthology/D16-1033 (visited on 05/25/2018).

[58] Colin Ware. Information visualization: perception for design. en. 2.ed., [Nachdr.] The Morgan Kaufmann series in interactive tech-nologies. OCLC: 552400929. Amsterdam: Elsevier [u.a.], 2009.ISBN: 978-1-55860-819-1.

[59] P. J. Werbos. “Backpropagation through time: what it does andhow to do it”. In: Proceedings of the IEEE 78.10 (Oct. 1990), pp. 1550–1560. ISSN: 0018-9219. DOI: 10.1109/5.58337.

[60] What is a Knowledge Graph? URL: https://www.authorea.com/users/6341/articles/107281-what-is-a-knowledge-graph/_show_article (visited on 11/01/2018).

[61] R. J. Williams and D. Zipser. “A Learning Algorithm for Con-tinually Running Fully Recurrent Neural Networks”. In: NeuralComputation 1.2 (June 1989), pp. 270–280. ISSN: 0899-7667. DOI:10.1162/neco.1989.1.2.270.

[62] Sam Wiseman and Alexander M. Rush. “Sequence-to-SequenceLearning as Beam-Search Optimization”. In: arXiv:1606.02960 [cs,stat] (June 2016). arXiv: 1606.02960. URL: http://arxiv.org/abs/1606.02960 (visited on 05/25/2018).

[63] Kun Xu et al. “Semantic Relation Classification via Convolu-tional Neural Networks with Simple Negative Sampling”. In:Proceedings of the 2015 Conference on Empirical Methods in Natu-ral Language Processing. Lisbon, Portugal: Association for Com-putational Linguistics, Sept. 2015, pp. 536–540. URL: http://aclweb.org/anthology/D15-1062 (visited on 10/01/2018).




https://aclweb.org/anthology/D16-1033

https://aclweb.org/anthology/D16-1033

https://doi.org/10.1109/5.58337

https://www.authorea.com/users/6341/articles/107281-what-is-a-knowledge-graph/_show_article



https://doi.org/10.1162/neco.1989.1.2.270





BIBLIOGRAPHY 59

[64] Vikas Yadav and Steven Bethard. “A Survey on Recent Advancesin Named Entity Recognition from Deep Learning models”. In:Proceedings of the 27th International Conference on ComputationalLinguistics. Santa Fe, New Mexico, USA: Association for Com-putational Linguistics, Aug. 2018, pp. 2145–2158. URL: http://www.aclweb.org/anthology/C18- 1182 (visited on10/01/2018).

http://www.aclweb.org/anthology/C18-1182

http://www.aclweb.org/anthology/C18-1182

TRITA-EECS-EX-2019:81

www.kth.se

Anemone: a Visual Semantic Graphkth.diva-portal.org/smash/get/diva2:1322081/FULLTEXT02.pdf · the...

Documents

Transcript of Anemone: a Visual Semantic Graphkth.diva-portal.org/smash/get/diva2:1322081/FULLTEXT02.pdf · the...