LAW XI 2017 - sigann.github.io

LAW XI 2017

11th Linguistic Annotation Workshop

Proceedings of the Workshop

EACL WorkshopApril 3, 2017

Valencia, Spain

c©2017 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-945626-39-5

ii

Preface

The Linguistic Annotation Workshop (The LAW) is organized annually by the Association forComputational Linguistics Special Interest Group for Annotation (ACL SIGANN). It provides a forum tofacilitate the exchange and propagation of research results concerned with the annotation, manipulation,and exploitation of corpora; work towards harmonization and interoperability from the perspective of theincreasingly large number of tools and frameworks for annotated language resources; and work towardsa consensus on all issues crucial to the advancement of the field of corpus annotation.

Last fall, when workshop proposals were solicited, we were asked for a tagline that would telegraphthe essence of LAW. Naturally, this prompted a healthy dose of wordsmithing on the part of theorganizing committee. The initial suggestions (including “Annotation schemers unite!” and “Don’tjust annoTATE—annoGREAT!”) were deemed too corny. Then, playing on the abbreviation “LAW”,various legal puns emerged: “the fine print of linguistic annotation”; “the letter and spirit of linguisticannotation”; “LAW, the authority on linguistic annotation”; “LAW, where language is made to behave”;and so forth. “Annotation schemers on parole” took the punning to the extreme (as students of Saussurewill recognize).

In the end, we settled on “LAW: Due process for linguistic annotation”. The concept of “due process”underscores the care required not just to annotate, but to annotate well. To produce a high-qualitylinguistic resource, diligence is required in all phases: assembling the source data; designing andrefining the annotation scheme and guidelines; choosing or developing appropriate annotation softwareand data formats; applying automatic tools for preprocessing and provisional annotation; selecting andtraining human annotators; implementing quality control procedures; and documenting and distributingthe completed resource.

The 14 papers in this year’s workshop study methods for annotation in the domains of emotion andattitude; conversations and discourse structure; events and causality; semantic roles; and translation(among others). Compared to previous years, syntax plays a much smaller role: indeed, this may be thefirst ever LAW where no paper has the word “treebank” in the title. (We leave it to the reader to speculatewhether this reflects broader trends in the field or has an innocuous explanation.) Also groundbreakingin this year’s LAW will be a best paper award, to be announced at the workshop.

LAW XI would not have been possible without the fine contributions of the authors; the remarkablythorough and thoughtful reviews from the program committee; and the sage guidance of the organizingcommittee. Two invited talks will add multilingual perspective to the program, Deniz Zeyrek and JohanBos having generously agreed to share their wisdom. We thank our publicity chair, Marc Verhagen, aswell as those who have worked to coordinate the various aspects of EACL workshops, including logisticsand publications.

We hope that after reading the collected wisdom in this volume, you will be empowered to give thelinguistic annotation process its due.

Nathan Schneider and Nianwen Xue, program co-chairs

iii

Program Co-chairs:

Nathan Schneider Georgetown UniversityNianwen Xue Brandeis University

Publicity Chair:

Marc Verhagen Brandeis University

Program Committee:

Adam Meyers New York UniversityAlexis Palmer University of North TexasAmália Mendes University of LisbonAmir Zeldes Georgetown UniversityAndrew Gargett University of BirminghamAnnemarie Friedrich Saarland UniversityAntonio Pareja-Lora Universidad Complutense de Madrid / ATLAS, UNEDArchna Bhatia IHMCBenoît Sagot Université Paris DiderotBonnie Webber University of EdinburghCollin Baker ICSI BerkeleyDirk Hovy University of CopenhagenDjamé Seddah Paris-Sorbonne UniversityEls Lefever Ghent UniversityHeike Zinsmeister University of HamburgInes Rehbein Leibniz Science Campus, Institute for German Language and

Heidelberg UniversityJoel Tetreault GrammarlyJohn S. Y. Lee City University of Hong KongJosef Ruppenhofer Leibniz Science Campus, Institute for German Language and

Heidelberg UniversityKatrin Tomanek OpenTableKemal Oflazer Carnegie Mellon University—QatarKilian Evang University of GroningenKim Gerdes Sorbonne NouvelleKiril Simov Bulgarian Academy of SciencesLori Levin Carnegie Mellon UniversityManfred Stede University of PotsdamMarie Candito Université Paris DiderotMarkus Dickinson Indiana UniversityMartha Palmer University of Colorado at BoulderMassimo Poesio University of EssexNancy Ide Vassar CollegeNicoletta Calzolari CNR-ILCNizar Habash New York University Abu DhabiÖzlem Çetinoglu University of StuttgartPablo Faria State University of CampinasRon Artstein Institute for Creative Technologies, USCSandra Kübler Indiana UniversityStefanie Dipper Ruhr University BochumTomaž Erjavec Jožef Stefan Institute

v

Udo Hahn Jena UniversityValia Kordoni Humboldt University of Berlin

SIGANN Organizing Commitee:

Stefanie Dipper Ruhr University BochumAnnemarie Friedrich Saarland UniversityChu-Ren Huang The Hong Kong Polytechnic UniversityNancy Ide Vassar CollegeLori Levin Carnegie Mellon UniversityAdam Meyers New York UniversityAntonio Pareja-Lora Universidad Complutense de Madrid / ATLAS, UNEDMassimo Poesio University of EssexSameer Pradhan Boulder Learning, Inc.Ines Rehbein Leibniz Science Campus, Institute for German Language and

Heidelberg UniversityManfred Stede University of PotsdamKatrin Tomanek OpenTableFei Xia University of WashingtonHeike Zinsmeister University of Hamburg

vi

Table of Contents

Readers vs. Writers vs. Texts: Coping with Different Perspectives of Text Understanding in EmotionAnnotation

Sven Buechel and Udo Hahn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Finding Good Conversations Online: The Yahoo News Annotated Comments CorpusCourtney Napoles, Joel Tetreault, Aasish Pappu, Enrica Rosato and Brian Provenzale . . . . . . . . . . 13

Crowdsourcing discourse interpretations: On the influence of context and the reliability of a connectiveinsertion task

Merel Scholman and Vera Demberg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

A Code-Switching Corpus of Turkish-German ConversationsÖzlem Çetinoglu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Annotating omission in statement pairsHéctor Martínez Alonso, Amaury Delamaire and Benoît Sagot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Annotating Speech, Attitude and Perception ReportsCorien Bary, Leopold Hess, Kees Thijs, Peter Berck and Iris Hendrickx . . . . . . . . . . . . . . . . . . . . . . 46

Consistent Classification of Translation Revisions: A Case Study of English-Japanese Student Transla-tions

Atsushi Fujita, Kikuko Tanabe, Chiho Toyoshima, Mayuka Yamamoto, Kyo Kageura and AnthonyHartley . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Representation and Interchange of Linguistic Annotation: An In-Depth, Side-by-Side Comparison ofThree Designs

Richard Eckart de Castilho, Nancy Ide, Emanuele Lapponi, Stephan Oepen, Keith Suderman, ErikVelldal and Marc Verhagen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

TDB 1.1: Extensions on Turkish Discourse BankDeniz Zeyrek and Murathan Kurfalı . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Two Layers of Annotation for Representing Event Mentions in News StoriesMaria Pia di Buono, Martin Tutek, Jan Šnajder, Goran Glavaš, Bojana Dalbelo Bašic and Natasa

Milic-Frayling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Word Similarity Datasets for Indian Languages: Annotation and Baseline SystemsSyed Sarfaraz Akhtar, Arihant Gupta, Avijit Vajpayee, Arjit Srivastava and Manish Shrivastava . 91

The BECauSE Corpus 2.0: Annotating Causality and Overlapping RelationsJesse Dunietz, Lori Levin and Jaime Carbonell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Catching the Common Cause: Extraction and Annotation of Causal Relations and their ParticipantsInes Rehbein and Josef Ruppenhofer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Assessing SRL Frameworks with Automatic Training Data ExpansionSilvana Hartmann, Éva Mújdricza-Maydt, Ilia Kuznetsov, Iryna Gurevych and Anette Frank . . 115

vii

Workshop Program

Monday, April 3, 2017

9:30–9:40 Welcome

9:40–10:40 Invited Talk I: The TED-Multilingual Discourse BankDeniz Zeyrek

10:40–11:00 Emotion

10:40–11:00 Readers vs. Writers vs. Texts: Coping with Different Perspectives of Text Under-standing in Emotion AnnotationSven Buechel and Udo Hahn

11:00–11:30 Coffee Break

11:30–12:10 Conversations & Discourse

11:30–11:50 Finding Good Conversations Online: The Yahoo News Annotated Comments Cor-pusCourtney Napoles, Joel Tetreault, Aasish Pappu, Enrica Rosato and Brian Proven-zale

11:50–12:10 Crowdsourcing discourse interpretations: On the influence of context and the relia-bility of a connective insertion taskMerel Scholman and Vera Demberg

ix

Monday, April 3, 2017 (continued)

12:10–13:00 Posters

A Code-Switching Corpus of Turkish-German ConversationsÖzlem Çetinoglu

Annotating omission in statement pairsHéctor Martínez Alonso, Amaury Delamaire and Benoît Sagot

Annotating Speech, Attitude and Perception ReportsCorien Bary, Leopold Hess, Kees Thijs, Peter Berck and Iris Hendrickx

Consistent Classification of Translation Revisions: A Case Study of English-Japanese Student TranslationsAtsushi Fujita, Kikuko Tanabe, Chiho Toyoshima, Mayuka Yamamoto, KyoKageura and Anthony Hartley

Representation and Interchange of Linguistic Annotation: An In-Depth, Side-by-Side Comparison of Three DesignsRichard Eckart de Castilho, Nancy Ide, Emanuele Lapponi, Stephan Oepen, KeithSuderman, Erik Velldal and Marc Verhagen

TDB 1.1: Extensions on Turkish Discourse BankDeniz Zeyrek and Murathan Kurfalı

Two Layers of Annotation for Representing Event Mentions in News StoriesMaria Pia di Buono, Martin Tutek, Jan Šnajder, Goran Glavaš, Bojana Dalbelo Bašicand Natasa Milic-Frayling

Word Similarity Datasets for Indian Languages: Annotation and Baseline SystemsSyed Sarfaraz Akhtar, Arihant Gupta, Avijit Vajpayee, Arjit Srivastava and ManishShrivastava

13:00–14:30 Lunch

x

Monday, April 3, 2017 (continued)

14:30–15:00 Posters (contd.)

15:00–16:00 Causality & Semantic Roles

15:00–15:20 The BECauSE Corpus 2.0: Annotating Causality and Overlapping RelationsJesse Dunietz, Lori Levin and Jaime Carbonell

15:20–15:40 Catching the Common Cause: Extraction and Annotation of Causal Relations andtheir ParticipantsInes Rehbein and Josef Ruppenhofer

15:40–16:00 Assessing SRL Frameworks with Automatic Training Data ExpansionSilvana Hartmann, Éva Mújdricza-Maydt, Ilia Kuznetsov, Iryna Gurevych andAnette Frank

16:00–16:30 Coffee Break

16:30–17:30 Invited Talk II: Cross-lingual Semantic AnnotationJohan Bos

17:30–18:00 Discussion and Wrap-up

xi

Proceedings of the 11th Linguistic Annotation Workshop, pages 1–12,Valencia, Spain, April 3, 2017. c©2017 Association for Computational Linguistics

Readers vs. Writers vs. Texts:Coping with Different Perspectives of Text Understanding in

Emotion Annotation

Sven Buechel and Udo HahnJena University Language & Information Engineering (JULIE) Lab

Friedrich-Schiller-Universitat Jena, Jena, Germany{sven.buechel,udo.hahn}@uni-jena.de

http://www.julielab.de

Abstract

We here examine how different perspec-tives of understanding written discourse,like the reader’s, the writer’s or the text’spoint of view, affect the quality of emo-tion annotations. We conducted a series ofannotation experiments on two corpora, apopular movie review corpus and a genre-and domain-balanced corpus of standardEnglish. We found statistical evidence thatthe writer’s perspective yields superior an-notation quality overall. However, thequality one perspective yields compared tothe other(s) seems to depend on the do-main the utterance originates from. Ourdata further suggest that the popular moviereview data set suffers from an atypicalbimodal distribution which may decreasemodel performance when used as a train-ing resource.

1 Introduction

In the past years, the analysis of subjective lan-guage has become one of the most popular areasin computational linguistics. In the early days, asimple classification according to the semantic po-larity (positiveness, negativeness or neutralness)of a document was predominant, whereas in themeantime, research activities have shifted towardsa more sophisticated modeling of sentiments. Thisincludes the extension from only few basic to morevaried emotional classes sometimes even assign-ing real-valued scores (Strapparava and Mihalcea,2007), the aggregation of multiple aspects of anopinion item into a composite opinion statementfor the whole item (Schouten and Frasincar, 2016),and sentiment compositionality on sentence level(Socher et al., 2013).

There is also an increasing awareness of differ-ent perspectives one may take to interpret writ-ten discourse in the process of text comprehen-sion. A typical distinction which mirrors differentpoints of view is the one between the writer andthe reader(s) of a document as exemplified by ut-terance (1) below (taken from Katz et al. (2007)):

(1) Italy defeats France in World Cup Final

The emotion of the writer, presumably a pro-fessional journalist, can be expected to be moreor less neutral, but French or Italian readers mayshow rather strong (and most likely opposing)emotional reactions when reading this news head-line. Consequently, such finer-grained emotionaldistinctions must also be considered when formu-lating instructions for an annotation task.

NLP researchers are aware of this multi-perspectival understanding of emotion as contri-butions often target either one or the other form ofemotion expression or mention it as a subject offuture work (Mukherjee and Joshi, 2014; Lin andChen, 2008; Calvo and Mac Kim, 2013). How-ever, contributions aiming at quantifying the ef-fect of altering perspectives are rare (see Section2). This is especially true for work examining dif-ferences in annotation results relative to these per-spectives. Although this is obviously a crucial de-sign decision for gold standards for emotion an-alytics, we know of only one such contribution(Mohammad and Turney, 2013).

In this paper, we systematically examine differ-ences in the quality of emotion annotation regard-ing different understanding perspectives. Apartfrom inter-annotator agreement (IAA), we willalso look at other quality criteria such as how wellthe resulting annotations cover the space of pos-sible ratings and check for the representativenessof the rating distribution. We performed a seriesof annotation experiments with varying instruc-

1

tions and domains of raw text, making this thefirst study ever to address the impact of text un-derstanding perspective on sentence-level emotionannotation. The results we achieved directly in-fluenced the design and creation of EMOBANK, anovel large-scale gold standard for emotion analy-sis employing the VAD model for affect represen-tation (Buechel and Hahn, 2017).

2 Related Work

Representation Schemes for Emotion. Due tothe multi-disciplinary nature of research on emo-tions, different representation schemes and modelshave emerged hampering comparison across dif-ferent approaches (Buechel and Hahn, 2016).

In NLP-oriented sentiment and emotion anal-ysis, the most popular representation scheme isbased on semantic polarity, the positiveness ornegativeness of a word or a sentence, whileslightly more sophisticated schemes include a neu-tral class or even rely on a multi-point polarityscale (Pang and Lee, 2008).

Despite their popularity, these bi- or tri-polarschemes have only loose connections to emotionmodels currently prevailing in psychology (Sanderand Scherer, 2009). From an NLP point of view,those can be broadly subdivided into categoricaland dimensional models (Calvo and Mac Kim,2013). Categorical models assume a small num-ber of distinct emotional classes (such as Anger,Fear or Joy) that all human beings are supposedto share. In NLP, the most popular of those mod-els are the six Basic Emotions by Ekman (1992) orthe 8-category scheme of the Wheel of Emotion byPlutchik (1980).

Dimensional models, on the other hand, arecentered around the notion of compositionality.They assume that emotional states can be best de-scribed as a combination of several fundamentalfactors, i.e., emotional dimensions. One of themost popular dimensional models is the Valence-Arousal-Dominance (VAD; Bradley and Lang(1994)) model which postulates three orthogo-nal dimensions, namely Valence (corresponding tothe concept of polarity), Arousal (a calm-excitedscale) and Dominance (perceived degree of con-trol in a (social) situation); see Figure 1 for an il-lustration. An even more wide-spread version ofthis model uses only the Valence and Arousal di-mension, the VA model (Russell, 1980).

For a long time, categorical models were pre-

−1.0 −0.5 0.0 0.5 1.0−1.0

−0.5

0.0

0.5

1.0

−1.0

−0.5

0.0

0.5

1.0

Valence

Arou

sal

Dom

inan

ce

●

●

●●

●

●

Anger

SurpriseDisgust

Fear

Sadness

Joy

Figure 1: The emotional space spanned by theValence-Arousal-Dominance model. For illustra-tion, the position of Ekman’s six Basic Emotionsare included (as determined by Russell and Mehra-bian (1977)).

dominant in emotion analysis (Ovesdotter Alm etal., 2005; Strapparava and Mihalcea, 2007; Bal-ahur et al., 2012). Only recently, the VA(D)model found increasing recognition (Paltoglou etal., 2013; Yu et al., 2015; Buechel and Hahn,2016; Wang et al., 2016). When one of these di-mensional models is selected, the task of emotionanalysis is most often interpreted as a regressionproblem (predicting real-valued scores for each ofthe dimension) so that another set of metrics mustbe taken into account than those typically appliedin NLP (see Section 3).

Despite its growing popularity, the first large-scale gold standard for dimensional models hasonly very recently been developed as a follow-up to this contribution (EMOBANK; Buechel andHahn (2017)). The results we obtained here werecrucial for the design of EMOBANK regarding thechoice of annotation perspective and the domainthe raw data were taken from. However, our re-sults are not only applicable to VA(D) but also tosemantic polarity (as Valence is equivalent to thisrepresentation format) and may probably general-ize over other models of emotion, as well.

Resources and Annotation Methods. For theVAD model, the Self-Assessment Manikin (SAM;Bradley and Lang (1994)) is the most impor-tant and to our knowledge only standardized in-strument for acquiring emotion ratings based onhuman self-perception in behavioral psychology(Sander and Scherer, 2009). SAM iconically dis-plays differences in Valence, Arousal and Dom-inance by a set of anthropomorphic cartoons on

2

a multi-point scale (see Figure 2). Subjects referto one of these figures per VAD dimension to ratetheir feelings as a response to a stimulus.

SAM and derivatives therefrom have been usedfor annotating a wide range of resources for word-emotion associations in psychology (such as War-riner et al. (2013), Stadthagen-Gonzalez et al.(2016), Yao et al. (2016) and Schmidtke et al.(2014)), as well as VAD-annotated corpora inNLP; Preotiuc-Pietro et al. (2016) developed acorpus of 2,895 English Facebook posts (but theyrely on only two annotators). Yu et al. (2016) gen-erated a corpus of 2,009 Chinese sentences fromdifferent genres of online text.

A possible alternative to SAM is Best-WorstScaling (BSW; Louviere et al. (2015)), a methodonly recently introduced into NLP by Kiritchenkoand Mohammad (2016). This annotation methodexploits the fact that humans are typically moreconsistent when comparing two items relative toeach other with respect to a given scale rather thanattributing numerical ratings to the items directly.For example, deciding whether one sentence ismore positive than the other is easier than scoringthem (say) as 8 and 6 on a 9-point scale.

Although BWS provided promising results forpolarity (Kiritchenko and Mohammad, 2016), inthis paper, we will use SAM scales. First, withthis decision, there are way more studies to com-pare our results with and, second, the adequacy ofBWS for emotional dimensions other than Valence(polarity) remains to be shown.

Perspectival Understanding of Emotions. Asstated above, research on the linkage of differ-ent annotation perspectives (typically reader vs.writer) is really rare. Tang and Chen (2012) ex-amine the relation between the sentiment of mi-croblog posts and the sentiment of their comments(as a proxy for reader emotion) using a positive-negative scheme. They examine which linguisticfeatures are predictive for certain emotion transi-tions (combinations of an initial writer and a re-sponsive reader emotion). Liu et al. (2013) modelthe emotion of a news reader jointly with the emo-tion of a comment writer using a co-training ap-proach. This contribution was followed up by Liet al. (2016) who criticized that important assump-tions underlying co-training, viz. sufficiency andindependence of the two views, had actually beenviolated in that work. Instead, they propose a two-view label propagation approach.

Various (knowledge) representation formalismshave been suggested for inferring sentiment oropinions by either readers, writers or both from apiece of text. Reschke and Anand (2011) proposethe concept of predicate-specific evaluativity func-tors which allow for inferring the writers’ evalua-tion of a proposition based on the evaluation ofthe arguments of the predicate. Using descriptionlogics as modeling language Klenner (2016) ad-vocates the concept of polarity frames to capturepolarity constraints verbs impose on their comple-ments as well as polarity implications they projecton them. Deng and Wiebe (2015) employ proba-bilistic soft logic for entity and event-based opin-ion inference from the viewpoint of the author orintra-textual entities. Rashkin et al. (2016) intro-duce connotation frames of (verb) predicates asa comprehensive formalism for modeling variousevaluative relationships (being positive, negativeor neutral) between the arguments of the predicateas well as the reader’s and author’s view on them.However, up until know, the power of this formal-ism is still restricted by assuming that author andreader evaluate the arguments in the same way.

In summary, different from our contribution,this line of work tends to focus less on the reader’sperspective and also addresses cognitive evalua-tions (opinions) rather than instantaneous affectivereactions. Although these two concepts are closelyrelated, they are yet different and in fact their re-lationship has been the subject of a long lastingand still unresolved debate in psychology (David-son et al., 2003) (e.g., are we afraid of somethingbecause we evaluate it as dangerous, or do weevaluate something as dangerous because we areafraid?).

To the best of our knowledge, only Mohammadand Turney (2013) investigated the effects of dif-ferent perspectives on annotation quality. Theyconducted an experiment on how to formulate theemotion annotation question and found that askingwhether a term is associated with an emotion ac-tually resulted in higher IAA than asking whethera term evokes a certain emotion. Arguably, theformer phrasing is rather unrelated to either writeror reader emotion, while the latter clearly targetsthe emotion of the reader. Their work renders evi-dence for the importance of the perspective of textcomprehension for annotation quality. Note thatthey focused on word emotion rather than sentenceemotion.

3

V

A

D

1 2 3 4 5 6 7 8 9

Figure 2: The icons of the 9-point Self-Assessment Manikin (SAM). Dimensions (Valence, Arousal andDominance; VAD) in rows, rating scores (1-9) in columns. Comprised in PXLab, an open source toolkitfor psychological experiments (http://irtel.uni-mannheim.de/pxlab/index.html).

3 Methods

Inter-Annotator Agreement. Annotating emo-tion on numerical scales demands for another sta-tistical tool set than the one that is common inNLP. Well-known metrics such as the κ-coefficientshould not be applied for measuring IAA becausethese are designed for nominal-scaled variables,i.e., ones whose possible values do not have anyintrinsic order (such as part-of-speech tags as com-pared to (say) a multi-point sentiment scale).

In the literature, there is no consensus on whatmetrics for IAA should be used instead. However,there is a set of repetitively used approaches whichare typically only described verbally. In the fol-lowing, we offer comprehensive formal definitionsand a discussion of them.

First, we describe a leave-one-out frameworkfor IAA where the ratings of an individual anno-tator are compared against the average of the re-maining ratings. As one of the first papers, it wasused and verbally described by Strapparava andMihalcea (2007) and was later taken on by Yu etal. (2016) and Preotiuc-Pietro et al. (2016).

Let X := (xij) ∈ Rm×n be a matrix where mcorresponds to the number of items and n corre-sponds to the number of annotators. X stores allthe individual ratings of the m items (organized inrows) and n annotators (organized in columns) sothat xij represents the rating of the i-th item by thej-th annotator. Since we use the three-dimensionalVAD model, in practice, we will have one suchmatrix for each VAD dimension.

Let bj denote (x1j , x2j , .., xmj), the vectorcomposed out of the j-th column of the matrix andlet f : Rm × Rm → R be an arbitrary metricfor comparing two data series, then L1Of (X), theleave-one-out IAA for the rating matrixX relativeto the metric f , is defined as

L1Of (X) :=1n

n∑j=1

f(bj , b∅j ) (1)

where b∅j is the average annotation vector of theremaining raters:

b∅j :=1

n− 1

∑k∈{1,...,n}\{j}

bk (2)

For our experiments, we will use three differentmetrics specifying the function f , namely r, MAEand RMSE.

In general, the Pearson correlation coefficient rcaptures the linear dependence between two dataseries, x = x1, x2, ..., xm and y = y1, y2, ..., ym.In our case x,y correspond to the rating vector ofan individual annotator and the aggregated ratingvector of the remaining annotators, respectively.

r(x, y) :=∑m

i=1(xi − x)(yi − y)√∑mi=1(xi − x)2

√∑mi=1(yi − y)2

(3)where x, y denote the mean value of x, y, respec-tively.

When comparing a model’s prediction to theactual data, it can be very important not only to

4

take correlation-based metrics like r into account,but also error-based metrics (Buechel and Hahn,2016). This is so because a model may producevery accurate predictions in terms of correlation,while at the same time it may perform poorly whentaking errors into account (for instance, when thepredicted values range in a much smaller intervalthan the actual values).

To be able to compare a system’s performancemore directly to the human ceiling, we also ap-ply error-based metrics within this leave-one-outframework. The most popular ones for emo-tion analysis are Mean Absolute Error (MAE) andRoot Mean Square Error (RMSE) (Paltoglou et al.,2013; Yu et al., 2016; Wang et al., 2016):

MAE(x, y) :=1m

m∑i=1

|(xi − yi)| (4)

RMSE(x, y) :=

√√√√ 1m

m∑i=1

(xi − yi)2 (5)

One of the drawbacks of this framework is thateach xij from matrix X has to be known in orderto calculate the IAA. An alternative method wasverbally described by Buechel and Hahn (2016)which can be computed out of mean and SD valuesfor each item alone (a format often available frompsychological papers). Let X be defined as aboveand let ai denote the mean value for the i-th item.Then, the Average Annotation Standard Deviation(AASD) is defined as

AASD(X) :=1m

m∑i=1

√√√√ 1n

n∑j=1

(xij − ai)2 (6)

Emotionality. While IAA is indubitably themost important quality criterion for emotion an-notation, we argue that there is at least one ad-ditional criterion that is not covered by prior re-search: When using numerical scales (especiallyones with a large number of rating points, e.g., the9-point scales we will use in our experiments) an-notations where only neutral ratings are used willbe unfavorable for future applications (e.g., train-ing models). Therefore, it is important that theannotations are properly distributed over the fullrange of the scale. This issue is especially rele-vant in our setting as different perspectives mayvery well differ in the extremity of their reactions,

as evident from Example (1). We call this desir-able property the emotionality (EMO) of the an-notations.

For the EMO metric, we first derive aggregatedratings from the individual rating decisions of theannotators, i.e., the ratings that would later formthe final ratings of a corpus. For that, we aggre-gate the rating matrix X from Equation 1 into thevector y consisting of the respective row means yi.

yi :=1n

n∑j=1

xij (7)

y := (y1, ..., yi, ..., ym) (8)

Since we use the VAD model, we will have onesuch aggregated vector per VAD dimension. Wedenote them y1, y2 and y3. Let the matrix Y =(yj

i ) ∈ Rm×3 hold the aggregated ratings of itemi for dimension j, and let N denote the neutralrating (e.g., 5 on a 9-point scale). Then,

EMO(Y ) :=1

3×m3∑

j=1

m∑i=1

|yji −N| (9)

Representative Distribution. A closely re-lated quality indicator relates to the representative-ness of the resulting rating distribution. For largesets of stimuli (words as well as sentences), nu-merous studies consistently report that when us-ing SAM-like scales, typically the emotion rat-ings closely resemble a normal distribution, i.e.,the density plot displays a Gaussian, “bell-shaped”curve (see Figure 3b) (Preotiuc-Pietro et al., 2016;Warriner et al., 2013; Stadthagen-Gonzalez et al.,2016; Montefinese et al., 2014).

Intuitively, it makes sense that most of the sen-tences under annotation should be rather neutral,while only few of them carry extreme emotions.Therefore, we argue that ideally the resulting ag-gregated ratings for an emotion annotation taskshould be normally distributed. Otherwise, it mustbe seriously called into question in how far therespective data set can be considered representa-tive, possibly reducing the performance of modelstrained thereon. Consequently, we will also takethe density plot of the ratings into account whencomparing different set-ups.

4 Experiments

Perspectives to Distinguish. Considering Ex-ample (1) and our literature review from Section

5

2, it is obvious that at least the perspective of thewriter and the reader of an utterance must be dis-tinguished. Accordingly, writer emotion refers tohow someone feels while producing an utterance,whereas reader emotion relates to how someonefeels right after reading or hearing this utterance.

Also taking into account the finding by Moham-mad and Turney (2013) that agreement among an-notators is higher when asking whether a wordis associated with an emotion rather than askingwhether it evokes this emotion, we propose to ex-tend the common writer-reader framework by athird category, the text perspective, where no ac-tual person is specified as perceiving an emotion.Rather, we assume for this perspective that emo-tion is an intrinsic property of a sentence (or analternative linguistic unit like a phrase or the en-tire text). In the following, we will use the termsWRITER, TEXT and READER to concisely refer tothe respective perspectives.

Data Sets. We collected two data sets, a moviereview data set highly popular in sentiment analy-sis and a balanced corpus of general English. Inthis way, we can estimate the annotation qualityresulting from different perspectives, also cover-ing interactions regarding different domains.

The first data set builds upon the corpus origi-nally introduced by Pang and Lee (2005). It con-sists of about 10k snippets from movie reviewsby professional critics collected from the websiterottentomatoes.com. The data was furtherenriched by Socher et al. (2013) who annotated in-dividual nodes in the constituency parse trees ac-cording to a 5-point polarity scale, forming theStanford Sentiment Treebank (SST) which con-tains 11,855 sentences.

Upon closer inspection, we noticed that the SST

data have some encoding issues (e.g., Absorbingcharacter study by AndrA c© Turpin .) that arenot present in the original Rotten Tomatoes dataset. So we decided to replicate the creation of theSST data from the original snippets. Furthermore,we filtered out fragmentary sentences automati-cally (e.g., beginning with comma, dashes, lowercase, etc.) as well as manually excluded grammat-ically incomplete and therefore incomprehensiblesentences, e.g., ”Or a profit” or ”Over age 15?”.Subsequently, a total of 10,987 sentences could bemapped back to SST IDs forming the basis for ourexperiments (the SST* collection).

To complement our review language data set, a

domain heavily focused on in sentiment analysis(Liu, 2015), for our second data set, we decidedto rely on a genre-balanced corpus. We chose theManually Annotated Sub-Corpus (MASC) of theAmerican National Corpus which is already anno-tated for various linguistic levels (Ide et al., 2008;Ide et al., 2010). We excluded registers contain-ing spoken, mainly dialogic or non-standard lan-guage, e.g., telephone conversations, movie scriptsand tweets. To further enrich this collection of rawdata for potential emotion analysis applications,we additionally included the corpus of the SEM-EVAL-2007 Task 14 focusing on Affective Text(SE07; Strapparava and Mihalcea (2007)), one ofthe most important data sets in emotion analysis.This data set already bears annotations accord-ing to Ekman’s six Basic Emotions (see Section2) so that the gold standard we ultimately supplyalready contains a bi-representational part (beingannotated according to a dimensional and a cat-egorical model of emotion). Such a double en-coding will easily allow for research on automati-cally mapping between different emotion formats(Buechel and Hahn, 2017).

In order to identify individual sentence inMASC, we relied on the already available anno-tations. We noticed, however, that a considerableportion of the sentence boundary annotations wereduplicates which we consequently removed (about5% of the preselected data). This left us with atotal of 18,290 sentences from MASC and 1,250headlines from SE07. Together, they form our sec-ond data set, MASC*.

Study Design. We pulled a 40 sentencesrandom sample from MASC* and SST*, respec-tively. For each of the three perspectives WRITER,READER and TEXT, we prepared a separate set ofinstructions. Those instructions are identical, ex-cept for the exact phrasing of what a participantshould annotate: For WRITER, it was consistentlyasked “what emotion is expressed by the author”,while TEXT and READER queried “what emotionis conveyed” by and “how do you [the participantof the survey] feel after reading” an individual sen-tence, respectively.

After reviewing numerous studies from NLPand psychology that had created emotion anno-tations (e.g., Katz et al. (2007), Strapparava andMihalcea (2007), Mohammad and Turney (2013),Pinheiro et al. (2016), Warriner et al. (2013)), welargely relied on the instructions used by Bradley

6

and Lang (1999) as this is one of the first and prob-ably the most influential resource from psychol-ogy which also greatly influenced work in NLP(Yu et al., 2016; Preotiuc-Pietro et al., 2016).

The instructions were structured as follows. Af-ter a general description of the study, the individ-ual scales of SAM were explained to the partici-pants. After that, they performed three trial rat-ings to familiarize themselves with the usage ofthe SAM scales before proceeding to judge the ac-tual 40 sentences of interest. The study was im-plemented as a web survey using Google Forms.1

The sentences were presented in randomized or-der, i.e., they were shuffled for each participant in-dividually.

For each of the six resulting surveys (one foreach combination of perspective and data set), werecruited 80 participants via the crowdsourcingplatform crowdflower.com (CF). The num-ber was chosen so that the differences in IAAmay reach statistical significance (according to theleave-one-out evaluation (see Section 3), the num-ber of cases is equal to the number of raters). Thesurveys went online one after the other, so that asfew participants as possible would do more thanone of the surveys. The task was available fromwithin the UK, the US, Ireland, Canada, Australiaand New Zealand.

We preferred using an external survey over run-ning the task directly via the CF platform becausethis set-up offers more design options, such as ran-domization, which is impossible via CF; there, thedata is only shuffled once and will then be pre-sented in the same order to each participant. Thedrawback of this approach is that we cannot relyon CF’s quality control mechanisms.

In order to still be able to exclude maliciousraters, we introduced an algorithmic filtering pro-cess where we summed up the absolute error theparticipants made on the trial questions—thosewere asking them to indicate the VAD values for averbally described emotion so that the correct an-swers were evident from the instructions. Raterswhose absolute error was above a certain thresh-old were excluded.

We set this parameter to 20 (removing about athird of the responses) because this was approxi-mately the ratio of raters which struck us as un-reliable when manually inspecting the data while,at the same time, leaving us with a reasonable

1https://forms.google.com/

Perspective r MAE RMSE AASD

SST*WRITER .53 1.41 1.70 1.73TEXT .41 1.73 2.03 2.10READER .40 1.66 1.96 2.02

MASC*WRITER .43 1.56 1.88 1.95TEXT .43 1.49 1.81 1.89READER .36 1.58 1.89 1.98

Table 1: IAA values obtained on the SST* and theMASC* data set. r, MAE and RMSE refer to therespective leave-one-out metric (see Section 3).

number of cases to perform statistical analysis.The results of this analysis is presented in the fol-lowing section. Our two small sized yet multi-perspectival data sets are publicly available for fur-ther analysis.2

5 Results

In this section, we compare the three annotationperspectives (WRITER, READER and TEXT) ontwo different data sets (SST* and MASC*; see Sec-tion 4), according to three criteria for annotationquality: IAA, emotionality and distribution (seeSection 3).

Inter-Annotator Agreement. Since there isno consensus on a fixed set of metrics for numeri-cal emotion values, we compare IAA according toa range of measures. We use r, MAE and RMSEin the leave-one-out framework, as well as AASD(see Section 3). Table 1 displays our results forthe SST* and MASC* data set. We calculated IAAindividually for Valence, Arousal and Dominance.However, to keep the number of comparisons fea-sible, we restrict ourselves to presenting the re-spective mean values (average over VAD), only.The relative ordering between the VAD dimen-sions is overall consistent with prior work so thatValence shows better IAA than Arousal or Dom-inance (in line with findings from Warriner et al.(2013) and Schmidtke et al. (2014)).

We find that on the review-style SST* data,WRITER displays the best IAA according to allof the four metrics (p < 0.05 using a two-tailedt-test, respectively). Note that MAE, RMSE andAASD are error-based so that the smaller thevalue the better the agreement. Concerning theordering of the remaining perspectives, TEXT ismarginally better regarding r, while the resultsfrom the three error-based metrics are clearly infavor of READER. Consequently, for IAA on the

2https://github.com/JULIELab/EmoBank

7

Perspective EMO

SST*WRITER 1.09TEXT 1.04READER 0.91

MASC*WRITER 0.75TEXT 0.70READER 0.63

Table 2: Emotionality results for the SST* and theMASC* data set.

SST* data set, WRITER yields the best perfor-mance, while the order of the other perspectivesis not so clear.

Surprisingly, the results look markedly differenton the MASC* data. Here, regarding r, WRITER

and TEXT are on par with each other. This con-trasts with the results from the error-based met-rics. There, TEXT shows the best value, whileWRITER, in turn, improves upon READER only bya small margin. Most importantly, for neither ofthe four metrics we obtain statistical significancebetween the best and the second best perspective(p ≥ 0.05 using a two-tailed t-test, respectively).Thus, concerning IAA on the MASC* sample, theresults remain rather opaque.

The fact that, contrary to that, on SST* the re-sults are conclusive and statistically significant,strongly suggests that the resulting annotationquality is not only dependent on the annotationperspective. Instead, there seem to be consider-able dependencies and interactions concerning thedomain of the raw data, as well.

Interestingly, on both corpora correlation- anderror-based sets of metrics behave inconsistentlywhich we interpret as a piece of evidence for us-ing both types of metrics, in parallel (Buechel andHahn, 2016; Wang et al., 2016).

Emotionality. For emotionality, we rely on theEMO metric which we defined in Section 3 (seeTable 2 for our results). For both corpora, the or-dering of the perspectives according to the EMOscore is consistent: WRITER yields the most emo-tional ratings followed by TEXT and READER.(p < 0.05 for each of the pairs using a two-tailedt-test). These unanimous and statistically signifi-cant results further underpin the advantage of theTEXT and especially the WRITER perspective asalready suggested by our findings for IAA.

Distribution. We also looked at the distributionof the resulting aggregated annotations relative tothe chosen data sets and the three perspectives byexamining the respective density plots. In Figure

2 4 6 80.0

0.2

0.4

a) SST*

Rating

Prob

abilit

y

2 4 6 80.0

0.2

0.4

b) MASC*

Rating

Prob

abilit

y

0

Writer Text Reader

2 4 6 80.0

0.2

0.4

a) SST*

Rating

Prob

abilit

y

2 4 6 80.0

0.2

0.4

b) MASC*

Rating

Prob

abilit

y

0

Writer Text Reader

Figure 3: Density plots of the aggregated Valenceratings for the two data sets and three perspectives.

3, we give six examples of these plots, displayingthe Valence density curve for both corpora, SST*and MASC*, as well as the three perspectives. ForArousal and Dominance, the plots show the samecharacteristics although slightly less pronounced.

The left density plots, for the SST*, display abimodal distribution (having two local maxima),whereas the MASC* plots are much closer to a nor-mal distribution. This second shape has been con-sistently reported by many contributions (see Sec-tion 3), whereas we know of no other study report-ing a bimodal emotion distribution. This highlyatypical finding for SST* might be an artifact ofthe website from which the original movie reviewsnippets were collected—there, movies are classi-fied into either fresh (positive) or rotten (negative).Consequently, this binary classification schememight have influenced the selection of snippetsfrom full-scale reviews (as performed by the web-site) so that these snippets are either clearly posi-tive or negative.

Thus, our findings seriously call into questionin how far the movie review corpus by Pang andLee (2005)—one of the most popular data sets insentiment analysis—can be considered represen-tative for review language or general English. Ul-timately, this may result in a reduced performanceof models trained on such skewed data.

6 Discussion

Overall, we interpret our data as suggesting theWRITER perspective to be superior to TEXT andREADER: Considering IAA, it is significantly bet-ter on one data set (SST*), while it is on par withor only marginally worse than the best perspectiveon the other data set (MASC*). Regarding emo-tionality of the aggregated ratings (EMO), the su-periority of this perspective is even more obvious.

8

The relative order of TEXT and WRITER on theother hand, is not so clear. Regarding IAA, TEXT

is better on MASC* while for SST* READER

seems to be slightly better (almost on par regard-ing r but markedly better relative to the errormeasures we propose here). However, regardingthe emotionality of the ratings, TEXT clearly sur-passes READER.

Our data suggest that the results of Mohammadand Turney (2013) (the only comparable study sofar, though considering emotion on the word ratherthan sentence level) may be also true for sentencesin most of the cases. However, our data indicatethat the validity of their findings may depend onthe domain the raw data originate from. Theyfound that phrasing the emotion annotation taskrelative to the TEXT perspective yields higher IAAthan relating to the READER perspective. How-ever, more importantly, our data complement theirresults by presenting evidence that WRITER seemsto be even better than any of the two perspectivesthey took into account.

7 Conclusion

This contribution presented a series of anno-tation experiments examining which annotationperspective (WRITER, TEXT or READER) yieldsthe best IAA, also taking domain differences intoaccount—the first study of this kind for sentence-level emotion annotation. We began by reviewingdifferent popular representation schemes for emo-tion before (formally) defining various metrics forannotation quality—for the VAD scheme we use,this task was so far neglected in the literature.

Our findings strongly suggest that WRITER isoverall the superior perspective. However, the ex-act ordering of the perspectives strongly dependson the domain the data originate from. Our re-sults are thus mainly consistent with, but substan-tially go beyond, the only comparable study so far(Mohammad and Turney, 2013). Furthermore, ourdata provide strong evidence that the movie reviewcorpus by Pang and Lee (2005)—one of the mostpopular ones for sentiment analysis—may not berepresentative in terms of its rating distribution po-tentially casting doubt on the quality of modelstrained on this data.

For the subsequent creation of EMOBANK, alarge-scale VAD gold standard, we took the fol-lowing decisions in the light of these not fullyconclusive outcomes. First, we decided to anno-

tate a 10k sentences subset of the MASC* corpusconsidering the atypical rating distribution in theSST* data set. Furthermore, we decided to anno-tate the whole corpus bi-perspectivally (accordingto WRITER and READER viewpoint) as we hopethat the resulting resource helps clarifying whichfactors exactly influence emotion annotation qual-ity. This freely available resource is further de-scribed in Buechel and Hahn (2017).

ReferencesA. Balahur, J. M. Hermida, and A. Montoyo. 2012.

Building and exploiting EmotiNet, a knowledgebase for emotion detection based on the appraisaltheory model. IEEE Transactions on Affective Com-puting, 3(1):88–101.

Margaret M. Bradley and Peter J. Lang. 1994. Measur-ing emotion: The self-assessment manikin and thesemantic differential. Journal of Behavior Therapyand Experimental Psychiatry, 25(1):49–59.

Margaret M. Bradley and Peter J. Lang. 1999. Af-fective norms for English words (ANEW): Stimuli,instruction manual and affective ratings. TechnicalReport C-1, The Center for Research in Psychophys-iology, University of Florida, Gainesville, FL.

Sven Buechel and Udo Hahn. 2016. Emotion anal-ysis as a regression problem: Dimensional modelsand their implications on emotion representation andmetrical evaluation. In Gal A. Kaminka, Maria Fox,Paolo Bouquet, Eyke Hullermeier, Virginia Dignum,Frank Dignum, and Frank van Harmelen, editors,ECAI 2016 — Proceedings of the 22nd EuropeanConference on Artificial Intelligence. The Hague,The Netherlands, August 29 - September 2, 2016,pages 1114–1122.

Sven Buechel and Udo Hahn. 2017. EMOBANK:Studying the impact of annotation perspective andrepresentation format on dimensional emotion anal-ysis. In EACL 2017 — Proceedings of the 15th An-nual Meeting of the European Chapter of the As-sociation for Computational Linguistics. Valencia,Spain, April 3-7, 2017.

Rafael A. Calvo and Sunghwan Mac Kim. 2013. Emo-tions in text: Dimensional and categorical models.Computational Intelligence, 29(3):527–543.

Richard J. Davidson, Klaus R. Scherer, and H. HillGoldsmith. 2003. Handbook of Affective Sciences.Oxford University Press, Oxford, New York, NY.

Lingjia Deng and Janyce Wiebe. 2015. Joint predic-tion for entity/event-level sentiment analysis usingprobabilistic soft logic models. In Lluıs Marquez,Chris Callison-Burch, and Jian Su, editors, EMNLP2015 — Proceedings of the 2015 Conference on Em-pirical Methods in Natural Language Processing.

9

Lisbon, Portugal, September 17–21, 2015, pages179–189.

Paul Ekman. 1992. An argument for basic emotions.Cognition & Emotion, 6(3-4):169–200.

Nancy C. Ide, Collin F. Baker, Christiane Fellbaum,Charles J. Fillmore, and Rebecca J. Passonneau.2008. MASC: The Manually Annotated Sub-Corpusof American English. In Nicoletta Calzolari, KhalidChoukri, Bente Maegaard, Joseph Mariani, Jan E.J. M. Odijk, Stelios Piperidis, and Daniel Tapias,editors, LREC 2008 — Proceedings of the 6th In-ternational Conference on Language Resources andEvaluation. Marrakech, Morocco, May 26 – June 1,2008, pages 2455–2461.

Nancy C. Ide, Collin F. Baker, Christiane Fellbaum,and Rebecca J. Passonneau. 2010. The ManuallyAnnotated Sub-Corpus: A community resource forand by the people. In Jan Hajic, M. Sandra Car-berry, and Stephen Clark, editors, ACL 2010 — Pro-ceedings of the 48th Annual Meeting of the Associa-tion for Computational Linguistics. Uppsala, Swe-den, July 11–16, 2010, volume 2: Short Papers,pages 68–73.

Phil Katz, Matthew Singleton, and Richard Wicen-towski. 2007. SWAT-MP: The SemEval-2007 sys-tems for Task 5 and Task 14. In Eneko Agirre,Lluıs Marquez, and Richard Wicentowski, editors,SemEval-2007 — Proceedings of the 4th Interna-tional Workshop on Semantic Evaluations @ ACL2007. Prague, Czech Republic, June 23-24, 2007,pages 308–313.

Svetlana Kiritchenko and Saif M. Mohammad. 2016.Capturing reliable fine-grained sentiment associa-tions by crowdsourcing and best-worst scaling. InKevin Knight, Ani Nenkova, and Owen Rambow,editors, NAACL-HLT 2016 — Proceedings of the2016 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies. San Diego, California,USA, June 12-17, 2016, pages 811–817.

Manfred Klenner. 2016. A model for multi-perspective opinion inferences. In Larry Birnbaum,Octavian Popescu, and Carlo Strapparava, editors,Proceedings of IJCAI 2016 Workshop Natural Lan-guage Meets Journalism, New York, USA, July 10,2016, pages 6–11.

Shoushan Li, Jian Xu, Dong Zhang, and GuodongZhou. 2016. Two-view label propagation to semi-supervised reader emotion classification. In Nico-letta Calzolari, Yuji Matsumoto, and Rashmi Prasad,editors, COLING 2016 — Proceedings of the 26thInternational Conference on Computational Lin-guistics. Osaka, Japan, December 11-16, 2016, vol-ume Technical Papers, pages 2647–2655.

Hsin-Yih Kevin Lin and Hsin-Hsi Chen. 2008. Rank-ing reader emotions using pairwise loss minimiza-tion and emotional distribution regression. In

EMNLP 2008 — Proceedings of the 2008 Con-ference on Empirical Methods in Natural Lan-guage Processing. Honolulu, Hawaii, October 25–27, 2008, pages 136–144.

Huanhuan Liu, Shoushan Li, Guodong Zhou, Chu-RenHuang, and Peifeng Li. 2013. Joint modeling ofnews reader’s and comment writer’s emotions. InHinrich Schutze, Pascale Fung, and Massimo Poe-sio, editors, ACL 2013 — Proceedings of the 51stAnnual Meeting of the Association for Computa-tional Linguistics. Sofia, Bulgaria, August 4-9, 2013,volume 2: Short Papers, pages 511–515.

Bing Liu. 2015. Sentiment Analysis: Mining Opinions,Sentiments, and Emotions. Cambridge UniversityPress, New York.

Jordan J Louviere, Terry N Flynn, and AAJ Marley.2015. Best-worst scaling: Theory, methods andapplications. Cambridge University Press, Cam-bridge.

Saif M. Mohammad and Peter D. Turney. 2013.Crowdsourcing a word-emotion association lexicon.Computational Intelligence, 29(3):436–465.

Maria Montefinese, Ettore Ambrosini, Beth Fairfield,and Nicola Mammarella. 2014. The adaptation ofthe Affective Norms for English Words (ANEW)for Italian. Behavior Research Methods, 46(3):887–903.

Subhabrata Mukherjee and Sachindra Joshi. 2014.Author-specific sentiment aggregation for polarityprediction of reviews. In Nicoletta Calzolari, KhalidChoukri, Thierry Declerck, Loftsson Hrafn, BenteMaegaard, Joseph Mariani, Asuncion Moreno, JanOdijk, and Stelios Piperidis, editors, LREC 2014— Proceedings of the 9th International Conferenceon Language Resources and Evaluation. Reykjavik,Iceland, May 26-31, 2014, pages 3092—3099.

Cecilia Ovesdotter Alm, Dan Roth, and RichardSproat. 2005. Emotions from text: Machinelearning for text-based emotion prediction. InRaymond J. Mooney, Christopher Brew, Lee-FengChien, and Katrin Kirchhoff, editors, HLT-EMNLP2005 — Proceedings of the Human Language Tech-nology Conference & 2005 Conference on EmpiricalMethods in Natural Language Processing. Vancou-ver, British Columbia, Canada, 6-8 October 2005,pages 579–586.

G. Paltoglou, M. Theunis, A. Kappas, and M. Thel-wall. 2013. Predicting emotional responses to longinformal text. IEEE Transactions on Affective Com-puting, 4(1):106–115.

Bo Pang and Lillian Lee. 2005. Seeing stars: Ex-ploiting class relationships for sentiment categoriza-tion with respect to rating scales. In Kevin Knight,Tou Hwee Ng, and Kemal Oflazer, editors, ACL2005 — Proceedings of the 43rd Annual Meeting ofthe Association for Computational Linguistics. Ann

10

Arbor, Michigan, USA, June 25–30, 2005, pages115–124.

Bo Pang and Lillian Lee. 2008. Opinion mining andsentiment analysis. Foundations and Trends in In-formation Retrieval, 2(1-2):1–135.

Ana P. Pinheiro, Marcelo Dias, Joo Pedrosa, and Ana P.Soares. 2016. Minho Affective Sentences (MAS):Probing the roles of sex, mood, and empathy in af-fective ratings of verbal stimuli. Behavior ResearchMethods. Online First Publication.

Robert Plutchik. 1980. A general psychoevolutionarytheory of emotion. Emotion: Theory, Research andExperience, 1(3):3–33.

Daniel Preotiuc-Pietro, Hansen Andrew Schwartz,Gregory Park, Johannes C. Eichstaedt, Margaret L.Kern, Lyle H. Ungar, and Elizabeth P. Shulman.2016. Modelling valence and arousal in Face-book posts. In Alexandra Balahur, Erik van derGoot, Piek Vossen, and Andres Montoyo, editors,WASSA 2016 — Proceedings of the 7th Workshopon Computational Approaches to Subjectivity, Sen-timent and Social Media Analysis @ NAACL-HLT2016. San Diego, California, USA, June 16, 2016,pages 9–15.

Hannah Rashkin, Sameer Singh, and Yejin Choi. 2016.Connotation frames: A data-driven investigation.In Antal van den Bosch, Katrin Erk, and Noah A.Smith, editors, ACL 2016 — Proceedings of the 54thAnnual Meeting of the Association for Computa-tional Linguistics. Berlin, Germany, August 7–12,2016, volume 1: Long Papers, pages 311–321.

Kevin Reschke and Pranav Anand. 2011. Extractingcontextual evaluativity. In Johan Bos and StephenPulman, editors, IWCS 2011 — Proceedings of the9th International Conference on Computational Se-mantics. Oxford, UK, January 12–14, 2011, pages370–374.

James A. Russell and Albert Mehrabian. 1977. Evi-dence for a three-factor theory of emotions. Journalof Research in Personality, 11(3):273–294.

James A. Russell. 1980. A circumplex model of af-fect. Journal of Personality and Social Psychology,39(6):1161–1178.

David Sander and Klaus R. Scherer, editors. 2009. TheOxford Companion to Emotion and the Affective Sci-ences. Oxford University Press, Oxford, New York.

David S. Schmidtke, Tobias Schroder, Arthur M. Ja-cobs, and Markus Conrad. 2014. ANGST: Affectivenorms for German sentiment terms, derived from theaffective norms for English words. Behavior Re-search Methods, 46(4):1108–1118.

Kim Schouten and Flavius Frasincar. 2016. Survey onaspect-level sentiment analysis. IEEE Transactionson Knowledge and Data Engineering, 28(3):813–830.

Richard Socher, Alex Perelygin, Jean Y. Wu, JasonChuang, Christopher D. Manning, Andrew Y. Ng,and Christopher Potts. 2013. Recursive deep mod-els for semantic compositionality over a sentimenttreebank. In Timothy Baldwin and Anna Korhonen,editors, EMNLP 2013 — Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing. Seattle, Washington, USA, 18-21October 2013, pages 1631–1642.

Hans Stadthagen-Gonzalez, Constance Imbault,Miguel A. Perez Sanchez, and Marc Brysbaert.2016. Norms of valence and arousal for 14,031Spanish words. Behavior Research Methods.Online First Publication.

Carlo Strapparava and Rada Mihalcea. 2007.SemEval-2007 Task 14: Affective text. In EnekoAgirre, Lluıs Marquez, and Richard Wicentowski,editors, SemEval-2007 — Proceedings of the 4th In-ternational Workshop on Semantic Evaluations @ACL 2007. Prague, Czech Republic, June 23-24,2007, pages 70–74.

Yi-jie Tang and Hsin-Hsi Chen. 2012. Mining senti-ment words from microblogs for predicting writer-reader emotion transition. In Nicoletta Calzolari,Khalid Choukri, Thierry Declerck, Mehmet UgurDogan, Bente Maegaard, Joseph Mariani, AsuncionMoreno, Jan E. J. M. Odijk, and Stelios Piperidis,editors, LREC 2012 — Proceedings of the 8th In-ternational Conference on Language Resources andEvaluation. Istanbul, Turkey, May 21-27, 2012,pages 1226–1229.

Jin Wang, Liang-Chih Yu, K. Robert Lai, and Xue-jie Zhang. 2016. Dimensional sentiment analy-sis using a regional CNN-LSTM model. In Antalvan den Bosch, Katrin Erk, and Noah A. Smith, ed-itors, ACL 2016 — Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics. Berlin, Germany, August 7–12, 2016, vol-ume 2: Short Papers, pages 225–230.

Amy Beth Warriner, Victor Kuperman, and Marc Brys-bært. 2013. Norms of valence, arousal, and dom-inance for 13,915 English lemmas. Behavior Re-search Methods, 45(4):1191–1207.

Zhao Yao, Jia Wu, Yanyan Zhang, and ZhendongWang. 2016. Norms of valence, arousal, concrete-ness, familiarity, imageability, and context availabil-ity for 1,100 Chinese words. Behavior ResearchMethods. Online First Publication.

Liang-Chih Yu, Jin Wang, K. Robert Lai, and XuejieZhang. 2015. Predicting valence-arousal ratings ofwords using a weighted graph method. In Yuji Mat-sumoto, Chengqing Zong, and Michael Strube, edi-tors, ACL-IJCNLP 2015 — Proceedings of the 53rdAnnual Meeting of the Association for Computa-tional Linguistics & 7th International Joint Confer-ence on Natural Language Processing of the AsianFederation of Natural Language Processing. Bei-jing, China, July 26–31, 2015, volume 2: Short Pa-pers, pages 788–793.

11

Liang-Chih Yu, Lung-Hao Lee, Shuai Hao, Jin Wang,Yunchao He, Jun Hu, K. Robert Lai, and XuejieZhang. 2016. Building Chinese affective resourcesin valence-arousal dimensions. In Kevin C. Knight,Ani Nenkova, and Owen Rambow, editors, NAACL-HLT 2016 — Proceedings of the 2016 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies. San Diego, California, USA, June 12–17,2016, pages 540–545.

12


Finding Good Conversations Online:The Yahoo News Annotated Comments Corpus

Courtney Napoles,1 Joel Tetreault,2 Enrica Rosato,3 Brian Provenzale,4 and Aasish Pappu3

1 Johns Hopkins University, [email protected] Grammarly, [email protected] Yahoo, {aasishkp|enricar}@yahoo-inc.com

4 Accellion, [email protected]

Abstract

This work presents a dataset and annota-tion scheme for the new task of identifying“good” conversations that occur online,which we call ERICs: Engaging, Respect-ful, and/or Informative Conversations. Wedevelop a taxonomy to reflect features ofentire threads and individual commentswhich we believe contribute to identify-ing ERICs; code a novel dataset of Ya-hoo News comment threads (2.4k threadsand 10k comments) and 1k threads fromthe Internet Argument Corpus; and ana-lyze the features characteristic of ERICs.This is one of the largest annotated corporaof online human dialogues, with the mostdetailed set of annotations. It will be valu-able for identifying ERICs and other as-pects of argumentation, dialogue, and dis-course.

1 Introduction

Automatically curating online comments has beena large focus in recent NLP and social media work,as popular news outlets can receive millions ofcomments on their articles each month (Warzel,2012). Comment threads often range from vacu-ous to hateful, but good discussions do occur on-line, with people expressing different viewpointsand attempting to inform, convince, or better un-derstand the other side, but they can get lost amongthe multitude of unconstructive comments. Wehypothesize that identifying and promoting thesetypes of conversations (ERICs) will cultivate amore civil and constructive atmosphere in onlinecommunities and potentially encourage participa-tion from more users.

ERICs are characterized by:

• A respectful exchange of ideas, opinions, and/orinformation in response to a given topic(s).• Opinions expressed as an attempt to elicit a di-

alogue or persuade.• Comments that seek to contribute some new in-

formation or perspective on the relevant topic.ERICs have no single identifying attribute: for in-stance, an exchange where communicants are intotal agreement throughout can be an ERIC, ascan an exchange with heated disagreement. Fig-ures 1 and 2 contain two threads that are charac-terized by continual disagreement, but one is anERIC and the other is not. We have developed anew coding scheme to label ERICs and identifysix dimensions of comments and three dimensionsof threads that are frequently seen in the commentssection. Many of these labels are for characteris-tics of online conversations not captured by tra-ditional argumentation or dialogue features. Someof the labels we collect have been annotated in pre-vious work (§2), but this is the first time they areaggregated in a single corpus at the dialogue level.

In this paper, we present the Yahoo NewsAnnotated Comments Corpus (YNACC), whichcontains 2.4k threads and 10k comments fromthe comments sections of Yahoo News articles.We additionally collect annotations on 1k threadsfrom the Internet Argument Corpus (Abbott et al.,2016), representing another domain of online de-bates. We contrast annotations of Yahoo and IACthreads, explore ways in which threads perceivedto be ERICs differ in this two venues, and identifysome unanticipated characteristics of ERICs.

This is the first exploration of how charac-teristics of individual comments contribute tothe dialogue-level classification of an exchange.YNACC will facilitate research to understandERICS and other aspects of dialogue. The cor-pus and annotations will be available at https://github.com/cnap/ynacc.

13

Legend Agreement: Agree ! , Disagree " , Adjunct opinion✋Audience: Broadcast $ , Reply %

Persuasiveness: Persuasive & , Not persuasive '

Sentiment: Mixed ( , Neutral ) , Negative ☹ , Positive +

Topic: Off-topic with article , , off-topic with conv. -Tone: .

Controversial

Controversial

Sarcastic

Sarcastic

$ ' ☹

% ' ☹"

% & )✋

% & )✋

% ' ☹☹"

% & ☹☹"

when a country has to use force to keep it's businesses behind a wall. . . something is very wrong. will the next step be forcing the talented and wealthy to remain? this strategy did not work well for the soviet union.

A

your solution is?B

@B, lower the govt imposed costs and businesses will stay voluntarily.C

just because a company was started in us, given large tax breakes in the us and makes most of its profits in the us does not mean it owes loyalty right? they have to appease the shareholders who want more value so lower your cost of business by lowering taxes while still getting all the perks is one way of doing it.

D

@D - in your world who eventually pays the taxes that our gov't charges business?

C

@C lowering corporate taxes does not equate to more jobs, its only equates to corporations making more money. did you think they take their profits and make high paying jobs with them? lol wake up!

B

Headline: Allergan CEO: Feds blindsided us on Pfizer deal

Figure 1: An ERIC that is labeled argumentative, positive/respectful, and having continual disagreement.

ControversialMean! " ☹

Controversial$ % &'

$ " ☹' ControversialInformative$ % ('

Informative$ % ('

Controversial$ % ☹'

$ " ( )'

$ " ( )'

$ " ( )'

$ " (' )

quit your whining you are in america assimilate into american society. or go back where you came from.

E

american society is that of immigrants and the freedom to practice whatever religion you wish. you anti american?

F

@F, you may be an immigrant, but i'm notG

the only reason you are an american is because of immigrants.F

that can be said of all humans. humans migrated from africa. everyone in germany is an immigrant.

G

then any statement about they need to "go back" is irrelevant and wrong. thanks for proving my point.

F

floridians tell new yorkers to go back. you have no point.G

just because someone says something doesnt make it valid. your point has no point.

F

just because someone says something doesnt make it valid. nothing you say is valid.

G

that's your opinion. but it's not valid. my factual statement is.F

Headline: 'The Daily Show' Nailed How Islamophobia Hurts the Sikh Community Too

Figure 2: A non-ERIC that is labeled argumentative and off-topic with continual disagreement.

2 Related work

Recent work has focused on the analysis of user-generated text in various online venues, includ-ing labeling certain qualities of individual com-ments, comment pairs, or the roles of individualcommenters. The largest and most extensively an-notated corpus predating this work is the InternetArgument Corpus (IAC), which contains approxi-mately 480k comments in 16.5k threads from on-

line forums in which users debate contentious is-sues. The IAC has been coded for for topic (3kthreads), stance (2k authors), and agreement, sar-casm, and hostility (10k comment pairs) (Abbottet al., 2016; Walker et al., 2012). Comments fromonline news articles are annotated in the SEN-SEI corpus, which contains human-authored sum-maries of 1.8k comments posted on Guardian ar-ticles (Barker et al., 2016). Participants described

14

each comment with short, free-form text labelsand then wrote a 150–250-word comment sum-mary with these labels. Barker et al. (2016) recog-nized that comments have diverse qualities, manyof which are coded in this work (§3), but did notexplicitly collect labels of them.

Previous works present a survey of how edi-tors and readers perceive the quality of commentsposted in online news publications (Diakopoulosand Naaman, 2011) and review the criteria profes-sional editors use to curate comments (Diakopou-los, 2015). The latter identifies 15 criteria for cu-rating user-generated responses, from online andradio comments to letters to the editor. Our anno-tation scheme overlaps with those criteria but alsodiverges as we wish for the labels to reflect thenature of all comments posted on online articlesinstead of just the qualities sought in editoriallycurated comments. ERICs can take many formsand may not reflect the formal tone or intent thateditors in traditional news outlets seek.

Our coding scheme intersects with attributesexamined in several different areas of research.Some of the most recent and relevant discoursecorpora from online sources related to this workinclude the following: Concepts related to persua-siveness have been studied, including annotationsfor “convincing-ness” in debate forums (Habernaland Gurevych, 2016), influencers in discussionsfrom blogs and Wikipedia (Biran et al., 2012),and user relations as a proxy of persuasion in red-dit (Tan et al., 2016; Wei et al., 2016). Polite-ness was labeled and identified in Stack Exchangeand Wikipedia discussions (Danescu-Niculescu-Mizil et al., 2013). Some previous work focusedon detecting agreement has considered blog andWikipedia discussions (Andreas et al., 2012) anddebate forums (Skeppstedt et al., 2016). Sar-casm has been identified in a corpus of microblogsidentified with the hashtag #sarcasm on Twitter(Gonzalez-Ibanez et al., 2011; Davidov et al.,2010) and in online forums (Oraby et al., 2016).Sentiment has been studied widely, often in thecontext of reviews (Pang and Lee, 2005), andin the context of user-generated exchanges, posi-tive and negative attitudes have been identified inUsenet discussions (Hassan et al., 2010). Otherqualities of user-generated text that are not cov-ered in this work but have been investigated be-fore include metaphor (Jang et al., 2014) and toler-ance (Mukherjee et al., 2013) in online discussion

threads, “dogmatism” of reddit users (Fast andHorvitz, 2016), and argumentation units in discus-sions related to technology (Ghosh et al., 2014).

3 Annotation scheme

This section outlines our coding scheme for identi-fying ERICs, with labels for comment threads andeach comment contained therein.

Starting with the annotation categories fromthe IAC and the curation criteria of Diakopoulos(2015), we have adapted these schemes and iden-tified new characteristics that have broad coverageover 100 comment threads (§4) that we manuallyexamined.

Annotations are made at the thread-level andthe comment-level. Thread-level annotations cap-ture the qualities of a thread on the whole, whilecomment-level annotations reflect the characteris-tics of each comment. The labels for each dimen-sion are described below. Only one label per di-mension is allowed unless otherwise specified.

3.1 Thread labelsAgreement The overall agreement present in athread.• Agreement throughout• Continual disagreement• Agreement → disagreement: Begins with

agreement which turns into disagreement.• Disagreement → agreement: Starts with dis-

agreement that converges into agreement.Constructiveness A binary label indicatingwhen a conversation is an ERIC, or has a clearexchange of ideas, opinions, and/or informationdone so somewhat respectfully.1

• Constructive• Not constructive

Type The overall type or tone of the conversa-tion, describing the majority of comments. Twolabels can be chosen if conversations exhibit morethan one dominant feature.• Argumentative: Contains a lot of “back and

forth” between participants that does not nec-essarily reach a conclusion.• Flamewar: Contains insults, users “yelling” at

each other, and no information exchanged.1Note that this definition of constructive differs from that

of Niculae and Danescu-Niculescu-Mizil (2016), who use theterm to denote discrete progress made towards identifying apoint on a map. Our definition draws from the more tradi-tional meaning when used in the context of conversations as“intended to be useful or helpful” (Macmillan, 2017).

15

• Off-Topic/digression: Comments are com-pletely irrelevant to the article or each other,or the conversation starts on topic but veers offinto another direction.• Personal stories: Participants exchange per-

sonal anecdotes.• Positive/respectful: Consists primarily of com-

ments expressing opinions in a respectful, po-tentially empathetic manner.• Snarky/humorous: Participants engage with

each other using humor rather than argue orsympathize. May be on- or off-topic.

3.2 Comment labels

Agreement Agreement expressed with explicitphrasing (e.g., I disagree...) or implicitly,such as in Figure 2. Annotating the target of(dis)agreement is left to future work due to thenumber of other codes the annotators need to at-tend to. Multiple labels can be chosen per com-ment, since a comment can express agreementwith one statement and disagreement with another.• Agreement with another commenter• Disagreement with another commenter• Adjunct opinion: Contains a perspective that

has not yet been articulated in the thread.Audience The target audience of a comment.• Reply to specific commenter: Can be ex-

plicit (i.e., @HANDLE) or implicit (not di-rectly naming the commenter). The target ofa reply is not coded.• Broadcast message: Is not directed to a spe-

cific person(s).Persuasiveness A binary label indicatingwhether a comment contains persuasive languageor an intent to persuade.• Persuasive• Not persuasive

Sentiment The overall sentiment of a comment,considering how the user feels with respect to whatinformation they are trying to convey.• Negative• Neutral• Positive• Mixed: Contains both positive and negative

sentiments.Tone These qualities describe the overall tone ofa comment, and more than one can apply.• Controversial: Puts forward a strong opinion

that will most likely cause disagreement.• Funny: Expresses or intends to express humor.

• Informative: Contributes new information tothe discussion.• Mean: The purpose of the comment is to be

rude, mean, or hateful.• Sarcastic: Uses sarcasm with either intent to

humor (overlaps with Funny) or offend.• Sympathetic: A warm, friendly comment that

expresses positive emotion or sympathy.Topic The topic addressed in a comment, andmore than one label can be chosen. Comments areon-topic unless either Off-topic label is selected.• Off-topic with the article• Off-topic with the conversation: A digression

from the conversation.• Personal story: Describes the user’s personal

experience with the topic.

4 Corpus collection

With the taxonomy described above, we codedcomments from two separate domains: onlinenews articles and debate forums.

Threads from online news articles YNACCcontains threads from the “comments section” ofYahoo News articles from April 2016.2 Yahoo fil-ters comments containing hate speech (Nobata etal., 2016) and abusive language using a combina-tion of manual review and automatic algorithms,and these comments are not included in our cor-pus. From the remaining comments, we identifiedthreads, which contain an initial comment and atleast one comment posted in reply. Yahoo threadshave a single-level of embedding, meaning thatusers can only post replies under a top-level com-ment. In total, we collected 521,608 comments in137,620 threads on 4,714 articles on topics includ-ing finance, sports, entertainment, and lifestyle.We also collected the following metadata for eachcomment: unique user ID, time posted, headline,URL, category, and the number of thumbs up andthumbs down received. We included commentsposted on a thread regardless of how much timehad elapsed since the initial comment because thevast majority of comments were posted in close se-quence: 48% in the first hour after an initial com-ment, 67% within the first three hours, and 92%within the first 24 hours.

We randomly selected 2,300 threads to anno-tate, oversampling longer threads since the aver-

2Excluding comments labeled non-English by LangID, ahigh-accuracy tool for identifying languages in multiple do-mains (Lui and Baldwin, 2012)

16

IAC Yahoo# Threads 1,000 2,400# Comments 16,555 9,160Thread length 29± 55 4± 3Comment length 568± 583 232± 538Trained 0 1,400 threads

9,160 commentsUntrained 1,000 threads 1,300 threads

Table 1: Description of the threads and commentsannotated in this work and and the number codedby trained and untrained annotators. Thread lengthis in comments, comment length in characters.

age Yahoo thread has only 3.8 comments. The dis-tribution of thread lengths is 20% with 2–4 com-ments, 60% 5–8, and 20% 9–15. For a held-outtest set, we collected an additional 100 threadsfrom Yahoo articles posted in July 2016, with thesame length distribution. Those threads are not in-cluded in the analysis performed herein.

Threads from web debate forums To test thisannotation scheme on a different domain, we alsocode online debates from the IAC 2.0 (Abbott etal., 2016). IAC threads are categorically differ-ent from Yahoo ones in terms of their stated pur-pose (debate on a particular topic) and length. Themean IAC thread has 29 comments and each com-ment has 102 tokens, compared to Yahoo threadswhich have 4 comments with 51 tokens each. Be-cause significant attention is demanded to code thenumerous attributes, we only consider IAC threadswith 15 comments or fewer for annotation, but donot limit the comment length. In total, we se-lected 1,000 IAC thread to annotate, specifically:474 threads from 4forums that were coded in theIAC, all 23 threads from CreateDebate, and 503randomly selected threads from ConvinceMe.

4.1 Annotation

The corpus was coded by two groups of anno-tators: professional trained editors and untrainedcrowdsourced workers. Three separate annota-tors coded each thread. The trained editors werepaid contractors who received two 30–45-minutetraining sessions, editorial guidelines (2,000-worddocument), and two sample annotated threads.The training sessions were recorded and availableto the annotators during annotation, as were theguidelines. They could communicate their ques-tions to the trainers, who were two authors of thispaper, and receive feedback during the trainingand annotation phases.

Because training is expensive and time consum-ing, we also collected annotations from untrainedcoders on Amazon Mechanical Turk (AMT). Tosimplify the task for AMT, we only solicitedthread-level labels, paying $0.75 per thread. Forquality assurance, only workers located in theUnited States or Canada with a minimum HITacceptance rate of 95% could participate, andthe annotations were spot-checked by the authors.Trained annotators coded 1,300 Yahoo threadsand the 100-thread test set on the comment- andthread-levels; untrained annotators coded thread-level labels of 1,300 Yahoo threads (300 of whichoverlapped with the trained annotations) and 1,000IAC threads (Table 1). In total, 26 trained and 495untrained annotators worked on this task.

4.2 ConfidenceTo assess the difficulty of the task, we also col-lected a rating for each thread from the trained an-notators describing how confident they were withtheir judgments of each thread and the commentsit comprises. Ratings were made on a 5-level Lik-ert scale, with 1 being not at all confident and5 fully confident. The levels of confidence werehigh (3.9 ± 0.7), indicating that coders were ableto distinguish the thread and comment codes withrelative ease.

4.3 Agreement levelsWe measure inter-annotator agreement with Krip-pendorff’s alpha (Krippendorff, 2004) and findthat, over all labels, there are substantial levels ofagreement within groups of annotators: α = 0.79for trained annotators and α = 0.71 and 72 for un-trained annotators on the Yahoo and IAC threads,respectively. However, there is lower agreementon thread labels than comment labels (Table 2).The agreement of thread type is 25% higher forthe Yahoo threads than the IAC (0.62–0.64 com-pared to 0.48). The less subjective comment la-bels (i.e., agreement, audience, and topic) havehigher agreement than persuasiveness, sentiment,and tone. While some of the labels have onlymoderate agreement (0.5 < α < 0.6), we findthese results satisfactory as the agreement levelsare higher than those reported for similarly sub-jective discourse annotation tasks (e.g., Walker etal. (2012)).

To evaluate the untrained annotators, we com-pare the thread-level annotations made on 300 Ya-hoo threads by both trained and untrained coders,

17

Yahoo IACThread label Trained Untrained UntrainedAgreement 0.52 0.50 0.53Constructive 0.48 0.52 0.63Type 0.62 0.64 0.48Comment labelAgreement 0.80 – –Audience 0.74 – –Persuasiveness 0.48 – –Sentiment 0.50 – –Tone 0.63 – –Topic 0.82 – –

Table 2: Agreement levels found for each labelcategory within trained and untrained groups ofannotators, measured by Krippendorff’s alpha.

Category Label MatchesConstructive class – 0.61Agreement – 0.62Thread type Overall 0.81

Argumentative 0.72Flamewar 0.80Off-topic 0.82Personal stories 0.94Respectful 0.81Snarky/humorous 0.85

Table 3: Percentage of threads (out of 300) forwhich the majority label of the trained annotatorsmatched that of the untrained annotators.

by taking the majority label per item from eachgroup of annotators and calculating the percentof exact matches (Table 3). When classifyingthe thread type, multiple labels are allowed foreach thread, so we convert each option into aboolean and analyze them separately. Only 8% ofthe threads have no majority constructive label inthe trained and/or untrained annotations, and 20%have no majority agreement label. Within both an-notation groups, there are majority labels on all ofthe thread type labels. The category with the low-est agreement is constructive class with only 61%of the majority labels matching, followed closelyby agreement (only 62% matching). A very highpercent of the thread type labels (81%). The strongagreement levels between trained and untrainedannotators suggest that crowdsourcing is reliablefor coding thread-level characteristics.

5 Annotation analysis

To understand what makes a thread constructive,we explore the following research questions:

1. How does the overall thread categorizationdiffer between ERICs and non-ERICs? (§5.1)

2. What types of comments make up ERICs

compared to non-ERICs? (§5.2)3. Are social signals related to whether a thread

is an ERIC? (§5.3)

5.1 Thread-level annotations

Before examining what types of threads areERICs, we first compare the threads coded bydifferent sets of annotators (trained or untrained)and from different sources (IAC or Yahoo). Wemeasure the significance of annotation group foreach label with a test of equal proportions for bi-nary categories (constructiveness and each threadtype) and a chi-squared test of independence forthe agreement label. Overall, annotations by thetrained and untrained annotators on Yahoo threadsare very similar, with significant differences onlybetween some of the thread type labels (Fig-ure 3). We posit that the discrepancies between thetrained and untrained annotators is due to the for-mer’s training sessions and ability to communicatewith the authors, which could have swayed anno-tators to make inferences into the coding schemethat were not overtly stated in the instructions.

The differences between Yahoo and IACthreads are more pronounced. The only label forwhich there is no significant difference is per-sonal stories (p = 0.41, between the IAC andtrained Yahoo labels). All other IAC labels aresignificantly different from both trained and un-trained Yahoo labels (p < 0.001). ERICs are moreprevalent in the IAC, with 70% of threads labeledconstructive, compared to roughly half of Yahoothreads. On the whole, threads from the IACare more concordant and positive than from Ya-hoo: they have more agreement and less disagree-ment, more than twice as many positive/respectfulthreads, and fewer than half the flamewars.

For Yahoo threads, there is no significant dif-ference between trained and untrained coders forconstructiveness (p = 0.11) and the argumenta-tive thread type (p = 0.07; all other thread typesare significant with p < 10−5). There is no signif-icant difference between the agreement labels, ei-ther (p = 1.00). Untrained coders are more likelythan trained to classify threads using emotional la-bels like snarky, flamewar, and positive/respectful,while trained annotators more frequently recog-nize off-topic threads. These differences shouldbe taken into consideration for evaluating the IACcodes, and for future efforts collecting subjectiveannotations through crowdsourcing.

18

0.00 0.25 0.50 0.75 1.00

snarky/humorous

positive/respectful

personal stories

off-topic/digression

flamewar

argumentative

not constructive

constructive

disagreement→agreement

agreement→disagreement

continual disagreement

agreement throughout

Agreem

ent

Constr

uctiv

eness

Type

IAC, untrained Yahoo, untrained Yahoo, trained

Figure 3: % threads assigned labels by annotatortype (trained, untrained) and source (Yahoo, IAC).

We measure the strength of relationships be-tween labels with the phi coefficient (Figure 4).There is a positive association between ERICs andall agreement labels in both Yahoo (trained) andIAC threads, which indicates that concord is notnecessary for threads to be constructive. The ex-ample in Figure 1 is a constructive thread that isargumentative and contains disagreement. Threadtypes associated with non-ERICs are flamewars,off-topic digressions, and snarky/humorous ex-changes, which is consistent across data sources.The labels from untrained annotators show astronger correlation between flamewars and notconstructive compared to the trained annotators,but the former also identified more flamewars.Some correlations are expected: across all anno-tating groups, there is a positive correlation be-tween threads labeled with agreement throughoutand positive/respectful, and disagreement through-out is correlated with argumentative (Figures 1and 2) and, to a lesser degree, flamewar.

The greatest difference between the IAC andYahoo are the thread types associated with ERICs.In the IAC, the positive/respectful label has amuch stronger positive relationship with construc-tive than the trained Yahoo labels, but this couldbe due to the difference between trained and un-trained coders. Argumentative has a positive cor-relation with constructive in the Yahoo threads,but a weak negative relationship is found in theIAC. In both domains, threads characterized as off-topic, snarky, or flamewars are more likely to be

non-ERICs. Threads with some level of agree-ment characterized as positive/respectful are com-monly ERICs. A two-tailed z-test shows a sig-nificant difference between the number of ERICsand non-ERICs in Yahoo articles in the Arts &Entertainment, Finance, and Lifestyle categories(p < 0.005; Figure 5).

5.2 Comment annotations

We next consider the codes assigned by trained an-notators to Yahoo comments (Figure 6). The ma-jority of comments are not persuasive, reply to aprevious comment, express disagreement, or havenegative sentiment. More than three times as manycomments express disagreement than agreement,and comments are labeled negative seven times asfrequently as positive. Approximately half of thecomments express disagreement or a negative sen-timent. Very few comments are funny, positive,sympathetic, or contain a personal story (< 10%).Encouragingly, only 6% of comments are off-topicwith the conversation, suggesting that participantsare attuned to and respectful of the topic. Only20% of comments are informative, indicating thatparticipants infrequently introduce new informa-tion to complement the article or discussion.

The only strong correlations are between the bi-nary labels, but the moderate correlations provideinsight into the Yahoo threads (Figure 7). Somerelationships accord with intuition. For instance,participants tend to go off-topic with the articlewhen they are responding to others and not dur-ing broadcast messages; comments expressing dis-agreement with a commenter are frequently postedin a reply to a commenter; comments express-ing agreement tend to be sympathetic and havepositive sentiment; and mean comments correlatewith negative sentiment. Commenters in this do-main also express disagreement without partic-ular nastiness, since there is no correlation be-tween disagreement and mean or sarcastic com-ments. The informative label is moderately cor-related with persuasiveness, suggesting that com-ments containing facts and new information aremore convincing than those without.

The correlation between comment and threadlabels is shown in Figure 7. Many of the rela-tionships are unsurprising, like off-topic threadstend to have off-topic comments, personal-storythreads have personal-story comments; threadagreement levels correlate with comment-level

19

agre

emen

tthr

ough

out

cont

inua

ldis

agre

emen

tag

reem

ent→

disa

gree

men

tdi

sagr

eem

ent→

agre

emen

tco

nstr

uctiv

eno

tcon

stru

ctiv

ear

gum

enta

tive

flam

ewar

off-

topi

c/di

gres

sion

pers

onal

stor

ies

posi

tive/

resp

ectf

ulsn

arky

/hum

orou

s

snarky/humorouspositive/respectful

personal storiesoff-topic/digression

flamewarargumentative

not constructiveconstructive

disagreement→agreementagreement→disagreement

continual disagreementagreement throughout

Agreement

Constructiveness

Type

Agreement

Constructiveness Type

Yahoo, trained

agre

emen

tthr

ough

out

cont

inua

ldis

agre

emen

tag

reem

ent→

disa

gree

men

tdi

sagr

eem

ent→

agre

emen

tco

nstr

uctiv

eno

tcon

stru

ctiv

ear

gum

enta

tive

flam

ewar

off-

topi

c/di

gres

sion

pers

onal

stor

ies

posi

tive/

resp

ectf

ulsn

arky

/hum

orou

s

Agreement


Yahoo, untrained

agre

emen

tthr

ough

out

cont

inua

ldis

agre

emen

tag

reem

ent→

disa

gree

men

tdi

sagr

eem

ent→

agre

emen

tco

nstr

uctiv

eno

tcon

stru

ctiv

ear

gum

enta

tive

flam

ewar

off-

topi

c/di

gres

sion

pers

onal

stor

ies

posi

tive/

resp

ectf

ulsn

arky

/hum

orou

s

Agreement


φIAC, untrained

−1

0

1

Figure 4: Correlation between thread labels, measured by the phi coefficient (φ).

0 100 200 300Sports

Society & CultureScience & Technology

PoliticsLifestyleFinance

Current EventsCelebrities

Arts & Entertainment

ERICs Non-ERICs

Figure 5: Number of threads by article category.

0.00 0.25 0.50 0.75 1.00personal story

off-topic with conversationoff-topic with article

sympatheticsarcastic

meaninformative

funnycontroversial

positiveneutral

negativemixed

persuasivenot persuasive

reply to a specific commenterbroadcast message

disagreement with commenteragreement with commenter

adjunct opinion

Topic

Tone

Sentiment

Persuasion

AudienceAgreement

Figure 6: % Yahoo comments assigned each label.

agreements; and flamewars are correlated withmean comments.

In accord with our definition of ERICs, con-structiveness is positively correlated with informa-tive and persuasive comments and negatively cor-related with negative and mean comments. Fromthese correlations one can infer that argumenta-

tive threads are generally respectful because, whilethey are strongly correlated with comments thatare controversial or express disagreement or amixed sentiment, there is no correlation with meanand very little with negative sentiment. More sur-prising is the positive correlation between contro-versial comments and constructive threads. Con-troversial comments are more associated withERICs, not non-ERICs, even though the contro-versial label also positively correlates with flame-wars, which are negatively correlated with con-structiveness. The examples in Figures 1–2 bothhave controversial comments expressing disagree-ment, but comments in the second half of the non-ERIC veer off-topic and are not persuasive, wherethe ERIC stays on-topic and persuasive.

5.3 The relationship with social signals

Previous work has taken social signals to be aproxy for thread quality, using some function ofthe total number of votes received by commentswithin a thread (e.g., Lee et al. (2014)). Becauseearlier research has indicated that user votes arenot completely independent or objective (Sipos etal., 2014; Danescu-Niculescu-Mizil et al., 2009),we take the use of votes as a proxy for qualityskeptically ad perform our own exploration of therelationship between social signals and the pres-ence of ERICs. On Yahoo, users reacted to com-ments with a thumbs up or thumbs down and wecollected the total number of such reactions foreach comment in our corpus. First, we com-pare the total number of thumbs up (TU) andthumbs down (TD) received by comments in a

20

adju

ncto

pini

onag

reem

entw

ithco

mm

ente

rdi

sagr

eem

entw

ithco

mm

ente

rbr

oadc

astm

essa

gere

ply

toa

spec

ific

com

men

ter

notp

ersu

asiv

epe

rsua

sive

mix

edne

gativ

ene

utra

lpo

sitiv

eco

ntro

vers

ial

funn

yin

form

ativ

em

ean

sarc

astic

sym

path

etic

off-

topi

cw

ithar

ticle

off-

topi

cw

ithco

nver

satio

npe

rson

alst

ory

personal storyoff-topic with conversation

off-topic with articlesympathetic

sarcasticmean

informativefunny

controversialpositiveneutral

negativemixed




adjunct opinion

Agreement

Audience

Persuasion

Sentiment

ToneTopic

Com

men

tlab

els

Topic

ToneSentim

entPersuasionAudienceAgreement

Comment labels

agre

emen

tthr

ough

out

cont

inua

ldis

agre

emen

tag

reem

ent→

disa

gree

men

tdi

sagr

eem

ent→

agre

emen

tco

nstr

uctiv

eno

tcon

stru

ctiv

ear

gum

enta

tive

flam

ewar

off-

topi

c/di

gres

sion

pers

onal

stor

ies

posi

tive/

resp

ectf

ulsn

arky

/hum

orou

s

personal storyoff-topic with conversation

off-topic with articlesympathetic

sarcasticmean

informativefunny

controversialpositiveneutral

negativemixed




adjunct opinion

Agreement


Thre

adla

bels

Topic

ToneSentim

entPersuasionAudienceAgreement

Comment labelsφ

−1

0

1

Figure 7: Correlation between comment labels (left) and comment labels and thread labels (right).

thread to the coded labels to determine whetherthere are any relationships between social signalsand threads qualities. We calculate the relation-ship between labels in each category with TU andTD with Pearson’s coefficient for the binary la-bels and a one-way ANOVA for the agreement cat-egory. The strongest correlation is between TDand untrained annotators’ perception of flamewars(r = 0.21), and there is a very weak to no correla-tion (positive or negative) between the other labelsand TU, TD, or TU−TD. There is moderate cor-relation between TU and TD (r = 0.46), suggest-ing that threads that elicit reactions tend to receiveboth thumbs up and down.

The correlation between TU and TD receivedby each comment is weaker (r = 0.23). Com-paring the comment labels to the TU and TD re-ceived by each comment also show little corre-lation. Comments that reply to a specific com-menter are negatively correlated with TU, TD, andTU−TD (r = 0.30, -0.25, and -0.22, respectively).The only other label with a non-negligible corre-lation is disagreement with a commenter, whichnegatively correlates with TU (r = −0.21). Thereis no correlation between social signal and thepresence of ERICs or non-ERICs. These resultssupport the findings of previous work and indicatethat thumbs up or thumbs down alone (and, pre-sumably, up/down votes) are inappropriate proxiesfor quality measurements of comments or threads

in this domain.

6 Conclusion

We have developed a coding scheme for label-ing “good” online conversations (ERICs) and cre-ated the Yahoo News Annotated Comments Cor-pus, a new corpus of 2.4k coded comment threadsposted in response to Yahoo News articles. Ad-ditionally, we have annotated 1k debate threadsfrom the IAC. These annotations reflect severaldifferent characteristics of comments and threads,and we have explored their relationships with eachother. ERICs are characterized by argumentative,respectful exchanges containing persuasive, infor-mative, and/or sympathetic comments. They tendto stay on topic with the original article and notto contain funny, mean, or sarcastic comments.We found differences between the distribution ofannotations made by trained and untrained anno-tators, but high levels of agreement within eachgroup, suggesting that crowdsourcing annotationsfor this task is reliable. YNACC will be a valu-able resource for researchers in multiple areas ofdiscourse analysis.

Acknowledgments

We are grateful to Danielle Lottridge, SmarandaMuresan, and Amanda Stent for their valuable in-put. We also wish to thank the anonymous review-ers for their feedback.

21

ReferencesRob Abbott, Brian Ecker, Pranav Anand, and Marilyn

Walker. 2016. Internet Argument Corpus 2.0: AnSQL schema for dialogic social media and the cor-pora to go with it. In Proceedings of the Tenth In-ternational Conference on Language Resources andEvaluation (LREC 2016), pages 4445–4452, Paris,France, May. European Language Resources Asso-ciation (ELRA).

Jacob Andreas, Sara Rosenthal, and Kathleen McKe-own. 2012. Annotating agreement and disagree-ment in threaded discussion. In Proceedings ofthe Eight International Conference on LanguageResources and Evaluation (LREC’12), pages 818–822, Istanbul, Turkey, May. European Language Re-sources Association (ELRA).

Emma Barker, Monica Lestari Paramita, AhmetAker, Emina Kurtic, Mark Hepple, and RobertGaizauskas. 2016. The SENSEI annotated corpus:Human summaries of reader comment conversationsin on-line news. In Proceedings of the 17th AnnualMeeting of the Special Interest Group on Discourseand Dialogue, pages 42–52, Los Angeles, Septem-ber. Association for Computational Linguistics.

Or Biran, Sara Rosenthal, Jacob Andreas, KathleenMcKeown, and Owen Rambow. 2012. Detectinginfluencers in written online conversations. In Pro-ceedings of the Second Workshop on Language inSocial Media, pages 37–45, Montreal, Canada, June.Association for Computational Linguistics.

Cristian Danescu-Niculescu-Mizil, Gueorgi Kossinets,Jon Kleinberg, and Lillian Lee. 2009. How opinionsare received by online communities: A case studyon amazon.com helpfulness votes. In Proceedingsof the 18th International Conference on World WideWeb, WWW ’09, pages 141–150, New York. ACM.

Cristian Danescu-Niculescu-Mizil, Moritz Sudhof,Dan Jurafsky, Jure Leskovec, and Christopher Potts.2013. A computational approach to politeness withapplication to social factors. In Proceedings of the51st Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages250–259, Sofia, Bulgaria, August. Association forComputational Linguistics.

Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010.Semi-supervised recognition of sarcasm in Twitterand Amazon. In Proceedings of the Fourteenth Con-ference on Computational Natural Language Learn-ing, pages 107–116, Uppsala, Sweden, July. Associ-ation for Computational Linguistics.

Nicholas Diakopoulos and Mor Naaman. 2011. To-wards quality discourse in online news comments.In Proceedings of the ACM 2011 Conference onComputer Supported Cooperative Work, CSCW ’11,pages 133–142, New York. ACM.

Nicholas Diakopoulos. 2015. Picking the NYT picks:Editorial criteria and automation in the curation of

online news comments. ISOJ Journal, 5(1):147–166.

Ethan Fast and Eric Horvitz. 2016. Identifying dog-matism in social media: Signals and models. InProceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing, pages690–699, Austin, Texas, November. Association forComputational Linguistics.

Debanjan Ghosh, Smaranda Muresan, Nina Wacholder,Mark Aakhus, and Matthew Mitsui. 2014. Ana-lyzing argumentative discourse units in online in-teractions. In Proceedings of the First Workshopon Argumentation Mining, pages 39–48, Baltimore,Maryland, June. Association for Computational Lin-guistics.

Roberto Gonzalez-Ibanez, Smaranda Muresan, andNina Wacholder. 2011. Identifying sarcasm in Twit-ter: A closer look. In Proceedings of the 49th An-nual Meeting of the Association for ComputationalLinguistics: Human Language Technologies: ShortPapers-Volume 2, pages 581–586. Association forComputational Linguistics.

Ivan Habernal and Iryna Gurevych. 2016. Which ar-gument is more convincing? Analyzing and pre-dicting convincingness of web arguments using bidi-rectional LSTM. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1589–1599, Berlin, Germany, August. Association forComputational Linguistics.

Ahmed Hassan, Vahed Qazvinian, and DragomirRadev. 2010. What’s with the attitude? Identi-fying sentences with attitude in online discussions.In Proceedings of the 2010 Conference on Empiri-cal Methods in Natural Language Processing, pages1245–1255, Cambridge, MA, October. Associationfor Computational Linguistics.

Hyeju Jang, Mario Piergallini, Miaomiao Wen, andCarolyn Rose. 2014. Conversational metaphors inuse: Exploring the contrast between technical andeveryday notions of metaphor. In Proceedings ofthe Second Workshop on Metaphor in NLP, pages1–10, Baltimore, MD, June. Association for Com-putational Linguistics.

Klaus Krippendorff. 2004. Content analysis: An in-troduction to its methodology. Sage Publications,Thousand Oaks, CA, 2nd edition.

Jung-Tae Lee, Min-Chul Yang, and Hae-Chang Rim.2014. Discovering high-quality threaded discus-sions in online forums. Journal of Computer Sci-ence and Technology, 29(3):519–531.

Macmillan Publishers Ltd. 2009. The onlineEnglish dictionary: Definition of constructive.http://www.macmillandictionary.com/dictionary/american/constructive.Accessed January 20, 2017.

22

Marco Lui and Timothy Baldwin. 2012. langid.py: Anoff-the-shelf language identification tool. In Pro-ceedings of the ACL 2012 System Demonstrations,pages 25–30, Jeju Island, Korea, July. Associationfor Computational Linguistics.

Arjun Mukherjee, Vivek Venkataraman, Bing Liu, andSharon Meraz. 2013. Public dialogue: Analy-sis of tolerance in online discussions. In Proceed-ings of the 51st Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 1680–1690, Sofia, Bulgaria, August.Association for Computational Linguistics.

Vlad Niculae and Cristian Danescu-Niculescu-Mizil.2016. Conversational markers of constructive dis-cussions. In Proceedings of the 2016 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, pages 568–578, San Diego, Califor-nia, June. Association for Computational Linguis-tics.

Chikashi Nobata, Joel Tetreault, Achint Thomas,Yashar Mehdad, and Yi Chang. 2016. Abu-sive language detection in online user content. InProceedings of the 25th International Conferenceon World Wide Web, pages 145–153. InternationalWorld Wide Web Conferences Steering Committee.

Shereen Oraby, Vrindavan Harrison, Lena Reed,Ernesto Hernandez, Ellen Riloff, and MarilynWalker. 2016. Creating and characterizing a di-verse corpus of sarcasm in dialogue. In Proceed-ings of the 17th Annual Meeting of the Special Inter-est Group on Discourse and Dialogue, pages 31–41,Los Angeles, September. Association for Computa-tional Linguistics.

Bo Pang and Lillian Lee. 2005. Seeing stars: Ex-ploiting class relationships for sentiment categoriza-tion with respect to rating scales. In Proceedings ofthe 43rd annual meeting on association for compu-tational linguistics, pages 115–124. Association forComputational Linguistics.

Ruben Sipos, Arpita Ghosh, and Thorsten Joachims.2014. Was this review helpful to you?: It depends!Context and voting patterns in online content. InProceedings of the 23rd International Conference onWorld Wide Web, pages 337–348. ACM.

Maria Skeppstedt, Magnus Kerren, Carita Sahlgren,and Andreas Paradis. 2016. Unshared task: (Dis)agreement in online debates. In 3rd Workshopon Argument Mining (ArgMining’16), Berlin, Ger-many, August 7-12, 2016, pages 154–159. Associa-tion for Computational Linguistics.

Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. 2016. Win-ning arguments: Interaction dynamics and persua-sion strategies in good-faith online discussions. InProceedings of the 25th International Conferenceon World Wide Web, pages 613–624. InternationalWorld Wide Web Conferences Steering Committee.

Marilyn Walker, Jean Fox Tree, Pranav Anand, RobAbbott, and Joseph King. 2012. A corpus forresearch on deliberation and debate. In Proceed-ings of the Eight International Conference on Lan-guage Resources and Evaluation (LREC’12), Istan-bul, Turkey, May. European Language ResourcesAssociation (ELRA).

Charlie Warzel. 2012. Everything in moderation.Adweek, June 18. http://www.adweek.com/digital/everything-moderation-141163/. Accessed February 20, 2017.

Zhongyu Wei, Yang Liu, and Yi Li. 2016. Is thispost persuasive? Ranking argumentative commentsin online forum. In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers), pages 195–200,Berlin, Germany, August. Association for Computa-tional Linguistics.

23


Crowdsourcing discourse interpretations: On the influence of contextand the reliability of a connective insertion task

Merel C.J. ScholmanLanguage Science and Technology

Saarland UniversitySaarbrucken, Germany{m.c.j.scholman,vera}@coli.uni-saarland.de

Vera DembergComputer Science

Saarland UniversitySaarbrucken, Germany

Abstract

Traditional discourse annotation tasks areconsidered costly and time-consuming,and the reliability and validity of thesetasks is in question. In this paper, we in-vestigate whether crowdsourcing can beused to obtain reliable discourse relationannotations. We also examine the influ-ence of context on the reliability of thedata. The results of the crowdsourcedconnective insertion task showed that themajority of the inserted connectives con-verged with the original label. Further,the distribution of inserted connectives re-vealed that multiple senses can often be in-ferred for a single relation. Regarding thepresence of context, the results show nosignificant difference in distributions of in-sertions between conditions overall. How-ever, a by-item comparison revealed sev-eral characteristics of segments that de-termine whether the presence of contextmakes a difference in annotations. Thefindings discussed in this paper can betaken as preliminary evidence that crowd-sourcing can be used as a valuable methodto obtain insights into the sense(s) of rela-tions.

1 Introduction

In order to study discourse coherence, researchersneed large amounts of discourse-annotated data,and these data need to be reliable and valid. How-ever, manually coding coherence relations is adifficult task that is prone to individual variation(Spooren and Degand, 2010). Because the task re-quires a large amount of time and resources, re-searchers try to find a balance between obtainingreliable data and sparing resources. This has led to

the standard practice of using two trained, expertannotators to code data.

Not only is this procedure time-consuming andtherefore costly, it also raises questions regardingthe reliability and validity of the data. When usingtrained, expert annotators, they may agree becausethey share implicit knowledge and know the pur-pose of the research well, rather than because theyare carefully following instructions (Artstein andPoesio, 2008; Riezler, 2014). Krippendorff (2004)therefore notes that the more annotators partici-pate in the process and the less expert they are,the more likely they can ensure the reliability ofthe data.

In this paper, we investigate how useful crowd-sourcing can be in obtaining discourse annota-tions. We present an experiment in which subjectswere asked to insert (“drag and drop”) a connect-ing phrase from a pre-defined list between the twosegments of coherence relations. By employingnon-trained, non-expert (also referred as naıve)subjects to code the data, large amounts of datacan be coded in a short period of time, and it isensured that the obtained annotations are indepen-dent and do not rely on implicit expert knowledge.Instead, the task allows us to tap into the naıvesubjects’ interpretations directly.

However, crowdsourcing has rarely been usedto obtain discourse relation annotations. Thiscould be due to the nature of crowdsourcing: Typ-ically, crowdsourced tasks are small and intuitivetasks. Under these conditions, crowdsourced an-notators – unlike expert annotators or in-lab naıveannotators – cannot be asked to code according toa specific framework because this would requirethem to study manuals. Therefore, rather thanasking for relation labels, we ask them to inserta connective from a predefined list. In order to en-sure that these connectives are not ambiguous (Asrand Demberg, 2013), we chose connectives based

24

on a classification of connective substitutability byKnott and Dale (1994). We investigate how re-liable the obtained annotations are by comparingthem to expert annotations from two existing cor-pora.

Moreover, we examine the effect of the designof the task on the reliability of the data. Re-searchers agree that discourse relations should besupplied with linguistic context in order to be an-notated reliably but there are no clear guidelinesfor how much context is needed. The current con-tribution experimentally examines the influence ofcontext on the interpretation of a discourse rela-tion, with a specific focus on whether there is aninteraction between characteristics of the segmentand the presence of context.

The contributions of this paper include the fol-lowing:

• We evaluate a new crowdsourcing method toelicit discourse interpretations and obtain dis-course annotations, showing that such a taskhas the potential to function as a reliable al-ternative to traditional annotation methods.

• The distributions of inserted connectives peritem reveal that, often, annotators convergedon two or three dominant interpretations,rather than one single interpretation. We alsofound that this distribution is replicable withhigh reliability. This is evidence that rela-tions can have multiple senses.

• We show that the presence of context ledto higher annotator agreement when (i) thefirst segment of a relation refers to an en-tity or event in the context, or introduces im-portant background information; (ii) the firstsegment consists of a deranked subordinateclause attaching to the context; or (iii) thecontext sentence following the relation ex-pands on the second argument of the relation.This knowledge can be used in the design ofdiscourse relation annotation tasks.

2 Background

In recent years, several researchers have set out toinvestigate whether naıve coders can also be em-ployed to annotate data. Working with such an-notators has the practical advantage that they areeasier to come by, and it is therefore also easier toemploy a larger number of annotators, which de-creases the effect of annotator bias (Artstein and

Poesio, 2005; Artstein and Poesio, 2008). Stud-ies employing naıve annotators have found highagreement between these annotators and expertannotators for Natural Language tasks (e.g., Snowet al., 2008). Classifying coherence relations,however, is considered to be a different and es-pecially difficult type of task due to the complexsemantic interpretations of relations and the factthat textual coherence does not reside in the ver-bal material, but rather in the readers’ mental rep-resentation (Spooren and Degand, 2010). Never-theless, naıve annotators have recently also beenemployed successfully in coherence relation an-notation tasks (Kawahara et al., 2014; Scholmanet al., 2016) and connective insertion tasks (Rohdeet al., 2015, 2016) similar to the one reported inthis paper.

Rohde et al. (2016) showed that readers can in-fer an additional reading for a discourse relationconnected by an adverbial. By obtaining many ob-servations for a single fragment rather than onlytwo, they were able to identify patterns of co-occurring relations; for example, readers can of-ten infer an additional causal reading for a rela-tion marked by otherwise. These results highlighta problem with double-coded data: Without a sub-stantial number of observations, differences in an-notations might be written off as annotator error ordisagreement. In reality, there might be multipleinterpretations for a relation, without there being asingle correct interpretation. The connective inser-tion method used by Rohde et al. (2016) is there-fore more sensitive to the possibility that relationscan have multiple readings.

The current study uses a similar method as Ro-hde et al. (2016), but applies it to answer a dif-ferent type of question. Rohde et al. (2016) in-vestigated whether readers can infer an additionalsense for a pair of sentences already marked by anadverbial. They did not have any expectations onwhether there was a correct answer; rather, theyset out to identify specific patters of connectiveinsertions. In the current study, we investigatewhether crowdsourcing can be used to obtain an-notated data that is similar in quality to data anno-tated by experts. Crucially, we assume that thereis a correct answer, namely the original label thatwas assigned by expert annotators. We thereforewill compare the results from the current study tothe original annotations in order to evaluate the us-ability of the connective insertion method for dis-

25

course annotation.

The design of the current study also dif-fers from other connective-based annotation ap-proaches such as Rohde et al. (2016) and the PennDiscourse Treebank (PDTB, Prasad et al., 2008)in that the connectives were selected to unambigu-ously mark a specific type of relation. Certain con-nectives are known to mark different types of re-lations, such as but, which can mark CONTRAST,CONCESSION and ALTERNATIVE relations. In thecurrent study, we excluded such ambiguous con-nectives in order to be able to derive relation typesfrom the insertions. For example, the connect-ing phrase AS AN ILLUSTRATION is taken to bea dominant marker for INSTANTIATION relations.The procedure for selecting phrases will be ex-plained in Section 3.

Given the limited amount of research into usingnaıve subjects for discourse relation annotation, itis important to investigate how this task should bedesigned. One aspect of this design is the inclu-sion of context. The benefits of context are widelyacknowledged in the field of discourse analysis.Context is necessary to ground the discourse be-ing constructed (Cornish, 2009), and the interpre-tation of any sentence other than the first in a dis-course is therefore constrained by the precedingcontext (Song, 2010). This preceding context hassignificant effects on essential parts of discourseannotation, such as determining the rhetorical roleeach sentence plays in the discourse, and the tem-poral relations between the events described (Las-carides et al., 1992; Spooren and Degand, 2010).The knowledge of context is therefore assumed tobe a requirement for discourse analysis.

Although researchers agree that relations shouldbe supplied with linguistic context in order to beannotated reliably, there are no clear guidelinesfor how much context is needed. As a result,studies have diverged in their methodology. Forsome annotation experiments, coders annotate theentire text (e.g., Rehbein et al., 2016; Zuffereyet al., 2012). In these cases, they automaticallytake the context of the relation at hand into ac-count when they annotate a text linearly. By con-trast, in experiments where the entire text doesnot have to be annotated, or the task is split intosmaller tasks for crowdsourcing purposes, the re-lations (or connectives) are often presented with acertain amount of context preceding and follow-ing the segments under investigation (e.g., Hoek

and Zufferey, 2015; Scholman et al., 2016).Knowing how much context is minimally

needed to be able to reliably annotate data willsave resources; after all, the less context annota-tors have to read, the less time they need to spendon the task. The goal of the current experimentis therefore to test the reliability of crowdsourceddiscourse annotations compared to original corpusannotations, as well as the effect of context on thereliability of the task.

3 Method

Participants were asked to insert connectives intocoherence relations. The items were divided intoseveral batches. Each batch contained items withcontext or without context, but these two typeswere not mixed.

3.1 Participants

167 native English speakers completed one ormore batches of this experiment. They were re-cruited via Prolific Academic and reimbursed fortheir participation (2 GBP per batch with context;1.5 GBP per batch without context). Their educa-tion level ranged between an undergraduate degreeand a doctorate degree.

3.2 Materials

The experimental passages consisted of 192 im-plicit and 42 explicit relations from Wall StreetJournal texts. These relations are part of both thePenn Discourse Treebank (PDTB, Prasad et al.,2008) and the Rhetorical Structure Theory Dis-course Treebank (RST-DT, Carlson et al., 2003),and therefore carry labels that were assigned bythe respective expert annotators at the time of thecreation of the corpora. The following types ofrelations were included: 24 CAUSE, 24 CON-JUNCTION (additive), 36 CONCESSION, 36 CON-TRAST, 54 INSTANTIATION and 60 SPECIFICA-TION relations. For the first four relation types,the PDTB and RST-DT annotators were in agree-ment on the label. The latter two types were cho-sen to accommodate a related experiment, and formost of these, the PDTB and RST-DT annotatorswere not in agreement. Lower agreement on theserelations is therefore also expected in the currentexperiment.

The 234 items were divided into 12 batches,with 2 CAUSE, 2 CONJUNCTION, 3 CONCES-SION, 3 CONTRAST, 4 or 5 INSTANTIATION and

26

5 SPECIFICATION items per batch. Order of pre-sentation of the items per batch was randomizedto prevent order effects. Subjects were allowed tocomplete more than one batch, but saw every itemonly once. Average completion time per batch was16 minutes with context and 12 minutes withoutcontext. Due to presentation errors in one CON-JUNCTION, two CAUSE, and two CONCESSION

items, the final dataset for analysis consists of 229items.

Connecting phrases – Subjects were presentedwith a list of connectives and asked to insert theconnective that best expresses the relation hold-ing between the textual spans. The connectiveswere chosen to distinguish between different re-lation types as unambiguously as possible, basedon an investigation on connective substitutabilityby Knott and Dale (1994). The list of connectingphrases consisted of: because, as a result, in addi-tion, even though, nevertheless, by contrast, as anillustration and more specifically.

3.3 Procedure

The experiment was hosted on LingoTurk (Pusseet al., 2016). Participants were presented with abox with predefined connectives followed by thetext passage. In the context condition, the passageconsisted of black and grey sentences. The blacksentences were the two arguments of the coher-ence relation, and the grey sentences functionedas context (two sentences preceding and one fol-lowing the relation). Subjects were instructed tochoose the connecting phrase that best reflectedthe meaning between the black text elements, butto take the grey text into account. In the no-contextcondition, the grey sentences were not presentedor mentioned.

Punctuation markers following the first argu-ment of the relation were replaced by a doubleslash (//, cf. Rohde et al., 2015) to avoid partic-ipants from being influenced by the original punc-tuation markers.

In between the two arguments of the coherencerelation was a box. Participants were instructed to“drag and drop” the connecting phrase that “bestreflected the meaning of the connection betweenthe arguments” (cf. Rohde et al., 2015) into thisgreen box. Participants could also choose two con-necting phrases using the option “add another con-nective”. Moreover, they could manually inserta connecting phrase by clicking “none of these”.

Participants were allowed to complete more thanone batch, but they were never able to completethe same batch in both conditions.

4 Results

Prior to analysis, 5 participants from the contextcondition and 4 participants from the no-contextcondition were removed from the analysis becausethey had very short completion times (<10 min-utes for 20 passages of 5 sentences each; <5 min-utes for 20 passages of 2 sentences each) andshowed high disagreement compared to other par-ticipants. The following analyses do not take theresponses of these participants into consideration.In total, each list was completed by 12 to 14 par-ticipants.

As with any discourse annotation task, somevariation in the distribution of insertions can be ex-pected. We are therefore interested in larger shiftsin the distribution of insertions. To evaluate thesedistributions, we report percentages of agreement(cf. De Kuthy et al., 2016). Typically, annotationtasks are evaluated using Cohen’s or Fleiss’ Kappa(Cohen, 1960; Fleiss, 1971). However, Kappa isnot suitable for the current task because it assumesthat all coders annotate all fragments.

Participants were given the option of insert-ing two connecting phrases if they thought thatboth phrases reflected the meaning of the relation.3.4% of all answers consisted of two connectingphrases. For most items that received a double in-sertion, only one answer consisted of a double in-sertion. The data on multiple insertions thereforedoes not allow us to draw any strong conclusions.This will be elaborated on in the discussion.

2% of all insertions were manual answers.There was no clear pattern in these manual an-swers: Only a few items received manual an-swers, and these items usually received at mosttwo manual answers. An additional 1% of the dataconsisted of ‘blank insertions’: Subjects used the‘manual answer’ option to not insert anything. Aswith the manual answers, there was no clear pat-tern. We aggregate the class ‘manual answer’ and‘no answer’ for our analyses.

We also aggregated frequencies of the connec-tives that fell into the same class: because and as aresult were aggregated as causal connectives, andeven though and nevertheless were aggregated asconcessive connectives.

In the next section, we first show evidence that

27

the method is reliable. We then turn to the re-liability of the no-context condition in compari-son to the context condition to be able to deter-mine whether the presence of context led to higheragreement on the sense(s) of items. Finally, welook at the entropy per item and per condition.

4.1 Overall reliability

The results showed that the method is success-ful: The connectives inserted by the participantsare consistent with the original annotation. This isshown in Figure 1a, with the bars reflecting the in-serted connective per original class and condition.Figure 1b shows this distribution in more detail bydisplaying the percentage of inserted connectivesper item for the context condition. The distributionfor the no-context condition is not included since itis almost identical to the distribution of the contextcondition. Every stacked bar on the x-axis repre-sents an item; the colours on the bars represent theinserted connective.

These visualizations reveal several trends. First,for CAUSE and CONCESSION relations, the inser-tions often converge with the original label. 78%of the inserted connectives in items with a causaloriginal label were causal connectives, and 67% ofthe inserted connectives in concessive items wereconcessive connectives. For both classes, the sec-ond most frequent category of inserted connec-tives was the other class: For CAUSE, the secondmost frequent category was CONCESSION (10%),and for CONCESSION, the second most frequentcategory was CAUSE (15%). On closer inspec-tion of the items, we find that the disagreement be-tween crowdsourced annotations and original an-notations can be traced back to difficulties withspecific items, and not to unreliability of the work-ers: The main cause for the confusion of causaland concessive relations can be attributed to thelack of context and/or background knowledge, es-pecially for items with economic topics. For thesetopics, it can be very hard to judge whether a situ-ation mentioned in one segment is a consequenceof the other segment, or a denied expectation.

The second pattern that Figure 1a reveals con-cerns the classes CONJUNCTION and CONTRAST.The distribution of inserted connectives for theseclasses look similar: The expected marker is usedmost often (40% and 44%, respectively), with thecorresponding causal relation as the second mostfrequent inserted connective type (27% causal in-

sertions and 32% concessive insertions, respec-tively). A closer look at the annotations for itemsin these classes reveals that this is due to genuineambiguity of the relation. For relations originallyannotated as additive, we find that oftentimes acausal relation can also be inferred. The sameexplanation holds for CONTRAST relations: Re-lations from this class that often receive conces-sive insertions are characterized by the referenceto contrasting expectations. Some confusion be-tween these relations is expected, as it is knownthat concessive and contrastive relations are rela-tively difficult to distinguish even for trained anno-tators (see, for example, Robaldo and Miltsakaki,2014; Zufferey and Degand, 2013).

Finally, looking at INSTANTIATION and SPEC-IFICATION relations, we can see that there is morevariety in terms of which connective participantsinserted. This was expected, as these relationswere chosen because original PDTB and RST an-notators did not agree on them.

Looking at the no-context condition in Figure1a, we find a near-perfect replication of the inser-tions in the context condition. This is further evi-dence for the reliability of the task. On average,the difference between the conditions on agree-ment with the original label differed only by 3.7%.Fisher exact tests showed no significant differencein the distribution of responses between conditionsfor any of the original classes (Cause: p = .61;Conjunction: p = .62; Concession: p = .98; Con-trast: p = .88; Instantiation: p = .93; Specification:p = .85).

Another notable pattern, shown in Figure 1b, isthat items often did not receive only one type ofinserted connective; rather, they received multi-ple types of insertions. For INSTANTIATION andSPECIFICATION items, for example, participantsoften converged on two senses: Both the originallyannotated sense, as well as a causal reading. Thisindicates that multiple interpretations are possiblefor a single relation.

Another way to analyse the data is to assign toeach relation the label corresponding to the con-nective that was inserted most frequently by ourparticipants (in Figure 1b, this corresponds to thelargest bar per item). We can then calculate agree-ment between the dominant response per item andthe original label. These results are reported in Ta-ble 1.

Table 1 shows that the dominant response con-

28

(a) Distributions (%) of inserted connectives per original class. For every type of insertion, darker colours represent the contextcondition and lighter colours represent the no-context condition.

(b) By-item distributions (%) for the context condition. Every bar represents a single item; the colours on the bars representthe inserted connective. Plots are arranged according to the amount of dominant insertions corresponding to the original label.

Figure 1: Distributions (%) of inserted connectives per original class.

Original class Context No contextCAUSE 91 95CONJUNCTION 52 35CONCESSION 85 79CONTRAST 53 58INSTANTIATION 54 46SPECIFICATION 25 20

Table 1: Percentage agreement between the origi-nal label and the dominant response per condition.

verges with the original label often for CAUSE

and CONCESSION relations and a majority of thetime for CONJUNCTION (in the context condi-tion), CONTRAST and INSTANTIATION relations(in the context condition). The dominant responsefor SPECIFICATION items hardly converges withtheir original classification. This is as expected, asPDTB and RST-DT annotators also showed littleagreement on SPECIFICATION relations.

Looking at the effect of context, we see thatagreement between the dominant response and theoriginal label is slightly higher when context ispresent for four of six types of relations (CONCES-SION, INSTANTIATION and SPECIFICATION rela-tions). For CONJUNCTION relations, the agree-

29

ment is even 17% higher in the context conditioncompared to the no-context condition. These re-sults suggest that presence of context does have aninfluence on the subjects’ interpretations of the re-lations. In the next sections, we will look at thedistribution of individual items in more detail.

4.2 Effect of context: Dominant response peritem

For 9% of the items, the dominant response shiftsfrom one category to another depending on thepresence of context. Manual inspection of theseitems revealed several characteristics that theyhave in common. First, it was found that oftenthe topic is introduced in the context, and the (lackof) knowledge of the topic influenced the subject’sinterpretation of the relation. This is illustrated us-ing the following CONJUNCTION example:

(1) Quite the contrary – it results from yearsof work by members of the NationalCouncil on the Handicapped, all appointedby President Reagan. You depict the billas something Democratic leaders “hood-winked” the administration into endors-ing.Arg1: The opposite is true: It’s the prod-uct of many meetings with administrationofficials, Senate staffers, advocates, andbusiness and transportation officials //Arg2: many congressmen are citing thecompromise on the Americans With Dis-abilities Act of 1989 as a model for bipar-tisan deliberations.Most National Council members are them-selves disabled or are parents of childrenwith disabilities. wsj 694

In Example 1, the context introduces the topic.The first argument (Arg1) then presents one argu-ment for the claim that the bill results from yearsof hard work (as mentioned in the context), andthe second argument (Arg2) is another argumentfor this claim. However, without the context, Arg2can be taken as a result of Arg1. While this in-terpretation might be true, it does not seem to bethe intended purpose of the relation. In the con-text condition, subjects interpreted the relation asa CONJUNCTION relation (58% of insertions werein addition. In the no-context condition, however,the dominant response was causal (58% of inser-tions), and the conjunctive in addition only ac-counted for 17% of all insertions.

Another common characteristic in items forwhich the presence or absence of context changesthe dominant response, is that the context sen-tence following the relation expands on Arg2,thereby changing the probability distribution ofthat relation. This is common in INSTANTIA-TION and SPECIFICATION relations, where thesecond argument provides an example or spec-ification of Arg1. Often, the sentence follow-ing Arg2 also provides an example or furtherspecification, which emphasizes the INSTANTI-ATION/SPECIFICATION sense of the relation be-tween Arg1 and Arg2. However, in relations forwhich Arg2 can also be seen as evidence for Arg1,the following context sentence can also functionto emphasize the causal sense of the relation byexpanding on the argument in Arg2. Consider Ex-ample 2, taken from the class SPECIFICATION.

(2) Like Lebanon, and however unfairly, Is-rael is regarded by the Arab world as acolonial aberration. Its best hope of ac-ceptance by its neighbours lies in reachinga settlement with the Palestinians.Arg1: Like Lebanon, Israel is being re-made by demography //Arg2: in Greater Israel more than half thechildren under six are Muslims.Within 25 years Jews will probably be theminority. wsj 1141

In this example, the context sentence follow-ing Arg2 expands on Arg2. Together, they con-vey the information that although Jews are the ma-jority now, within 25 years Muslims will be themajority. Without the context, one could imaginethat the text would go on to list more instances ofhow the demography is changing. Subjects in theno-context condition indeed seem to have inter-preted it this way: 75% of the inserted connectivephrases were as an illustration, and the remain-ing insertions were even though and because. Bycontrast, in the context condition subjects mainlyinterpreted a causal relation (64% of insertions),together with the specification sense (17%). Themarker as an illustration only accounted for 7% ofcompletions. Hence, with context present subjectsinterpreted Arg2 as providing evidence for Arg1,but without context it was interpreted as an IN-STANTIATION relation.

30

4.3 Effect of context: Entropy per itemAnother way of analyzing the influence of thepresence of context on the participants’ response,is to look at the entropy of the distribution of inser-tions. In the context of the current study, entropyis defined as a measure of the consistency of con-nective insertions. When the majority of insertionsfor a certain item are the same, the entropy will below, but when a certain item receives many differ-ent types of insertions, the entropy will be high.

For every item, we calculated Shannon’s en-tropy. We then compared the conditions to deter-mine whether entropy of an item increased or de-creased depending on the presence of the context.Here we discuss items that have a difference of atleast 1 bit of entropy between the conditions. Thisset consists of 18 items. Interestingly, presence ofcontext only leads to lower entropy (higher agree-ment) in 10 items. For the other 8 items, subjectsshowed more agreement when the context was notpresented.

When context is beneficial An analysis of itemsfor which presence of context led to higher agree-ment has revealed two common characteristics.First, similar to what we found in the previous sec-tion, presence of context is helpful when the con-text introduces important background information,or when the first argument refers to an entity orevent in the context.

Second, we observed that agreement was higherin the context condition when Arg1 consists of asubordinate clause that attaches to another clausein the context. In these cases, the dependancyof Arg1 to the context possibly hinders a correctinterpretation of Arg1. Consider the followingSPECIFICATION relation:

(3) The spun-off concern “clearly will be oneof the dominant real estate developmentcompanies with a prime portfolio,” hesaid. For the last year, Santa Fe Pacific hasredirected its real estate operations towardlonger-term development of its properties,Arg1: hurting profits that the parent hadgenerated in the past from periodic salesfrom its portfolio //Arg2: real estate operating income for thefirst nine months fell to $71.9 million from$143 million a year earlier.In a statement late yesterday, Santa Fe Pa-cific’s chairman, Robert D. Krebs, saidthat Santa Fe Pacific Realty would re-

pay more than $500 million in debt owedto the parent before the planned spinoff.wsj 1330

In this example, Arg1 is a deranked subordinateclause, which cannot be used as an independentclause. All subjects in the context condition in-serted a causal connective. However, in the no-context condition only 58% inserted a causal con-nective, and 33% of inserted connectives were inaddition. Hence, the dominant response remainedthe same, but the amount of agreement decreasedwhen the context was absent.

When context is disadvantageous Of the 8 itemsfor which absence of context led to more agree-ment, 7 had a common characteristic: The relationbetween the context and Arg1 is not strong, forexample because Arg1 is also the start of a newparagraph, or because there is a topic change. It islikely that in these cases, the presence of contexttook the focus away from the relation.

5 Discussion

The annotations obtained using the connective in-sertion task have the potential to better reflect theaverage readers’ interpretations because the naıveannotators don’t rely on implicit expert knowl-edge. Moreover, it is easier, more affordable andfaster to obtain many annotations for the sameitem via crowdsourcing than via traditional anno-tation methods. Collecting a large number of an-notations for the same item furthermore reveals aprobability distribution over relation senses. Thiscan give researchers more insight into the read-ings of ambiguous relations, and into how domi-nant each sense is for a specific relation.

The procedures of traditional annotation meth-ods often lead to implicit annotation biases that areimplemented to achieve inter-annotator agreement(see, for example, Rehbein et al., 2016). How-ever, annotations that contain biases are less usefulfrom a linguistic or machine learning perspective,as relevant information about a second or third in-terpretation is obscured. Asking a single, trainedannotator to annotate several senses also does notsolve this issue: The annotations would still de-pend on expert knowledge and the annotation pro-cess would take more time. In this paper, we haveshown that crowdsourcing can be a solution.

However, it should be noted that the designof the experiment was somewhat simplified com-pared to traditional annotation tasks, largely due to

31

two factors. First, all items were known to be re-lated to one of the six senses under investigation,that is, participants were not presented with itemsthat did not actually contain a relation (similar toPDTB’s NOREL), or that belonged to a differentclass from those under investigation (for exam-ple, TEMPORAL relations). A second constrainton the current study is that participants were pre-sented with tokens that only marked the six classesunder investigation. Including more classes andtherefore also more connectives in an annotationstudy could result in lower agreement between thecoders. Future research will therefore focus onwhether other relations (including NOREL) canalso be annotated reliably by naıve coders.

Crowdsourcing the data also presents possibleconfounding factors for the design of an anno-tation study. More specifically, one has to beaware of the effect of motivation on the results.For example, we found that the participants rarelyinserted multiple connectives for the same rela-tion. It is possible that motivation played a role inthis. Participants were only required to insert oneconnecting phrase; the second one was optional.Since inserting a second phrase takes more time,participants might have neglected to do so, evenif they interpreted multiple readings for some re-lations. For future experiments, this effect can beavoided by asking subjects to explicitly indicatethat they don’t see a second reading.

Regarding the influence of context, the findingsfrom our experiment do not support the generalconsensus that presence of context is a necessaryrequirement for discourse annotation. The lackof a clear positive effect of context on agreementcould be due to general ambiguity of language. AsSpooren and Degand (2010) note, “establishing acoherence relation in a particular instance requiresthe use of contextual information, which in itselfcan be interpreted in multiple ways and hence isa source of disagreement.” Nevertheless, we dosuggest to include context in discourse annotationtasks if time and resources permit it. Generallycontext does not lead to worse annotations whenthe fragments are presented in their original for-matting, and the presence of context might facili-tate the inference of the intended relation.

6 Conclusion

The current paper addresses the question ofwhether a crowdsourcing connective insertion task

can be used to obtain reliable discourse annota-tions, and whether the presence of context influ-ence the reliability of the data.

Regarding the influence of context, the resultsshowed that the presence of context influenced theannotations when the fragments contained at leastone of the following characteristics: (i) the con-text introduced the topic, (ii) the context sentencefollowing the relation expands on the second ar-gument of the relation; or (iii) the first argumentof the relation is a subordinate clause that attachesto the context. The presence of context led to lessagreement when the connection between the con-text and the first argument was not strong due to aparagraph break or a topic change.

Regarding the reliability of the task, we foundthat the method is reliable for acquiring discourseannotations: The majority of inserted connectivesconverged with the original label, and this conver-gence was almost perfectly replicable, in the sensethat a similar pattern was found in both conditions.The results also showed that subjects often con-verged on two types of insertions. This indicatesthat multiple interpretations are possible for a sin-gle relation. Based on these results, we argue thatannotation by many (more than 2) annotators isnecessary, because it provides researchers with aprobability distribution of all the senses of a rela-tion. This probability distribution reflects the truemeaning of the relation better than a single labelassigned by an annotator according to a specificframework.

Acknowledgements

This research was funded by the German Re-search Foundation (DFG) as part of SFB 1102 “In-formation Density and Linguistic Encoding” andthe Cluster of Excellence “Multimodal Computingand Interaction” (EXC 284).

ReferencesRon Artstein and Massimo Poesio. 2005. Bias de-

creases in proportion to the number of annotators.Proceedings of the Conference on Formal Grammarand Mathematics of Language (FG-MoL), pages141–150.

Ron Artstein and Massimo Poesio. 2008. Inter-coderagreement for computational linguistics. Computa-tional Linguistics, 34(4):555–596.

Fatemeh Torabi Asr and Vera Demberg. 2013. On theinformation conveyed by discourse markers. In Pro-

32

ceedings of the Fourth Annual Workshop on Cogni-tive Modeling and Computational Linguistics, pages84–93.

Lynn Carlson, Daniel Marcu, and Mary EllenOkurowski. 2003. Building a discourse-tagged cor-pus in the framework of Rhetorical Structure The-ory. In Current and new directions in discourse anddialogue, pages 85–112. Springer.

Jacob Cohen. 1960. A coefficient of agreementfor nominal scales. Educational and PsychologicalMeasurement, 20(1):37–46.

Francis Cornish. 2009. “Text’’ and “discourse” ascontext. Working Papers in Functional DiscourseGrammar (WP-FDG-82): The London Papers I,2009, pages 97–115.

Kordula De Kuthy, Ramon Ziai, and Detmar Meurers.2016. Focus annotation of task-based data: Estab-lishing the quality of crowd annotation. In Proceed-ings of the Linguistic Annotation Workshop (LAWX), pages 110–119.

Joseph L. Fleiss. 1971. Measuring nominal scaleagreement among many raters. Psychological bul-letin, 76(5):378–382.

Jet Hoek and Sandrine Zufferey. 2015. Factors in-fluencing the implicitation of discourse relationsacross languages. In Proceedings 11th Joint ACL-ISO Workshop on Interoperable Semantic Annota-tion (isa-11), pages 39–45. TiCC, Tilburg center forCognition and Communication.

Daisuke Kawahara, Yuichiro Machida, Tomohide Shi-bata, Sadao Kurohashi, Hayato Kobayashi, andManabu Sassano. 2014. Rapid development of acorpus with discourse annotations using two-stagecrowdsourcing. In Proceedings of the InternationalConference on Computational Linguistics (COL-ING), pages 269–278.

Alistair Knott and Robert Dale. 1994. Using linguisticphenomena to motivate a set of coherence relations.Discourse Processes, 18(1):35–62.

Klaus Krippendorff. 2004. Reliability in content anal-ysis. Human communication research, 30(3):411–433.

Alex Lascarides, Nicholas Asher, and Jon Oberlander.1992. Inferring discourse relations in context. InProceedings of the 30th annual meeting on Associ-ation for Computational Linguistics, pages 1–8. As-sociation for Computational Linguistics.

Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt-sakaki, Livio Robaldo, Aravind K. Joshi, and Bon-nie Webber. 2008. The Penn Discourse TreeBank2.0. In Proceedings of the International Conferenceon Language Resources and Evaluation (LREC).Citeseer.

Florian Pusse, Asad Sayeed, and Vera Demberg.2016. LingoTurk: Managing crowdsourced tasksfor psycholinguistics. In Proceedings of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies(NAACL-HLT).

Ines Rehbein, Merel C. J. Scholman, and Vera Dem-berg. 2016. Annotating discourse relations in spo-ken language: A comparison of the PDTB and CCRframeworks. In Proceedings of the Tenth Interna-tional Conference on Language Resources and Eval-uation (LREC), Portoroz, Slovenia. European Lan-guage Resources Association (ELRA).

Stefan Riezler. 2014. On the problem of theoreticalterms in empirical computational linguistics. Com-putational Linguistics, 40(1):235–245.

Livio Robaldo and Eleni Miltsakaki. 2014. Corpus-driven semantics of concession: Where do expecta-tions come from? Dialogue & Discourse, 5(1):1–36.

Hannah Rohde, Anna Dickinson, Nathan Schneider,Christopher Clark, Annie Louis, and Bonnie Web-ber. 2016. Filling in the blanks in understand-ing discourse adverbials: Consistency, conflict, andcontext-dependence in a crowdsourced elicitationtask. In Proceedings of the Linguistic AnnotationWorkshop (LAW X), pages 49–58.

Merel C. J. Scholman, Jacqueline Evers-Vermeul, andTed J. M. Sanders. 2016. Categories of coherencerelations in discourse annotation: Towards a reliablecategorization of coherence relations. Dialogue &Discourse, 7(2):1–28.

Rion Snow, Brendan O’Connor, Daniel Jurafsky, andAndrew Y Ng. 2008. Cheap and fast—but is itgood?: Evaluating non-expert annotations for nat-ural language tasks. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 254–263. Associationfor Computational Linguistics.

Lichao Song. 2010. The role of context in discourseanalysis. Journal of Language Teaching and Re-search, 1(6):876–879.

Wilbert P. M. S. Spooren and Liesbeth Degand. 2010.Coding coherence relations: Reliability and va-lidity. Corpus Linguistics and Linguistic Theory,6(2):241–266.

Sandrine Zufferey and Liesbeth Degand. 2013. Anno-tating the meaning of discourse connectives in mul-tilingual corpora. Corpus Linguistics and LinguisticTheory, 1:1–24.

Sandrine Zufferey, Liesbeth Degand, Andrei Popescu-Belis, and Ted J. M. Sanders. 2012. Empirical val-idations of multilingual annotation schemes for dis-course relations. In Proceedings of the 8th JointISO-ACL/SIGSEM Workshop on Interoperable Se-mantic Annotation, pages 77–84.

33


A Code-Switching Corpus of Turkish-German Conversations

Özlem ÇetinogluIMS, University of Stuttgart

[email protected]

Abstract

We present a code-switching corpusof Turkish-German that is collected byrecording conversations of bilinguals. Therecordings are then transcribed in twolayers following speech and orthographyconventions, and annotated with sentenceboundaries and intersentential, intrasen-tential, and intra-word switch points. Thetotal amount of data is 5 hours of speechwhich corresponds to 3614 sentences. Thecorpus aims at serving as a resource forspeech or text analysis, as well as a col-lection for linguistic inquiries.

1 Introduction

Code-switching (CS) is mixing two (or more)languages in spoken and written communication(Myers-Scotton, 1993; Poplack, 2001; Toribio andBullock, 2012) and is quite common in multilin-gual communities (Auer and Wei, 2007). With theincrease in multilingual speakers worldwide, CSbecomes more prominent.

In parallel, the interest in processing mixed lan-guage is on the rise in the Computational Lin-guistics community. Researchers work on coretasks such as normalisation, language identifi-cation, language modelling, part-of-speech tag-ging as well as downstream ones such as auto-matic speech recognition and sentiment analysis(Çetinoglu et al., 2016). The majority of the cor-pora used in these tasks come from social media(Nguyen and Dogruöz, 2013; Barman et al., 2014;Vyas et al., 2014; Solorio et al., 2014; Choudhuryet al., 2014; Jamatia et al., 2015; Samih and Maier,2016; Vilares et al., 2016; Molina et al., 2016).

Social media has the advantage of containingvast amount of data and easy access. Dependingon the medium, however, limitations might arise.

For instance, Twitter, the most popular source sofar, allows the distribution of tweet IDs rather thantweets themselves, which can be deleted. Hence itis hard to use the full resource, reproduce previousresults or compare to them. Moreover the charac-ter limit and idiosyncratic language of social me-dia bring extra challenges of processing in addi-tion to challenges coming from code-switching.

Spoken data has also been a popular sourcein computational CS research (Solorio and Liu,2008; Lyu and Lyu, 2008; Chan et al., 2009; Shenet al., 2011; Li et al., 2012; Lyu et al., 2015; Yil-maz et al., 2016). There are no limitations on thelength of sentences, idiosyncrasies are less pro-nounced. Despite such advantages, it is almostsolely used in speech analysis. To our knowledge,only Solorio and Liu (2008) have used transcrip-tions of CS speech in text analysis. One reasonthat researchers processing CS text prefer socialmedia could be that it is already text-based, and itrequires much less time and effort than speech col-lection transcription. For the existing speech cor-pora, discrepancies between the speech transcrip-tions and the input text processing tools expectcould be a drawback. For instance the SEAMEcorpus (Lyu et al., 2015) does not use punctuation,capitalisation, or sentence boundaries in transcrip-tions, yet standard text processing tools (POS tag-gers, morphological analysers, parsers) are trainedon edited text, hence make use of orthographiccues.

In this paper, we introduce a Turkish-Germancode-switching corpus of conversations and theirtwo layers of transcriptions following speech andorthography conventions. The data is annotatedwith sentence boundaries and intersentential, in-trasentential, and intra-word switch points. Ouraim is to provide a resource that could be used byresearchers from different backgrounds, e.g., forspeech recognition and language identification in

34

speech, for language identification and predictingCS points in text, and as a corpus of empirical ev-idence for linguistically interesting structures.

2 Related Work

Creating code-switching corpora for speech anal-ysis has started with reading designed text ratherthan spontaneous speech. Lyu and Lyu (2008) usea Mandarin-Taiwanese test set for their languageidentification system that consist of 4.8 hours ofspeech corresponding to 4600 utterances. Theset is designed to have Mandarin as the mainlanguage with one or two Taiwanese words re-placed with their Mandarin counterparts. Chanet al. (2009) introduce a Cantonese-English cor-pus of read speech of 3167 manually designedsentences. English is inserted into Cantonese assegments of one or more words. Another readspeech corpus is created by Shen et al. (2011) forMandarin-English and consists of 6650 utterances.Li et al. (2012) collected 5 hours of code-switchedMandarin-English speech from conversational andproject meetings. Intersentential and intrasenten-tial switches add up to 1068 in total.

Lyu et al. (2015) present the largest CS speechresource, the SEAME corpus, which has 192hours of transcribed Mandarin-English interviewsand conversations in the latest version.1 The code-switching points naturally occur in the text, as bothlanguages are written in their own scripts. A recentcorpus of 18.5 hours is introduced by Yilmaz et al.(2016) on Frisian-Dutch broadcasts. CS points aremarked in the transcriptions but not on the audiolevel.

Solorio and Liu (2008) recorded a conversa-tion of 40 minutes among Spanish-English bilin-guals. The transcribed speech contains 922 sen-tences with 239 switch points among them. Theauthors used this data to train machine learning al-gorithms that predict CS points of an incremen-tally given input.

Speech collections have always been the pri-mary source in sociolinguistic and pyscholinguis-tic research. We list some of these spoken corporathat employ code-switching instances of Turk-ish and German, mixed with other languages orwith each other. The “Emigranto” corpus (Ep-pler, 2003) documents conversations with Jewishrefugees settled in London in 1930s, who mix

1https://catalog.ldc.upenn.edu/LDC2015S04

Austrian German with British English. In this cor-pus, Eppler (2011) looks into mixed dependencieswhere a dependent and its head are from differentlanguages. She observes that dependents with amixed head have on average longer dependenciesthan ones with a monolingual head.

In a similar fashion, Tracy and Lattey (2009)present more than 50 hours of recordings of el-derly German immigrants in the U.S. The datais fully transcribed and annotated, yet each ses-sion of recordings is transcribed as a single filewith no alignment between transcript utterencesand their corresponding audio parts, and annota-tions use Microsoft Word markings, e.g. bold,italic, underline, or different font sizes, thus re-quire format conversions to be processed by auto-matic tools that accept text-based inputs.

Kallmeyer and Keim (2003) investigate thecommunication characteristics between younggirls in Mannheim, mostly of Turkish origin, andshow that with peers, they employ a mixed formof Turkish and German. Rehbein et al. (2009)and Herkenrath (2012) study the language acqui-sition of Turkish-German bilingual children. Onthe same data Özdil (2010) analyses reasons ofcode-switching decisions. The Kiezdeutsch cor-pus (Rehbein et al., 2014) consists of conversa-tions among native German adolescents with amultiethnic background, including Turkish. As aresult, it also contains a small number of Turkish-German mixed sentences.

3 Data

The data collection and annotation processes arehandled by a team of five Computational Linguis-tics and Linguistics students. In the following sec-tions we give the details of these processes.

3.1 Collection

The data collection is done by the annotators asconversation recordings. We asked the annota-tors to approach Turkish-German bilinguals fromtheir circle for an informal setting, assuming thismight increase the frequency of code-switching.Similarly we recommended the annotators to opentopics that might induce code-switching, suchas work and studies (typically German-speakingenvironments) if a dialogue started in Turkish,or Turkish food and holidays in Turkey (henceTurkish-specific words) in a German-dominatedconversation.

35

28 participants (20 female, 8 male) took part inthe recordings. The majority of the speakers areuniversity students. Their ages range from 9 to 39,with an average of 24 and a mode of 26. We alsoasked the participants to assign a score from 1 to10 for their proficiency in Turkish and German. 18of the participants think their German is better, 5of them think their Turkish is better, and the re-maining 5 assigned an equal score. The averagescore for German is 8.2, and for Turkish 7.5.2

3.2 AnnotationThe annotation and transcriptions are done usingPraat.3 We created six tiers for each audio file:spk1_verbal, spk1_norm, spk2_verbal,spk2_norm, lang, codesw. The first four tierscontain the verbal and normalised transcription ofspeakers 1 and 2. The tier lang correspondsto the language of intervals and can have TR forTurkish, DE for German, and LANG3 for utter-ances in other languages. The first five tiers areintervals, while the last one is a point tier thatdenotes sentence and code-switching boundaries.The labels on the boundaries are SB when bothsides of the boundary are in the same language,SCS when the language changes from one sen-tence to the next (intersentential), WCS when theswitch is between words within a sentence (in-trasentential). Figure 1 shows a Praat screenshotthat demonstrates the tiers and exemplifies SCSand WCS boundaries.

Since Turkish is agglutinative and case mark-ers determine the function of NPs, non-Turkishcommon and proper nouns with Turkish suffixesare commonly observed in CS conversations. Wemark such words in the codesw tier as a intra-word switch and use the symbol § followingÇetinoglu (2016). Example (1) depicts the rep-resentation of a mixed word where the Germancompound Studentenwohnheim ‘student accom-modation’ is followed by the Turkish locative casemarker -da (in bold).

(1) Studentenwohnheimstudent accommodation

§§

daLoc

‘in the student accommodation’

For many proper names, Turkish and Germanorthography are identical. Here, the speech data inparallel becomes an advantage, and the language

2The metadata is also available in the CMDI format at theIMS Clarin repository.

3www.fon.hum.uva.nl/praat

is decided according to the pronunciation. If theproper name is pronounced in German, and fol-lowed by a Turkish suffix a § switch point is in-serted. Otherwise it follows Turkish orthography.

3.3 Transcription

For speech analysis it is important to transcribeutterances close to how they are pronounced. Insome transcription guidelines, capitalisation andpunctuation are omitted (e.g. in the SEAME cor-pus (Lyu et al., 2015)4), in some others they areused to mark speech information (e.g. in theKiezdeutsch corpus (Rehbein et al., 2014)5). Textanalysis on the other hand generally relies on stan-dard orthography. This raises a conflict betweentwo tasks on how to transcribe speech. To avoidthis problem, we introduced two tiers of transcrip-tion. The verbal tier follows the speech conven-tions. If a speaker uses a contraction, the word istranscribed as contracted. The acronyms are writ-ten as separate characters. Numbers are spelledout. Recurring characters are represented with thesingle character followed by a colon. The nor-malised tier follows the edited text conventions.Words obey the orthographic rules of standardTurkish and German, e.g. characters of acronymsare merged back. Punctuation is added to the text,obeying the tokenisation standards (i.e. separatedfrom the preceding and following tokens with aspace).

Example (2) gives a sentence showing the ver-bal and normalised tiers for a Turkish sentence.The r sound in the progressive tense suffix -yor isnot pronounced, hence omitted in the verbal tier.The vowel of the interjection ya is extended dur-ing speech, and the colon representation is used toreflect it in the verbal tier, yet the normalised tierhas the standard form. Also, the question mark ispresent in the normalised tier.

(2) verbal:norm:

neNeWhat

diyosundiyorsunsay.Prog.2PSg

ya:yaIntj.

?

‘What do you say??’

If a made-up word is uttered, it is preceded withan asterisk mark in the transcription. Note thatdialectal pronunciation or using a valid word in

4https://catalog.ldc.upenn.edu/docs/LDC2015S04/SEAME.V4.0.pdf

5http://www.kiezdeutschkorpus.de/files/kidko/downloads/KiDKo-Transkriptionsrichtlinien.pdf

36

Figure 1: A screenshot example from Praat annotations. It shows part of a Turkish sentence and a fullmixed sentence from speaker 1, and part of a Turkish sentence from speaker 2.

wrong context is not considered within this class.Partial words are marked with two hyphens insteadof the common use of one hyphen, as the latter isused in German to denote the initial part of a com-pound when two compounds share a common partand the first compound is written only as the un-shared part (e.g. Wohn- und Schlafzimmer ‘livingroom and bedroom’).

We also marked [silence], [laugh],[cough], [breathe], [noise], and put theremaining sounds into the [other] category.Overlaps occur usually when one speaker is talk-ing and the other is uttering backchannel signalsand words of acknowledgement. There are alsocases both speakers tend to speak at the same time.In all such cases, both voices are transcribed, onespeaker is chosen to be the main speaker, andan [overlap] marker is inserted to the sec-ondary speaker’s verbal and normalised tiers. Thecodesw and lang tiers are decided according tothe main speaker’s transcription.

3.4 Quality Control

Once the Praat annotation is completed its outputfiles are converted to a simpler text format for eas-ier access from existing tools and for easier humanreadability. 6 We ran simple quality control scriptsthat check if all the tiers are present and non-

6The format of the text files is given with an exam-ple in http://www.ims.uni-stuttgart.de/institut/mitarbeiter/ozlem/LAW2017.htmlThe script that converts Praat .TextGrid files to that format isalso provided.

empty, if the lang and codesw tiers have valuesfrom their label sets, and if the lang and codeswlabels are meaningful, for instance, if there are TRlabels on both sides of a SCS (intersentential CS)boundary, either the boundary should be correctedto SB or one of the language labels should be DEor LANG3. Any mistakes are corrected by the an-notators on a second pass.

For the quality control of the transcriptionswe employed Turkish and German morphologi-cal analysers (Oflazer, 1994; Schmid et al., 2004)and analysed all the tokens in the normalised tieraccording to their languages. We then created alist of tokens unknown to the analysers, whichare potentially mispelled words. The annotatorswent through the list and corrected their mistakesin both the verbal and normalised tiers. The re-maining list also gives us the words unknown tothe morphological analysers.

4 Statistics and Observations

The durations of recordings range from 20 secondsto 16 minutes. There are 47 transcribed files witha total of 5 hours. Each file is accompanied witha metadata file that contains speaker information,that could be used to filter the corpus accordingto age intervals, education levels, language profi-ciency etc.

Table 1 gives the basic statistics on the nor-malised version of the transcriptions. The to-ken count includes punctuation and interjections,and excludes paralinguistic markers and overlaps.

37

sentences 3614tokens 41056average sent. length 11.36sentence boundaries (SB) 2166intersentential switches (SCS) 1448intrasentential switches (WCS) 2113intra-word switches (§) 257switches in total 3818sent. with at least one WCS 1108

Table 1: Basic statistics about the data.

Switch Language Pair # %

SBDE→ DE 1356 62.60TR→ TR 809 37.35LANG3→ LANG3 1 0.05

SCS

TR→ DE 754 52.07DE→ TR 671 46.34LANG3→ TR 7 0.48LANG3→ DE 6 0.41DE→ LANG3 5 0.35TR→ LANG3 5 0.35

WCS

TR→ DE 1082 51.20DE→ TR 914 43.26DE→ LANG3 34 1.61TR→ LANG3 31 1.47LANG3→ DE 28 1.33LANG3→ TR 24 1.14

§ DE→ TR 246 95.72LANG3→ TR 11 4.28

Table 2: Breakdown of switches from one lan-guage to another, and their percentages withintheir switch type.

Switch points split mixed tokens into two in thetranscriptions for representational purposes, butthey are counted as one token in the statistics.

The majority of the switches are intrasenten-tial and the language of the conversation changeswhen moving from one sentence to another in 40%of the time. They also correspond to the 55.3% ofall switches. 38% of them happen between words,and the remaining 6.7% are within a word. Table2 shows the breakdown of switches. There are 614overlaps and 648 paralinguistic markers.7

We have observed that many CS instances fallinto the categories mentioned in Çetinoglu (2016),like German verbs coupled with Turkish lightverbs etmek ‘do’ or yapmak ‘make’; Turkish lexi-calised expressions and vocatives in German sen-tences, and vice versa; subordinate clauses andconjuctions in the one language while the remain-ing of the sentence is in the other language. Onecategory we have seen more prominent in speechdata is non-standard syntactic constructions, per-haps due to spontaneity. For instance, Example

7laugh: 279, noise: 148, silence: 113, breath: 74, other:25, cough: 9.

(3), which is also given as Figure 1, is a questionwith two verbs (Turkish in bold). Both Germanhast du and Turkish var mı corresponds to ‘do youhave’.

(3) HastHave

duyou

auchalso

solike

BWLbusiness studies

gibilike

derslerinclass.Poss2Sg

varexist

mıQues

so?like?

‘Do you also have classes like businessstudies?’

5 Conclusion

We present a corpus collected from Turkish-German bilingual speakers, and annotated withsentence and code-switching boundaries in audiofiles and their corresponding transcriptions whichare carried out as both verbal and normalised tiers.In total, it is 5 hours of speech and 3614 sentences.

Transcriptions are available for academicresearch purposes.8 The licence agreement can befound at http://www.ims.uni-stuttgart.de/

institut/mitarbeiter/ozlem/LAW2017.html

along with transcription examples. Audio fileswill be manipulated before distribution in orderto conceal speakers’ identity, to comply with theGerman data privacy laws9.

Acknowledgements

We thank Sevde Ceylan, Ugur Kostak, Hasret elSanhoury, Esra Soydogan, and Cansu Turgut forthe data collection and annotation. We also thankanonymous reviewers for their insightful com-ments. This work was funded by the DeutscheForschungsgemeinschaft (DFG) via SFB 732,project D2.

ReferencesPeter Auer and Li Wei. 2007. Handbook of multilin-

gualism and multilingual communication, volume 5.Walter de Gruyter.

Utsab Barman, Amitava Das, Joachim Wagner, andJennifer Foster. 2014. Code mixing: A challengefor language identification in the language of so-cial media. In Proceedings of the First Workshopon Computational Approaches to Code Switching,

8If parts of data contain information that can reveal theidentity of a specific person, they are anonymised.

9National: https://www.datenschutz-wiki.de/Bundesdatenschutzgesetz, State: https://www.baden-wuerttemberg.datenschutz.de/landesdatenschutzgesetz-inhaltsverzeichnis/

38

pages 13–23, Doha, Qatar, October. Association forComputational Linguistics.

Özlem Çetinoglu, Sarah Schulz, and Ngoc Thang Vu.2016. Challenges of computational processing ofcode-switching. In Proceedings of the Second Work-shop on Computational Approaches to Code Switch-ing, pages 1–11, Austin, Texas, November. Associa-tion for Computational Linguistics.

Özlem Çetinoglu. 2016. A Turkish-German code-switching corpus. In The 10th International Con-ference on Language Resources and Evaluation(LREC-16), Portorož, Slovenia.

Joyce YC Chan, Houwei Cao, PC Ching, and Tan Lee.2009. Automatic recognition of Cantonese-Englishcode-mixing speech. Computational Linguisticsand Chinese Language Processing, 14(3):281–304.

Monojit Choudhury, Gokul Chittaranjan, Parth Gupta,and Amitava Das. 2014. Overview of FIRE 2014track on transliterated search. In Forum for Infor-mation Retrieval Evaluation, Bangalore,India, De-cember.

Eva Eppler. 2003. Emigranto data: A dependency ap-proach to code-mixing. In Pereiro Carmen C., AnxoM.L. Suarez, and XoÃan P Rodriguez-Yanez, edi-tors, Bilingual communities and individuals., pages652–63. Vigo: Servicio de Publicacions da Univer-sidade de Vigo, 1.

Eva Duran Eppler. 2011. The dependency distance hy-pothesis for bilingual code-switching. In Proceed-ings of the International Conference on DependencyLinguistics.

Annette Herkenrath. 2012. Receptive multilingual-ism in an immigrant constellation: Examples fromTurkish–German children’s language. InternationalJournal of Bilingualism, pages 287–314.

Anupam Jamatia, Björn Gambäck, and Amitava Das.2015. Part-of-speech tagging for code-mixedEnglish-Hindi Twitter and Facebook chat messages.In Proceedings of the International Conference Re-cent Advances in Natural Language Processing,pages 239–248, Hissar, Bulgaria, September. IN-COMA Ltd. Shoumen, BULGARIA.

Werner Kallmeyer and Inken Keim. 2003. Linguisticvariation and the construction of social identity in aGerman-Turkish setting. Discourse constructions ofyouth identities, 110:29.

Ying Li, Yue Yu, and Pascale Fung. 2012. AMandarin-English code-switching corpus. In Nico-letta Calzolari, Khalid Choukri, Thierry Declerck,Mehmet Ugur Dogan, Bente Maegaard, Joseph Mar-iani, Jan Odijk, and Stelios Piperidis, editors, Pro-ceedings of the Eighth International Conference onLanguage Resources and Evaluation (LREC-2012),pages 2515–2519, Istanbul, Turkey, May. EuropeanLanguage Resources Association (ELRA). ACLAnthology Identifier: L12-1573.

Dau-Cheng Lyu and Ren-Yuan Lyu. 2008. Lan-guage identification on code-switching utterancesusing multiple cues. In Interspeech, pages 711–714.

D.-C. Lyu, T.-P. Tan, E.-S. Chng, and H. Li. 2015.Mandarin–English code-switching speech corpus inSouth-East Asia: SEAME. LRE, 49(3):581–600.

Giovanni Molina, Fahad AlGhamdi, MahmoudGhoneim, Abdelati Hawwari, Nicolas Rey-Villamizar, Mona Diab, and Thamar Solorio. 2016.Overview for the second shared task on languageidentification in code-switched data. In Proceedingsof the Second Workshop on Computational Ap-proaches to Code Switching, pages 40–49, Austin,Texas, November. Association for ComputationalLinguistics.

C. Myers-Scotton. 1993. Duelling languages: Gram-matical structure in codeswitching. Oxford Univer-sity Press.

Dong Nguyen and A. Seza Dogruöz. 2013. Word levellanguage identification in online multilingual com-munication. In Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 857–862, Seattle, Washington, USA,October. Association for Computational Linguistics.

Kemal Oflazer. 1994. Two-level description of Turk-ish morphology. Literary and Linguistic Comput-ing, 9(2):137–148.

Erkan Özdil. 2010. Codeswitching im zweisprachigenHandeln. Waxmann Verlag.

Shana Poplack. 2001. Code-switching (linguistic). In-ternational encyclopedia of the social and behav-ioral sciences, pages 2062–2065.

Jochen Rehbein, Annette Herkenrath, and BirselKarakoç. 2009. Turkish in Germany on contact-induced language change of an immigrant languagein the multilingual landscape of europe. STUF-Language Typology and Universals Sprachtypologieund Universalienforschung, 62(3):171–204.

Ines Rehbein, Sören Schalowski, and Heike Wiese.2014. The KiezDeutsch korpus (KiDKo) release1.0. In The 9th International Conference on Lan-guage Resources and Evaluation (LREC-14), Reyk-javik, Iceland.

Younes Samih and Wolfgang Maier. 2016. An Arabic-Moroccan Darija code-switched corpus. In Pro-ceedings of the Tenth International Conference onLanguage Resources and Evaluation (LREC 2016),Portorož, Slovenia, may.

Helmut Schmid, Arne Fitschen, and Ulrich Heid.2004. SMOR: A German computational morphol-ogy covering derivation, composition and inflection.In Proceedings of the IVth International Confer-ence on Language Resources and Evaluation (LREC2004, pages 1263–1266.

39

Han-Ping Shen, Chung-Hsien Wu, Yan-Ting Yang,and Chun-Shan Hsu. 2011. Cecos: A chinese-english code-switching speech database. In SpeechDatabase and Assessments (Oriental COCOSDA),2011 International Conference on, pages 120–123.IEEE.

Thamar Solorio and Yang Liu. 2008. Part-of-speechtagging for English-Spanish code-switched text. InProceedings of the 2008 Conference on EmpiricalMethods in Natural Language Processing, pages1051–1060, Honolulu, Hawaii, October. Associa-tion for Computational Linguistics.

Thamar Solorio, Elizabeth Blair, Suraj Mahar-jan, Steven Bethard, Mona Diab, MahmoudGhoneim, Abdelati Hawwari, Fahad AlGhamdi, Ju-lia Hirschberg, Alison Chang, and Pascale Fung.2014. Overview for the first shared task on languageidentification in code-switched data. In Proceedingsof the First Workshop on Computational Approachesto Code Switching, pages 62–72, Doha, Qatar, Octo-ber. Association for Computational Linguistics.

Almeida Jacqueline Toribio and Barbara E Bullock.2012. The Cambridge handbook of linguistic code-switching. Cambridge University Press.

Rosemarie Tracy and Elsa Lattey. 2009. It wasn’t easybut irgendwie äh da hat sich’s rentiert, net?: a lin-guistic profile. In Michaela Albl-Mikasa, SabineBraun, and Sylvia Kalina, editors, Dimensionender Zweitsprachenforschung. Dimensions of SecondLanguage Research. Narr.

David Vilares, Miguel A. Alonso, and Carlos Gomez-Rodriguez. 2016. EN-ES-CS: An English-Spanishcode-switching twitter corpus for multilingual sen-timent analysis. In Nicoletta Calzolari (Confer-ence Chair), Khalid Choukri, Thierry Declerck,Sara Goggi, Marko Grobelnik, Bente Maegaard,Joseph Mariani, Helene Mazo, Asuncion Moreno,Jan Odijk, and Stelios Piperidis, editors, Proceed-ings of the Tenth International Conference on Lan-guage Resources and Evaluation (LREC 2016),Paris, France, may. European Language ResourcesAssociation (ELRA).

Yogarshi Vyas, Spandana Gella, Jatin Sharma, Ka-lika Bali, and Monojit Choudhury. 2014. Pos tag-ging of English-Hindi code-mixed social media con-tent. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Process-ing (EMNLP), pages 974–979, Doha, Qatar, Octo-ber. Association for Computational Linguistics.

Emre Yilmaz, Maaike Andringa, Sigrid Kingma, JelskeDijkstra, Frits Van der Kuip, Hans Van de Velde,Frederik Kampstra, Jouke Algra, Henk van denHeuvel, and David van Leeuwen. 2016. Alongitudinal bilingual Frisian-Dutch radio broad-cast database designed for code-switching research.In Nicoletta Calzolari (Conference Chair), KhalidChoukri, Thierry Declerck, Sara Goggi, Marko Gro-belnik, Bente Maegaard, Joseph Mariani, Helene

Mazo, Asuncion Moreno, Jan Odijk, and SteliosPiperidis, editors, Proceedings of the Tenth Interna-tional Conference on Language Resources and Eval-uation (LREC 2016), Paris, France, May. EuropeanLanguage Resources Association (ELRA).

40


Annotating omission in statement pairs

Hector Martınez Alonso1 Amaury Delamaire2,3 Benoıt Sagot11. Inria (ALMAnaCH), 2 rue Simone Iff, 75012 Paris, France

2. Ecole des Mines de Saint-Etienne, 158 cours Fauriel, 42000 Saint-Etienne, France3. Storyzy (Trooclick), 130 rue de Lourmel, 75015 Paris, France

{hector.martinez-alonso,benoit.sagot}@[email protected]

Abstract

In this piece of industrial application, wefocus on the identification of omissionin statement pairs for an online newsplatform. We compare three annota-tion schemes, namely two crowdsourcingschemes and an expert annotation. Thesimplest of the two crowdsourcing ap-proaches yields a better annotation qual-ity than the more complex one. We use adedicated classifier to assess whether theannotators’ behaviour can be explained bystraightforward linguistic features. How-ever, for our task, we argue that expert andnot crowdsourcing-based annotation is thebest compromise between cost and quality.

1 Introduction

In a user survey, the news aggregator Storyzy1

found out that the two main obstacles for user sat-isfaction when accessing their site’s content wereredundancy of news items, and missing informa-tion respectively. Indeed, in the journalistic genrethat is characteristic of online news, editors makefrequent use of citations as prominent information;yet these citations are not always in full. The rea-sons for leaving information out are often moti-vated by the political leaning of the news platform.

Existing approaches to the detection of politicalbias rely on bag-of-words models (Zhitomirsky-Geffet et al., 2016) that examine the words presentin the writings. Our goal is to go beyond such ap-proaches, which focus on what is said, by insteadfocusing on what is omitted. Thus, this methodrequires a pair of statements; an original one, anda shortened version with some deleted words orspans. The task is then to determine whether the

1http://storyzy.com

information left out in the second statement con-veys substantial additional information. If so, thepair presents an omission; cf. Table 1.

Omission detection in sentence pairs constitutesa new task, which is different from the recognitionof textual entailment—cf. (Dagan et al., 2006)—because in our case we are certain that the longertext entails the short one. What we want to esti-mate is whether the information not present in theshorter statement is relevant. To tackle this ques-tion, we used a supervised classification frame-work, for which we require a dataset of manuallyannotated sentence pairs.

We conducted an annotation task on a sampleof the corpus used by the news platform (Section3). In this corpus, reference statements extractedfrom news articles are used as long ‘reference’statements, whereas their short ‘target’ counter-parts were selected by string and date matching.

We followed by examining which features helpidentify cases of omission (Section 4). In additionto straightforward measures of word overlap (theDice coefficient), we also determined that there isa good deal of lexical information that determineswhether there is an omission. This work is, to thebest of our knowledge, the first empirical study onomission identification in statement pairs.2

2 Related work

To the best of our knowledge, no work has beenpublished about omission detection as such. How-ever, our work is related to a variety of questionsof interest that resort both to linguistics and NLP.

Segment deletion is one of the most immediateforms of paraphrase, cf. Vila et al. (2014) for asurvey. Another phenomenon that also presentsthe notion of segment deletion, although in a very

2We make all data and annotations are freely available atgithub.com/hectormartinez/verdidata .

41

different setting, is ellipsis. In the case of an ellip-sis, the deleted segment can be reconstructed givena discourse antecedent in the same document, beit observed or idealized (Asher et al., 2001; Mer-chant, 2016). In the case of omission, a referenceand a target version of a statement are involved,the deleted segment in one version having an an-tecedent in the other version of the statement, inanother document, as a result of editorial choices.

Our task is similar to the problem of omis-sion detection in translations, but the bilingual set-ting allows for word-alignment-based approaches(Melamed, 1996; Russell, 1999), which we can-not use in our setup. Omission detection is alsorelated to hedge detection, which can be achievedusing specific lexical triggers such as vaguenessmarkers (Szarvas et al., 2012; Vincze, 2013).

3 Annotation Task

The goal of the annotation task is to provide eachreference–target pair with a label: Omission, ifthe target statement leaves out substantial informa-tion, or Same if there is no information loss.

Corpus We obtained our examples from a cor-pus of English web newswire. The corpus is madeup of aligned reference-target statement pairs; cf.Table 1 for examples. These statements werealigned automatically by means of word overlapmetrics, as well as a series of heuristics such ascomparing the alleged speaker and date of thestatement given the article content, and a series oftext normalization steps. We selected 500 pairsfor annotation. Instead of selecting 500 randompairs, we selected a contiguous section from a ran-dom starting point. We did so in order to obtaina more natural proportion of reference-to-targetstatements, given that reference statements can beassociated with more than one target.3

Annotation setupOur first manual annotation strategy relies on

the AMT crowdsourcing platform. We refer toAMT annotators as turkers. For each statementpair, we presented the turkers with a display likethe one in Figure 1.

We used two different annotation schemes,namely OMp, where the option to mark an omis-sion is “Text B leaves out some substantial infor-mation”, and OMe, where it is “Text B leaves out

3The full distribution of the corpus documentation shallprovide more details on the extraction process.

something substantial, such as time, place, cause,people involved or important event information.”

The OMp scheme aims to represent a naive userintuition of the relevance of a difference betweenstatements, akin to the intuition of the users men-tioned in Section 1, whereas OMe aims at captur-ing our intuition that relevant omissions relate tomissing key news elements describable in termsof the 5-W questions (Parton et al., 2009; Das etal., 2012). We ran AMT task twice, once for eachscheme. For each scheme, we assigned 5 turkersper instance, and we required that the annotatorsbe Categorization Masters according to the AMTscoring. We paid 0.05$ per instance.

Moreover, in order to choose between OMp andOMe, two experts (two of the authors of this ar-ticle) annotated the same 100 examples from thecorpus, yielding the OE annotation set.

Figure 1: Annotation scheme for OMp

Annotation results The first column in Table2 shows the agreement of the annotation tasks interms of Krippendorff’s α coefficient. A score ofe.g. 0.52 is not a very high value, but is well withinwhat can be expected on crowdsourced semanticannotations. Note, however, the chance correctionthat the calculation of α applies to a skewed bi-nary distribution is very aggressive (Passonneauand Carpenter, 2014). The conservativeness ofthe chance-corrected coefficient can be assessed ifwe compare the raw agreement between experts(0.86) with the α of 0.67. OMe causes agree-ment to descend slightly, and damages the agree-ment of Same, while Omission remains largelyconstant. Moreover, disagreement is not evenlydistributed across annotated instances, i.e. someinstances show perfect agreement, while other in-stances have maximal disagreement.

We also measured the median annotation timeper instance for all three methods; OMe is al-most twice as slow as OMp (42s vs. 22s), while

42

Instance OMp OMe OE

Example 1 Interior Minister Chaudhry Nisar Ali Khan on Friday said no Pakistani can remainsilent over the atrocities being committed against the people of the occupied Kashmir by theIndian forces.

0 1 1

Example 2 I don’t feel guilty. I cannot tell you how humiliated I feel. ”I feel robbed emotionally.But we’re coming from the east (eastern Europe), we’re too close to Russia ..”

.8 .2 0

Example 3 The tusks resemble the prehistoric sabre-tooth tiger, but of course, they are notrelated. It could make wildlife watching in Sabah more interesting. The rare elephant’s reversedtusks might create some problems when it comes to jostling with other elephants. The tusksresemble the prehistoric sabre-tooth tiger, but of course, they are not related

.6 .4 .5

Table 1: Examples of annotated instances. The ‘Instance’ column contains the full reference statement,with the elements not present in the target statement marked in italics. The last three columns displaythe proportion of Omission labels provided by the three annotation setups.

Dataset α t % Om. Vote MACE

Full OMp 0.52 22 61.72 .65 .63Full OMe 0.49 41 63.48 .69 .61

100 OMp 0.52 22 62.42 .64 .62100 OMe 0.54 42 60.00 .61 .58100 OE 0.67 16 70.87 — .62

Table 2: Dataset, Krippendorff’s α, median anno-tation time, raw proportion of Omision, and labeldistribution using voting and MACE.

the the expert annotation time in OE is 16s. Thelarge time difference between OMp and OMe in-dicates that changing the annotation guidelines hasindeed an effect in annotation behavior, and thatthe agreement variation is not purely a result ofthe expectable annotation noise in crowdsourcing.

The fourth and fifth columns in Table 2 show thelabel distribution after adjudication. While the dis-tribution of Omission-Same labels is very similarafter applying simple majority voting, we observethat the distribution of the agreement does change.In OMp, approx. 80% of the Same-label instancesare assigned with a high agreement (at least fourout of five votes), whereas only a third of the Sameinstances in OMe have such high agreement. Bothexperts have a similar perception of omission, al-beit with a different threshold: in the 14 wherethey disagree, one of the annotators shows a sys-tematic preference for the Omission label.

We also use MACE to evaluate the stabil-ity of the annotations. Using an unsupervisedexpectation-maximization model, MACE assignsconfidence to annotators, which are used to esti-mate the resulting annotations (Hovy et al., 2013).While we do not use the label assignments from

MACE for the classification experiments in Sec-tion 4, we use them to measure how much theproportion of omission changes with regards tosimple majority voting. The more complex OMe

scheme has, parallel to lower agreement, a muchhigher fluctuation—both in relative and absoluteterms—with regards to OMp, which also indicatesthis the former scheme provides annotations thatare more subject to individual variation. Whilethis difference is arguably of a result of genuinelinguistic reflection, it also indicates that the dataobtained by this method is less reliable as such.

To sum up, while the label distribution is similaracross schemes, the Same class drops in overallagreement, but the Omission class does not.

In spite of the variation suggested by their αcoefficient, the two AMT annotated datasets arevery similar. They are 85% identical after labelassignment by majority voting. However, the co-sine similarity between the example-wise propor-tions of omission labels is 0.92. This difference isa consequence of the uncertainty in low-agreementexamples. The similarity with OE is 0.89 for OMp

and 0.86 for OMe; OMp is more similar to the ex-pert judgment. This might be related to the factthat the OMe instructions prime turkers to favornamed entities, leading them to pay less attentionto other types of substantial information such asmodality markers. We shall come back to the moregeneral role of lexical clues in Section 4.

Given that it is more internally consistent and itmatches better with OE, we use the OMp datasetfor the rest of the work described in this article.

4 Classification experiments

Once the manually annotated corpus is built, wecan assess the learnability of the Omission–Same

43

decision problem, which constitutes a binary clas-sification task. We aimed at measuring whetherthe annotators’ behavior can be explained by sim-ple proxy linguistic properties like word overlap orlength of the statements and/or lexical properties.

Features: For a reference statement r, a targetstatement t and a set M of the words that onlyappear in r, we generate the following feature sets:

1. Dice (Fa): Dice coefficient between r and t.2. Length (Fb): The length of r, the length of t,

and their difference.3. BoW (Fc): A bag of words (BoW) of M .4. DWR (Fd): A dense word representation is

word-vector representation of M built fromthe average word vector for all words in M .We use the representations from GloVe (Pen-nington et al., 2014).

5. Stop proportion (Fe): The proportion of stopwords and punctuation in M .

6. Entities (Ff ): The number of entities in Mpredicted by the 4-class Stanford Named En-tity Recognizer (Finkel et al., 2005).

Table 3 shows the classification results. We useall exhaustive combinations of these feature sets totrain a discriminative classifier, namely a logisticregression classifier, to obtain a best feature com-bination. We consider a feature combination to bethe best when it outperforms the others in both ac-curacy and F1 for the Omission label. We compareall systems against the most frequent label (MFL)baseline. We evaluate each feature twice, namelyusing five-cold cross validation (CV-5 OMp), andin a split scenario where we test on the 100 exam-ples of OE after training with the remaining 400examples from OMp (Test OE). The three bestsystems (i.e. non-significantly different from eachother when tested on OMp) are shown in the lowersection of the table. We test for significance usingStudent’s two-tailed test and p <0.05.

As expected, the overlap (Fa) and length metrics(Fb) make the most competitive standalone fea-tures. However, we want to measure how much ofthe labeling of omission is determined by whichwords are left out, and not just by how many.

The system trained on BoW outperforms thesystem on DWR. However, BoW features containa proxy for statement length, i.e. if n words aredifferent between ref and target, then n featureswill fire, and thus approximate the size of M . Adistributional semantic model such as GloVe ishowever made up of non-sparse, real-valued vec-

CV-5 OMp Test OEacc. F1 acc. F1

MFL .69 .81 .73 .84

Fa .79 .81 .76 .83Fb .80 .85 .74 .82Fc .76 .83 .76 .82Fd .74 .84 .76 .84Fe .69 .81 .73 .84Ff .69 .81 .73 .84

Fabe .83 .87 .74 .81Fbf .83 .85 .79 .85Fcdf .81 .86 .82 .88

Table 3: Accuracy and F1 for the Omission la-bel for all feature groups, plus for the best featurecombination in both evaluation methods. Systemssignificantly under baseline are marked in grey.

tors, and does not contain such a proxy for worddensity. If we examine the contribution of using Fd

as a feature model, we see that, while it falls shortof its BoW counterpart, it beats the baseline by amargin of 5-10 points. In other words, regardlessof the size of M , there is lexical information thatexplains the choices of considering an omission.

5 Conclusion

We have presented an application-oriented ef-fort to detect omissions between statement pairs.We have assessed two different AMT annotationschemes, and also compared them with expert an-notations. The extended crowdsourcing scheme isdefined closer to the expert intuition, but has loweragreement, and we use the plain scheme instead.Moreover, if we examine the time need for anno-tation, our conclusion is that there it is in fact detri-mental to use crowdsourcing for this annotationtask with respect to expert annotation. Chiefly,we also show that simple linguistic clues allow aclassifier to reach satisfying classification results(0.86–0.88 F1), which are better than when solelyrelying on the straightforward features of differentlength and word overlap.

Further work includes analyzing whether thechanges in the omission examples contain alsochanges of uncertainty class (Szarvas et al., 2012)or bias type (Recasens et al., 2013), as well as ex-panding the notion of omission to the detection ofthe loss of detail in paraphrases. Moreover, wewant to explore how to identify the most omission-prone news types, in a style similar to the charac-terization of unreliable users in Wei et al. (2013).

44

ReferencesNicholas Asher, Daniel Hardt, and Joan Busquets.

2001. Discourse parallelism, ellipsis, and ambigu-ity. Journal of Semantics, 18(1):1–25.

Ido Dagan, Oren Glickman, and Bernardo Magnini.2006. The pascal recognising textual entailmentchallenge. In Machine learning challenges. evalu-ating predictive uncertainty, visual object classifica-tion, and recognising tectual entailment, pages 177–190. Springer.

Amitava Das, Sivaji Bandyaopadhyay, and BjornGamback. 2012. The 5w structure for senti-ment summarization-visualization-tracking. In In-ternational Conference on Intelligent Text Process-ing and Computational Linguistics, pages 540–555.Springer.

Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating non-local informa-tion into information extraction systems by gibbssampling. In Proceedings of the 43rd Annual Meet-ing on Association for Computational Linguistics,pages 363–370. Association for Computational Lin-guistics.

Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani,and Eduard Hovy. 2013. Learning whom to trustwith MACE. In Proceedings of the 2013 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, pages 1120–1130.

I. Dan Melamed. 1996. Automatic detection of omis-sions in translations. In Proceedings of the 16thConference on Computational Linguistics - Volume2, COLING ’96, pages 764–769, Copenhagen, Den-mark.

Jason Merchant. 2016. Ellipsis: A survey of analyticalapproaches. http://home.uchicago.edu/merchant/pubs/ellipsis.revised.pdf.Manuscript for Jeroen van Craenenbroeck and TanjaTemmerman (eds.), Handbook of ellipsis, OxfordUniversity Press: Oxford, United Kingdom.

Kristen Parton, Kathleen R McKeown, Bob Coyne,Mona T Diab, Ralph Grishman, Dilek Hakkani-Tur,Mary Harper, Heng Ji, Wei Yun Ma, Adam Mey-ers, et al. 2009. Who, what, when, where, why?:comparing multiple approaches to the cross-lingual5W task. In Proceedings of the Joint Conference ofthe 47th Annual Meeting of the ACL and the 4th In-ternational Joint Conference on Natural LanguageProcessing of the AFNLP, pages 423–431.

Rebecca J Passonneau and Bob Carpenter. 2014. Thebenefits of a model of annotation. Transactionsof the Association for Computational Linguistics,2:311–326.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. GloVe: Global Vectorsfor Word Representation. In Proceedings of the

2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 1532–1543,Doha, Qatar.

Marta Recasens, Cristian Danescu-Niculescu-Mizil,and Dan Jurafsky. 2013. Linguistic models for an-alyzing and detecting biased language. In Proceed-ings of the 51th Annual Meeting of the ACL, pages1650–1659, Sofia, Bulgaria.

Graham Russell. 1999. Errors of Omission in Trans-lation. In Proceedings of the 8th International Con-ference on Theoretical and Methodological Issues inMachine Translation (TMI 99), pages 128–138, Uni-versity College, Chester, England.

Gyorgy Szarvas, Veronika Vincze, Richard Farkas,Gyorgy Mora, and Iryna Gurevych. 2012. Cross-genre and cross-domain detection of semantic uncer-tainty. Computational Linguistics, 38(2):335–367.

Marta Vila, M Antonia Martı, Horacio Rodrıguez, et al.2014. Is this a paraphrase? what kind? paraphraseboundaries and typology. volume 4, page 205. Sci-entific Research Publishing.

Veronika Vincze. 2013. Weasels, Hedges and Pea-cocks: Discourse-level Uncertainty in WikipediaArticles. In Proceedings of the International JointConference on Natural Language Processing, pages383–391, Nagoya, Japan.

Zhongyu Wei, Junwen Chen, Wei Gao, Binyang Li,Lanjun Zhou, Yulan He, and Kam-Fai Wong. 2013.An empirical study on uncertainty identification insocial media context. In Proceedings of the 51th An-nual Meeting of the ACL, pages 58–62, Sofia, Bul-garia.

Maayan Zhitomirsky-Geffet, Esther David, MosheKoppel, Hodaya Uzan, and GE Gorman. 2016. Uti-lizing overtly political texts for fully automatic eval-uation of political leaning of online news websites.Online Information Review, 40(3).

45


Annotating Speech, Attitude and Perception Reports

Corien Bary, Leopold Hess, Kees Thijs,Radboud University Nijmegen

Department of Philosophy,Theology and Religious Studies

Nijmegen, the Netherlandsl.hess,k.thijs,c.bary

@ftr.ru.nl

Peter Berck and Iris HendrickxRadboud University Nijmegen

Centre for Language & Speech Technology /Centre for Language Studies,Nijmegen, the Netherlandsp.berck,i.hendrickx

@let.ru.nl

Abstract

We present REPORTS, an annotationscheme for the annotation of speech, at-titude and perception reports. The schememakes it possible to annotate the varioustext elements involved in such reports (e.g.embedding entity, complement, comple-ment head) and their relations in a uniformway, which in turn facilitates the automaticextraction of information on, for example,complementation and vocabulary distribu-tion. We also present the Ancient Greekcorpus RAG (Thucydides’ History of thePeloponnesian War), to which we have ap-plied this scheme using the annotation toolBRAT. We discuss some of the issues, boththeoretical and practical, that we encoun-tered, show how the corpus helps in an-swering specific questions about narrativeperspective and the relation between re-port type and complement type, and con-clude that REPORTS fitted in well with ourneeds.

1 Introduction

Both in our daily communication and in narrativeswe often refer to what other people said, thoughtand perceived. Take as an example (1) which hasa speech, attitude and perception report in the first,second and third sentence, respectively:

(1) John came to Mary, bent on his knees, andasked her ‘Will you marry me?’ He wasafraid that Mary didn’t like him enough. Hedidn’t look at her face.

Notice that not only does the type of the reportdiffer (speech, attitude, perception), we also seedifferent kinds of complements: a direct comple-ment ‘Will you marry me?’, an indirect comple-ment that Mary didn’t like him enough and an NP

complement her face. (Throughout this paper, by‘reports’ we understand reports of speech acts, at-titudes and perceptions - i.e., such things that canin principle have propositional contents, even if ina given case the report complement is only an NP.John came to Mary is not a report in this sense.)

The relation between the report type and thecomplement type (direct, indirect (further dividedinto e.g. complementizer + finite verb, participle,infinitive), NP) has been a major topic of researchin semantics (Portner, 1992; Verspoor, 1990), syn-tax (Bresnan, 1970; Haumann, 1997), and lan-guage typology (Givon, 1980; Cristofaro, 2003;Cristofaro, 2008) alike.

A corpus annotated for speech, attitude and per-ception reports is a convenient tool to study thisrelation since it makes it possible to extract rel-evant information automatically. For a dead lan-guage like Ancient Greek - for which we devel-oped our annotation scheme REPORTS in the firstplace - such a corpus is even more important, asthe research is corpus-based by necessity.

In addition to the linguistic question of under-standing the relation between report type and com-plement type, a corpus annotated for speech, atti-tude and perception reports is also of great use forquestions of a more narratological nature. Narra-tology is the study of narratives, and one of thebig topics here is that of narrative perspective, thephenomenon whereby literary texts often presentevents through the eyes of one of the charactersin the story. Such a perspective emerges from theintricate interplay of all kinds of linguistic expres-sions, with an important role for speech, attitudeand perception reports (whose thoughts and per-ceptions we read and what form they have).

In order to ultimately understand how narrativeperspective works, a first step is a corpus anno-tated for speech, attitude and perception reports.Within the Perspective project,1 we created such

1www.ru.nl/ncs/perspective

46

a corpus for Ancient Greek, RAG (Reports in An-cient Greek). Ancient Greek authors often createshifts to the perspectives of characters. Thucy-dides, for example, whose History of the Pelopon-nesian War (books 6 and 7) is the first corpus weannotated, was already in ancient times famousfor this (Plutarch, De Gloria 3). How these au-thors achieved these perspective shifts is howeveran unsolved puzzle. We aim to shed light on theseissues using the RAG corpus. Both the annotationguidelines and the annotated corpus are publiclyavailable.2

RAG makes it possible to extract certain in-formation about reports automatically, which willcontribute to answering questions at both the lin-guistic and the narratological side. Although wedeveloped the annotation scheme primarily withAncient Greek in mind, we expect that it can beused for corpora in many other languages as well(see the evaluation below). A corpus annotated ac-cording to this scheme makes it for example easyto see which report types occur with which com-plement types. Here we distinguish not only be-tween reports of speech, attitude and perception,but also annotate certain further subdivisions suchas that between normal and manipulative speechreports (as in English tell that vs. tell to) whichcan be expected to be relevant.

In addition to extracting information about thecombinations of report types and complementtypes, RAG (and other corpora that use the scheme)also makes it possible to search for certain wordsin report complements only. An interesting classhere is for example that of attitudinal particlessuch as Ancient Greek δή, μήν, and που. Thesesmall words express the attitude of the actualspeaker towards his utterance (e.g. (un)certainty,confirming expectations, countering the assump-tions of the addressee (Thijs, 2017)). In reports,however, they can also be anchored to the re-ported speaker. Both in light of a better under-standing of these notoriously elusive words them-selves (e.g. which layer of meaning they target),and in light of their role in creating a certain nar-rative perspective (whose point of view they ex-press) (Eckardt, 2012), the behavior of particles inreports deserves special attention (Doring, 2013),the study of which is strongly facilitated by a cor-pus annotated for reports.

In parallel with RAG, we developed an Ancient2https://github.com/GreekPerspective

Greek lemmatizer, GLEM, and POS tagger. Thiscombination increases the possibilities for data ex-traction considerably, as we will see. An interest-ing application of the lemmatizer for the narrato-logical question lies in determining the vocabularydistribution of a certain text. Are there for examplesignificant differences between the words the nar-rator uses when speaking for himself and those inthe complements of other people’s speeches, atti-tudes and perceptions (and how does this differ forthe different kinds of reports and complements)?The lemmatizer makes it possible to extract thisinformation at the level of the lemma rather thanthe individual word form. If we apply the schemeREPORTS to other authors, we can also study dif-ferences between authors in this respect, for exam-ple, whether Herodotus has a stronger distinctionbetween vocabularies, while in Thucydides this ismore blurred. This could then explain why it isespecially Thucydides that stands out as the au-thor who creates especially sophisticated narrativeeffects.

A characteristic feature of Ancient Greekspeech reports is that they are often quite long.Even indirect reports seem to easily extend to sev-eral sentences (rather than just clauses). RAG isalso useful for a better linguistic understanding ofthese constructions. We can for example searchfor clitic words that are taken to come exclu-sively at the peninitial position within a sentence,e.g. connective particles such as γάρ (Goldstein,2016), to see whether it is indeed justified to speakabout ‘complements’ consisting of more than onesentence (and hence, in the case of infinitive com-plements, of sentences without a finite verb!).

In this paper we discuss related annotation work(section 2), and describe the annotation tool BRAT

which we used (section 3) and our annotationscheme REPORTS (section 4). In section 4 we alsodiscuss some choices we made regarding the im-plementation of REPORTS in our corpus RAG. Thecorpus is further described in section 5, where wealso discuss the application of the lemmatizer andPOS tagger and present the results of a small ex-periment testing inter-annotator agreement. Weevaluate BRAT and REPORTS in section 6, includ-ing a discussion of the extendability of REPORTS

to other languages. Section 7 concludes with finalremarks.

47

2 Related work

Previous attempts at corpus annotation for relatedtopics include the annotation of committed belieffor English (Diab et al., 2009) and the annotationof direct and indirect speech in Portuguese (Freitaset al., 2016). Our project differs from the formerin its focus on complementation (rather than infor-mation retrieval) and from the latter in its broaderscope (reports in general rather than only speech).

Also related are the annotation schemes formodality such as (McShane et al., 2004; Hen-drickx et al., 2012). These schemes aim to graspthe attitude of the actual speaker towards theproposition and label such attitudes as for examplebelief or obligation. In contrast to modality anno-tation, which focuses on the attitude of the actualspeaker, we are interested in speech, attitude andperception acriptions in general, including ascrip-tions to other people than the actual speaker. An-other difference is our focus on the linguistic con-structions used. In that respect our scheme alsodiffers from (Wiebe et al., 2005), which, like RAG,annotates what we call reports, but without differ-entiating between e.g. different kinds of comple-ments.

3 BRAT rapid annotation tool

BRAT is an open source web-based tool for text an-notation (Stenetorp et al., 2012)3 and is an exten-sion of stav, a visualization tool that was designedinitially for complex semantic annotations for in-formation extraction in the bio-medical domain in-cluding entities, events and their relations (Ohta etal., 2012; Neves et al., 2012). BRAT has been usedin many different linguistic annotation projectsthat require complex annotation such as ellipsis(Anand and McCloskey, 2015), co-reference res-olution (Kilicoglu and Demner-Fushman, 2016),and syntactic chunks (Savkov et al., 2016).

As BRAT uses a server-based web interface, an-notators can access it in a web browser on theirown computer without the need for further in-stallation of software. All annotations are conve-niently stored on the server.

We considered several other possible annotationtools for our project, such as MMAX2 (Muller andStrube, 2006), GATE Teamware (Bontcheva et al.,2013) and Arethusa4. The main reasons for se-

3http://brat.nlplab.org4http://www.perseids.org/tools/

arethusa/app/#/

lecting BRAT as tool for the implementation of ourannotation scheme were its web interface and itsflexibility: BRAT accommodates the annotation ofdiscontinuous spans as one entity and supports dif-ferent types of relations and attributes.

Furthermore, BRAT offers a simple search in-terface and contains a tool for comparison of dif-ferent versions of annotations on the same sourcetext. BRAT also includes conversion scripts toconvert several input formats such as the CoNNLshared task format, MALT XML5 for parsing andthe BIO format (Ramshaw and Marcus, 1995).

BRAT stores the annotation in a rather simpleplain text standoff format that is merely a list ofcharacter spans and their assigned labels and re-lations, but that can easily be converted to otherformats for further exploitation or search. We planto port the annotated corpus to the ANNIS searchtool (Krause and Zeldes, 2016) in a later stage tocarry out more complex search queries.

4 REPORTS: an annotation scheme forspeech, attitude and perception reports

4.1 The schemeThe annotation scheme REPORTS consists of enti-ties, events and attributes of both.

Entities are (possibly discontinuous) spans oftext. Let’s start with two central ones, the atti-tude/ speech/ perception embedding entity, likeconfessed in (2), and the report complement, herethat he was in love.

(2) John confessed that he was in love.

The attitude/speech/perception embedding entityis most typically a verb form, as in (2), but mayalso be a noun phrase (e.g. the hope that).6 Theembedding entity and the complement stand in thetwo-place relation report, which we implementedas an event in BRAT.

Because this complement is internally complexin some cases – consisting of a series of connectedcomplement clauses – we use as a third entity thecomplement chunk. Chunks are all of the indi-vidual complement clauses that are syntacticallydependent upon one and the same embedding en-tity. In (3) we have one complement that he wasin love and had not slept for three nights, which

5https://stp.lingfil.uu.se/˜nivre/research/treebank.xsd.txt

6Hence the term embedding entity, rather than just verb.

48

consists of two chunks that he was in love and andhad not slept for three nights:

(3) John confessed that he was in love and hadnot slept for three nights.

The complement chunks stand in the chunk-ofrelation to the complement they are part of. Com-plement chunks have a head, the final entity weannotate. Heads are always verbs. It is the verbthat is directly dependent on the embedding entityand can be either a finite verb, an infinitive or aparticiple, depending on the specific subordinat-ing construction used. In (3) the heads are wasand had. As one would expect they stand in thehead-of relation to the chunk. Table 1 lists all theentities and events.

The table also shows the attributes assignedwithin each class. The attributes of the embeddingentities concern its semantic type. Within the classof speech report we distinguish (i) normal speech,involving neutral expressions such as say, answer,report; (ii) manner of speech, which are restrictedto entities that refer to the physical properties ofthe speech act (e.g. scream, cry, whisper); (iii) ma-nipulative speech, which is reserved for speech en-tities that are meant to lead to future actions of theaddressee, such as order/persuade/beg someoneto. The attitude embedding entities (which covera broadly construed range of propositional atti-tudes) are further subdivided into (i) knowledge(e.g. know, understand that), (ii) belief (e.g. think,believe, assume that), (iii) voluntative (e.g. want,intend, hope, fear to) and (iv) other (mostly emo-tional attitudes such as be ashamed, be grieved,rejoice). Entities of perception (e.g. see, hear) donot have a further subdivision.

The complement type is also specified by meansof an attribute. Here, there are five options: (i) di-rect, (ii) indirect, (iii) mixed, (iv) NP and (v) pre-posed NP. The mixed category is used for thosecases where a combination of direct and indirectspeech is used – embedding constructions in An-cient Greek sometimes shift or slip from one con-struction into the other (Maier, 2015). The NP-category covers instances of complements whichdo not have a propositional (clausal) character, butonly consist of an NP-object. An English examplewould be he expects a Spartan victory.

The category of preposed NPs is typical of An-cient Greek. In case of finite complement clauses,a constituent that semantically belongs to this

complement is sometimes placed in a position pre-ceding the complementizer, i.e. syntactically out-side of the complement clause. This happens forreasons of information structure – Ancient Greekis a discourse-configurational language, in whichword order is determined mainly by information-structural concepts like topic and focus (Allan,2014; Goldstein, 2016). It may even happen thatthis constituent is syntactically marked as a mainclause argument – this phenomenon is called pro-lepsis in the literature (Panhuis, 1984). As awhole, constructions like these are annotated ascontaining two complements – a preposed NP andan indirect one – as well as two report relations.

Let’s consider some Ancient Greek examplesfrom RAG.

(4) οἱthe.NOM

δὲPRT

ἄλλοιothers.NOM

ἐψηφίσαντόvote.PST.3PL

τεPRT

ξυμμαχίανalliance.ACC

τοῖςthe.DAT

ΑθηναίοιςAthenians.DAT

καὶand

τόthe.ACC

ἄλλοother.ACC

στράτευμαarmy.ACC

ἐκέλευονinvite.PST.3PL

ἐκfrom

῾ΡηγίουRhegium.GEN

κομίζεινfetch.INF

’the others voted for an alliance with theAthenians and invited them to fetch the rest oftheir army from Rhegium.’ (Thuc. 6.51.2)

Figure 1 shows the visualization of (4) (with somecontext) as it is annotated in BRAT. Here, we havea manipulative speech verb (ἐκέλευον) that gov-erns a discontinuous infinitival complement thatconsists of one chunk (τὸ ἄλλο στράτευμα ... ἐκ῾Ρηγίου κομίζειν); its head is the infinitive κο-μίζειν.

Our second example is more complicated. Theannotations in BRAT are shown in Figure 2, againwith some context.

(5) [A ship went from Sicily to thePeloponnesus with ambassadors,]

οἵπερwho.REL.NOM

τάthe.ACC

τεPRT

σφέτεραown.affairs.ACC

φράσουσινtell.FUT.3PL

ὅτιthat

ἐνinἐλπίσινhopes.DAT

εἰσὶbe.PRS.3PL

καὶand

τὸνthe.ACC

ἐκεῖthere

πόλεμονwar.ACC

ἔτιeven

μᾶλλονmore

ἐποτρυνοῦσιincite.PRS.3PL

γίγνεσθαιbecome.INF

‘who should tell that their own affairs werehopeful, and should incite [the Peloponnesians]to prosecute the war there even moreactively.’ (Thuc. 7.25.1)

49

Entitiesembedding entitya speech normal

manner of speechmanipulative

attitude knowledgebeliefvoluntativeother

perceptioncomplement direct

indirectmixednoun phrasepreposed nounphrase

complement chunkhead of complement chunk finite, not optative

finite, optativeinfinitiveparticiple

Events (relations)reportchunk-ofhead-of

Table 1: RAG’s entities, events and their attributesaIn the implementation in BRAT there actually is no such entity as an underspecified embedding entity, instead we go straight

to the speech, attitude and perception embedded entities. The reason is that BRAT does not allow attributes of attributes, whichwe would otherwise need for the attributes normal etc.

Figure 1: visualization of annotation in BRAT of the sentence in (4) with some preceding context

Here τά σφέτερα is a preposed NP, the comple-ment clause being marked by the complementizerὅτι ‘that’. In the second part of the example, how-ever, we find an infinitive construction instead of afinite clause with a complementizer and the wholecomplement clause is annotated as one (discontin-uous) complement span again, like in (4).

4.2 Choices that we madeThis basic scheme can be felicitously used for agreat deal of the actual data in our corpus, but wealso encountered on the one hand practical and onthe other hand more complex issues that asked foradditional annotation rules. The issues were dis-cussed in the test phase and the rules were spelledout in an elaborate annotation manual. Some ex-amples of the choices we made are the following.

50

Figure 2: visualization of annotation in BRAT of the sentence in (5)

NPs We only made annotations when a comple-ment is explicitly present. In other words, speechor attitude verbs used in an absolute sense (hespoke for a long time) are left out. We did include,however, NP-complements that have a preposi-tional form, as in περὶ τῆς ἀρχῆς εἰπεῖν (to speakabout the empire) or ἐς Συρακοσίους δέος (fear ofthe Syracusans). With regard to NP-complementsin general, we excluded instances of indefinite anddemonstrative pronominal NP-objects (e.g. he ex-pects something/this), since they are not interest-ing for our present research goals due to their lackof meaningful content.

chunks and heads As follows from the defini-tion of a chunk as a subordinated clause, we didnot annotate chunks in the case of NP and directcomplements (nor heads, since a head is alwayshead of a chunk).

attributes of the head We did not make man-ual annotation for the attributes of the complementhead, i.e. whether it is an indicative, optative, in-finitive or participle form. Instead, we used theoutput from the independently-trained POS tagger(see section 5).

UID As mentioned in the introduction, AncientGreek reports, even indirect ones, can be verylong. Quite frequently we find what is calledUnembedded Indirect Discourse (Bary and Maier,2014), as in (6).

(6) [A general sends messengers to his allies,]

ὅπωςin.order.that

μὴnot

διαφρήσωσιlet.through.SBJV.3PL

τοὺςthe.ACC

πολεμίουςenemies.ACC

ἀλλὰbut

ξυστραφέντεςcombine.PTCP.PASS

κωλύσωσιprevent.SBJV.3PL

διελθεῖν.pass.INF

ἄλλῃelsewhere

γὰρfor

αὐτοὺςthem.ACC

οὐδὲnot.even

πειράσειν.try.INF.FUT

’in order that they would not let the enemiesthrough, but would combine themselves andprevent them from passing; for [he said]

elsewhere they would not even attempt it.’(Thuc. 7.32.1)

UID has a form that is usually associated with adependent construction (infinitive or the AncientGreek reportative mood called the optative), but incases like the second sentence in (6) there is noembedding verb it is syntactically dependent on.As the clause with the infinitive or optative ex-presses the content of the report, we do annotate itas a complement (although the term complementmay be misleading in this case).

parenthetical reports We made a differentchoice in the case of parenthetical report construc-tions, Xerxes builds a bridge, as it is said (ὡςλέγεται in Greek). Although here we do annotatethe parenthetical verbs (since they have an impor-tant narrative function – the narrator attributes athought or story to someone other than himself),we do not annotate the main clause Xerxes builds abridge as a complement because there is no reportmorphology (infinitive or optative). In such casesthe boundaries of the complement are also oftenvery vague. Thus, while UID is annotated as a re-port complement without an embedding entity (orreport relation), a parenthetical verb is annotatedas an embedding entity without a complement.

defaults for ambiguous cases Some of the em-bedding entities have multiple meanings, whichbelong to different semantic categories in our clas-sification of attributes. In some of these cases thechoice of the attribute depends on the constructionused for the complement clause. Just as Englishtell, mentioned in the introduction, εἶπον with abare infinitive means tell someone to and is classi-fied as a manipulative speech verb, whereas εἶπονwith a subordinated that-clause means say thatand belongs to the category of normal speech. Incase of speech verbs governing an accusative con-stituent and an infinitive, however, there may stillbe an ambiguity in interpretation between the so-called Accusative plus Infinitive-construction (Hetold that Xerxes builds a bridge), where the ac-

51

cusative constituent functions exclusively as thesubject of the complement infinitive clause, and aconstruction with an accusative object and a bareinfinitive (He told Xerxes to build a bridge) (cf.(Rijksbaron, 2002)). Usually a decision can beeasily made by looking at the surrounding context(as is the case in (4) above).

In other cases, the semantics of embedding en-tities is truly ambiguous between two categories,irrespective of the complement construction. Per-ception verbs like see, for instance, can easilymean understand or know. For verbs like these,we have made default classification rules such asthe following: ’a perception entity is annotated assuch by default; only if an interpretation of directphysical perception is not possible in the given dis-course context it is annotated as an attitude knowl-edge entity.’ Moreover, a list was made of all em-bedding entities and their (default) classification.

personal passive report constructions If theembedding entity is a passive verb of speech orthought, as in Xerxes is said to build a bridge,its subject is coreferential with the subject of thecomplement clause. (This is the so-called Nom-inative plus Infinitive construction (Rijksbaron,2002)). What is reported here, of course, is the factthat Xerxes builds a bridge. However, we have de-cided not to include the subject constituent withinthe annotated complement in these cases, mainlyto warrant consistency with other constructionswith coreferential subjects for which it is morenatural to exclude the subject from the comple-ment (as in Xerxes promised to build a bridge/thathe would build a bridge). There is a similar rulefor constructions like δοκεῖ μοι ’X seems to me’and φαίνομαι’to appear’.

complement boundaries In complex, multi-clause report complements, which are not rarein Ancient Greek, it is sometimes difficult totell which parts actually belong to the report andwhich are interjections by the reporting speaker.As a default rule, we only treat material within thespan of a report complement as an interjection (i.e.not annotate it as part of the complement) if it isa syntactically independent clause. Thus, for in-stance, relative clauses in non-final positions al-ways belong to the span of the complement.

These and similar choices that we made in theprogress of fine-tuning our annotation were mo-tivated primarily by practical considerations, but

they already led to a better conceptualization ofsome substantial questions, such as complementboundaries or relevant kinds of syntactic and se-mantic ambiguities.

5 RAG: a Greek corpus annotated forreports

So far we have annotated Thucydides’ History ofthe Peloponnesian War, books 6 and 7, which con-sists of 807 sentences and 30891 words. In addi-tion to Thucydides, we are also currently workingon Herodotus’ Histories.

The Thucydides digital text is provided by thePerseus Digital Library (Crane, 2016). As it wasin betacode we converted it into unicode (utf8) us-ing a converter created by the Classical LanguageToolkit (Johnson and others, 2016).7

As mentioned before, we combine the manualreports annotation with automated POS-taggingand lemmatization (Bary et al., 2017), whichwe developed independently and which is opensource.

The POS tagger made life easier for the anno-tator. We only annotated what is the head of thecomplement chunk and let it to the POS taggerto decide automatically whether this head is e.g.an infinitive, participle or finite verb and if finite,whether it has indicative mood or for example thereportative optative mood.

The lemmatizer enables us to discover whethera specific verb (e.g. all forms of λέγω ‘to say’) oc-curs with, say, a complement which contains theparticle μήν or a complement with an oblique op-tative, without having to specify the (first, second,third person etc) forms of the verb manually.

For Herodotus, we can also adapt the man-ual annotations (including syntactic dependencies)made in the PROIEL project (Haug and Jøhndal,2008; Haug et al., 2009),8 whose text we use.

All of the corpus has been annotated by twoannotators (PhD students with MA in Classics)working independently. An inventory of differ-ences has been made for every chapter by a stu-dent assistant (partly extracted from BRAT auto-matically using the built-in comparison tool). Allthe errors and differences were then reviewed by

7https://github.com/cltk/cltk/blob/master/cltk/corpus/greek/beta_to_unicode.py

8http://www.hf.uio.no/ifikk/english/research/projects/proiel/

52

the annotators (the most difficult issues were dis-cussed in project meetings) to arrive at a commonand final version. Most often differences betweenannotators concerned two types of issues, whereclear-cut criteria are impossible to define: catego-rization of embedding verbs and syntactic struc-ture ambiguities. The former issue involved verbswhich could, depending on interpretation, be an-notated with two or more different attributes. Forexample, ἐλπίζω may be a ‘voluntative’ verb (‘tohope’) or a ‘belief’ verb (‘to expect’), (cf. discus-sion of εἶπον above); some verbs are ambiguousbetween factive (‘knowledge’ attitude entity cat-egory) and non-factive (‘belief’ category) sensesetc. Even with the use of the more specific rules inthe manual, different readings were often possible.The latter issue involved many kinds of ambigu-ities, most typically concerning relation betweenthe complement clause and other subordinate andcoordinate clauses. For example, a final relativeclause whose antecedent is within the scope of thecomplement may, depending on interpretation, be-long to the complement as well (its content is partof the content of the reported speech act or atti-tude) or be an external comment. (Purpose andconditional clauses give rise to similar issues.)

A small selection of the results are listed in Ta-ble 2. Here we see for example that γάρ, whichis taken to come exclusively at the second positionwithin a sentence, quite frequently occurs withina non-direct complement, suggesting that in thesecases we have to do with main clauses rather thandependent clauses. Likewise we can easily searchfor the particle δή within complements to investi-gate whose perspective it expresses.

Inter annotator agreementWe performed a small experiment to measure theinter annotator agreement for labeling the mainlabels in this annotation task. We compared thespan annotations of the following sample: bookone of Thucydides, chapters 138-146, which con-tain 1932 words and 56 sentences. We countedthe main labels (complement, complement chunk,head of chunk, attitude, speech and perception en-tities). We wielded a strict form of agreement:both the span length and span labels had to matchto count as agreement. One annotator labeled 192spans while the other labeled 182 spans leading toan inter annotator agreement of 83.4% mutual F-score (Hripcsak and Rothschild, 2005).

# embedding entities 670lalaspeech 189lalaattitude 441lalaperception 40# complements 702lalaindirect 543laladirect 15lalaNP 138lalapreposed NP 19lalawith speech embedding entities 186lalawith attitude embedding entities 460lalawith perception embedding entities 39lalaunembedded 17# δή/δὴ in non-direct complements 10# multisentential complements 9# γάρ/γὰρ after sentence-boundary inlala non-direct complements 12total # words in complementsa 17.836average # of words per complement 25.41lalaindirect 14.25laladirect 630.60lalaNP 4.09lalapreposed NP 3.89

Table 2: Some numbers for RAGaEmbedded ones counted twice.

6 Evaluation

In this section we evaluate both the BRAT tool andthe REPORTS scheme with respect to their conve-nience and usefulness.

BRAT is a convenient annotation tool, offeringperspicuous visualization and easy to use with-out any prior training or IT skills (although suchskills are needed, of course, to set up an annotationscheme in BRAT). It does not even require typingany commands - after selecting a span of text, awindow opens from which the annotation can bechosen with a click of the mouse. However, it hasits limitations. The following remarks can be seenas suggestions for future versions or extensions ofthe program.

For example, with complex annotations involv-ing multiple entities and relations (where often onereport is embedded in another) the visualizationceases to be easily legible. In this respect, it seemsthat an annotation scheme of the complexity of

53

REPORTS reaches the limits of BRAT’s usefulness.Also, since it is currently impossible to assign at-tributes to attributes, we could not have speech, at-titude and perception as attributes of the categoryembedding entities (see the footnote in Table 1).As a result we need to query the conjunction ofspeech, attitude and perception entities if we wantto draw conclusions about this class in general.

Deleting and correcting complex annotations isnot straightforward. Crossing and overlappingspans frequently give rise to errors, which are thenimpossible to repair from the level of BRAT’s in-terface and require manual access to source files.

A useful function would be that of creating dif-ferent annotation layers that could be switched onand off in the visual representation - which is pos-sible in e.g. MMAX2 (Muller and Strube, 2006)and would be helpful in this project to use for theannotations of POS and lemma information.

Finally, it would have been convenient if it hadbeen possible to formulate default features, suchas the attribute ‘normal’ in the case of speech em-bedding entities.

As for the annotation scheme itself, it involvesa relatively small number of entities, relations andattributes, but its application is not straightforwardand it necessitated the creation of additional doc-uments (described above): a manual containingexplicit rules for annotation and a list of embed-ding verbs in the different categories. Both doc-uments have been extended and amended in thecourse of work on the annotation. Annotators alsorequired time to get accustomed to the scheme.Nonetheless, after the initial period it was pos-sible to achieve a good level of inter-annotatoragreement, as shown by the experiment mentionedabove.

The annotation scheme is easily extendable toother languages which share the same typology ofcomplements (direct vs. indirect vs. NP) in speech,attitude and perception reports (that includes atleast all major European languages). The cate-gorization of embedding verbs should be univer-sally applicable. Some simplifications are possi-ble in many languages, e.g. removing the cate-gory of preposed NP complements or the addi-tional layer of complement chunks (which maynot be as useful for many languages as it is forAncient Greek, where complements often containseveral clauses of different types). For modern lit-erary languages it would probably be necessary to

create a category for Free Indirect Discourse (butperhaps this would not require more than adding anew attribute of complement entities - the schemealready supports unembedded reports). More sub-stantial changes would be needed for languageswhich have different typologies of reports (e.g.with no strict distinction between direct and in-direct reports) or use other constructions besidesembedding verbs to convey reports (e.g. eviden-tial morphemes).

7 Conclusion

BRAT, despite some limitations, is a useful anno-tation tool that made it possible to implement anannotation scheme which covers all the categoriesand distinctions that we had wanted to include.

Our annotation scheme REPORTS serves its pur-pose well, as it makes it possible to easily extractfrom the corpus information that is relevant to avariety of research questions, concerning e.g. rela-tions between semantics of embedding entities andsyntax of complement clauses, factive presupposi-tions, distribution of vocabulary (including specialsubsets such as discourse particles, evaluative ex-pressions, deictic elements) in different types ofreport complements and outside of them, narra-tive perspective and focalization etc. Both the cor-pus and the annotation scheme, which are madepublicly available, can therefore be a valuable re-source for both linguists and literary scholars.

Acknowledgments

This research is supported by the EU under FP7,ERC Starting Grant 338421-Perspective. Wethank Anke Kersten, Celine Teeuwen and Delanovan Luik for their help.

ReferencesRutger Allan. 2014. Changing the topic: Topic po-

sition in Ancient Greek word order. Mnemosyne,67(2):181–213.

Pranav Anand and Jim McCloskey. 2015. Annotatingthe implicit content of sluices. In The 9th Linguis-tic Annotation Workshop held in conjuncion withNAACL 2015, pages 178–187.

Corien Bary and Emar Maier. 2014. Unembedded In-direct Discourse. Proceedings of Sinn und Bedeu-tung, 18:77–94.

Corien Bary, Peter Berck, and Iris Hendrickx. 2017.A memory-based lemmatizer for Ancient Greek.Manuscript.

54

Kalina Bontcheva, Hamish Cunningham, Ian Roberts,Angus Roberts, Valentin Tablan, Niraj Aswani, andGenevieve Gorrell. 2013. Gate teamware: aweb-based, collaborative text annotation framework.Language Resources and Evaluation, 47(4):1007–1029.

Joan W. Bresnan. 1970. On Complementizers: Towarda Syntactic Theory of Complement Types. Springer,Dordrecht.

Gregory R. Crane. 2016. Perseus Digital Library.http://www.perseus.tufts.edu. [Online;accessed Dec 16, 2016].

Sonia Cristofaro. 2003. Subordination. Oxford Uni-versity Press, Oxford.

Sonia Cristofaro. 2008. A constructionist approachto complementation: Evidence from Ancient Greek.Linguistics, 46(3):571–606.

Mona T. Diab, Lori S. Levin, Teruko Mitamura, OwenRambow, Vinodkumar Prabhakaran, and WeiweiGuo. 2009. Committed Belief Annotation andTagging. In Third Linguistic Annotation Workshop,pages 68–73, Singapore. The Association for Com-puter Linguistics.

Sophia Doring. 2013. Modal particles and contextshift. In Beyond expressives: Explorations in use-conditional meaning, pages 95–123, Leiden. Brill.

Regine Eckardt. 2012. Particles as speaker indexicalsin Free Indirect Discourse. Sprache und Datenver-arbeitung, 36(1):1–21.

Claudia Freitas, Bianca Freitas, and Diana Santos.2016. Quemdisse? Reported speech in Portuguese.In Proceedings of the Tenth International Confer-ence on Language Resources and Evaluation (LREC2016), pages 4410–4416, Paris. European LanguageResources Association (ELRA).

Talmy Givon. 1980. The binding hierarchy and thetypology of complements. Studies in Language,4(3):333–377.

David Goldstein. 2016. Classical Greek Syntax:Wackernagel’s Law in Herodotus. Brill, Leiden.

Dag Haug and Marius Jøhndal. 2008. Creating a par-allel treebank of the old Indo-European Bible trans-lations. In Proceedings of the Language Technol-ogy for Cultural Heritage Data Workshop (LaTeCH2008), Marrakech, Morocco, 1st June 2008, pages27–34.

Dag Haug, Marius Jøhndal, Hanne Eckhoff, EirikWelo, Mari Hertzenberg, and Angelika Muth. 2009.Computational and linguistic issues in designinga syntactically annotated parallel corpus of Indo-European languages. Traitement Automatique desLangues (TAL), 50(2):17–45.

Dagmar Haumann. 1997. The Syntax of Subordina-tion. Max Niemeyer, Tubingen.

Iris Hendrickx, Amalia Mendes, and Silvia Mencar-elli. 2012. Modality in Text: a Proposal for CorpusAnnotation. In LREC’2012 – Eighth InternationalConference on Language Resources and Evaluation,pages 1805–1812, Istanbul, Turkey, May. EuropeanLanguage Resources Association (ELRA).

George Hripcsak and Adam S. Rothschild. 2005.Agreement, the f-measure, and reliability in infor-mation retrieval. Journal of the American MedicalInformatics Association (JAMIA), 12(3):296–298.

Kyle P. Johnson et al. 2016. CLTK: The Classi-cal Languages Toolkit. https://github.com/cltk/cltk. [Online; accessed Nov 12, 2016].

Halil Kilicoglu and Dina Demner-Fushman. 2016.Bio-SCoRes: A Smorgasbord Architecture forCoreference Resolution in Biomedical Text. PLoSONE, 11(3).

Thomas Krause and Amir Zeldes. 2016. Annis3: Anew architecture for generic corpus query and vi-sualization. Digital Scholarship in the Humanities,31(1):118–139.

Emar Maier. 2015. Reported speech in the transi-tion from orality to literacy. Glotta: Zeitschrift furgriechische und lateinische Sprache, 91E(1):152–170.

Marjorie McShane, Sergei Nirenburg, and RonZacharski. 2004. Mood and modality: out of theoryand into the fray. Natural Language Engineering,10(01):57–89.

Christoph Muller and Michael Strube. 2006. Multi-level annotation of linguistic data with MMAX2.In Sabine Braun, Kurt Kohn, and Joybrato Mukher-jee, editors, Corpus Technology and Language Ped-agogy: New Resources, New Tools, New Methods,pages 197–214. Peter Lang, Frankfurt a.M.

Mariana Neves, Alexander Damaschun, AndreasKurtz, and Ulf Leser. 2012. Annotating and evalu-ating text for stem cell research. In Third Workshopon Building and Evaluation Resources for Biomed-ical Text Mining (BioTxtM 2012) at Language Re-sources and Evaluation (LREC).

Tomoko Ohta, Sampo Pyysalo, Jun’ichi Tsujii, andSophia Ananiadou. 2012. Open-domain anatomi-cal entity mention detection. In Proceedings of theWorkshop on Detecting Structure in Scholarly Dis-course, pages 27–36. Association for ComputationalLinguistics.

Dirk Panhuis. 1984. Prolepsis in Greek as a dis-course strategy. Glotta: Zeitschrift fur griechischeund lateinische Sprache, 62:26–39.

Paul Portner. 1992. Situation theory and the seman-tics of propositional expressions. Ph.D. Disserta-tion, University of Massachusetts, Amherst.

55

L.A. Ramshaw and M.P. Marcus. 1995. Text chunkingusing transformation-based learning. In Proceed-ings of the Third Workshop on Very Large Corpora,pages 82–94.

Albert Rijksbaron. 2002. The Syntax and Semanticsof the Verb in Classical Greek: An Introduction.Gieben, Amsterdam.

A. Savkov, J. Carroll, R. Koeling, and J. Cassell. 2016.Annotating patient clinical records with syntacticchunks and named entities: the Harvey Corpus.Language Resources and Evaluation, 50:523–548.

Pontus Stenetorp, Sampo Pyysalo, Goran Topic,Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu-jii. 2012. BRAT: a web-based tool for NLP-assistedtext annotation. In Proceedings of the Demonstra-tions at the 13th Conference of the European Chap-ter of the Association for Computational Linguistics,pages 102–107.

Kees Thijs. 2017. The Attic particle μήν: intersub-jectivity, contrast and polysemy. Journal of GreekLinguistics.

Marjolijn Verspoor. 1990. Semantic criteria in Englishcomplement selection. Ph.D. Dissertation, Rijksuni-versiteit Leiden.

Janyce Wiebe, Theresa Wilson, and Claire Cardie.2005. Annotating expressions of opinions and emo-tions in language. Language resources and evalua-tion, 39(2):165–210.

56


Consistent Classification of Translation Revisions:A Case Study of English-Japanese Student Translations

Atsushi FujitaNICT

[email protected]

Kikuko TanabeKobe College

[email protected]

Chiho ToyoshimaKansai Gaidai University

[email protected]

Mayuka YamamotoHonyaku Center Inc.yamamoto.mayuka

@honyakuctr.co.jp

Kyo KageuraUniversity of [email protected]

Anthony HartleyRikkyo University

[email protected]

Abstract

Consistency is a crucial requirement intext annotation. It is especially importantin educational applications, as lack of con-sistency directly affects learners’ motiva-tion and learning performance. This pa-per presents a quality assessment schemefor English-to-Japanese translations pro-duced by learner translators at university.We constructed a revision typology anda decision tree manually through an ap-plication of the OntoNotes method, i.e.,an iteration of assessing learners’ transla-tions and hypothesizing the conditions forconsistent decision making, as well as re-organizing the typology. Intrinsic eval-uation of the created scheme confirmedits potential contribution to the consistentclassification of identified erroneous textspans, achieving visibly higher Cohen’s κvalues, up to 0.831, than previous work.This paper also describes an applicationof our scheme to an English-to-Japanesetranslation exercise course for undergrad-uate students at a university in Japan.

1 Introduction

Assessing and assuring translation quality is oneof the main concerns for translation services, ma-chine translation (MT) industries, and translationteaching institutions.1 The assessment process fora given pair of source document (SD) and its trans-lation, i.e., target document (TD), consists of twotasks. The first task is to identify erroneous textspans in the TD. In professional settings, when as-sessors consider a text span in a TD as erroneous,

1These include both private companies and translation-related departments in colleges and universities.

they generally suggest a particular revision pro-posal (Mossop, 2014). For instance, in example(1), a transliteration error is corrected.

(1) SD: Mark Potok is a senior fellow at theSouthern Poverty Law Center.

TD: マーク・ポッドック (⇒ポトク)氏は南部貧困法律センターの上級研究員だ。(Podok ⇒ Potok)

Henceforth, we refer to a marked text span reflect-ing the identification of a particular error or defi-ciency as an issue. The second task is to classifyeach identified issue into an abstract issue type,such as “omission” or “misspelling.”

An inherent problem concerning translationquality assessment is that it inevitably involves hu-man judgments, and thus is subjective.2 The firsttask, i.e., identifying issues in TDs, relies heav-ily on assessors’ translation and linguistic compe-tence, as may the subsequent step of making a re-vision proposal for them, depending on the sub-tlety of the issue. It therefore seems impracticalto create an annotation scheme that enables eveninexperienced translators to perform this task at acomparable level to mature translators.

For regulating the second task, several typolo-gies, such as those reviewed in Secara (2005), theMultilingual e-Learning in Language Engineer-ing (MeLLANGE) error typology (Castagnoli etal., 2006), and Multidimensional Quality Metrics(MQM),3 have been proposed. Existing issue ty-pologies show diversity in their granularity andtheir organization of issue types, owing to the factthat the scope and granularity of issues depend

2While automated metrics for MT quality evaluation areoften presented as objective, many, including BLEU (Pap-ineni et al., 2002), rely on comparison with a one or morehuman reference translations whose quality and subjectivityare merely assumed and not independently validated.

3http://www.qt21.eu/launchpad/content/multidimensional-quality-metrics

57

on the purpose of translations and the aim of hu-man assessments (e.g., formative or summative).However, the typology alone does not necessarilyguarantee consistent human assessments (Lommelet al., 2015). For instance, while one may clas-sify the issue in (1) as an “incorrect translation ofterm,” it could also be regarded as a “misspelling.”

In this paper, we focus on the quality as-sessment of learners’ translations. Motivated bythe increasing demand for translation, transla-tion teaching institutions have been incorporat-ing best practices of professionals into their cur-ricula. When teaching the revision and reviewprocesses in such institutions, the assessor’s re-vision proposal is normally not provided, in or-der to prevent learners believing that it is the onlycorrect solution (Klaudy, 1996). Thus, issue typeplays a crucial role in conveying the assessors’intention to learners, and its consistency is espe-cially important, since lack of consistency directlyaffects learners’ motivation and learning perfor-mance. Besides the consistency, the applicabilityof an assessment tool to a wide range of transla-tions is also important. To the best of our knowl-edge, however, none of the existing typologieshave been validated for translations between lan-guages whose structures are radically different,such as English and Japanese. Neither have theirapplicability to translations produced by less ad-vanced learners, such as undergraduate students,been fully examined.

Aiming at (i) a consistent human assessment,(ii) of English-to-Japanese translations, (iii) pro-duced by learner translators, we manually con-structed a scheme for classifying identified issues.We first collected English-to-Japanese translationsfrom learners in order to assure and validate theapplicability of our scheme (§3). We then manu-ally created an issue typology and a decision treethrough an application of the OntoNotes method(Hovy et al., 2006), i.e., an iteration of assess-ing learners’ translations and updating the typol-ogy and decision tree (§4). We adopted an existingtypology, that of MNH-TT (Babych et al., 2012),as the starting point, because its origin (Castag-noli et al., 2006) was tailored to assessing uni-versity student learners’ translations and its ap-plicability across several European languages hadbeen demonstrated. We evaluated our scheme withinter-assessor agreement, employing four asses-sors and an undergraduate learner translator (§5).

We also implemented our scheme in an English-to-Japanese translation exercise course for under-graduate students at a university in Japan, and ob-served tendencies among absolute novices (§6).

2 Previous Work

To the best of our knowledge, the error typol-ogy in the Multilingual e-Learning in LanguageEngineering (MeLLANGE) project (Castagnoli etal., 2006) was the first tool tailored to assess-ing learners’ translations. It had been provedapplicable to learners’ translations across severalEuropean languages, including English, German,Spanish, French, and Italian. The MeLLANGE ty-pology distinguished more than 30 types of issues,grouped into Transfer (TR) issues, whose diagno-sis requires reference to both SD and TD, and Lan-guage (LA) issues, which relate to violations oftarget language norms. This distinction underliesthe widespread distinction between adequacy andfluency, the principal editing and revision strate-gies advocated by Mossop (2014), and the differ-entiation between (bilingual) revision and (mono-lingual) reviewing specified in ISO/TC27 (2015).Designed for offering formative assessment by ex-perienced instructors to university learner transla-tors, it provided a fine-grained discrimination seenalso in, for instance, the framework of the Amer-ican Translators Association (ATA) with 23 cate-gories.4

The MeLLANGE typology was simplified byBabych et al. (2012), who conflated various sub-categories and reduced the number of issue typesto 16 for their translation training environment,MNH-TT, which differs from MeLLANGE in tworespects. First, it is designed for feedback frompeer learners acting as revisers and/or review-ers, whose ability to make subtle distinctions isreduced. Second, it is embedded in a project-oriented translation scenario that simulates pro-fessional practice and where more coarse-grained,summative schemes prevail.5,6 In our pilot test,however, we found that even the MNH-TT typol-ogy did not necessarily guarantee consistent hu-man assessments. When we identified 40 issues

4http://www.atanet.org/certification/aboutexams_error.php

5SAE J2450, the standard for the automotive industry, hasonly seven categories. http://standards.sae.org/j2450_200508/

6The latest MQM (as of February 21, 2017) has eight top-level issue types (dimensions) and more than 100 leaf nodes.http://www.qt21.eu/mqm-definition/

58

Level 1: Incompleteness Translation is not finished.Level 2: Semantic errors The contents of the SD are not properly transferred.Level 3: TD linguistic issues The contents of the SD are transferred, but there are some linguistic issues in the TD.Level 4: TD felicity issues The TD is meaning-preserving and has no linguistic issues, but have some flaws.Level 5: TD register issues The TD is a good translation, but not suitable for the assumed text type.

Table 1: Priority of coarse-grained issue types for translation training for novices.

in an English-to-Japanese translation by a learnerand two of the authors separately classified them,only 17 of them (43%) resulted in agreement onthe classification, achieving Cohen’s κ (Cohen,1960) of 0.36. This highlighted the necessity ofa navigation tool, such as a decision tree, for con-sistent human decision making, especially giventhat the issue type serves as feedback to learners.

Multidimensional Quality Metrics (MQM) hasbeen widely used in the MT community and trans-lation industries. However, the consistency ofclassifying issues had not been guaranteed whenonly its issue typology was used. Lommel et al.(2014) measured the inter-assessor agreement ofidentifying erroneous text spans in MT outputsand classifying them, using the MQM typologycomprising 20 types. Having obtained low Co-hen’s κ values, 0.18 to 0.36, and observed sev-eral types of ambiguities, they pointed out the lackof decision making tool. Motivated by this study,MQM later established a decision tree (Burchardtand Lommel, 2014). Nevertheless, MQM has notbeen validated as applicable to learners’ transla-tions, especially those between distant languages.

3 Collecting English-to-JapaneseTranslations by Learners

Our study began with collecting English-to-Japanese translations produced by learner transla-tors. Assuming novice learner translators and thevery basic competences to teach, we selected jour-nalistic articles as the text type. As the SDs inEnglish, 18 articles with similar conceptual andlinguistic difficulties were sampled from the col-umn page of a news program “Democracy Now!”7

by a professional translator, who also had signif-icant experience in teaching English-to-Japanesetranslation at universities. The average number ofwords in the SDs was 781. Then, 30 students (12undergraduate and 18 graduate students) were em-ployed to translate one of the SDs into Japanese.8

7http://www.democracynow.org/blog/category/weekly_column/

8Some of the SDs were separately translated by more thanone student.

All these participants were native Japanese speak-ers and had attended translation classes at a uni-versity in Japan. They were asked to produce aTD that served as a Japanese version of the origi-nal article.

The collected pairs of SD and TD were dividedinto the following three partitions.

Development: Three randomly selected docu-ment pairs were used to develop our scheme(§4).

Validation 1: Another 17 sampled documentpairs were used to gauge the inter-assessoragreement of the issue classification task,given the identified issues (§5.1).

Validation 2: The remaining ten document pairswere used to examine the stability of iden-tifying erroneous text spans in the TDs, aswell as the inter-assessor agreement betweena learner and an experienced assessor (§5.2).

4 Development of an Issue ClassificationScheme

To alleviate potential inconsistencies, we struc-turalized the issue types in the MNH-TT typology(Babych et al., 2012), introducing a decision tree.We chose a decision tree as a navigation tool forhuman decision making, as in MQM (Burchardtand Lommel, 2014), because the resulting issueswill be used not only by the instructors in order toevaluate the translation quality but also by learn-ers in order to understand the diagnoses. We alsoconsidered that explicit explanation for decisionsis critical in such scenarios.

We first determined the priorities of issue typesthrough in-depth interviews with two professionaltranslators, who also had ample experience inteaching English-to-Japanese translation at univer-sities. These priorities were based on both thework-flow of the professionals and the nature ofthe issues they found in grading learners’ transla-tions. Table 1 shows the coarse-grained figure re-sulting from the two translators’ agreement. Ob-vious incompleteness of translations are captured

59

Issue types in our typology MeLLANGE MQMLevel 1: IncompletenessX4a Content-SD-intrusion-untranslated:

The TD contains elements of the SD left untranslated in errorTR-SI-UT Untranslated

X6 Content-indecision:The TD contains alternative choices left unresolved by the translator

TR-IN n/a

Level 2: Semantic errorsX7 Lexis-incorrect-term: Item is a non-term, incorrect, inconsistent

with the glossary or inconsistent within the TDLA-TL-{IN,NT,IG,IT}

Terminology

X1 Content-omission:Content present in the SD is wrongly omitted in the TD

TR-OM Omission

X2 Content-addition:Content not present in the SD is wrongly added to the TD

TR-AD Addition

X3 Content-distortion:Content present in the SD is misrepresented in the TD

TR-DI,TR-TI-∗,TR-TL-IN

Mistranslation

Level 3: TD linguistic issuesX8 Lexis-inappropriate-collocation: Item is not a usual collocate of a

neighbor it governs or is governed byLA-TL-IC n/a

X10 Grammar-preposition/particle:Incorrect preposition or (Japanese) particle

LA-PR Grammar

X11 Grammar-inflection: Incorrect inflection or agreement for tense, as-pect, number, case, or gender

LA-IA-∗ Grammar

X12 Grammar-spelling: Incorrect spelling LA-HY-SP SpellingX13 Grammar-punctuation: Incorrect punctuation LA-HY-PU PunctuationX9 Grammar-others: Other grammatical and syntactic issues in the TD n/a GrammarLevel 4: TD felicity issuesX16 Text-incohesive: Inappropriate use or non-use of anaphoric expres-

sions, or wrong ordering of given and new elements of informationn/a n/a

X4b Content-SD-intrution-too-literal:The TD contains elements of the SD that are translated too literally

TR-SI-TL Overly literal

X15 Text-clumsy: Lexical choice or phrasing is clumsy, tautologous, orunnecessarily verbose

LA-ST-∗ Awkward

Level 5: TD register issuesX14 Text-TD-inappropriate-register: Lexical choice, phrasing, or style

is inappropriate for the intended text type of the TDLA-RE-∗ Style (except “Awkward”),

Local convention,Grammatical register

Table 2: Our issue typology, with prefixes (context, lexis, grammar, and text) indicating their coarse-grained classification in the MNH-TT typology (Babych et al., 2012). The two rightmost columns showthe corresponding issue types in the MeLLANGE typology (Castagnoli et al., 2006) and those in MQM(Lommel et al., 2015), respectively, where “n/a” indicates issue types that are not covered explicitly.

at Level 1. While Level 2 covers issues relatedto misunderstandings of the SD, Levels 3 and 4highlight issues in the language of the TD. Level 5deals with violation of various requirements im-posed by the text type of the translated document.

Regarding Table 1 as a strict constraint for theshape of the decision tree, and the 16 issue typesin the MNH-TT typology as the initial issue types,we developed our issue classification scheme, us-ing the OntoNotes method (Hovy et al., 2006).In other words, we performed the following iter-ation(s).

Step 1. Annotate issues in the TDs for develop-ment, using the latest scheme.

Step 2. Terminate the iteration if we meet a satis-factory agreement ratio (90%, as in Hovy etal. (2006)).

Step 3. Collect disagreed issues among assessors,including those newly found, and discuss thefactors of consistent decision making.

Step 4. Update the scheme, including the defini-tion of each issue type, the conditions for de-cision making, and their organization in theform of a decision tree. Record marginal ex-amples in the example list.

Step 5. Go back to Step 1.

Three of the authors conducted the above pro-cess using the three document pairs (see §3),which resulted in the issue typology in Table 2 andthe decision tree in Table 3. A total of 52 typicaland marginal examples were also collected.

It is noteworthy that our issue typology pre-serves almost perfectly the top-level distinctionof the MeLLANGE typology, i.e., the TR (trans-

60

ID Question Determined type or next questionTrue False

Q1a Is it an unjustified copy of the SD element? X4a Q1bQ1b Do multiple options remain in the TD? X6 Q2aQ2a Is all content in the SD translated in proper quantities in a proper way? Q3a Q2bQ2b Is the error related to a term in the given glossary? X7 X1/X2/X3Q3a Is it a grammatical issue? Q3b Q4aQ3b Is it predefined specific type? X8/X10/X11

/X12/X13X9

Q4a Does it hurt cohesiveness of the TD? X16 Q4bQ4b Does it hurt fluency? Q4c Q5aQ4c Is it too literal? X4b X15Q5a Is it unsuitable for the intended text type of the TD? X14 Q6aQ6a Is it anyways problematic? “Other issue” “Not an issue”

Table 3: Our decision tree for classifying a given issue: we do not produce questions for distinguishingX1/X2/X3, and X8/X10/X11/X12/X13, considering that their definitions are clear enough.

fer) and LA (language) issues, described in §2.The priority of the former over the latter, implic-itly assumed in the MeLLANGE typology, is alsolargely preserved; the sole exceptions are X7 (in-correct translations of terms) and X4b (too literal).Table 2 also shows that our typology includes thefollowing three issue types that are not covered byMQM.

• X6 (indecision) captures a student habit of of-fering more than one translation for a giventext, which is not observed in professionaltranslators.

• X8 (collocation) employs a more specific,linguistic terminology for diagnosing onesubtype of X15 (clumsy).

• X16 (incohesive), which is also absent fromthe MeLLANGE typology but present in theATA framework, appears not applicable inthe common (commercial) situation wheresentences are translated without reference totheir context.

During the development process, we decided toidentify and classify only the first occurrence ofidentical issues in a single TD. For instance, otherincorrect translations of “ポッドック” for “Potok”in the same TD as example (1) will not be an-notated repeatedly. This is because annotationsare made and used by humans, i.e., assessors andlearners, and persistent indications of identical is-sues may waste the time of assessors and discour-age learners. This practice differs from ordinarylinguistic annotation, especially that aiming to de-velop training data for machine learning methods,which requires exhaustive annotation of the phe-nomena of interest within given documents. Al-though there have been several studies on the use

of partial/incomplete annotation, e.g., Tsuboi etal. (2008), our procedure is nevertheless differentfrom these in the sense that we leave issues “un-annotated” only when identical ones are alreadyannotated.

5 Intrinsic Evaluation of the Scheme

It is hard to make a fair and unbiased comparisonbetween different annotation schemes that targetthe same phenomena, employing the same asses-sors. We thus evaluated whether our issue classi-fication scheme leads to sufficiently high level ofinter-assessor agreement, regarding those poor re-sults described in §2 as baselines, and analyzed thetendencies of disagreements and the distribution ofissues.

5.1 Validation 1: Classification of IdentifiedIssues

5.1.1 Inter-Assessor AgreementWe gauged the consistency of classifying identi-fied issues by the inter-assessor agreement.

First, three of the authors who developed ourscheme identified erroneous text spans in the 17TDs (see §3) and made a revision proposal foreach, through discussion. Then, four assessorswere independently asked to classify each of theresulting 575 issues into one of the 16 issue types,“other issue,” and “not an issue,” following our de-cision tree in Table 3. Two of them were anony-mous paid workers (A and B), while the others (Cand D) were two of the above three authors. Allfour assessors were native Japanese speakers witha strong command of English and an understand-ing of our scheme and translation-related notions.While they were asked to adhere to our decision

61

ID Background Agreement ratio [%] Cohen’s κvs A vs B vs C vs D vs A vs B vs C vs D

A Bachelor of Engineering (now translation editor) - 67.7 63.3 57.9 - 0.613 0.554 0.490B Master of Japanese Pedagogy (now translator) 67.7 - 67.1 61.4 0.613 - 0.592 0.523C Master of Translation Studies 63.3 67.1 - 86.6 0.554 0.592 - 0.831D Ph.D in Computational Linguistics 57.9 61.4 86.6 - 0.490 0.523 0.831 -

Table 4: Inter-assessor agreement on the identified 575 issues.

tree, no dictionary or glossary was provided.Table 4 summarizes the agreement ratio and

Cohen’s κ (Cohen, 1960). The most consistentpair was C and D who agreed on 86.6% (498/575)of the issues and achieved almost perfect agree-ment, κ = 0.831, although it is indisputable thatthey had some advantages, having been engaged indeveloping the scheme and identifying the issues.Both of the two measures draw a clear distinc-tion between the anonymous and identified asses-sors. As our analysis below illustrates, the anony-mous workers made many careless mistakes, pre-sumably because the human resource agency didnot offer substantial incentive to pursue accurateand consistent annotations. Nevertheless, even thelowest κ value in our experiment, 0.490, was vis-ibly higher than those achieved using the typolo-gies with the same level of granularity but withouta tool for consistent decision making (see §2).

Table 5 shows the most frequent disagreementpatterns between each anonymous worker and thetwo authors (C and D) on the 498 issues aboutwhich the authors have agreed. The most typi-cal disagreement was between X3 (distortion) andX4b (too literal). For instance, “has passed” inexample (2) was mistakenly translated into “通過した ([bill] passed [legislature]),” resulting in twoexclusive subjects marked with nominative casemarker “が,” i.e., “各州政府 (state after state)” and“農業口封じ法 (Ag-Gag laws).”

(2) SD: State after state has passed so-calledAg-Gag laws.

TD: 各州政府がいわゆる農業口封じ法が通過した (⇒を可決した)。([bill] passed [legislature] ⇒ [legisla-ture] passed [bill])

As the TD does not convey the original meaning inthe SD, both C and D classified this issue into X3(distortion). In contrast, both A and B regardedthem as X4b (too literal), presumably consider-ing that both of the original translation “通過した”and the revision proposal “可決した” were appro-priate lexical translations for “has passed” when

A, B C&D A BX4b (Level 4) X3 (Level 2) 37 8X3 (Level 2) X4b (Level 4) 11 24X1 (Level 2) X3 (Level 2) 13 10X1 (Level 2) X4b (Level 4) 6 5X1 (Level 2) X16 (Level 4) 6 4X1 (Level 2) X7 (Level 2) 5 4

Table 5: Frequent disagreements between anony-mous workers (A and B) and two of the authors (Cand D) among the 498 identified issues that C andD classified consistently.

separated from the context. The above results, andthe fact that X3 (distortion) and X4b (too literal)also produced the most frequent disagreements be-tween C and D (11 out of 77 disagreements), sug-gested that question Q2a in Table 3 should be de-fined more clearly. We plan to make this precisein our future work.

The other frequent disagreements concerned theissues classified as X1 (omission) by A and B,whereas C and D classified them as other types.For instance, both C and D classified the issuein (3) as X3 (distortion) since the original word“sailors” was incorrectly translated as “soldiers,”and the issue in (4) as X7 (incorrect translation ofterms) since named entities compose a typical sub-class of term.

(3) SD: We have filed a class action for approx-imately a hundred sailors.

TD: およそ 100人の兵士 (⇒海兵兵士)のための集団訴訟を起こした。(soldiers⇒ sailors)

(4) SD: President Ronald Reagan vetoed thebill, but, . . .

TD: レーガン大統領 (⇒ロナルド・レーガン大統領)はその法案を拒否したが、. . .(President Reagan ⇒ President RonaldReagan)

These disagreements imply that the anonymousworkers might not have strictly adhered to our de-cision tree, and classified them as X1 after merely

62

Issue type nundergrad. (6) grad. (11)avg. s.d. avg. s.d.

Level 1 X4a 3 0.04 0.11 0.01 0.04X6 0 0.00 0.00 0.00 0.00

Level 2 X7 33 0.39 0.18 0.26 0.31X1 53 0.73 0.54 0.34 0.29X2 28 0.33 0.26 0.26 0.32X3 240 2.67 1.26 2.24 1.41

Level 3 X8 16 0.19 0.23 0.16 0.24X10 22 0.27 0.24 0.21 0.32X11 10 0.13 0.11 0.07 0.11X12 8 0.11 0.15 0.07 0.11X13 18 0.21 0.15 0.17 0.20X9 10 0.08 0.10 0.12 0.12

Level 4 X16 18 0.20 0.23 0.14 0.14X4b 92 0.87 0.69 1.05 0.93X15 14 0.09 0.07 0.21 0.25

Level 5 X14 28 0.34 0.14 0.25 0.32Total 593 6.63 2.35 5.58 3.35

Table 6: Total frequency and relative frequency ofeach issue type (macro average and standard devi-ation over TDs).

comparing the marked text span with the revisionproposal at the surface level.

5.1.2 Coverage of the Issue TypologyThrough a discussion, the disagreements betweenC and D on the 77 issues were resolved and 18newly identified issues were also classified. Wethen calculated relative frequency RF of each is-sue type, t, in each TD, d, as follows:

RF (t, d) =(frequency of t in d)

(# of words in the SD of d)/100.

Table 6 summarizes the frequency of each issuetype; the “n” column shows their total frequencyacross all TDs and the remaining columns com-pares macro average and standard deviation of therelative frequencies over TDs produced by eachgroup of students. All the identified issues wereclassified into one of the 16 issue types in ourtypology, confirming that the MNH-TT typologyhad also covered various types of issues appear-ing in English-to-Japanese translations producedby learners. As reviewed in §2 and §4, both ofour typology and the MNH-TT typology cover abroader range of issues than the MeLLANGE ty-pology. Thus, we can even insist that our schemeis applicable to translations between several Eu-ropean languages that Castagnoli et al. (2006)have investigated. In our preliminary experimentson assessing English-to-Chinese and Japanese-to-Korean translations using our scheme, we have notobserved any novel type of issues.

X3 (distortion) occurred significantly more fre-quently than the others. This is consistent with theprevious investigation based on the MeLLANGEtypology (Castagnoli et al., 2006), consideringthat X3 (distortion) in our typology corresponds toparts of the most frequent type, LA-TL-IN (Lan-guage, terminology and lexis, incorrect), and thesecond-ranked TR-DI (Transfer, distortion). Theother frequent types were X4b (too literal) and X1(omission), which are both listed in the two exist-ing typologies in Table 2, and also frequently ob-served in the learners’ translations between Euro-pean languages (Castagnoli et al., 2006).

The annotation results revealed that the gradu-ate students produced issues at Level 2 less fre-quently than the undergraduate students, whileproducing more Level 4 issues. Although the rela-tive frequencies of issues vary greatly between in-dividuals, we speculate that less experienced stu-dents are more likely to struggle at Level 2, i.e.,properly understanding content in SDs.

5.2 Validation 2: Annotation by a NoviceLearner Translator

We also evaluated our issue classification schemein a more realistic setting: the comparison of anundergraduate learner translator with an experi-enced assessor.

The learner involved in this experiment, re-ferred to as assessor E, was also a native Japanesespeaker and had attended some translation classesat a university. The other assessor was D, who hadparticipated in the first experiment. The two as-sessors separately identified erroneous text spansin the ten TDs (see §3) with a revision proposal,and classified them following our decision tree.

As a result, D and E respectively annotated561 and 406 issues. Among these, 340 were foridentical text spans, with not necessarily identi-cal but similar revision proposals. They consis-tently classified 289 issues out of 340 (85.0%),achieving a substantial and notably high agree-ment, κ = 0.794. These are substantially higherthan those achieved by the anonymous workersA and B (see Table 4), although they worked ondifferent TDs. This fact indicates that the iden-tified assessors in the first experiment (C and D)did not necessarily have an advantage. More im-portantly, this experiment verified the understand-ability of our scheme by actual learner translators.We expect that learner translators would be able

63

D E # of issuesX4b (Level 4) X3 (Level 2) 6X3 (Level 2) X15 (Level 4) 6X4b (Level 4) X15 (Level 2) 5X1 (Level 2) X3 (Level 2) 3

Table 7: Frequent disagreements between alearner translator (E) and one of the authors (D)among the 340 issues they identified consistently.

to perform peer reviewing of their draft transla-tions, once they have acquired a certain level ofunderstanding of our scheme. Consequently, asKiraly (2000) mentions, they would be able to ef-fectively develop their translation skills throughplaying various roles in the translation work-flow,including that of assessor.

Typical disagreement patterns are shown inTable 7. Similarly to the first experiment, dis-agreement between X3 (distortion) and X4b (tooliteral) was frequently observed. E also classifiedas X15 (clumsy) 11 issues which D classified asX3 (distortion) or X4b (too literal). To answerquestion Q4c consistently, the literalness needs tobe confirmed, for instance, by using dictionaries.

There were 221 and 66 issues identified onlyby D or E, respectively; 171 and 41 out of thesewere ignored by the other, including missed issuesand accepted translations, reflecting the differentlevels of sensitivity of the assessors. The other38 and 14 mismatches suggested the necessity ofa guideline to consistently annotate single issues.For instance, E identified one X3 (distortion) is-sue in (5), while D annotated two issues there: “情報が豊富な (with rich information)” as X2 (addi-tion) and “お天気アプリ (weather application)” asX3 (distortion).

(5) SD: I put the question to Jeff Masters,co-founder at Weather Underground,an Internet weather information service.

TD: 情報量が豊富なお天気アプリ (⇒ 気象情報を提供するウェブサービス)、ウェザー・アンダーグラウンドの共同設立者であるジェフ・マスターズ氏に質問を投げかけた。(a weather application withrich information ⇒ a Web service whichprovides weather information)

6 Translation Exercise at a University

Having created and validated an annotationscheme, we should ultimately verify its usefulnessin actual practice. We implemented our scheme in

an English-to-Japanese translation exercise coursefor undergraduate students at a university in Japan.

6.1 Course Design

Two different types of English texts were used:travel guides from “Travellerspoint”9 (henceforth,“Travel”) and columns from “Democracy Now!”as in §3 (henceforth, “Column”). For each texttype, the instructor of the course sampled threedocuments with similar conceptual and linguis-tic difficulties, and excerpted roughly the first 550words of each document as SDs.

A total of 27 undergraduate students partici-pated in the course held over 15 weeks, from Aprilto July 2015. All of them were native Japanesespeakers; eight had attended translation classesat a university, while the other 19 were absolutenovices. Each student selected one of the sampledSDs for each text type. Before starting transla-tion, they prepared a glossary and collected back-ground information by themselves, and the in-structor added any missing information. Each stu-dent first translated a “Travel” SD into Japaneseover six weeks, referring to the correspondingglossary and background information, and then a“Column” SD in the same manner.

During the process of translating one SD, stu-dents’ translations were assessed every two weeks(three times per SD); a teaching assistant iden-tified erroneous text spans with a revision pro-posal, and classified them following our decisiontree; and then the instructor double-checked them.While the identified erroneous text spans and theassigned issue types were fed back to the stu-dents, revision proposals were not shown (Klaudy,1996). When the instructor fed back the assess-ment results to the students, she also explainedour issue typology (Table 2) and decision tree(Table 3), also using the examples collected dur-ing the development process.

6.2 Observations

Through the course, 54 TDs were annotated with1,707 issues, all of which fell into one of the16 types in our issue typology. Table 8 summa-rizes the relative frequency of each issue type. X3(distortion) occurred significantly more frequentlythan the others in translating both types of SDs,as in the results in Table 6 and previous work(Castagnoli et al., 2006). In other words, transfer-

9http://www.travellerspoint.com/

64

Issue type Travel Columnavg. s.d. avg. s.d.

Level 1 X4a 0.03 0.07 0.03 0.07X6 0.00 0.00 0.00 0.00

Level 2 X7 1.14 0.79 0.53 0.39X1 0.47 0.42 0.49 0.31X2 0.09 0.13 0.16 0.19X3 2.20 0.95 2.91 1.03

Level 3 X8 0.11 0.11 0.13 0.18X10 0.19 0.24 0.18 0.27X11 0.07 0.09 0.12 0.15X12 0.15 0.16 0.11 0.11X13 0.09 0.16 0.17 0.21X9 0.22 0.34 0.03 0.08

Level 4 X16 0.07 0.11 0.03 0.07X4b 0.53 0.53 0.24 0.22X15 0.10 0.18 0.08 0.13

Level 5 X14 0.30 0.25 0.45 0.31Total 5.76 2.17 5.65 1.84

Table 8: Relative frequency of each issue type(macro average and standard deviation over TDs).

ring content of the given SD is the principal issuefor learner translators in general.

Table 8 also highlights that the relative frequen-cies of X7 (incorrect translations of terms) andX4b (too literal) are drastically different for the“Travel” and “Column” SDs. A student-wise com-parison of the relative frequencies in Figure 1 re-vealed that the students who made these two typesof issues more frequently in translating “Travel”SDs (shown in the right-hand side in the figure)produced these types of issues significantly lessfrequently during translating “Column” SDs. Dueto the difference in text types, we cannot claimthat this demonstrates students’ growth in learn-ing to translate and that this has been promoted byour scheme. Nevertheless, our scheme is clearlyuseful for quantifying the characteristics of suchstudents.

7 Conclusion

To consistently assess human translations, es-pecially focusing on English-to-Japanese trans-lations produced by learners, we manually cre-ated an improved issue typology accompaniedby a decision tree through an application of theOntoNotes method. Two annotation experiments,involving four assessors and an actual learnertranslator, confirmed the potential contribution ofour scheme to making consistent classification ofidentified issues, achieving Cohen’s κ values ofbetween 0.490 (moderate) to 0.831 (almost per-fect). We also used our scheme in a translationexercise course at a university in order to assess

0

0.5

1

1.5

0 0.5 1 1.5 2 2.5 3

Rela

tive fre

quency o

f is

sues

(C

olu

mn)

Relative frequency of issues (Travel)

X7X4b

Figure 1: Student-wise comparison of relative fre-quencies of X7 (incorrect translations of terms)and X4b (too literal).

learners’ translations. The predefined 16 issuetypes in our typology covered all the issues thatappeared in English-to-Japanese translations pro-duced by undergraduate students, supporting theapplicability of our issue typology to real-worldtranslation training scenarios.

Our plans for future work include further im-provements of our issue classification scheme,such as clarifying questions in the decision treeand establishing a guideline for annotating singleissues. Its applicability will further be validatedusing other text types and other language pairs.From the pedagogical point of view, monitoringthe effects of assessment is also important (Orozcoand Hurtado Albir, 2002). Given the high agree-ment ratio in our second experiment (§5.2), we arealso interested in the feasibility of peer reviewing(Kiraly, 2000). Last but not least, with a viewto efficient assessment with less human labor, wewill also study automatic identification and classi-fication of erroneous text spans, referring to recentadvances in the field of word- and phrase-levelquality estimation for MT outputs.10

Acknowledgments

We are deeply grateful to anonymous reviewersfor their valuable comments. This work was partlysupported by JSPS KAKENHI Grant-in-Aid forScientific Research (A) 25240051.

ReferencesBogdan Babych, Anthony Hartley, Kyo Kageura, Mar-

tin Thomas, and Masao Utiyama. 2012. MNH-TT:a collaborative platform for translator training. InProceedings of Translating and the Computer 34.10http://www.statmt.org/wmt16/

quality-estimation-task.html

65

Aljoscha Burchardt and Arle Lommel. 2014. QT-LaunchPad supplement 1: Practical guidelines forthe use of MQM in scientific research on transla-tion quality. http://www.qt21.eu/downloads/MQM-usage-guidelines.pdf.

Sara Castagnoli, Dragos Ciobanu, Kerstin Kunz, Na-talie Kubler, and Alexandra Volanschi. 2006. De-signing a learner translator corpus for training pur-pose. In Proceedings of the 7th International Con-ference on Teaching and Language Corpora.

Jacob Cohen. 1960. A coefficient of agreementfor nominal scales. Educational and PsychologicalMeasurement, 20(1):37–46.

Eduard Hovy, Mitchell Marcus, Martha Palmer,Lance Ramshaw, and Ralph Weischedel. 2006.OntoNotes: The 90% solution. In Proceedingsof the Human Language Technology Conference ofthe North American Chapter of the Association forComputational Linguistics (HLT-NAACL) Short Pa-pers, pages 57–60.

ISO/TC27. 2015. ISO 17100:2015 translation ser-vices: Requirements for translation services.

Donald Kiraly. 2000. A Social Constructivist Ap-proach to Translator Education: Empowermentfrom Theory to Practice. Routledge.

Kinga Klaudy. 1996. Quality assessment in schoolvs professional translation. In Cay Dollerup andVibeke Appel, editors, Teaching Translation andInterpreting 3: New Horizons: Papers from theThird Language International Conference, pages197–203. John Benjamins.

Arle Lommel, Maja Popovic, and Aljoscha Burchardt.2014. Assessing inter-annotator agreement fortranslation error annotation. In Proceedings of theLREC MTE Workshop on Automatic and ManualMetrics for Operational Translation Evaluation.

Arle Lommel, Attila Gorog, Alan Melby, HansUszkoreit, Aljoscha Burchardt, and MajaPopovic. 2015. QT21 deliverable 3.1:Harmonised metric. http://www.qt21.eu/wp-content/uploads/2015/11/QT21-D3-1.pdf.

Brian Mossop. 2014. Revising and Editing for Trans-lators (3rd Edition). Routledge.

Mariana Orozco and Amparo Hurtado Albir. 2002.Measuring translation competence acquisition.Translators’ Journal, 47(3):375–402.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automaticevaluation of machine translation. In Proceedingsof the 40th Annual Meeting of the Association forComputational Linguistics (ACL), pages 311–318.

Alina Secara. 2005. Translation evaluation: A stateof the art survey. In Proceedings of the eCoL-oRe/MeLLANGE Workshop, pages 39–44.

Yuta Tsuboi, Hisashi Kashima, Shinsuke Mori, Hi-roki Oda, and Yuji Matsumoto. 2008. Trainingconditional random fields using incomplete annota-tions. In Proceedings of the 22nd International Con-ference on Computational Linguistics (COLING),pages 897–904.

66


Representation and Interchange of Linguistic AnnotationAn In-Depth, Side-by-Side Comparison of Three Designs

Richard Eckart de Castilho♣, Nancy Ide♠, Emanuele Lapponi♥, Stephan Oepen♥,Keith Suderman♠, Erik Velldal♥, and Marc Verhagen♦

♣ Technische Universität Darmstadt, Department of Computer Science♠ Vassar College, Department of Computer Science♥ University of Oslo, Department of Informatics

♦ Brandeis University, Linguistics and Computational Linguistics

AbstractFor decades, most self-respecting linguistic engi-neering initiatives have designed and implementedcustom representations for various layers of, forexample, morphological, syntactic, and semanticanalysis. Despite occasional efforts at harmoniza-tion or even standardization, our field today isblessed with a multitude of ways of encoding andexchanging linguistic annotations of these types,both at the levels of ‘abstract syntax’, namingchoices, and of course file formats. To a large de-gree, it is possible to work within and across de-sign plurality by conversion, and often there maybe good reasons for divergent design reflecting dif-ferences in use. However, it is likely that some ab-stract commonalities across choices of representa-tion are obscured by more superficial differences,and conversely there is no obvious procedure totease apart what actually constitute contentful vs.mere technical divergences. In this study, we seekto conceptually align three representations for com-mon types of morpho-syntactic analysis, pinpointwhat in our view constitute contentful differences,and reflect on the underlying principles and spe-cific requirements that led to individual choices.We expect that a more in-depth understanding ofthese choices across designs may lead to increasedharmonization, or at least to more informed designof future representations.

1 Background & Goals

This study is grounded in an informal collaborationamong three frameworks for ‘basic’ natural lan-guage processing, where workflows can combinethe outputs of processing tools from different devel-oper communities (i.e. software repositories), forexample a sentence splitter, tokenizer, lemmatizer,tagger, and parser—for morpho-syntactic analysisof running text. In large part owing to divergencesin input and output representations for such tools, ittends to be difficult to connect tools from differentsources: Lacking interface standardization, thus,severely limits interoperability.

The frameworks surveyed in this work ad-dress interoperability by means of a common

representation—a uniform framework-internalconvention—with mappings from tool-specific in-put and output formats. Specifically, we will takean in-depth look at how the results of morpho-syntactic analysis are represented in (a) the DKProCore component collection1 (Eckart de Castilhoand Gurevych, 2014), (b) the Language AnalysisPortal2 (LAP; Lapponi et al. (2014)), and (c) theLanguage Application (LAPPS) Grid3 (Ide et al.,2014a). These three systems all share the com-mon goal of facilitating the creation of complexNLP workflows, allowing users to combine toolsthat would otherwise need input and output formatconversion in order to be made compatible. Whilethe programmatic interface of DKPro Core targetsmore technically inclined users, LAP and LAPPSare realized as web applications with a point-and-click graphical interface. All three have been un-der active development for the past several yearsand have—in contemporaneous, parallel work—designed and implemented framework-specific rep-resentations. These designs are rooted in relatedbut interestingly different traditions; hence, ourside-by-side discussion of these particular frame-works provides a good initial sample of observablecommonalities and divergences.4

2 Terminological Definitions

A number of closely interrelated concepts applyto the discussion of design choices in the repre-

1https://dkpro.github.io/dkpro-core2https://lap.clarino.uio.no3https://www.lappsgrid.org4There are, of course, additional designs and workflow

frameworks that we would ultimately hope to include in thiscomparison, as for example the representations used by CON-CRETE, WebLicht, and FoLiA (Ferraro et al., 2014; Heidet al., 2010; van Gompel and Reynaert, 2013), to name justa few. However, some of these frameworks are at least ab-stractly very similar to representatives in our current sample,and also for reasons of space we need to restrict this in-depthcomparison to a relatively small selection.

67

DT NNP NNP VBZ RB VB VBG NNP .The Olympic Committee does n’t regret choosing China .the Olympic Committee do not regret choose China .

ORGANIZATION COUNTRY

det

nn

nsubj

aux

neg xcomp dobj

root

BV

compound

ARG1

top

neg ARG2ARG2

ARG1

ARG2

Figure 1: Running example, in ‘conventional’ visualization, with five layers of annotation: syntacticdependencies and parts of speech, above; and lemmata, named entities, and semantic dependencies, below.

sentations for linguistic annotations. Albeit tosome degree intuitive, there is substantial termi-nological variation and vagueness, which in turnreflects some of the differences in overall annota-tion scheme design across projects and systems.Therefore, with due acknowledgement that no sin-gle, definitive view exists we provide informal defi-nitions of relevant terms as used in the sections thatfollow in order to ground our discussion.

Annotations For the purposes of the current ex-ercise we focus on annotations of language data intextual form, and exclude consideration of othermedia such as speech signals, images, and video.An annotation associates linguistic informationsuch as morpho-syntactic tags, syntactic roles, anda wide range of semantic information with one ormore spans in a text. Low-level annotations typi-cally segment an entire text into contiguous spansthat serve as the base units of analysis; in text, theseunits are typically sentences and tokens. The as-sociation of an annotation to spans may be director indirect, as an annotation can itself be treated asan object to which other (higher level) annotationsmay be applied.

Vocabulary The vocabulary provides an inven-tory of semantic entities (concepts) that form thebuilding blocks of the annotations and the rela-tions (links) that may exist between them (e.g. con-stituent or dependency relations, coreference, andothers). The CLARIN Data Concept Registry5 (for-merly ISOcat) is an example of a vocabulary forlinguistic annotations.

5https://openskos.meertens.knaw.nl/ccr/browser/

Schema A schema provides an abstract specifi-cation (as opposed to a concrete realization) of thestructure of annotation information, by identifyingthe allowable relations among entities from the vo-cabulary that may be expressed in an annotation.A schema is often expressed using a diagrammaticrepresentation such as a UML diagram or entity-relationship model, in which entities label nodes inthe diagram and relations label the edges betweenthem. Note that the vocabulary and the schema thatuses it are often defined together, thus blurring thedistinction between them, as for example, in theLAPPS Web Service Exchange Vocabulary (seeSection 3) or any UIMA CAS type system.

Serialization A serialization of the annotationsis used for storage and exchange. Annotations fol-lowing a given schema can be serialized in a varietyof formats such as (basic) XML, database (column-based) formats, compact binary formats, JSON,ISO LAF/GrAF (Ide and Suderman, 2014; ISO,2012), etc.

3 A Simple Example

In the following sections, we will walk througha simple English example with multiple layers oflinguistic annotations. Figure 1 shows a renderingof our running example in a compact, graphicalvisualization commonly used in academic writing.However, even for this simple sentence, we willpoint out some hidden complexity and informationleft implicit at this informal level of representation.

For example, there are mutual dependenciesamong the various layers of annotation: parts ofspeech and lemmatization both encode aspects of

68

Token(int) begin 12(int) end 21String posValue NP

Committee (12:21)

PROPN(int) begin 12(int) end 21String posValue NPString coarseValue PROPN

pos

Lemma(int) begin 12(int) end 21String value Committee

NN(int) begin 4(int) end 11String dependencyType NNString flavor basic

Token(int) begin 4(int) end 11String posValue NP

Olympic (4:11)

dependent governor

lemma

Organization(int) begin 4(int) end 11String value ORGString identifier http://dbpedia.org/page/International_Olympic_Committee

Dependency(int) begin 12(int) end 21String dependencyType compoundString flavor enhanced

governor dependent

Figure 2: DKPro Core Zoom

token-level morphological analysis; syntactic de-pendencies, in turn, are plausibly interpreted asbuilding on top of the morphological information;finally, also the semantic dependencies, will typ-ically be based on some or maybe all layers ofmorpho-syntactic analysis. In practice, on the otherhand, various layers of annotation are often com-puted by separate tools, which may or may not takeinto account information from ‘lower’ analysis lay-ers. In this respect, the visualization in Figure 1gives the impression of a single, ‘holistic’ represen-tation, even though it need not always hold that allanalysis layers have been computed to be mutuallyconsistent with each other. In Section 4 below, wewill observe that a desire to make explicit the prove-nance of separate annotation layers can provide animportant design constraint.

DKPro Core Type System The DKPro CoreType System extends the type system that is builtinto the UIMA6 framework (Ferrucci and Lally,2004). It provides types for many layers of linguis-tic analysis, such as segmentation, morphology,syntax, discourse, semantics, etc. Additionally,there are several types that carry metadata about

6When talking about UIMA, we refer to the Apache UIMAimplementation: http://uima.apache.org.

the document being processed, about tagsets, etc.UIMA represents data using the Common Anal-

ysis System (CAS) (Götz and Suhre, 2004). TheCAS consists of typed feature structures organizedin a type system that supports single inheritance.There are various serialization formats for the CAS.The most prominent is based on the XML Meta-data Interchange specification (XMI) (OMG, 2002).However, there are also e.g. various binary serial-izations of the CAS with specific advantages, e.g.built-in compression for efficient network transfer.

The built-in types of UIMA are basic ones, suchas TOP (a generic feature structure), Annotation(a feature structure anchored on text via start/endoffsets), or SofA (short for ‘subject of analysis’, thesignal that annotations are anchored on). A singleCAS can accommodate multiple SofAs which isuseful in many cases: holding multiple translationsof a document; holding a markup and a plaintextversion; holding an audio signal and its transcrip-tion; etc. Annotation and all types inheriting from itare always anchored on a single SofA, but they mayrefer to annotations anchored on a different SofA,e.g. to model word alignment between translatedtexts. The CAS also allows for feature structuresinheriting from TOP that are not anchored to aparticular SofA.

69

Figure 2 shows a zoom on the sub-stringOlympic Committee from Figure 1. All of theshown annotations are anchored on the text usingoffsets. The Token, PoS (PROPN), and Lemmaannotations are anchored each on a single word.The named entity (Organization) annotation spanstwo words. The Dependency relation annotationsanchored by convention on the span of the depen-dent Token. Syntactic and semantic dependenciesare distinguished via the flavor feature. All anno-tations (except Token) have a feature which con-tains the original output(s) of the annotation tool(value, posValue, coarseValue, dependencyType).The possible values for these original outputs arenot specified in the DKPro Core type system. How-ever, for the convenience of the user, DKPro Coresupports so-called elevated types which are partof the type system and function as universal tags.Using mappings, the elevated type is derived fromthe original tool output. The Penn Treebank tagNNP, for example. is mapped to the type PROPN.For example, for PoS tags DKPro Core uses theUniversal Dependency PoS categories.

Lap eXchange Format (LXF) Closely follow-ing the ISO LAF guidelines (Ide and Suderman,2014), LXF represents annotations as a directedgraph that references pieces of text; elements com-prising the annotation graph and the base segmen-tation of the text are explicitly represented in LXFwith a set of node, edge, and region elements. Re-gions describe text segmentation in terms of char-acter offsets, while nodes contain (sets of) annota-tions (and optionally direct links to regions). Edgesrecord relations between nodes and are optionallyannotated with (non-linguistic) structural informa-tion. Nodes in LXF are typed, ranging for exampleover sentence, token, morphology, chunk, or depen-dency types. There are some technical propertiescommon to all nodes (e.g. a unique identifier, se-quential index, as well as its rank and receipt, asdiscussed below), and each node further provides afeature structure with linguistic annotations, wherethe particular range of features is determined bythe node type.

Consider the example of Figure 1. Assumingsentence segmentation prior to word tokenization,the corresponding LXF graph comprises ten re-gions, one for the full sentence and one for eachof the tokens. Figure 3 pictures the LXF versionof Figure 2 described in the previous section. Forsentence segmentation, LXF includes one node of

type sentence, which contains links (representedas dashed edges in Figure 3) to the correspond-ing regions; similarly, token-typed nodes also di-rectly link to their respective region, effectivelytreating sentence segmentation and word tokeniza-tion equally. If these two annotation types are ob-tained sequentially as part of an annotation work-flow (i.e. the word tokenizer segments one sentenceat a time), the LXF graph includes one directededge from each token node to its sentence node,thus ensuring that the provenance of the annotationis explicitly modeled in the output representation.

Moving upwards to parts of speech, LXF forthe example sentence includes one node (of typemorphology) per PoS, paired with one edge point-ing to the token node it annotates. Similarly tothe sentence–token relationship, separate morphol-ogy nodes report the result of lemmatization, withdirect edges to PoS nodes when lemmatization re-quires tagged input (or to token nodes otherwise).In general, running a new tool results in new LXF(nodes and) edges linking incoming nodes to thehighest (i.e. topologically farthest from segmenta-tion) annotation consumed. This holds true alsofor the nodes of type dependency in Figure 3; hereeach dependency arc from Figure 1 is ‘reified’ asa full LXF node, with an incoming and outgoingedge each recording the directionality of the head–dependent relation. The named entity, in turn, isrepresented as a node of type chunk, with edgespointing to nodes for PoS tags, reflecting that thenamed entity recognizer operated on tokenized andtagged input.

LXF graph elements (including the annotatedmedia and regions) are in principle serialization-agnostic, and currently implemented in LAP as amultitude of individual, atomic records in a NoSQLdatabase. A specific annotation (sub-)graph, i.e. acollection of interconnected nodes and edges, inthis approach is identified by a so-called receipt,essentially a mechanism for group formation. Eachstep in a LAP workflow consumes one or more re-ceipts as input and returns a new receipt comprisingadditional annotations. Thus, each receipt uniquelyidentifies the set of annotations contributed by onetool, as reflected in the receipt properties on thenodes of Figure 3. LAP imposes a strict principleof monotonicity, meaning that existing annotationsare never modified by later processing, but rathereach tool adds its own, new layer of annotations(which could in principle ‘overlay’ or ‘shadow’

70

(0:52)

sentence

string: The … China.

R2@0#0

(4:11)

token

form: Olympic

R3@1#0

(12:21)

token

form: Committee

R3@2#0

morphology

pos: NNP

R4@1#0

chunk

class: ORGANIZATION

R5@0#0

morphology

pos: NNP

R4@2#0

morphology

lemma: Olympic

R6@1#0

dependency

relation: nn

R7@2#0

morphology

lemma: Committee

R6@2#0

dependency

relation: compound

R8@1#0

Figure 3: Excerpt from the LXF graph for the running example in Figure 1, zooming in on OlympicCommittee. Segmentation regions are shown as dashed boxes, with character offsets for the two tokensand the full sentence, respectively. Nodes display their type (e.g. morphology), part of the feature structurecontaining the linguistic annotation (e.g. the part of speech), and their receipt, index, and rank properties(‘R4’, ‘@1’ and ‘#0’, respectively).

information from other layers). Therefore, for ex-ample, parallel runs of the same (type of) tool canoutput graph elements that co-exist in the sameLXF graph but can be addressed each by their ownreceipt (see Section 4 below).

LAPPS Interchange Format (LIF) TheLAPPS Grid exchanges annotations across webservices using LIF (Verhagen et al., 2016), aninstantiation of JSON-LD (JavaScript ObjectNotation for Linked Data), a format for trans-porting Linked Data using JSON, a lightweight,text-based, language-independent data interchangeformat for the portable representation of structureddata. Because it is based on the W3C ResourceDefinition Framework (RDF), JSON-LD is triviallymappable to and from other graph-based formatssuch as ISO LAF and UIMA CAS, as well asa growing number of formats implementingthe same data model. JSON-LD extends JSONby enabling references to annotation categoriesand definitions in semantic-web vocabulariesand ontologies, or any suitably defined conceptidentified by a URI. This allows for referencinglinguistic terms in annotations and their definitionsat a readily accessible canonical web location, andhelps ensure consistent term usage across projects

and applications. For this purpose, the LAPPSGrid project provides a Web Service ExchangeVocabulary (WSEV; Ide et al. (2014b)), whichdefines a schema comprising an inventory ofweb-addressable entities and relations.7

Figure 4 shows the LIF equivalent of Figures 2and 3 in the previous sections. Annotations in LIFare organized into views, each of which providesinformation about the annotations types it containsand what tool created the annotations. Views aresimilar to annotation ‘layers’ or ‘tasks’ as definedby several mainstream annotation tools and frame-works. For the full example in Figure 1, a viewcould be created for each annotation type in the or-der it was produced, yielding six consecutive viewscontaining sentence boundaries, tokens, parts ofspeech and lemmas, named entities, syntactic de-pendencies, and semantic dependencies.8 In Fig-ure 4, a slightly simplified graph is shown with onlythree views and where token and part of speechinformation is bundled in one view and where lem-mas and semantic relations are ignored. A view

7The WSEV links terms in the inventory to equivalent orsimilar terms defined elsewhere on the web.

8The last three views could be in a different order, depend-ing on the sequence in which the corresponding tools wereapplied.

71

Dependency view id=v2 NamedEntity view id=v1

Token view id=v0

Token

id tk_1start 4end 11pos NNP

Token

id tk_2start 12end 21pos NNP

NamedEntity

id ne_0start 4end 21label Organization

DependencyStructure

id ne_0start 0end 52dependencies [ dep_3 ]

Dependency

id dep_3governor v0:tk_2dependent v0:tk_1label nn

Olympic (4:11) Committee (12:21)

Figure 4: Excerpt from the LIF graph for the running example in Figure 1, zooming in on OlympicCommittee. Views are shown as dashed boxes and annotations as regular boxes with annotation types inbold and with other attributes as appropriate for the type. Arrows follow references to other annotations orto the source data.

is typically created by one processing componentand will often contain all information added bythat component. All annotations are in standoffform; an annotation may therefore reference a span(region) in the primary data, using character off-sets, or it may refer to annotations in another viewby providing the relevant ID or IDs. In the exam-ple, a named entity annotation in the Named Entityview refers to character offsets in the primary dataand the dependency annotation in the Dependencyview refers to tokens in the Token view, using To-ken view ID and the annotation IDs as defined inthe Token view.9

Preliminary Observations Each of the threerepresentations has been created based on a dif-ferent background and with a different focus. Forexample, LIF is coupled with JSON-LD as its seri-alization format and uses views to carry along thefull analysis history of a document including per-view provenance data. LXF uses explicit relationsbetween annotations to model provenance and hasstrong support for multiple concurrent annotationsand for managing annotation data persisted in adatabase. The DKPro Core is optimized for ease ofuse and processing efficiency within analysis work-flows and has rather limited support for concurrentannotations, provenance, and stable IDs.

Some technical differences between the three de-signs actually remain hidden in the abbreviated,diagrammatic representations of Figures 2 to 4.Abstractly, all three are directed graphs, but therelations between nodes in DKPro Core and LIF

9Note that multiple Token views can co-exist in the anno-tation set.

are established by having the identifier or referenceof the target node as a feature value on a sourcenode, whereas LXF (in a faithful rendering of theISO LAF standard) actually realizes each edge asa separate, structured object. In principle, separateedge objects afford added flexibility in that theycould bear annotations of their own—for example,it would be possible to represent a binary syntacticor semantic dependency as just one edge (insteadof reifying the dependency as a separate node, con-nected to other nodes by two additional edges).However, Ide and Suderman (2014) recommend torestrict edge annotations to non-linguistic informa-tion, and LXF in its current development status atleast heeds that advice. Hence, the DKPro Coreand LIF representations are arguably more compact(in the number of objects involved).

A broad view of the three approaches showsthat at what we may regard as the ‘micro-level’,that is, the representation of individual annotations,differences are irrelevant in terms of the schemaapplied, which are trivially mappable based on acommon underlying (graph-based) model. At ahigher level, however, different goals have led todivergences in the content and organization of theinformation that is sent from one tool to anotherin a workflow chain. In the following section, weconsider these differences.

4 Pushing a Little Farther

While our above side-by-side discussion of ‘basic’layers of morpho-syntactic annotations may seemto highlight more abstract similarity than diver-gence, in the following we will discuss a few more

72

intricate aspects of specific annotation design. Weexpect that further study of such ‘corner cases’ mayshed more light on inherent degrees of flexibilityin a particular design, as well as on its scalabilityin annotation complexity.

Media–Tokenization Mismatches Tokenizersmay apply transformations to the original inputtext that introduce character offset mismatches withthe normalized output. For example, some PennTreebank–compliant tokenizers normalize differentconventions for quotation marks (which may berendered as straight ‘typewriter’ quotes or in multi-character LATEX-style encodings, e.g. " or ` ) intoopening (left) and closing (right) Unicode glyphs(Dridan and Oepen, 2012). To make such normal-ization accessible to downstream processing, it isinsufficient to represent tokens as only a region(sub-string) of the underlying linguistic signal.

In LXF, the string output of tokenizers isrecorded in the annotations encapsulated with eachtoken node, which is in turn linked to a regionrecording its character offsets in the original me-dia. LIF (which is largely inspired by ISO LAF,much like LXF) also records the token string and itscharacter offsets in the original medium. LIF sup-ports this via the word property on tokens. DKProCore has also recently started introducing a To-kenForm annotation optionally attached to Tokenfeature structures to support this.

Tokenizers may also return more than one tokenfor the same region. Consider the Italian word del,which combines the preposition di and the definitearticle il. With both tokens anchored to the samecharacter offsets, systems require more than rea-soning over sub-string character spans to representthe linear order of tokens. LXF and LIF encodethe ordering of annotations in an index property onnodes, trivializing this kind of annotation. DKProCore presently does not support this.

Alternative Annotations and AmbiguityWhile relatively uncommon in the manualconstruction of annotated corpora, it may bedesirable in a complex workflow to allow multipleannotation layers of the same type, or to recordin the annotation graph more than the one-besthypothesis from a particular tool. Annotatingtext with different segmenters, for example,may result in diverging base units, effectivelyyielding parallel sets of segments. In our runningexample, the contraction don’t is conventionally

tokenized as 〈do,n’t〉, but a linguistically lessinformed tokenization regime might also lead tothe three-token sequence 〈don,’,t〉 or just thefull contraction as a single token.

In LXF, diverging segmentations originatingfrom different annotators co-exist in the same an-notation graph. The same is true for LIF, wherethe output of each tokenizer (if more than one isapplied) exists in its own view with unique IDs oneach token, which can be referenced by annotationsin views added later. Correspondingly, alternativeannotations (i.e. annotations of the same type pro-duced by different tools) are represented with theirown set of nodes and edges in LXF and their ownviews in LIF. The DKPro Core type system doesnot link tokens explicitly to the sentence but relieson span offsets to infer the relation. Hence, it isnot possible to represent multiple segmentationson a single SofA. However, it is possible to havemultiple SofAs with the same text and differentsegmentations within a single CAS.

A set of alternative annotations may also be pro-duced by a single tool, for instance in the form of ann-best list of annotations with different confidencescores. In LXF, this kind of ambiguous analysestranslates to a set of graph elements sharing thesame receipt identifier, with increasing values ofthe rank property for each alternative interpretation.Again, the DKPro Core type system largely relieson span offsets to relate annotations to tokens (e.g.named entities). Some layers, such as dependencyrelations also point directly to tokens. However,it is still not possible to maintain multiple sets ofdependency relations in DKPro Core because eachrelation exists on its own and there is presentlynothing that ties them together. The views in LIFare the output produced by any single run of a giventool over the data; therefore, in this case all the vari-ants would be contained in a single view, and thealternatives would appear in a list of values asso-ciated with the corresponding feature (e.g. a listof PoS–confidence score pairs). Additionally, LIFprovides a DependencyStructure which can bindmultiple dependency relations together and thussupports multiple parallel dependency structureseven within a single LIF view.

Parallel Annotation At times it is necessary tohave multiple versions of a text or multiple paral-lel texts during processing, e.g. when correctingmistakes, removing markup, or aligning transla-tions. DKPro Core inherits from the UIMA CAS

73

the ability to maintain multiple SofAs in parallel.This features of UIMA is for example used in theDKPro Core normalization framework where So-faChangeAnnotations can be created on one view,stating that text should be inserted, removed, orreplaced. These annotations can then be applied tothe text using a dedicated component which createsa new SofA that contains the modified text. Furtherprocessing can then happen on the modified textwithout any need for the DKPro Core type systemor DKPro Core components to be aware of the factthat they operate on a derived text. The alignmentinformation between the original text and the de-rived text is maintained such that the annotationscreated on the derived text can be transferred backto the original text.

The LXF and LIF designs support multiple lay-ers or views with annotations, but both assume asingle base text. In these frameworks, text-level ed-its or normalizations would have to be representedas ‘overlay’ annotations, largely analogous to thediscussion of token-level normalization above.

Provenance Metadata describing the softwareused to produce annotations, as well as the rulesand/or annotation scheme—e.g. tokenization rules,part-of-speech tagset—may be included with theannotation output. This information can be used tovalidate the compatibility of input/output require-ments for tool sequences in a pipeline or workflow.

LIF provides all of this information in metadataappearing at the beginning of each view, consistingof URI pointing to the producing software, tagset,or scheme used, and accompanying rules for iden-tifying the annotation objects (where applicable).

The LXF principle of monotonicity in accumu-lating annotation layers is key to its approach toprovenance. For our running example in Section 3above, we assume that PoS tagging and lemmatiza-tion were applied as separate steps; hence, there areseparate nodes for these (all of type morphology)and two distinct receipts. Conversely, if a com-bined tagger–lemmatizer had been used, its outputwould be recorded as a single layer of morphologynodes—yielding a different (albeit equivalent inlinguistic content) graph structure.

The provenance support in DKPro Core ispresently rather limited but also distinctly differ-ent from LXF or LIF. Presently, for a given typeof annotation, e.g. PoS tags, the name of one cre-ator component can be stored. This assumes thatevery type of annotation is produced by at most

by one component. Additionally, whenever possi-ble, DKPro Core extracts tagsets from the modelsprovided with taggers, parsers, and similar compo-nents and stores these along with the model lan-guage, version, and name in a TagsetDescription.This even lists tags not output by the component.

5 Conclusions & Outlook

We have surveyed the differences and commonali-ties among three workflow analysis systems in or-der to move toward identifying the needs to achievegreater interoperability among workflow systemsfor NLP. This preliminary analysis shows that whilesome basic elements have common models and aretherefore easily usable by other systems, the han-dling of alternative annotations and representationof provenance are among the primary differencesin approach. This suggests that future work aimedat interoperability needs to address this level of rep-resentation, as we attempt to move toward means torepresent linguistically annotated data and achieveuniversal interoperability and accessibility.

In ongoing work, we will seek to overcome re-maining limitations through (a) incremental refine-ment (working across the developer communitiesinvolved) that seeks to eliminate unnecessary, su-perficial differences (e.g. in vocabulary namingchoices) and (b) further exploring the relationshipsbetween distinct designs via the implementationof a bidirectional converter suite. Information-preserving round-trip conversion, on this view,would be a strong indicator of abstract design equiv-alence, whereas conversion errors or informationloss in round-trip conversion might either point tocontentful divergences or room for improvement inthe converter suite.

Acknowledgments

This work was funded in parts from the Euro-pean Union’s Horizon 2020 research and innova-tion programme (H2020-EINFRA-2014-2) undergrant agreement No. 654021. It reflects only theauthor’s views and the EU is not liable for anyuse that may be made of the information containedtherein. It was further supported in parts by theGerman Federal Ministry of Education and Re-search (BMBF) under the promotional reference01UG1416B (CEDIFOR) and by the U.S. NationalScience Foundation grants NSF-ACI 1147944 andNSF-ACI 1147912. LAP development is sup-ported by the Norwegian Research Council through

74

the national CLARINO initiative, as well as bythe Norwegian national computing and storage e-infrastructures, and the Department of Informaticsat the University of Oslo. We are grateful to every-one involved, in particular all taxpayers.

ReferencesRebecca Dridan and Stephan Oepen. 2012. Tokeniza-

tion. Returning to a long solved problem. A sur-vey, contrastive experiment, recommendations, andtoolkit. In Proceedings of the 50th Meeting of the As-sociation for Computational Linguistics, page 378 –382, Jeju, Republic of Korea, July.

Richard Eckart de Castilho and Iryna Gurevych. 2014.A broad-coverage collection of portable NLP com-ponents for building shareable analysis pipelines. InProceedings of the Workshop on Open Infrastruc-tures and Analysis Frameworks for HLT, page 1 – 11,Dublin, Ireland.

Francis Ferraro, Max Thomas, Matthew R Gorm-ley, Travis Wolfe, Craig Harman, and BenjaminVan Durme. 2014. Concretely annotated corpora.In Proceedings of the AKBC Workshop at NIPS2014, Montreal, Canada, December.

David Ferrucci and Adam Lally. 2004. UIMA. Anarchitectural approach to unstructured informationprocessing in the corporate research environment.Natural Language Engineering, 10(3-4):327 – 348,September.

Maarten van Gompel and Martin Reynaert. 2013. Fo-LiA. A practical XML format for linguistic annota-tion. A descriptive and comparative study. Computa-tional Linguistics in the Netherlands Journal, 3:63 –81.

Thilo Götz and Oliver Suhre. 2004. Design and imple-mentation of the UIMA Common Analysis System.IBM Systems Journal, 43(3):476 – 489.

Ulrich Heid, Helmut Schmid, Kerstin Eckart, and Er-hard W. Hinrichs. 2010. A corpus representa-tion format for linguistic web services. The D-SPINText Corpus Format and its relationship with ISOstandards. In Proceedings of the 7th InternationalConference on Language Resources and Evaluation,page 494 – 499, Valletta, Malta.

Nancy Ide and Keith Suderman. 2014. The Linguis-tic Annotation Framework. A standard for annota-tion interchange and merging. Language Resourcesand Evaluation, 48(3):395 – 418.

Nancy Ide, James Pustejovsky, Christopher Cieri, EricNyberg, Denise DiPersio, Chunqi Shi, Keith Su-derman, Marc Verhagen, Di Wang, and JonathanWright. 2014a. The Language Application Grid.In Proceedings of the 9th International Conferenceon Language Resources and Evaluation, Reykjavik,Iceland.

Nancy Ide, James Pustejovsky, Keith Suderman, andMarc Verhagen. 2014b. The Language ApplicationGrid Web Service Exchange Vocabulary. In Pro-ceedings of the Workshop on Open Infrastructuresand Analysis Frameworks for HLT, Dublin, Ireland.

ISO. 2012. Language Resource Management. Linguis-tic Annotation Framework. ISO 24612.

Emanuele Lapponi, Erik Velldal, Stephan Oepen, andRune Lain Knudsen. 2014. Off-road LAF: Encod-ing and processing annotations in NLP workflows.In Proceedings of the 9th International Conferenceon Language Resources and Evaluation, page 4578 –4583, Reykjavik, Iceland.

OMG. 2002. OMG XML metadata interchange (XMI)specification. Technical report, Object ManagementGroup, Inc., January.

Marc Verhagen, Keith Suderman, Di Wang, NancyIde, Chunqi Shi, Jonathan Wright, and James Puste-jovsky. 2016. The LAPPS Interchange Format. InRevised Selected Papers of the Second InternationalWorkshop on Worldwide Language Service Infras-tructure - Volume 9442, WLSI 2015, pages 33–47,New York, NY, USA. Springer-Verlag New York,Inc.

75


TDB 1.1: Extensions on Turkish Discourse Bank

Deniz ZeyrekMiddle East Technical University

Ankara, [email protected]

Murathan KurfalıMiddle East Technical University

Ankara, [email protected]

Abstract

In this paper we present the recent de-velopments on Turkish Discourse Bank(TDB). We first summarize the resourceand present an evaluation. Then, we de-scribe TDB 1.1, i.e. enrichments on 10%of the corpus (namely, added senses forexplicit discourse connectives and newannotations for implicit relations, entityrelations and alternative lexicalizations).We explain the method of annotation andevaulate the data.

1 Introduction

The annotation of linguistic corpora has recentlyextended its scope from morphological or syn-tactic tagging to discourse-level annotation. Dis-course annotation, however, is known to be highlychallenging due to the multiple factors that makeup texts (anaphors, discourse relations, topics,etc.). The challenge may become even moreheightened depending on the type of text to be an-notated, e.g. spoken vs written, or texts belongingto different genres. Yet, discourse-level informa-tion is highly important for language technologyand it is more so for languages such as Turkish thatare relatively less resource-rich when compared toEuropean languages.

Given that systematically and consistently an-notated corpora would help advance state-of-the-art discourse-level annotation, this paper aims todescribe the methodology of enriching TurkishDiscourse Bank, a multi-genre, 400.000-word cor-pus of written texts containing annotations for dis-course relations in the PDTB style. Thus, the mo-tivation of this paper is to contribute to the empir-ical analysis of Turkish at the level of discourserelations and enable further LT applications on thecorpus. The corpus can also be used by linguists,applied linguists and translators interested in Turk-ish or Turkic languages in general.

The rest of the paper proceeds as follows. §2provides an overview of Turkish Discourse Bank,summarizes the linguistic decisions underlying thecorpus and presents an evaluation of the corpus.§3 introduces TDB 1.1, explains the added anno-tations and how the data are evaluated. §4 showsthe distribution of discourse relation types andpresents a preliminary cross-linguistic comparisonwith similarly annotated corpora. Finally, §5 sum-marizes the study and draws some conclusions.

2 An Overview of Turkish DiscourseBank (TDB)

The current release of Turkish Discourse Bank, orTDB 1.0 annotates discourse relations, i.e. seman-tic relations that hold between text segments (ex-pansion, contrast, contingency, etc.). Discourserelations (DRs) may be expressed by explicit de-vices or may be conveyed implicitly. Explicit dis-course connecting devices (but, because, however)make a DR explicit. These will be referred to asdiscourse connectives in this paper. Even when aDR lacks an explicit connective, the sense can beinferred. In these cases, native speakers can addan explicit discourse connective to the text to sup-port their inference. These have been known asimplicit (discourse) relations. However, TDB 1.0only annotates DRs with an explicit connective.

While sharing the goals and annotation princi-ples of PDTB1, TDB takes the linguistic charac-teristics of Turkish into account. Here we brieflyreview some of these characteristics, which havean impact on the annotation decisions (see §2.1 formore principles that guide the annotation proce-dure).

Turkish belongs to the Altaic language familywith the SOV as the dominant word order, thoughit exhibits all other possible word orders. It isan agglutinating language with rich morphology.

1 https://www.seas.upenn.edu/ pdtb/

76

Two of its characteristics are particularly relevantfor this paper. Firstly, it is characterized (a) byclause-final function words, such as postpositionsthat select a verb with nominalization and/or casesuffixes; (b) by simple suffixes attached to the verbstem (termed as converbs). These are referred to ascomplex and simplex subordinators, respectively(Zeyrek and Webber, 2008). Both types of subor-dinators largely correspond to subordinating con-junctions in English (see Ex.1 for a complex sub-ordinator icin ‘for/in order to’ and the accompa-nying suffixes on the verb, and Ex.2 for a converb,-yunca ‘when’, underlined). Only the independentpart of the complex subordinators have been anno-tated so far.

(1) Gor-me-sisee-NOM-ACC

icinto

Ankara’yato-Ankara

gel-dik.came-we

For him to see her, we came to Ankara.

(2) Kuru-yuncadry-when

fırcala-yacag-ım.brush-will-I

I will brush it when it dries.

Secondly, Turkish is a null-subject language;the subject of a tensed clause is null as long as thetext continues to talk about the same topic (Ex.3).

(3) Ali her gun kosar. Saglıklı yiyecekler yer.Ali jogs everyday. (He) maintains ahealthy diet.

We take postpositions (and converbs) as poten-tial explicit discourse connectives and consider thenull subject property of the language as a signal forpossible entity relations.

TDB adopts PDTB’s lexical approach to dis-course as an annotation principle, which meansthat all discourse relations are grounded on a lex-ical element (Prasad et al., 2014). The lexicallygrounded approach applies not only to explicitlymarked discourse relations but also to implicitones; i.e., it necessitates annotating implicit DRsby supplying an explicit connective that wouldmake the sense of the DR explicit, as in Ex.4.

(4) ... bu cocugun sınırsız bir dus gucu var.[IMP=bu yuzden] Sen bunu okulundanmahrum etme.... the child has a vivid imagination.[IMP=for this reason] Don’t stop himfrom going to school.

2.1 Principles that Guide Annotation

In TDB 1.0, explicit discourse connectives (DCs)are selected from three major lexical classes. Thisis motivated by the need to start from well-definedsyntactic classes known to function as discourseconnectives: (a) complex subordinators (postpo-sitions, e.g. ragmen ‘despite’, and similar clausefinal elements, such as yerine ‘instead of’), (b) co-ordinating conjunctions (ve ‘and’, ama ‘but’), and(c) adverbials (ayrca ‘in addition’). TDB 1.0 alsoannotates phrasal expressions; these are devicesthat contain a postposition or a similar clause fi-nal element taking a deictic item as an argument,e.g. buna ragmen ‘despite this’, as in Ex.5 be-low. This group of connectives are morphologi-cally and syntactically well-formed but not lexi-cally frozen. Moreover, due to the presence of thedeictic element in their composition, they are pro-cessed anaphorically. Because of these reasons,phrasal expressions, which are annotated sepa-rately in TDB 1.0, are merged with alternative lex-icalizations in TDB 1.1 (see §3).

It is important to note that connectives may havea DC use as well as a non-DC use. The criterion todistinguish the DC/non-DC use is Asher’s (2012)notion of abstract objects (AO) (events, activities,states, etc.). We take a lexical signal as a DC tothe extent it relates text segments with an AO in-terpretation. The DC is referred to as the head ofa DR, the text segments it relates are termed asthe arguments. We also adhere to the minimalityprinciple of PDTB (MP), a principle that applies tothe length of text spans related by a DC. It meansthat annotators are required to choose an argumentspan that is minimally necessary for the sense ofthe relation (Prasad et al., 2014).

With the MP and the AO criterion in mind, theannotators went through the whole corpus search-ing for predetermined connectives one by one ineach file, determining and annotating their DC use,leaving the non-DC use unannotated. Here, to an-notate means that (explicit) DCs and phrasal ex-pressions are tagged mainly for their predicate-argument structure; i.e. for their head (Conn) andtwo arguments (Arg1, Arg2) as well as the mate-rial that supplements them (Supp1, Supp2)2.

2Following the PDTB principles, Arg2 is taken as the textsegment that syntactically hosts the discourse connective; theother text segment is Arg1. The clause order of sentenceswith complex subordinators is Arg2-Arg1 while the other re-lations have the Arg1-Arg2 order. Supp1 and Supp2 stand fortext segments that support the interpretation of an argument.

77

In the examples in the rest of the paper, Arg2 isshown in bold, Arg1 is rendered in italics; the DCitself is underlined. Any null subjects are shownby parenthesized pronouns in the glosses.

(5) Calısması gerekiyordu. Buna ragmen,universiteyi bırakmadı.(She) had to work. Despite this, (she) didnot quit university.

2.2 Evaluation of TDB 1.0

TDB 1.0 has a total of 8483 annotations on 77Conn types and 147 tokens including coordinat-ing conjunctions, complex subordinators, and dis-course adverbials. However, it does not con-tain sense annotations; it does not annotate im-plicit DRs or entity relations; neither does it anno-tate alternative lexicalizations as conceived by thePDTB. The addition of these relations and theirsenses would enhance the quality of the corpus.Thus, this study describes an effort that involvesthe addition of new annotations to TDB 1.0, partof which involves sense-tagging of pre-annotatedexplicit DCs.

Before explaining the details about the enrich-ment of the corpus, we provide an evaluation ofTDB 1.0. In earlier work, we reported the annota-tion procedure and the annotation scheme (Zeyreket al., 2010) and provided inter-annotator agree-ment for complex subordinators and phrasal ex-pressions (Zeyrek et al., 2013), but a completeevaluation of the corpus has not been provided.Table 1 presents inter-annotator agreement (IAA)of the connectives by syntactic type. We measuredIAA by Fleiss’ Kappa (Fleiss, 1971) using wordsas the boundaries of the text spans selected by theannotators, as explained in Zeyrek et al. (2013).

The agreement statistics for argument spans areimportant because they show how much the an-notators agreed on the AO interpretation of a textspan. Table 1 shows that overall, IAA of both ar-guments is > 0.7. Although this is below the com-monly accepted threshold of 0.8, we take it sat-isfactory for discourse-level annotation, which ishighly challenging due to the ambiguity of coher-ence relations (Spooren and Degand, 2010).

3Some phrasal expressions are retrieved by the samesearch token as subordinators; thus, ‘Subord’ indicates IAAfor subordinators and phrasal expressions calculated jointly.

4‘Subtotal’ represents the total of connectives for whichIAA could be calculated; ‘IAA not avl.’ (available) meansIAA could not be calculated.

Conn. Syn. Type DC Non-DC Arg1 Arg2Coord. 3609 6947 0.78 0.83

Subord. 3 3439 5154 0.75 0.80Disc. Adv. 698 223 0.74 0.83Subtotal 4 7746 12324 0.76 0.82

IAA not avl. 737 903 - -TOTAL 8483 13227

Table 1: DC/Non-DC counts of connective types in TDB 1.0(coordinators, complex subordinators, adverbials) and Fleiss’Kappa IAA results for argument spans (Sevdik-Callı, 2015)

3 Creating TDB 1.1

Due to lack of resources, we built TDB 1.1 on 10%of TDB (40.000 words). We used PDTB 2.0 anno-tation guidelines and the sense hierarchy therein(see fn 1).

Four part-time working graduate students anno-tated the corpus in pairs. We trained them by goingover the PDTB guidelines and the linguistic prin-ciples provided in §2.1. Each pair annotated 50%of the corpus using an annotation tool developedby Aktas et al. (2010). The annotation task tookapproximately three months, including adjudica-tion meetings where we discussed the annotations,revised and/or corrected them where necessary.

3.1 Annotation Procedure

The PDTB sense hierarchy is based on four toplevel (or level-1) senses (TEMPORAL, CON-TINGENCY, COMPARISON, EXPANSION) andtheir second and third level senses. The annotationprocedure involved two rounds. First, we askedthe annotators to add senses to the pre-annotatedexplicit DCs and phrasal expressions. The annota-tors implemented this task by going through eachfile. In this way, they fully familiarized themselveswith the predicate-argument structure of DCs inTDB 1.0, as well as the PDTB 2.0 sense hierar-chy.

In the second round, the annotators firsttagged alternative lexicalizations (AltLexs) inde-pendently of all other DRs in each file. Giventhat phrasal expressions could be considered as asubset of PDTB-style AltLexs, this step ensuredthat TDB 1.1 not only includes phrasal expres-sions but various subtypes of Altlexs as well. Fi-nally, the annotators identified and annotated im-plicit DRs and entity relations (EntRels) simulta-neously in each file by searching them within para-graphs and between adjacent sentences delimitedby a full stop, a colon, a semicolon or a questionmark.

78

Alternative Lexicalizations: This refers tocases which could be taken as evidence for the lex-icalization of a relation. The evidence may be aphrasal expression (Ex. 5), or a verb phrase, as inEx. 6:

(6) ... genc Marx, Paris’de Avrupa’nın en de-vrimci isci sınıfı ile tanısır. Bu, onundusuncesinin olusmasında en onemlikilometre taslarından birini teskil eder.... in Paris, young Marx meets Europe’sthe most revolutionary working class.This constitutes one of the most importantmilestones that shapes his thoughts.

Entity Relations: In entity relations, the in-ferred relation between two text segments is basedon an entity, where Arg1 mentions an entity andArg2 describes it further. As mentioned in §2, anull subject in Arg2 (or in both Arg1 and Arg2) isoften a sign of an EntRel (Ex. 7).

(7) Kerem ter icindeydi. “Kurtulamamısımdemek,” diye mırıldandı.Kerem was all sweaty. “So I was not setfree” (he) muttered.

Implicit DRs: For the annotation of implicitDRs, we provided the annotators with an exam-ple explicit DC or a phrasal expression (in Turk-ish) for each level of the PDTB 2.0 sense hierar-chy. We told the annotators to insert the exampleconnective (or another connective of their choiceif needed) between two sentences where they in-fered an implicit DR (Ex. 5 above). While EntRelswere only annotated for their arguments, Altlexsand implicit DRs required senses as well. Whileannotating the senses, the annotators were free tochose multiple senses where necessary.

3.2 Additional Sense TagsTo capture some senses we came across in Turk-ish, we added three level-2 senses to the top-levelsenses, COMPARISON and EXPANSION.

COMPARISON: Degree. This sense tag cap-tures the cases where one eventuality is com-pared to the other in terms of the degree it issimilar to or different from the other eventuality.The label seemed necessary particularly to capturethe sense conveyed by the complex subordinatorkadar, which can be translated to English as, ‘asADJ/ADV as’ or ‘so AJD/ADV that’. When kadaris used to compare two eventualities in terms of

how they differ, Arg2 is a negative clause (Ex. 8).So far, this label has only been used to annotateexplicit DRs.

(8) Tanınmayacak kadar degismisti.(He) changed so much that (he) could notbe recognized.

EXPANSION: Manner. This tag indicates themanner by which an eventuality takes place.5 Itwas particularly needed to capture the sense of thepre-annotated complex subordinator gibi ‘as’, andthe simplex subordinator -erek ’by’, which we aimto annotate. So far, the Manner tag has only beenused to annotate explicit DRs.

(9) Dedigi gibi yaptı.(S/he) did as (S/he) said (s/he) would

EXPANSION: Correction. The Correction tagis meant to capture the relations where an incorrectjudgement or opinion gets corrected or rectified inthe other clause. So far, the majority of Correctionrelations in TDB 1.1 are implicit. There are pol-ysemous tokens (Ex. 10), as well as single-sensetokens (Ex. 11). These do not convey the PDTBchosen alternative sense (the sense where one ofthe alternatives replaces the other). For example,to insert onun yerine ‘instead of this’ in Ex. 11would be odd (though this connective would fit Ex.10). Although further research is needed, we pre-dict that Correction relations are characterized bythe negative marker of nominal constituents, degil(underlined) in Arg1.

(10) Ben yere bakmazdım. (IMP=ama ‘but’)Gozune bakardım insanların. (Chosenalternative; Correction)I wouldn’t look down. (I) would look intopeoples eyes.

(11) O olayları yasayan ben degilim. (IMP= bi-lakis ‘to the contrary’) Benim yasamımbambaska. (Correction)I am not the one who went through thoseevents. My life is completely different.

5PDTB-3 sense hierarchy (Webber et al., 2016) introducesExpansion:Manner and Comparison:Similarity, among othersense tags. The PDTB Manner label conveys the same sensewe wanted to capture. On the other hand, the PDTB label‘Similarity’ is similar to Degree only to the extent it conveyshow two eventualities are similar. To the best of our knowl-edge, the Similarity label does not indicate comparison on thebasis of how two things differ. Finally, we became aware ofthe revised PDTB sense hierarchy after we have started ourannotation effort. We decided to continue with PDTB 2.0 la-bels (plus our new labels) for consistency.

79

3.3 Annotation EvaluationTDB 1.1 was doubly-annotated by annotators whowere blind to each other’s annotations. To deter-mine the disagreements, we calculated IAA regu-larly by the exact match method (Miltsakaki et al.,2004). At regular adjudication meetings involvingall the annotators and the project leader, we dis-cussed the disagreements and created an agreed setof annotations with a unanimous decision.

We measured two types of IAA: type agree-ment (the extent at which annotators agree overa certain DR type), and sense agreement (agree-ment/disagreement on sense identity for each to-ken). For the senses added to the pre-annotatedexplicit DCs and phrasal expressions, we onlycalculated sense agreement. For the new rela-tions, we measured both type agreement and senseagreement. This was done in two steps. Follow-ing Forbes-Riley et al. (2016), in the first step,we measured type agreement. Type agreementis defined as the number of common DRs overthe number of unique relations, where all dis-course relations are of the same type. For ex-ample, assume annotator1 produced 12 implicitdiscourse relations for a certain text whereas an-notator2 produced 13, where the total number ofunique discourse relations were 15 and the com-mon annotations 11. In this case, type agreementis 73.3%. Then, we calculated sense agreementamong the common annotations using the exactmatch method 6 (see Table 2 and Table 3 below).

Relation Type AgreementImplicit 33.4%AltLex 72.6%EntRel 79.5%

Table 2: IAA results for type agreement in TDB 1.1

Sense Explicit Implicit AltLexLevel-1 88.4% 85.7% 93.9%Level-2 79.8% 78.8% 79.5%Level-3 75.9% 73.1% 73.4%

Table 3: IAA results for sense agreement in TDB 1.1

According to Table 2, the type agreement forAltLexs and EntRels is satisfactory (> 0.7) butimplicit DRs display too low a type agreement.Due to this low score, we evaluated the reliabilityof the gold standard implicit relations: one yearafter TDB 1.1 was created, we asked one of our

6Since no sense tag is assigned to Entrels, for them onlytype agreement is calculated.

Type TDB 1.1 PDTB 2.0 Hindi DRBExplicit 800 (43.1%) 18459 (45.4%) 189 (31.4%)Implicit 407 (21.9%) 16224 (39.9%) 185 (30.7%)Altlex 108 (5.8%) 624 (1.5%) 37 (6.15%)Entrel 541 (29.1%) 5210 (12.8%) 140 (23.2%)NoRel - 254 (0.6%) 51 (8.4%)

TOTAL 1,856 40,600 602

Table 4: Cross linguistic comparison of DR types. The num-bers within the parenthesis indicate the ratio of DR tokens.

four annotators to annotate the implicit DRs (bothfor type and sense) by going through 50% of thecorpus he had not annotated before. He searchedand annotated implicit DRs between adjacent sen-tences within paragraphs, skipping other kinds ofrelations. This procedure is different from the ear-lier one where we asked the annotators to anno-tate EntRels and implicit DRs simultaneously ineach file. We also told the annotator to pay at-tention to the easily confused implicit EXPAN-SION:Restatement:specification relations and En-tRels. ( We stressed that in the former, one shoulddetect an eventuality being further talked aboutrather than an entity as in the latter.)

Then, we assessed intra-rater agreement be-tween the annotator’s annotations and the goldstandard data. In this way, we reached the score of72.9% for type agreement on implicit DRs.7 Thisresult shows that implicit DRs have been consis-tently detected in the corpus; in addition, it sug-gests that annotating implicit DRs independentlyof EntRels is a helpful annotation procedure.

Table 3 shows that for explicit DCs, the IAA re-sults for all the sense levels is > 0.7, indicatingthat the senses were detected consistently. Simi-larly, the sense agreement results for implicit DRsand AltLexs for all the sense levels are > 0.7, cor-roborating the reliability of the guidelines.

4 Distribution of Discourse RelationTypes

This section offers a preliminary cross-linguisticcomparison. It presents the distribution of dis-course relation types in TDB 1.1 and comparesthem with PDTB 2.0 (Prasad et al., 2014) andHindi Discourse Relation Bank (Oza et al., 2009),which also follows the PDTB principles (Table 4).

7Intra-rater agreement between the implicit relation senseannotations of the annotator and the gold standard data is alsosatisfactory, i.e. > 0.7 for all sense levels (Level-1: 87.5%,Level-2: 79.3%, Level-3: 74.6%). We calculated sense agree-ment in the same way explained thorughout the current sec-tion.

80

It is known that implicit relations abound intexts; thus, it is important to reveal the extent ofimplicitation in discourse-annotated corpora. Ta-ble 4 indicates that in TDB 1.1, explicit DRs arehighest in number, followed by EntRels and im-plicit DRs. The ratio of explicit DRs to implicitDRs is 1.96. This ratio is 1.13 for PDTB 2.0, and1.02 for Hindi DRB. That is, among the corporarepresented in the table, TDB displays the largestdifference in terms of the explicit-implicit split.However, it is not possible at this stage to general-ize the results of this cross-linguistic comparisonto tendencies at the discourse level. TDB 1.1 doesnot annotate simplex subordinators and leaves im-plicit VP conjunctions out of scope. Thus, whenthese are annotated, the ratio of explicit DRs to im-plicit DRs would change. Issues related to the dis-tribution of explicit and implicit relations acrossgenres are also necessary to reveal. We leave thesematters for further research.

5 Conclusion

We presented an annotation effort on 10% of Turk-ish Discourse Bank 1.0 resulting in an enrichedcorpus called TDB 1.1. We described how PDTBprinciples were implemented or adapted, and pre-sented a complete evaluation of TDB 1.1 as wellas TDB 1.0, which has not been provided be-fore. The evaluation procedure of TDB 1.1 in-volved measuring inter-annotator agreement forall relations and assessing intra-annotator agree-ment for implicit relations. The agreement statis-tics are overall satisfactory. While inter-annotatoragreement measurements show reliability of anno-tations (and hence the re-usability of the annota-tion guidelines), intra-rater agreement results in-dicate the reproducibility of gold standard annota-tions by an experienced annotator. Using the samemethodology, we aim to annotate a larger part ofthe TDB including attribution and no relations inthe future.

Acknowledgements We would like to thank ouranonymous reviewers for their useful comments.We also thank METU Project Funds (BAP-07-04-2015-004) for their support.

References

Berfin Aktas, Cem Bozsahin, and Deniz Zeyrek. 2010.Discourse relation configurations in Turkish and an

annotation environment. In Proc. of the 4th Linguis-tic Annotation Workshop, pages 202–206. ACL.

Nicholas Asher. 2012. Reference to abstract objects indiscourse, volume 50. Springer Science & BusinessMedia.

Joseph L Fleiss. 1971. Measuring nominal scaleagreement among many raters. Psychological bul-letin, 76(5):378.

Kate Forbes-Riley, Fan Zhang, and Diane Litman.2016. Extracting PDTB discourse relations fromstudent essays. In Proc. of the SIGDIAL, pages 117–127.

Eleni Miltsakaki, Rashmi Prasad, Aravind K Joshi, andBonnie L Webber. 2004. The Penn Discourse Tree-bank. In LREC.

Umangi Oza, Rashmi Prasad, Sudheer Kolachina,Dipti Misra Sharma, and Aravind Joshi. 2009. TheHindi Discourse Relation Bank. In Proc. of the3rd Linguistic Annotation Workshop, pages 158–161. Association for Computational Linguistics.

Rashmi Prasad, Bonnie Webber, and Aravind Joshi.2014. Reflections on the Penn Discourse Tree-bank, comparable corpora, and complementary an-notation. Computational Linguistics.

Ayısıgı Sevdik-Callı. 2015. Assessment of the Turk-ish Discourse Bank and a Cascaded Model to Au-tomatically Identify Discursive Phrasal Expressionsin Turkish. Ph.D. thesis, Middle East Technical Uni-versity.

Wilbert Spooren and Liesbeth Degand. 2010. Codingcoherence relations: Reliability and validity. Cor-pus Linguistics and Linguistic Theory, 6(2):241–266.

Bonnie Webber, Rashmi Prasad, Alan Lee, and Ar-avind Joshi. 2016. A discourse-annotated corpusof conjoined VPs. In Proc. of the 10th LinguisticsAnnotation Workshop, pages 22–31.

Deniz Zeyrek and Bonnie L Webber. 2008. A dis-course resource for Turkish: Annotating discourseconnectives in the METU corpus. In IJCNLP, pages65–72.

Deniz Zeyrek, Isın Demirsahin, Ayısıgı Sevdik-Callı, Hale Ogel Balaban, Ihsan Yalcinkaya, andUmit Deniz Turan. 2010. The annotation schemeof the Turkish Discourse Bank and an evaluation ofinconsistent annotations. In Proc. of the 4th Linguis-tics Annotation Workshop, pages 282–289. Associa-tion for Computational Linguistics.

Deniz Zeyrek, Isın Demirsahin, Ayısıgı Sevdik-Callı,and Ruket Cakıcı. 2013. Turkish Discourse Bank:Porting a discourse annotation style to a morpho-logically rich language. Dialogue and Discourse,4(2):174–184.

81


Two Layers of Annotation for RepresentingEvent Mentions in News Stories

Maria Pia di Buono1 Martin Tutek1 Jan Snajder1 Goran Glavas2

Bojana Dalbelo Basic1 Natasa Milic-Frayling3

1 TakeLab, Faculty of Electrical Engineering and Computing, University of Zagreb, [email protected]

2 Data and Web Science Group, University of Mannheim, [email protected]

3 School of Computer Science, University of Nottingham, United [email protected]

Abstract

In this paper, we describe our prelimi-nary study of methods for annotating eventmentions as part of our research on high-precision models for event extraction fromnews. We propose a two-layer annota-tion scheme, designed to capture the func-tional and the conceptual aspects of eventmentions separately. We hypothesize thatthe precision can be improved by mod-eling and extracting the different aspectsof news events separately, and then com-bining the extracted information by lever-aging the complementarities of the mod-els. We carry out a preliminary annota-tion using the proposed scheme and an-alyze the annotation quality in terms ofinter-annotator agreement.

1 Introduction

The task of representing events in news stories andthe way in which they are formalized, namely theirlinguistic expressions (event mentions), is interest-ing from both a theoretical and practical perspec-tive. Event mentions can be analyzed from var-ious aspects; two aspects that emerge as particu-larly interesting are the linguistic aspect and themore practical information extraction (IE) aspect.

As far as the linguistic aspect is concerned,news reporting is characterized by specific mech-anisms and requires a specific descriptive struc-ture. Generally speaking, such mechanisms con-vey non-linear temporal information that complieswith news values rather than narrative norms (Set-zer and Gaizauskas, 2000). In fact, unlike tradi-tional story telling, news writing follows the “in-

verted pyramid” mechanism that consists of intro-ducing the main information at the beginning ofan article and pushing other elements to the mar-gin, as shown in Figure 1 (Ingram and Henshall,2008). Besides, news texts use a mechanism ofgradual specification of event-related information,entailing a widespread use of coreference relationsamong the textual elements.

On the other hand, the IE aspect is concernedwith the information that can be automatically ac-quired from news story texts, to allow for moreefficient processing, retrieval, and analysis of mas-sive news data nowadays available in digital form.

In this paper, we describe our preliminary studyon annotating event mention representations innews stories. Our work rests on two main as-sumptions. The first assumption is that eventin news substantially differ from events in othertexts, which warrants the use of a specific anno-tation scheme for news events. The second as-sumption is that, because news events can be an-alyzed from different aspects, it makes sense alsoto use different annotation layers for the differentaspects. To this end, in this paper we propose atwo-layer annotation scheme, designed to capturethe functional and the conceptual aspects of eventmentions separately. In addition, we carry out apreliminary annotation using the proposed schemeand analyze the annotation quality in terms ofinter-annotator agreement.

The study presented in this paper is part of ourresearch on high-precision models for event ex-traction from news. We hypothesize that the preci-sion can be improved by modeling and extractingthe different aspects of news events separately, andthen combining the extracted information by lever-aging the complementarities of the models. As a

82

Narrative News

When electricians wired the home of Mrs Mary Ume in Ho-hola, Port Moresby, some years ago they neglected to installsufficient insulation at a point in the laundry where a num-ber of wires crossed.A short-circuit occurred early this morning.Contact between the wires is thought to have created aspark, which ignited the walls of the house.The flames quickly spread through the entire house.Mrs Ume, her daughter Peni (aged ten) and her son Jonah(aged five months) were asleep in a rear bedroom. They hadno way of escape and all perished.

A Port Moresby woman and her two children died in ahouse fire in Hohola today.Mrs Mary Ume, her ten-year-old daughter Peni and babyson Jonah were trapped in a rear bedroom as flames sweptthrough the house.The fire started in the laundry, where it is believed faultyelectrical wiring caused a short-circuit. The family wereasleep at the time.The flames quickly spread and soon the entire house wasblazing.

Table 1: An example of narrative and news styles (Ingram and Henshall, 2008).

first step towards that goal, in this paper we carryout a preliminary comparative analysis of the pro-posed annotation layers.

The rest of the paper is structured as follows.In the next section we briefly describe the relatedwork on representing and annotating events. InSection 3 we present the annotation methodology.In Section 4 we describe the annotation task, whilein Section 5 we discuss the results. In Section 6we describe the comparative analysis. Section 7concludes the paper.

2 Related Work

Several definitions of events have been proposedin the literature, including that from the Topic De-tection and Tracking (TDT) community: “a TDTevent is defined as a particular thing that hap-pens at a specific time and place, along with allnecessary preconditions and unavoidable conse-quences” (TDT, 2004). On the other hand, theISO TimeML Working Group (Pustejovsky et al.,2003) defines an event as “something that can besaid to obtain or hold true, to happen or to occur.”

On the basis of such definitions, different ap-proaches have been developed to represent and ex-tract events and those aspects considered represen-tative of event factuality.

In recent years, several communities proposeddifferent shared tasks aiming at evaluating eventannotation systems, mainly devoted to recognizeevent factuality or specific aspects related to fac-tuality representation (e.g., temporal annotation),or tasks devoted to annotate events in specific lan-guage, e.g., Event Factuality Annotation Task pre-sented at EVALITA 2016, the first evaluation ex-ercise for factuality profiling of events in Italian(Minard et al., 2016b).

Among the communities working in this field,

the TimeML community provides a rich specifi-cation language for event and temporal expres-sions aiming to capture different phenomena inevent descriptions, namely “aspectual predica-tion, modal subordination, and an initial treatmentof lexical and constructional causation in text”(Pustejovsky et al., 2003).

Besides the work at these shared tasks, sev-eral authors proposed different schemes for eventannotation, considering both the linguistic leveland the conceptual one. The NewsReader Project(Vossen et al., 2016; Rospocher et al., 2016;Agerri and Rigau, 2016) is an initiative focusedon extracting information about what happenedto whom, when, and where, processing a largevolume of financial and economic data. Withinthis project, in addition to description schemes(e.g., ECB+. (Cybulska and Vossen, 2014a))and multilingual semantically annotated corpus ofWikinews articles (Minard et al., 2016a), van Sonet al. (2016) propose a framework for annotatingperspectives in texts using four different layers,i.e., events, attribution, factuality, and opinion. Inthe NewsReader Project the annotation is basedon the guidelines to detect and annotate markablesand relations among markables (Speranza and Mi-nard, 2014). In the detection and annotation ofmarkables, the authors distinguish among entitiesand entity mention in order to “handle both the an-notation of single mentions and of the coreferencechains that link several mentions to the same en-tity in a text” (Losch and Nikitina, 2009). Enti-ties and entity mention are then connected by theREFER TO link.

Another strand of research are the conceptualschemes, rooted in formal ontologies. Several up-per ontologies for annotating events have been de-veloped, e.g., the EVENT Model F (Scherp et al.,

83

2009). This ontology represents events solvingtwo competency questions1 about the participantsin the events and the previous events that causedthe event in question. EVENT Model F is basedon the foundational ontology DOLCE+DnS Ultra-lite (DUL) (Gangemi et al., 2002) and it focuseson the participants involved in the event and onmereological, causal, and correlative relationshipsbetween events.

Most of the proposed ontologies are tailored forfinancial or economic domains. A case in point isThe newsEvent Ontology, a conceptual scheme fordescribing events in business events (Losch andNikitina, 2009).

3 Methodology

Our methodology arises from the idea that eventsin news call for a representation that is differentfrom event representations in other texts. We be-lieve that a coherent and consistent descriptionand, subsequently, extraction of event mentions innews stories should be dealt with conveying tem-poral information (When), but also distinguishingother information related to the action (What), theparticipants (Who), the location (Where), the mo-tivation (Why) and the manner in which the eventhappened (How). This means that a meaningfulnews/event description should cover the prover-bial 5Ws and one H, regarded basic in informationgathering, providing a factual answer for all theseaspects.

The above assumption implies that events can-not be considered black boxes or monolithicblocks describable merely by means of the tem-poral chain description. Instead, it is necessaryto capture the functional and conceptual aspectsof event mentions. Indeed, as previously claimed,language used in news stories is characterized bymechanisms that differ from the narrative one.Such differences may manifest themselves in boththe syntactic structures and the patterns of discur-sive features that effect the sentence structure.

In line with the above, our approach aims ataccomplishing a fine-grained description of eventmentions in news stories applying a two-layer an-notation scheme. The first layer conveys the dif-ferent syntactic structures of sentences, account-ing the functional aspects and the components in

1Competency questions refer to natural language sen-tences that express the scope and content of an ontology. An-swering these question represents a functional requirementfor an ontology (Uschold and Gruninger, 1996).

events on the basis of their role. As noted byPapafragou (2015), “information about individualevent components (e.g., the person being affectedby an action) or relationships between event com-ponents that determine whether an event is coher-ent can be extracted rapidly by human viewers”.On the other hand, the second layer is suitablealso to recognize the general topic or theme thatunderlies a news story, due to the fact that thislayer concerns conceptual aspects. This theme canbe described as a “semantic macro-proposition”,namely a proposition composed by the sequencesof propositions retrievable in the text (Van Dijk,1991). Thus, the conceptual scheme makes it pos-sible to recognize these structures reducing thecomplexity of the information and guaranteeing asummarization process that is closer to users’ rep-resentation.

3.1 Functional Layer

Following the previously-mentioned broad defini-tion of an event in news as something that happensor occurs, in the functional annotation layer wefocus on the lower level representation of events,closer to the linguistic level.

We represent each event with an event actionand a variable number of arguments of different(sub)categories. The event action is most com-monly the verb associated with the event (e.g.,“destroyed”, “awarded”), however it can also beother parts of speech (e.g., “explosion”) or a mul-tiword expression (e.g., “give up”). The action de-fines the focus of the event and answers the “Whathappened” question, and is the main part of anevent mention.

Along with the event action, we define fourmain categories of event arguments to be anno-tated, which are then split into fine-grained sub-categories, as shown in Table 2. We subcate-gorize the standard Participant category into theAGENT, PATIENT, and OTHERPARTICIPANT sub-categories. We further divide each of the afore-mentioned subcategory into HUMAN and NON-HUMAN subcategories. The AGENT subcategorypertains to the entities that perform an action eitherdeliberately (usually in the case of human agents)or mindlessly (natural causes, such as earthquakesor hurricanes). The PATIENT is the entity thatundergoes the action and, as a result of the ac-tion, changes its state. The TIME and LOCA-TION categories serve to further specify the event.

84

Category Subcategory

PARTICIPANT AGENT HUMANNONHUMAN

PATIENT HUMANNONHUMAN

OTHERPARTICIPANT HUMANNONHUMAN

LOCATION GEOLOCATIONOTHER

TIME

OTHERARGUMENT

Table 2: Functional categories and subcategories.

Finally, the OTHERARGUMENT category coversthemes and instruments (Baker, 1997; Jackendoff,1985) of an action. Table 3 gives an example ofsentence “Barcelona defeated Real Madrid yester-day at Camp Nou” annotated using the functionallayer.

The action’s arguments focus on the specifics ofthe event that occurred. We depart from the stan-dard arguments that can be found in schemes likeECB+ (Cybulska and Vossen, 2014a) or TimeML(Pustejovsky et al., 2003) in that we includedthe Other argument category. Furthermore, inTimeML, predicates related to states or circum-stances are considered as events, while in thescope of this work, sentences describing a state,e.g., “They live in Maine”, are not annotated. Infact, we argue that they do not represent the fo-cus in news, but merely describe the situation sur-rounding the event.

Our functional annotation differs from Prop-Bank (Palmer et al., 2005) definitions of semanticroles as we do not delineate our functional rolesthrough a verb-by-verb analysis. More concretely,PropBank adds predicate-argument relations to thesyntactic trees of the Penn Treebank, represent-ing these relations as framesets, which describethe different sets of roles required for the differ-ent meanings of the verb. In contrast, our anal-ysis aims to describe the focus of an event men-tion by means of identifying actions, which caninvolve also other lexical elements in addition tothe verb. This is easily demonstrated through theexample “fire broke out” from Figure 2a, wherewe annotate “fire broke out” as an action, since itfully specifies the nature of the event defining in aless general way the action.

Text span Label

Barcelona AGENT

defeated ACTION

Real Madrid PATIENT

yesterday TIME

at Camp Nou GEOLOCATION

Table 3: Sample of functional event annotation.

Entity Property

PERSON IDENTITY MOVEMENTORGANIZATION ASSOCIATION LOCATIONANIMAL PARTICIPATION CAUSEOBJECT OCCURRENCE PERFORMANCEPLACE ATTRIBUTION INFLUENCETIME CONSTRUCTION SUPPORTMANIFESTATION CREATION PURPOSE

MODIFICATION CLASSIFICATIONDECLARATION DEATH

Table 4: Main entity classes and properties.

3.2 Conceptual Layer

In order to represent semantically meaningfulevent mentions and, consequently, to develop anontology of the considered domain, we define alsoa second layer of annotation, namely a conceptualmodel for news stories. This model, putting for-ward a classification of the main concepts retriev-able in news stories, defines seven entity classes,six entity subclasses, and eighteen properties (Ta-ble 4).

Entities and properties. Entity classes are de-fined in order to represent a set of different in-dividuals, sharing common characteristics. Thus,being representative of concepts in the domain, en-tities may be identified by noun phrases. On theother hand, properties describe the relations thatlink entity classes to each other and can be repre-sented by the verb phrase. For this reason, eachproperty is associated with some association rulesthat specify the constraints related to both its syn-tactic behaviors and the pertinence and the inten-sion of the property itself. In other words, theseassociation rules contribute to the description ofthe way in which entity classes can be combinedthrough properties in sentence contexts. To for-malize such rules in the form of a set of axioms,we take in consideration the possibility of com-bining semantic and lexical behaviors, suitable foridentifying specific event patterns. Thus, for in-

85

stance, the property MOVEMENT may connect theentity class PERSON and the entity classes PLACE

and TIME, but the same property cannot be usedto describe the relation between MANIFESTATION

and PLACE. The definition of these rules, and thecorresponding axioms, relies on word combina-tion principles that may occur in a language, de-rived from an analysis of work of Harris (1988),and conceptual considerations related to the do-main.

Factuality. To represent the factuality in eventdescriptions, we specify three attributes for eachproperty: polarity, speculation, and passive mark-ers. The polarity refers to the presence of an ex-plicit negation of the verb phrase or the propertyitself. The speculation attribute for the propertyidentifies something that is characterized by spec-ulation or uncertainty. Such an attribute is asso-ciated with the presence of some verbal aspects(e.g., the passive applied to specific verbs as inthey were thought to be), some specific construc-tions/verbs (e.g., to suggest, to suppose, to hypoth-esize, to propose) or modality verbs. According toHodge and Kress (1988), the “modality refers tothe status, authority and reliability of a message,to its ontological status, or to its value as truth orfact”. Finally, we use an attribute for a passivemarker due to the fact that passive voice is usedmainly to indicate a process and can be appliedto infer factual information. Note that, althoughthe time marker is typically considered to be in-dicative of factuality, we prefer to avoid annotat-ing time markers in our schema. Thus, we inferthe temporal chain in event mentions by meansof both temporal references in the sentence, e.g.,the presence of adverbs of time, and the syntactictense of the verb.

Coreference. To account for the coreferencephenomenon among entities, we introduce asymmetric-transitive relation taking two entityclasses as arguments. This allows for annotationof two types of coreference, identity and apposi-tion, and can be used at inter-sentence level to an-notate single or multiple mentions of the same en-tity; an example is shown in Table 5.

Complex events. In the description of eventmentions in news stories we often encounter sen-tence structures expressing complex events, i.e.,events characterized by the presence of more thanone binary relation among their elements. Due to

Text span Label

Blasts MANMADEEVENT*which MANMADEEVENT*killed DEATH

38 COLLECTIVE

by stadium PLACE

claimed by DECLARATION (passive)TAK ORGANIZATION

Table 5: Sample of coreference and attribute an-notation (* denotes coreferring elements).

the fact that properties generally express binary re-lationships between two entity classes, we intro-duce N-ary relations, namely reified relations, inorder to describe these complex structures. Thereified relations allow for the description of com-plex events composed by more than two entitiesand one property. According to the recommen-dation of the W3C,2 these additional elements,which contribute to constitute complex events, canbe formalized as a value of the property or as otherarguments (entity classes) occurring in the sen-tence. In our scheme, we decide to deal with someof these reified relations creating three additionalentity classes – MANNER, SCOPE, and INSTRU-MENT – which may hold heterogeneous elements.Nevertheless, these elements present a shared in-tensive property defined by the main property theyrefer to.

4 Annotation Task

To calibrate the two annotation schemes, we per-formed two rounds of annotation on a set of newsstories in English. We hired four annotators towork on each layer separately, to avoid interfer-ence between the layers.

We set up the annotation task as follows. First,we collected a corpus of news documents. Sec-ondly, we gave each of four annotators per schemathe same document set to annotate, along with theguidelines for that schema. We then examined theinter-annotator agreement for the documents, anddiscussed the major disagreements in person withthe annotators. After the discussion, we revisedthe guidelines and again gave the annotators the

2We refer to W3C protocols for representing these struc-tures to warrant the compliance of our schema with Seman-tic Web languages. More information can be found here:https://www.w3.org/TR/swbp-n-aryRelations/.

86

same set of documents. For the annotation tool,we used Brat (Stenetorp et al., 2012).

We collected the documents by compiling alist of recent events, then querying the web tofind news articles about those events from vari-ous sources. We collected the articles from varioussources to be invariant of the writing style of spe-cific news sites. We aimed for approximately thesame length of articles to keep the scale of agree-ment errors comparable. For this annotator cali-bration step, we used a set of five news documents,approximately 20 sentences in length each.

We computed the inter-annotator agreement be-tween the documents on sentence level in order todetermine the sources of annotator disagreement.We then organized discussion meetings with allof the annotators for each schema to determinewhether the disagreement stems from the ambigu-ity of the source text or from the incomprehensive-ness of the annotation schema.

After the meetings, we revised and refined theguidelines in a process which mostly includedsmaller changes such as adding explanatory sam-ples of annotation for borderline cases as well asrephrasing and clarifying the text of the guidelines.However, we also made a couple of more substan-tial revisions such as adding label classes and de-termining what should or should not be includedin the text spans for particular labels.

5 Inter-Annotator Agreement

We use two different metrics for calculating theinter-annotator agreement (IAA), namely Cohen’skappa coefficient (Cohen, 1960) and the F1-score(van Rijsbergen, 1979). The former has been usedin prior work on event annotations, e.g., in (Cybul-ska and Vossen, 2014b). On the other hand, F1-score is routinely used for evaluating annotationsthat involve variable-length text spans, e.g., namedentity annotations (Tjong Kim Sang and De Meul-der, 2003) used in named entity recognition (NER)tasks. In line with NER evaluations, we considertwo F1-score calculations: strict F1-score (boththe labels and the text spans have to match per-fectly) and lenient F1-score (labels have to match,but text spans may only partially overlap). In bothcases, we calculate the macro F1-score by averag-ing the F1-scores computed for each label.

The motivation for using the F1-score alongwith Cohen’s kappa coefficient lies in the fact thatCohen’s kappa treats the untagged tokens as true

Kappa F1-strict F1-lenient

Functional layerRound 1 0.428 ± 0.08 0.383 ± 0.04 0.671 ± 0.10Round 2 0.409 ± 0.07 0.424 ± 0.04 0.621 ± 0.07

Conceptual layerRound 1 0.280 ± 0.08 0.350 ± 0.06 0.680 ± 0.15Round 2 0.476 ± 0.07 0.475 ± 0.03 0.778 ± 0.06

Table 6: Inter-annotator agreement scores for thetwo annotation layers and two annotation rounds,averaged across annotator pairs and documents.

negatives. If the majority of tokens is untagged,the agreement values will be inflated, as demon-strated by Cybulska and Vossen (2014b). In con-trast, the F1-score disregards the untagged tokens,and is therefore a more suitable measure for se-quence labeling tasks. In our case, the ratio ofuntagged vs. tagged tokens was less skewed (6:4and 1:2 for the functional and conceptual layer, re-spectively), i.e., for both annotation layers a fairportion of text is covered by annotated text spans,which means that the discrepancy between kappavalues and F1-scores is expected to be lower.

We compute the IAA across all annotator pairsworking on the same document, separately for thesame round of annotation, and separately for eachannotation layer. We then calculate the IAA aver-aged across the five documents, along with stan-dard deviations. Table 6 shows the IAA scores.

For the functional layer, the Cohen’s kappa co-efficient is above 0.4, which, according to Landisand Koch (1977), is considered a borderline be-tween fail and moderate agreement. Interestinglyenough, the kappa agreement dropped between thefirst and the second round. We attribute this to thefact that the set of labels was refined (extended)between the two rounds, based on the discussionwe had with the annotators after the first roundof annotations. Apparently, the refinement madethe annotation task more difficult, or we failed tocater for it in the guidelines. Conversely, for theconceptual layer, the agreement in first round waslower, but increased to a moderate level in the sec-ond round. The same observations hold for the F1-strict and F1-lenient measures. Furthermore, theIAA scores for the second round for the concep-tual layer are higher than for the functional layer.A number of factors could be at play here: the an-notators working on the conceptual layer were per-haps more skilled, the guidelines were more com-

87

prehensive, or the task is inherently less difficultor perhaps more intuitive.

While the IAA scores may seem moderate atfirst, one has to bear in mind the total numberof different labels, which is 17 and 28 for thefunctional and conceptual layer, respectively. Inview of this, and considering also the fact that thisis a preliminary study, we consider the moderateagreement scores to be very satisfactory. Nonethe-less, we believe the scores could be improved evenfurther with additional calibration rounds.

6 Comparative analysis

In this section, we provide examples of a coupleof sentences annotated in both layers, along with abrief discussion on why we believe that each layercompensates the shortcomings of the other.

Fig. 1 provides an example of a sentence anno-tated in the functional and conceptual layer. Weobserve that the last part of the sentence, “read-ing: Bye all!!!”, is not annotated in the functionallayer (Fig. 1a). This is due to the fact that the lastpart is a modifier of the patient, and not the ac-tion. Even though we could argue that in this casethe information provided by the modifier is unim-portant for the event, we could conceive of a con-tent of the note that would indeed be important.Along with that, any modifier of the event argu-ments that is not directly linked to the argumentsis not annotated in the functional layer, leading toinformation loss. We argue that in such cases theconceptual layer (Fig. 1b) is more suited towardsgathering the full picture of the event along withall the descriptions.

Fig. 2a exemplifies the case where, in the func-tional layer, the action is a noun phrase. Suchcases are intentionally meant to be labeled as ac-tions as they change the meaning of the verb it-self. In the conceptual case (Fig. 2b), as the occur-rence we label “broke out”, a phrase that, althoughclear, gives no indication of the true nature of theevent, and the conceptual layer relies on the “nat-ural event” argument for the full understanding ofthe event. We argue that having a noun phraseas an action, such as in the functional layer, is amore natural representation of an event as it fullyanswers the “What” question. We also argue thatmaking a distinction between “fire broke out” and“broke out” as actions is beneficial for the trainingof the event extraction model as it emphasizes thedistinction between a verb and an action.

7 Conclusions

We have presented a two-layered scheme for theannotation of event mentions in news, conveyingdifferent information aspects: the functional as-pect and the conceptual aspect. The first one dealswith a more general analysis of sentence struc-tures in news and the lexical elements involved inevents. The conceptual layer aims at describingevent mentions in news focusing on the “semanticmacro-propositions”, which compose the theme ofthe news story.

Our approach to event mentions in news is a partof a research project on high-precision news eventextraction models. The main hypothesis, leadingthe development of our system, is that the pre-cision of models can be improved by modelingand extracting separately the different aspects ofnews events, and then combining the extracted in-formation by leveraging the complementarities ofthe models. As part of this examination, we havepresented also a preliminary analysis of the inter-annotator agreement.

Acknowledgments

This work has been funded by the Unity ThroughKnowledge Fund of the Croatian Science Foun-dation, under the grant 19/15: “EVEnt RetrievalBased on semantically Enriched Structures for In-teractive user Tasks (EVERBEST)”.

ReferencesRodrigo Agerri and German Rigau. 2016. Robust

multilingual named entity recognition with shallowsemi-supervised features. Artificial Intelligence,238:63–82.

Mark C. Baker. 1997. Thematic roles and syntacticstructure. In Elements of grammar, pages 73–137.Springer.

Jacob Cohen. 1960. A coefficient of agreementfor nominal scales. Educational and psychologicalmeasurement, 20(1):37–46.

Agata Cybulska and Piek Vossen. 2014a. Guidelinesfor ECB+ annotation of events and their coreference.Technical report, Technical report, Technical ReportNWR-2014-1, VU University Amsterdam.

Agata Cybulska and Piek Vossen. 2014b. Using asledgehammer to crack a nut? lexical diversity andevent coreference resolution. In LREC, pages 4545–4552.

88

(a)

(b)

Figure 1: Example of sentence annotated in the (a) functional and the (b) conceptual layer.

(a)

(b)

Figure 2: Example of sentence annotated in the (a) functional and the (b) conceptual layer.

Aldo Gangemi, Nicola Guarino, Claudio Masolo,Alessandro Oltramari, and Luc Schneider. 2002.Sweetening ontologies with dolce. In InternationalConference on Knowledge Engineering and Knowl-edge Management, pages 166–181. Springer.

Zellig Harris. 1988. Language and information.Columbia University Press.

Robert Hodge and Gunther R. Kress. 1988. Socialsemiotics. Cornell University Press.

D. Ingram and P. Henshall. 2008. The news manual.

Ray Jackendoff. 1985. Semantic structure and con-ceptual structure. Semantics and Cognition, pages3–22.

J. Richard Landis and Gary G. Koch. 1977. The mea-surement of observer agreement for categorical data.biometrics, pages 159–174.

Uta Losch and Nadejda Nikitina. 2009. The new-sevents ontology: an ontology for describing busi-ness events. In Proceedings of the 2009 Inter-national Conference on Ontology Patterns-Volume516, pages 187–193. CEUR-WS. org.

A-L Minard, Manuela Speranza, Ruben Urizar, BegonaAltuna, MGJ van Erp, AM Schoen, CM van Son,et al. 2016a. MEANTIME, the NewsReader multi-lingual event and time corpus.

Anne-Lyse Minard, Manuela Speranza, TommasoCaselli, and Fondazione Bruno Kessler. 2016b.The EVALITA 2016 event factuality annotation task

(FactA). In Proceedings of Third Italian Confer-ence on Computational Linguistics (CLiC-it 2016)& Fifth Evaluation Campaign of Natural LanguageProcessing and Speech Tools for Italian. Final Work-shop (EVALITA 2016).

Martha Palmer, Daniel Gildea, and Paul Kingsbury.2005. The proposition bank: An annotated cor-pus of semantic roles. Computational linguistics,31(1):71–106.

Anna Papafragou. 2015. The representation of eventsin language and cognition. The conceptual mind:New directions in the study of concepts, page 327.

James Pustejovsky, Jose M Castano, Robert Ingria,Roser Sauri, Robert J Gaizauskas, Andrea Setzer,Graham Katz, and Dragomir R. Radev. 2003.TimeML: Robust specification of event and tempo-ral expressions in text. New directions in questionanswering, 3:28–34.

Marco Rospocher, Marieke van Erp, Piek Vossen,Antske Fokkens, Itziar Aldabe, German Rigau,Aitor Soroa, Thomas Ploeger, and Tessel Bogaard.2016. Building event-centric knowledge graphsfrom news. Web Semantics: Science, Services andAgents on the World Wide Web, 37:132–151.

Ansgar Scherp, Thomas Franz, Carsten Saathoff, andSteffen Staab. 2009. F – a model of events basedon the foundational ontology DOLCE+DnS ultra-lite. In Proceedings of the fifth international confer-ence on Knowledge capture, pages 137–144. ACM.

89

Andrea Setzer and Robert J. Gaizauskas. 2000. Anno-tating events and temporal information in newswiretexts. In LREC, volume 2000, pages 1287–1294.

Manuela Speranza and Anne-Lyse Minard. 2014.NewsReader guidelines for cross-document annota-tion NWR-2014-9.

Pontus Stenetorp, Sampo Pyysalo, Goran Topic,Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu-jii. 2012. BRAT: a web-based tool for NLP-assistedtext annotation. In Proceedings of the Demonstra-tions at the 13th Conference of the European Chap-ter of the Association for Computational Linguistics,pages 102–107. Association for Computational Lin-guistics.

Tdt TDT. 2004. Annotation manual version 1.2. Fromknowledge accumulation to accommodation: cyclesof collective cognition in work groups.

Erik F Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. InProceedings of the seventh conference on Naturallanguage learning at HLT-NAACL 2003-Volume 4,pages 142–147. Association for Computational Lin-guistics.

Mike Uschold and Michael Gruninger. 1996. On-tologies: Principles, methods and applications. Theknowledge engineering review, 11(02):93–136.

Teun A. Van Dijk. 1991. Media contents: the interdis-ciplinary study of news as discourse, a handbook ofqualitative methodologies for mass communicationresearch.

C.J. van Rijsbergen. 1979. Information Retreival. But-terworths.

Chantal van Son, Tommaso Caselli, Antske Fokkens,Isa Maks, Roser Morante, Lora Aroyo, and PiekVossen. 2016. Grasp: A multilayered annotationscheme for perspectives. In Proceedings of the 10thEdition of the Language Resources and EvaluationConference (LREC).

Piek Vossen, Rodrigo Agerri, Itziar Aldabe, Agata Cy-bulska, Marieke van Erp, Antske Fokkens, EgoitzLaparra, Anne-Lyse Minard, Alessio Palmero Apro-sio, German Rigau, et al. 2016. NewsReader:Using knowledge resources in a cross-lingual read-ing machine to generate more knowledge from mas-sive streams of news. Knowledge-Based Systems,110:60–85.

90


Word Similarity Datasets for Indian Languages:Annotation and Baseline Systems

Syed S. Akhtar Arihant Gupta Avijit Vajpayee Arjit Srivastava M. ShrivastavaInternational Institute of Information Technology

Hyderabad, Telangana, India{syed.akhtar, arihant.gupta, arjit.srivastava}@research.iiit.ac.in,

[email protected],[email protected]

Abstract

With the advent of word representations,word similarity tasks are becoming in-creasing popular as an evaluation met-ric for the quality of the representations.In this paper, we present manually anno-tated monolingual word similarity datasetsof six Indian languages – Urdu, Tel-ugu, Marathi, Punjabi, Tamil and Gu-jarati. These languages are most spokenIndian languages worldwide after Hindiand Bengali. For the construction of thesedatasets, our approach relies on transla-tion and re-annotation of word similaritydatasets of English. We also present base-line scores for word representation modelsusing state-of-the-art techniques for Urdu,Telugu and Marathi by evaluating them onnewly created word similarity datasets.

1 Introduction

Word representations are being increasingly pop-ular in various areas of natural language process-ing like dependency parsing (Bansal et al., 2014),named entity recognition (Miller et al., 2004) andparsing (Socher et al., 2013). Word similaritytask is one of the most popular benchmark for theevaluation of word representations. Applicationsof word similarity range from Word Sense Dis-ambiguation (Patwardhan et al., 2005), MachineTranslation Evaluation (Lavie and Denkowski,2009), Question Answering (Mohler et al., 2011),and Lexical Substitution (Diana and Navigli,2009).

Word Similarity task is a computationally effi-cient method to evaluate the quality of word vec-tors. It relies on finding correlation between hu-man assigned semantic similarity (between words)and corresponding word vectors. We have used

Spearman’s Rho for calculating correlation. Un-fortunately, most of the word similarity tasks havebeen majorly limited to English language becauseof availability of well annotated different wordsimilarity test datasets and large corpora for learn-ing good word representations, where as for In-dian languages like Marathi, Punjabi, Telugu etc.– which even though are widely spoken by signif-icant number of people, are still computationallyresource poor languages. Even if there are mod-els trained for these languages, word similaritydatasets to test reliability of corresponding learnedword representations do not exist.

Hence, primary motivation for creation of thesesix word similarity datasets has been to providenecessary evaluation resources for all the currentand future work in field of word representationson these six Indian languages – all ranked in top25 most spoken languages in the world, since noprior word similarity datasets have been publiclymade available.

The main contribution of this paper is the setof newly created word similarity datasets whichwould allow for fast and efficient comparison be-tween. Word similarity is one of the most im-portant evaluation metric for word representationsand hence as an evaluation metric, these datasetswould promote development of better techniquesthat employ word representations for these lan-guages. We also present baseline scores usingstate-of-the-art techniques which were evaluatedusing these datasets.

The paper is structured as follows. We first dis-cuss the corpus and techniques used for trainingour models in section 2 which are later used forevaluation. We then talk about relevant relatedwork that has been done with respect to word sim-ilarity datasets in section 3. We then move on toexplain how these datasets have been created insection 4 followed by our evaluation criteria and

91

experimental results of various models evaluatedon these datasets in section 5. Finally, we analyzeand explain the results in section 6 and finish thispaper with how we plan to extend our work in sec-tion 7.

2 Datasets

For all the models trained in this paper, we haveused the Skip-gram, CBOW (Mikolov et al.,2013a) and FastText (Bojanowski et al., 2016) al-gorithms. The dimensionality has been fixed at300 with a minimum count of 5 along with neg-ative sampling.

As training set of Marathi, we use the mono-lingual corpus created by IIT-Bombay. Thisdata contains 27 million tokens. For Urdu,we use the untagged corpus released by Jawaidet. al. (2014) containing 95 million to-kens. For Telugu, we use Telugu wikidump avail-able at https://archive.org/details/tewiki-20150305 having 11 million tokens.

For testing, we use the newly created datasets.The word similarity datatsets for Urdu, Marathi,Telugu, Punjabi, Gujarati and Tamil contain 100,104, 111, 143, 163 and 97 word pairs respectively.

For rest of the paper, we have calculated theSpearman ρ (multiplied by 100) between humanassigned similarity and cosine similarity of ourword embeddings for the word-pairs. For anyword which was is not found, we assign it a zerovector.

In order to learn initial representations of thewords, we train word embeddings (word2vec) us-ing the parameters described above on the trainingset.

3 Related Work

Multitude of word similarity datasets have beencreated for English, like WordSim-353 (Finkel-stein et al., 2002), MC-30 (Miller and Charles,1991), Simlex-999 (Hill et al., 2016), RG-65(Rubenstein and Goodenough, 2006) etc. RG-65is one of the oldest and most popular datasets, be-ing used as a standard benchmark for measuringreliability of word representations.

RG-65 has also acted as base for variousother word similarity datasets created in differ-ent languages: French (Joubarne and Inkpen,2011), German (Zesch and Gurevyc, 2006), Por-tuguese (Granada et al., 2014), Spanish andFarsi (Camacho-Collados et al., 2015). While

German and Portuguese reported IAA (Inter An-notator Agreement) of 0.81 and 0.71 respectively,no IAA was calculated for French. For Spanishand Farsi, inter annotator agreement of 0.83 and0.88 respectively was reported. Our datasets werecreated using RG-65 and WordSim-353 as base,and their respective IAA(s) are mentioned later inthe paper.

4 Construction of Monolingual WordSimilarity datasets

4.1 TranslationEnglish RG-65 and WordSim-353 were used asbase for creating all of our six different word sim-ilarity datasets. Translation of English data setto target language (one of the six languages) wasmanually done by a set of three annotators who arenative speakers of the target language and are flu-ent in English. Initially, translations are providedby two of them, and in case of disparity, third an-notator was used as a tie breaker.

Finally, all three annotators reached a final set oftranslated word pairs in target language, ensuringthat there were no repeated word pairs. This ap-proach was followed by Camacho-Callados et al.(2015) where they created word similarity datasetsfor Spanish and Farsi in a similar manner.

4.2 ScoringFor each of the six languages, 8 native speakerswere asked to manually evaluate each word sim-ilarity data set individually. They were instructedto indicate, for each pair, their opinion of how sim-ilar in meaning the two words are on a scale of 0-10, with 10 for words that mean the same thing,and 0 for words that mean completely differentthings. The guidelines provided to the annotatorswere based on the SemEval task on Cross-LevelSemantic Similarity (Jurgens et al., 2014), whichprovides clear indications in order to distinguishsimilarity and relatedness.

The results were averaged over the 8 responsesfor each word similarity data set, and each data setsaw good agreement amongst the evaluators, ex-cept for Tamil, which saw relatively weaker agree-ment with respect to other languages (see table 1).

5 Evaluation

5.1 Inter Annotator Agreement (IAA)The meaning of a sentence and its words can beinterpreted in different ways by different read-

92

ers. This subjectivity can also reflect in annota-tion of sentences of a language despite the anno-tation guidelines being well defined. Therefore,inter-annotator agreement is calculated to give ameasure of how well the annotators can make thesame annotation decision for a certain category.

Language Inter Annotator AgreementUrdu 0.887

Punjabi 0.821Marathi 0.808Tamil 0.756Telugu 0.866

Gujarati 0.867

Table 1: Inter Annotator Agreement (FleissKappa) scores for word similarity datasets createdfor six languages.

5.1.1 Fleiss’ KappaFleiss’ kappa is a statistical measure for assess-ing the reliability of agreement between a fixednumber of raters when assigning categorical rat-ings to a number of items or classifying items.This contrasts with other kappas such as Cohen’skappa, which only work when assessing the agree-ment between not more than two raters or the in-terrater reliability for one appraiser versus himself.The measure calculates the degree of agreement inclassification over that which would be expectedby chance (Wikipedia contributors, 2017).

We have calculated Fleiss’ Kappa for all ourword similarity datasets (see table 1).

6 Result and Analysis

System Score OOV VocabCBOW 28.30 19 130K

SG 34.40 19 130KFastText 34.61 19 130K

FastText w/ OOV 45.47 14 -

Table 2: Results for Urdu

We present baseline scores using state of theart techniques – CBOW and Skipgram (Mikolovet al., 2013a) and FastText-SG (Bojanowski etal., 2016), evaluated using our word similaritydatasets in tables 2, 3 and 4. As we can seethe models trained encountered unseen word pairswhen evaluated on their corresponding word simi-larity datasets. This goes on to show that all word


SG 41.22 3 194KFastText 33.68 3 194K


Table 3: Results for Marathi


SG 27.04 14 174KFastText 34.29 14 174K


Table 4: Results for Telugu

pairs in our word similarity sets are not too com-mon, and contain word pairs with some rarity.

We see that FastText w/ OOV (Out of Vocab-ulary) performed better than FastText in all theexperiments, because character based models per-form better than rest of the models since theyare able to handle unseen words by generatingword embeddings for missing words via charactermodel.

7 Future Work

There are a lot of Indian languages that are stillcomputationally resource poor even though theyare widely spoken by significant number of peo-ple. Our work is a small step towards generatingresources to further the research involving wordrepresentations on Indian languages.

To further extend our work, we will create rare-word word similarity datasets for six languages weworked on in this paper, and creating word simi-larity datasets for other major Indian languages aswell.

We will also work on improving word repre-sentations for the languages we worked on, henceimprove the baseline scores that we present here.This will require us to build new corpus to trainour models for three languages that we couldn’tprovide baseline scores for – Punjabi, Tamil andGujarati and build more corpus for Urdu, Teluguand Marathi to train better word embeddings.

ReferencesAlon Lavie, and Michael J. Denkowski. 2009. The

METEOR metric for automatic evaluation of ma-

93

chine translation. Machine translation 23, no. 2-3(2009): 105-115.

Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov 2016. Enriching word vec-tors with subword information. arXiv preprintarXiv:1607.04606 (2016).

Bushra Jawaid, Amir Kamran, and Ondrej Bojar. 2014.A Tagged Corpus and a Tagger for Urdu. LREC2014.

Camacho-Collados, Jos, Mohammad Taher Pilehvar,and Roberto Navigli. 2015. A Framework forthe Construction of Monolingual and Cross-lingualWord Similarity Datasets. In ACL (2) (pp. 1-7).

Colette Joubarne and Diana Inkpen. 2011. Compar-ison of semantic similarity for different languagesusing the Google n-gram corpus and second-orderco-occurrence measures. In Advances in ArtificialIntelligence – 24th Canadian Conference on Artifi-cial Intelligence, pages 216-221.

David Jurgens, Mohammad Taher Pilehvar, andRoberto Navigli. 2014. Semeval-2014 task 3:Cross-level semantic similarity. Proceedings of the8th International Workshop on Semantic Evalua-tion (SemEval 2014), in conjunction with COLING.2014.

Diana McCarthy, and Roberto Navigli. 2009. The En-glish lexical substitution task. Language resourcesand evaluation 43, no. 2 (2009): 139-159.

Felix Hill, Roi Reichart, and Anna Korhonen. 2016.Simlex-999: Evaluating semantic models with (gen-uine) similarity estimation. Computational Linguis-tics.

George A. Miller, and Walter G. Charles. 1991. Con-textual correlates of semantic similarity. Languageand cognitive processes 6.1 (1991): 1-28.

Herbert Rubenstein and John B. Goodenough. 2006.Contextual correlates of synonym. Communicationsof the ACM, volume 8, number 10, pages 627-633.

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,Ehud Rivlin, Zach Solan, Gadi Wolfman and EytanRuppin. 2002. Placing search in context: The con-cept revisited. Proceedings of the 10th internationalconference on World Wide Web, pages 406-414.

Michael Mohler, Razvan Bunescu, and Rada Mihal-cea.. 2011. Learning to grade short answer ques-tions using semantic similarity measures and depen-dency graph alignments. In Proceedings of the 49thAnnual Meeting of the Association for Computa-tional Linguistics: Human Language Technologies-Volume 1 (pp. 752-762). Association for Computa-tional Linguistics.

Mohit Bansal, Kevin Gimpel, and Karen Livescu.2014. Tailoring continuous word representations fordependency parsing. In Proceedings of the 52nd

Annual Meeting of the Association for Computa-tional Linguistics, ACL 2014, June 22-27, Balti-more, MD, USA, Volume 2: Short Papers, pages809815.

Richard Socher, John Bauer, Christopher D. Manning,and Andrew Y. Ng. 2013. Parsing With Composi-tional Vector Grammars. In ACL, pages 455-465.

Roger Granada, Cassia Trojahn, and Renata Vieira.2014. Comparing semantic relatedness betweenword pairs in Portuguese using Wikipedia. Inter-national Conference on Computational Processingof the Portuguese Language. Springer InternationalPublishing, 2014.

Scott Miller, Jethran Guinness, and Alex Zamanian.2004. Name tagging with word clusters and discrim-inative training. In Proceedings of HLT-NAACL,volume 4, pages 337-342.

Siddharth Patwardhan, Satanjeev Banerjee, and TedPedersen 2005. SenseRelate:: TargetWord: a gen-eralized framework for word sense disambiguation.Proceedings of the ACL 2005 on Interactive posterand demonstration sessions (pp. 73-76). Associationfor Computational Linguistics.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffDean. 2006. Efficient estimation of word repre-sentations in vector space. In In arXiv preprintarXiv:1301.3781

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.2013. Linguistic regularities in continuous spaceword representations. In Proceedings of HLT-NAACL, volume 13, pages 746-751.

Tomas Mikolov, Quoc V. Le, and lya Sutskever. 2013.Exploiting similarities among languages for ma-chine translation. arXiv preprint arXiv:1309.4168(2013).

Torsten Zesch and Iryna Gurevyc. 2006. Automati-cally creating datasets for measures of semantic re-latedness. In Proceedings of the Workshop on Lin-guistic Distances, pages 16-24.

Wikipedia contributors. 2017. “Fleiss’ kappa.”Wikipedia, The Free Encyclopedia.. Wikipedia, TheFree Encyclopedia, 6 Feb. 2017. Web. 16 Feb. 2017.

94


The BECauSE Corpus 2.0: Annotating Causality and OverlappingRelations

Jesse DunietzComputer Science Department

Carnegie Mellon UniversityPittsburgh, PA 15213, [email protected]

Lori Levin and Jaime CarbonellLanguage Technologies Institute

Carnegie Mellon UniversityPittsburgh, PA 15213, USA{lsl,jgc}@cs.cmu.edu

Abstract

Language of cause and effect captures anessential component of the semantics ofa text. However, causal language is alsointertwined with other semantic relations,such as temporal precedence and correla-tion. This makes it difficult to determinewhen causation is the primary intendedmeaning. This paper presents BECauSE2.0, a new version of the BECauSE corpuswith exhaustively annotated expressions ofcausal language, but also seven semantic re-lations that are frequently co-present withcausation. The new corpus shows highinter-annotator agreement, and yields in-sights both about the linguistic expressionsof causation and about the process of anno-tating co-present semantic relations.

1 Introduction

We understand our world in terms of causal net-works – phenomena causing, enabling, or prevent-ing others. Accordingly, the language we use isfull of references to cause and effect. In the PennDiscourse Treebank (PDTB; Prasad et al., 2008),for example, over 12% of explicit discourse con-nectives are marked as causal, as are nearly 26%of implicit discourse relationships. Recognizingcausal assertions is thus invaluable for semantics-oriented applications, particularly in domains suchas finance and biology where interpreting theseassertions can help drive decision-making.

In addition to being ubiquitous, causation is of-ten co-present with related meanings such as tempo-ral order (cause precedes effect) and hypotheticals(the if causes the then). This paper presents theBank of Effects and Causes Stated Explicitly (BE-CauSE) 2.0, which offers insight into these over-laps. As in BECauSE 1.0 (Dunietz et al., 2015, in

press), the corpus contains annotations for causallanguage. It also includes annotations for sevencommonly co-present meanings when they are ex-pressed using constructions shared with causality.

To deal with the wide variation in linguistic ex-pressions of causation (see Neeleman and Van deKoot, 2012; Dunietz et al., 2015), BECauSE drawson the principles of Construction Grammar (CxG;Fillmore et al., 1988; Goldberg, 1995). CxG positsthat the fundamental units of language are construc-tions – pairings of meanings with arbitrarily simpleor complex linguistic forms, from morphemes tostructured lexico-syntactic patterns.

Accordingly, BECauSE admits arbitrary con-structions as the bearers of causal relationships.As long as there is at least one fixed word, anyconventionalized expression of causation can beannotated. By focusing on causal language – con-ventionalized expressions of causation – rather thanreal-world causation, BECauSE largely sidestepsthe philosophical question of what is truly causal.It is not concerned, for instance, with whether thereis a real-world causal relationship within flu virus(virus causes flu) or delicious bacon pizza (baconcauses deliciousness); neither is annotated.

Nonetheless, some of the same overlaps and am-biguities that make real-world causation so hardto circumscribe seep into the linguistic domain,as well. Consider the following examples (withcausal constructions in bold, CAUSES in smallcaps, and effects in italics):

(1) After I DRANK SOME WATER, I felt muchbetter.

(2) As VOTERS GET TO KNOW MR. ROMNEY

BETTER, his poll numbers will rise.(3) THE MORE HE COMPLAINED, the less his

captors fed him.(4) THE RUN ON BEAR STERNS created a crisis.(5) THE IRAQI GOVERNMENT will let Represen-

95

tative Hall visit next week.

Each sentence conveys a causal relation, but pig-gybacks it on a related relation type. (1) uses atemporal relationship to suggest causality. (3) em-ploys a correlative construction, and (2) containselements of both time and correlation in additionto causation. (4), meanwhile, is framed as bring-ing something into existence, and (5) suggests bothpermission and enablement.

Most semantic annotation schemes have requiredthat each token be assigned just one meaning. BE-CauSE 1.0 followed this policy, as well, but thisresulted in inconsistent handling of cases like thoseabove. For example, the meaning of let varies from“allow to happen” (clearly causal) to “verbalize per-mission” (not causal) to shades of both. Theseoverlaps made it difficult for annotators to decidewhen to annotate such cases as causal.

The contributions of this paper are threefold.First, we present a new version of the BECauSEcorpus, which offers several improvements overthe original. Most importantly, the updated corpusincludes annotations for seven different relationtypes that overlap with causality: temporal, cor-relation, hypothetical, obligation/permission, cre-ation/termination, extremity/sufficiency, and con-text. Overlapping relations are tagged for any con-struction that can also be used to express a causalrelationship. The improved scheme yields highinter-annotator agreement. Second, using the newcorpus, we derive intriguing evidence about howmeanings compete for linguistic machinery. Fi-nally, we discuss the issues that the annotation ap-proach does and does not solve. Our observationssuggest lessons for future annotation projects insemantic domains with fuzzy boundaries betweencategories.

2 Related Work

Several annotation schemes have addressed ele-ments of causal language. Verb resources such asVerbNet (Schuler, 2005) and PropBank (Palmeret al., 2005) include verbs of causation. Likewise,preposition schemes (e.g., Schneider et al., 2015,2016) include some purpose- and explanation-related senses. None of these, however, unifies alllinguistic realizations of causation into one frame-work; they are concerned with specific classes ofwords, rather than the semantics of causality.

FrameNet (Ruppenhofer et al., 2016) is closer inspirit to BECauSE, in that it starts from meanings

and catalogs/annotates a wide variety of lexicalitems that can express those meanings. Our workdiffers in several ways. First, FrameNet representscausal relationships through a variety of unrelatedframes (e.g., CAUSATION and THWARTING) andframe roles (e.g., PURPOSE and EXPLANATION).As with other schemes, this makes it difficult totreat causality in a uniform way. (The ASFALDAFrench FrameNet project recently proposed a reor-ganized frame hierarchy for causality, along withmore complete coverage of French causal lexicalunits [Vieu et al., 2016]. Merging their frame-work into mainline FrameNet would mitigate thisissue.) Second, FrameNet does not allow a lex-ical unit to evoke more than one frame at a time(although SALSA [Burchardt et al., 2006], the Ger-man FrameNet, does allow this).

The Penn Discourse Treebank includes causal-ity under its hierarchy of contingency relations.Notably, PDTB does allow annotators to mark dis-course relations as both causal and something else.However, it is restricted to discourse relations; itexcludes other realizations of causal relationships(e.g., verbs and many prepositions), as well as PUR-POSE relations, which are not expressed as dis-course connectives. BECauSE 2.0 can be thoughtof as an adaptation of PDTB’s multiple-annotationapproach. Instead of focusing on a particular typeof construction (discourse relations) and annotatingall the meanings it can convey, we start from a par-ticular meaning (causality), find all constructionsthat express it, and annotate each instance in thetext with all the meanings it expresses.

Other projects have attempted to address causal-ity more narrowly. For example, a small corpusof event pairs conjoined with and has been taggedas causal or not causal (Bethard et al., 2008). TheCaTeRS annotation scheme (Mostafazadeh et al.,2016), based on TimeML, also includes causal re-lations, but from a commonsense reasoning stand-point rather than a linguistic one. Similarly, RicherEvent Description (O’Gorman et al., 2016) inte-grates real-world temporal and causal relations be-tween events into a unified framework. A broader-coverage linguistic approach was taken by Mirzaand Tonelli (2014), who enriched TimeML to in-clude causal links and their lexical triggers. Theirwork differs from ours in that it requires argumentsto be TimeML events; it requires causal connec-tives to be contiguous; and its guidelines definecausality less precisely, relying on intuitive notions

96

of causing, preventing, and enabling.

2.1 BECauSE 1.0

Our work is of course most closely based on BE-CauSE 1.0. Its underlying philosophy is to annotateany form of causal language – conventionalizedlinguistic mechanisms used to appeal to cause andeffect. Thus, the scheme is not concerned with whatreal-world causal relationships hold, but rather withwhat relationships are presented in the text. It de-fines causal language as “any construction whichpresents one event, state, action, or entity as pro-moting or hindering another, and which includes atleast one lexical trigger.” Each annotation consistsof a cause span; an effect span; and a causal con-nective, the possibly discontinuous lexical itemsthat express the causal relationship (e.g., becauseof or opens the way for).

3 Extensions and Refactored Guidelinesin BECauSE 2.0

This update to BECauSE improves on the originalin several ways. Most importantly, as mentionedabove, the original scheme precluded multiple co-present relations. Tagging a connective as causalwas taken to mean that it was primarily express-ing causation, and not temporal sequence or per-mission. (In fact, temporal expressions that wereintended to suggest causation were explicitly ex-cluded.) Based on the new annotations, there were210 instances in the original corpus where multiplerelations were present and annotators had to makean arbitrary decision.1 The new scheme extends theprevious one to include these overlapping relations.

Second, although the first version neatly handledmany different kinds of connectives, adjectives andnouns were treated in a less general way. Verbs,adverbs, conjunctions, prepositions, and complexconstructions typically have two natural slots inthe construction. For example, the because con-struction can be schematized as 〈effect〉 because〈cause〉, and the causative construction present inso loud it hurt as 〈so cause〉 (that) 〈effect〉.

Adjective and noun connectives, however, do notoffer such natural positions for 〈cause〉 and 〈effect〉.In the following example, BECauSE 1.0 wouldannotate the connective as marked in bold: the

1This is the total number, in the new corpus, of instancesthat are annotated with both causal and overlapping relationsand which would have been ambiguous under the 1.0 guide-lines – i.e., the guidelines did not either explictly exclude themor deem them always causal.

cause of her illness was dehydration. But this is anunparsimonious account of the causal construction:the copula and preposition do not contribute to thecausal meaning, and other language could be usedto tie the connective to the arguments. For instance,it would be equally valid to say her illness’ causewas dehydration, or even the chart listed her illness’cause as dehydration. The new corpus addressesthis by annotating just the noun or adjective as theconnective (e.g., cause), and letting the remainingargument realization language vary. A number ofconnectives were similarly refactored to make themsimpler and more consistent.

Finally, version 1.0 struggled with the distinctionbetween the causing event and the causing agent.For example, in I caused a commotion by shatteringa glass, either the agent (I) or the agent’s action(shattering a glass) could plausibly be annotated asthe cause. The guidelines for version 1.0 suggestedthat the true cause is the action, so the agent shouldbe annotated as the cause only when no actionis described. (In such cases, the agent would beconsidered metonymic for the action.) However,given the scheme’s focus on constructions, it seemsodd to say that the arguments to the constructionchange when a by clause is added.

The new scheme solves this by labeling the agentas the cause in both cases, but adding a MEANS

argument for cases where both an agent and theiraction are specified.2

4 BECauSE 2.0 Annotation Scheme

4.1 Basic Features of Annotations

The second version of the BECauSE corpus retainsthe philosophy and most of the provisions of thefirst, with the aforementioned changes.

To circumscribe the scope of the annotations,we follow BECauSE 1.0 in excluding causal rela-tionships with no lexical trigger (e.g., He left. Hewasn’t feeling well.); connectives that lexicalizethe means or result of the causation (e.g., kill orconvince); and connectives that underspecify thenature of the causal relationship (e.g., linked to).

2Another possibility would have been to divvy up causesinto CAUSE and AGENT arguments. Although FrameNetfollows this route in some of its frames, we found that thisdistinction was difficult to make in practice. For example, anon-agentive cause might still be presented with a separatemeans clause, as in inflammation triggers depression by alter-ing immune responses. In contrast, MEANS are relatively easyto identify when present, and tend to exhibit more consistentbehavior with respect to what constructions introduce them.

97

As in BECauSE 1.0, the centerpiece of each in-stance of causal language is the causal connective.The connective is not synonymous with the causalconstruction; rather, it is a lexical proxy indicatingthe presence of the construction. It consists of allwords present in every use of the construction. Forexample, the bolded words in enough money for usto get by would be marked as the connective. An-notators’ choices of what to include as connectiveswere guided by a constructicon, a catalog of con-structions specified to a human-interpretable levelof precision (but not precise enough to be machine-interpretable). The constructicon was updated asneeded throughout the annotation process.

In addition to the connective, each instance in-cludes cause and effect spans. (Either the causeor the effect may be absent, as in a passive or in-finitive.) BECauSE 2.0 also introduces the meansargument, as mentioned above. Means argumentsare annotated when an agent is given as the cause,but the action taken by that agent is also explicitlydescribed, or would be but for a passive or infini-tive. They are marked only when expressed as aby or via clause, a dependent clause (e.g., Singingloudly, she caused winces all down the street), ora handful of other conventional devices. If anyof an instance’s arguments consists of a bare pro-noun, including a relativizing pronoun such as that,a coreference link is added back to its antecedent(assuming there is one in the same sentence).

The new scheme distinguishes three types ofcausation, each of which has slightly different se-mantics: CONSEQUENCE, in which the cause nat-urally leads to the effect; MOTIVATION, in whichsome agent perceives the cause, and therefore con-sciously thinks, feels, or chooses something; andPURPOSE, in which an agent chooses the effectbecause they desire to make the cause true. UnlikeBECauSE 1.0, the new scheme does not includeevidentiary uses of causal language, such as Shemet him previously, because she recognized himyesterday. These were formerly tagged as INFER-ENCE. We eliminated them because unlike othercategories of causation, they are not strictly causal,and unlike other overlapping relations, they neveralso express true causation; they constitute a differ-ent sense of because.

The scheme also distinguishes positive causation(FACILITATE) from inhibitory causation (INHIBIT);see Dunietz et al. (2015) for full details.

Examples demonstrating all of these categories

are shown in Table 1.

4.2 Annotating Overlapping RelationsThe constructions used to express causation over-lap with many other semantic domains. For ex-ample, the if/then language of hypotheticals andthe so 〈adjective〉 construction of extremity havebecome conventionalized ways of expressing cau-sation, usually in addition to their other meanings.In this corpus, we annotate the presence of theseoverlapping relations, as well.

A connective is annotated an instance of eithercausal language or a non-causal overlapping rela-tion whenever it is used in a sense and constructionthat can carry causal meaning. The operational testfor this is whether the word sense and linguisticstructure allow it to be coerced into a causal inter-pretation, and the meaning is either causal or oneof the relation types below.

Consider, for example, the connective without. Itis annotated in cases like without your support, thecampaign will fail. However, annotators ignoreduses like we left without saying goodbye, becausein this linguistic context, without cannot be coercedinto a causal meaning. Likewise, we include if as aHYPOTHETICAL connective, but not suppose that,because the latter cannot indicate causality.

All overlapping relations are understood to holdbetween an ARGC and an ARGE. When annotatinga causal instance, ARGC and ARGE refer to thecause and effect, respectively. When annotating anon-causal instance, ARGC and ARGE refer to thearguments that would be cause and effect if the in-stance were causal. For example, in a TEMPORAL

relation, ARGC would be the earlier argument andARGE would be the later one.

The following overlapping relation types are an-notated:• TEMPORAL: when the causal construction ex-

plicitly foregrounds a temporal order betweentwo arguments (e.g., once, after) or simultaneity(e.g., as, during).

• CORRELATION: when the core meaning of thecausal construction is that ARGC and ARGEvary together (e.g., as, the more. . . the more. . . ).

• HYPOTHETICAL: when the causal constructionexplicitly imagines that a questionable premiseis true, then establishes what would hold in theworld where it is (e.g., if. . . then. . . ).

• OBLIGATION/PERMISSION: when ARGE (ef-fect) is an agent’s action, and ARGC (cause)

98

FACILITATE INHIBIT

CONSEQUENCE We are in serious economic trouble because ofINADEQUATE REGULATION.

THE NEW REGULATIONS should prevent futurecrises.

MOTIVATION WE DON’T HAVE MUCH TIME, so let’s movequickly.

THE COLD kept me from going outside.

PURPOSE Coach them in handling complaints so that THEYCAN RESOLVE PROBLEMS IMMEDIATELY.

(Not possible)

Table 1: Examples of every allowed combination of the three types of causal language and the two degrees of causation (withconnectives in bold, CAUSES in small caps, and effects in italics).

is presented as some norm, rule, or entity withpower that is requiring, permitting, or forbiddingARGE to be performed (e.g., require in the legalsense, permit).

• CREATION/TERMINATION: when the construc-tion frames the relationship as an entity or cir-cumstance being brought into existence or termi-nated (e.g., generate, eliminate).

• EXTREMITY/SUFFICIENCY: when the causalconstruction also expresses an extreme or suf-ficient/insufficient position of some value on ascale (e.g., so. . . that. . . , sufficient. . . to. . . ).

• CONTEXT: when the construction clarifies theconditions under which the effect occurs (e.g,.with, without, when in non-temporal uses). Forinstance, With supplies running low, we didn’teven make a fire that night.

All relation types present in the instance aremarked. For example, so offensive that I left wouldbe annotated as both causal (MOTIVATION) andEXTREMITY/SUFFICIENCY. When causality is notpresent in a use of a sometimes-causal construction,the instance is annotated as NON-CAUSAL, and theoverlapping relations present are marked.

It can be difficult to determine when languagethat expresses one of these relationships was also in-tended to convey a causal relationship. Annotatorsused a variety of questions to assess an ambiguousinstance, largely based on Grivaz (2010):

• The “why” test: After reading the sentence,could a reader reasonably be expected to answera “why” question about the potential effect argu-ment? If not, it is not causal.

• The temporal order test: Is the cause assertedto precede the effect? If not, it is not causal.

• The counterfactuality test: Would the effecthave been just as probable to occur or not occurhad the cause not happened? If so, it is notcausal.

• The ontological asymmetry test: Could you

just as easily claim the cause and effect are re-versed? If so, it is not causal.

• The linguistic test: Can the sentence berephrased as “It is because (of) X that Y ” or“X causes Y ?” If so, it is likely to be causal.Figure 1 showcases several fully-annotated sen-

tences that highlight the key features of the newBECauSE scheme, including examples of overlap-ping relations.

5 BECauSE 2.0 Corpus

5.1 Data

The BECauSE 2.0 corpus3 is an expanded versionof the dataset from BECauSE 1.0. It consists of:• 59 randomly selected articles from the year 2007

in the Washington section of the New YorkTimes corpus (Sandhaus, 2008)

• 47 documents randomly selected4 from sections2-23 of the Penn Treebank (Marcus et al., 1994)

• 679 sentences5 transcribed from Congress’Dodd-Frank hearings, taken from the NLP Un-shared Task in PoliInformatics 2014 (Smithet al., 2014)

• 10 newspaper documents (Wall Street Jour-nal and New York Times articles, totalling547 sentences) and 2 journal documents (82sentences) from the Manually Annotated Sub-Corpus (MASC; Ide et al., 2010)

The first three sets of documents are the samedataset that was annotated for BECauSE 1.0.

5.2 Inter-Annotator Agreement

Inter-annotator agreement was calculated betweenthe two primary annotators on a sample of 260

3Publicly available, along with the constructicon, athttps://github.com/duncanka/BECauSE.

4We excluded WSJ documents that were either earningsreports or corporate leadership/structure announcements, asboth tended to be merely short lists of names/numbers.

5The remainder of the document was not annotated due toconstraints on available annotation effort.

99

18 months after the congress approved private enterprise, district authorities allowed Mr. Chan to resume work.

The U.S. is required to notify foreign dictators if it knows of coup plans likely to endanger their lives.

I approve of any bill that makes regulation more efficient by allowing common-sense exceptions.

Cons [Facil]⏰ Argument ArgumentArgument

Cons [Facil]❗ Cons [Facil]❗

Argument

Argument

Cause

EffectEffect

Cause

ArgumentArgument

NonCausal❗

Argument

Cons [Facil]? ArgumentCause

Effect

Arg1

Argument Arg Cons [Facil] Argument Argument ArgumentNonCausal ❗

ArgumentCauseCoref

EffectArg0

MeansArg1

Figure 1: Several example sentences annotated in BRAT (Stenetorp et al., 2012). The question mark indicates a hypothetical, theclock symbol indicates a temporal relation, and the thick exclamation point indicates obligation/permission.

Causal Overlapping

Connective spans (F1) 0.77 0.89Relation types (κ) 0.70 0.91

Degrees (κ) 0.92 (n/a)

Cause/ARGC spans (%) 0.89 0.96Cause/ARGC spans (J) 0.92 0.97Cause/ARGC heads (%) 0.92 0.96

Effect/ARGE spans (%) 0.86 0.84Effect/ARGE spans (J) 0.93 0.92Effect/ARGE heads (%) 0.95 0.89

Table 2: Inter-annotator agreement for the new version ofBECauSE. κ indicates Cohen’s kappa; J indicates the averageJaccard index, a measure of span overlap; and % indicates per-cent agreement of exact matches. Each κ and argument scorewas calculated only for instances with matching connectives.

An argument’s head was determined automatically by pars-ing the sentence with version 3.5.2 of the Stanford Parser(Klein and Manning, 2003) and taking the highest dependencynode in the argument span.

Means arguments were not included in this evaluation, asthey are quite rare – there were only two in the IAA dataset,one of which was missed by one annotator and the other ofwhich was missed by both. Both annotators agreed with thesetwo means arguments once they were pointed out.

sentences, containing 98 causal instances and 82instances of overlapping relations (per the first au-thor’s annotations). Statistics appear in Table 2.

Overall, the results show substantially improvedconnective agreement. F1 for causal connectivesis up to 0.77, compared to 0.70 in BECauSE 1.0.(The documents were drawn from similar sourcesand containing connectives of largely similar com-plexity as the previous IAA set.) The improvementsuggests that the clearer guidelines and the over-lapping relations made decisions less ambiguous,although some of the difference may be due tochance differences in the IAA datasets. Agreement

on causal relation types is somewhat lower than inversion 1.0 – 0.7 instead of 0.8 (possibly becausemore instances are annotated in the new scheme,which tends to reduce κ) – but it is still high. Unsur-prisingly, most of the disagreements are betweenCONSEQUENCE and MOTIVATION. Degrees areclose to full agreement; the only disagreement ap-pears to have been a careless error. Agreement onargument spans is likewise quite good.

For overlapping relations, only agreement onARGEs is lower than for causal relations; all othermetrics are significantly higher. The connectiveF1 score of 0.89 is especially promising, giventhe apparent difficulty of deciding which uses ofconnectives like with or when could plausibly becoerced to a causal meaning.

5.3 Corpus Statistics and Analysis

The corpus contains a total of 5380 sentences,among which are 1803 labeled instances of causallanguage. 1634 of these, or 90.7%, include bothcause and effect arguments. 587 – about a third– involve overlapping relations. The corpus alsoincludes 583 non-causal overlapping relation anno-tations. The frequency of both causal and overlap-ping relation types is shown in Table 3.

A few caveats about these statistics: first, PUR-POSE annotations do not overlap with any of thecategories we analyzed. However, this should notbe interpreted as evidence that they have no over-laps. Rather, they seem to inhabit a different partof the semantic space. PURPOSE does share somelanguage with origin/destination relationships (e.g.,toward the goal of, in order to achieve my goals),both diachronically and synchronically; see §7.

100

CONSEQUENCE MOTIVATION PURPOSE All causal NON-CAUSAL Total

None 625 319 272 1216 - 1216TEMPORAL 120 135 - 255 463 718

CORRELATION 9 3 - 12 5 17HYPOTHETICAL 73 48 - 121 24 145

OBLIGATION/PERMISSION 67 5 - 72 27 99CREATION/TERMINATION 37 4 - 41 43 84

EXTREMITY/SUFFICIENCY 53 9 - 62 - 62CONTEXT 17 15 - 32 25 57

Total 994 537 272 1803 583 2386

Table 3: Statistics of various combinations of relation types. Note that there are 9 instances of TEMPORAL+CORRELATION and3 instances of TEMPORAL+CONTEXT. This makes the bottom totals less than the sum of the rows.

Second, the numbers do not reflect all construc-tions that express, e.g., temporal or correlative rela-tionships – only those that can be used to expresscausality. Thus, it would be improper to concludethat over a third of temporals are causal; manykinds of temporal language simply were not in-cluded. Similarly, the fact that all annotated EX-TREMITY/SUFFICIENCY instances are causal is anartifact of only annotating uses with a complementclause, such as so loud I felt it; so loud on its owncould never be coerced to a causal interpretation.

Several conclusions and hypotheses do emergefrom the relation statistics. Most notably, causal-ity has thoroughly seeped into the temporal andhypothetical domains. Over 14% of causal expres-sions are piggybacked on temporal relations, andnearly 7% are expressed as hypotheticals. Thisis consistent with the close semantic ties betweenthese domains: temporal order is a preconditionfor a causal relationship, and often hypotheticalsare interesting specifically because of the conse-quences of the hypothesized condition. The extentof these overlaps speaks to the importance of cap-turing overlapping relations for causality and otherdomains with blurry boundaries.

Another takeaway is that most hypotheticals thatare expressed as conditionals are causal. Not all hy-potheticals are included in BECauSE (e.g., supposethat is not), but all conditional hypotheticals are6:any conditional could express a causal relationshipin addition to a hypothetical one. In principle, non-causal hypotheticals could be more common, suchas if he comes, he’ll bring his wife or if we mustcry, let them be tears of laughter. It appears, how-ever, that the majority of conditional hypotheticals

6We did not annotate even if as a hypothetical, since itseems to be a specialized concessive form of the construction.However, this choice does not substantially change the conclu-sion: even including instances of even if, 77% of conditionalhypotheticals would still be causal.

(84%) in fact carry causal meaning.Finally, the data exhibit a surprisingly strong

preference for framing causal relations in termsof agents’ motivations: nearly 45% of causal in-stances are expressed as MOTIVATION or PUR-POSE. Of course, the data could be biased towardsevents involving human agents; many of the docu-ments are about politics and economics. Still, it isintriguing that many of the explicit causal relation-ships are not just about, say, politicians’ economicdecisions having consequences, but about why theagents made the choices they did. It is worth inves-tigating further to determine whether there really isa preference for appeals to motivation even whenthey are not strictly necessary.

6 Lessons Learned

Our experience suggests several lessons about an-notating multiple overlapping relations. First, itclearly indicates that a secondary meaning can beevoked without losing any of the original mean-ing. In terms of the model of prototypes and radialcategories (Lewandowska-Tomaszczyk, 2007), theconventional model for blurriness between cate-gories, an instance can simultaneously be prototyp-ical for one type of relation and radial for another.For instance, the ruling allows the police to enteryour home is a prototypical example a permissionrelationship. However, it is also a radial exampleof enablement (a form of causation): prototypicalenablement involves a physical barrier being re-moved, whereas allow indicates the removal of anormative barrier.

A second lesson: even when including overlap-ping semantic domains in an annotation project, itmay still be necessary to declare some overlappingdomains out of scope. In particular, some adjacentdomains will have their own overlaps with mean-ings that are far afield from the target domain. It

101

would be impractical to simply pull all of thesesecond-order domains into the annotation scheme;the project would quickly grow to encompass theentire language. If possible, the best solution is todissect the overlapping domain into a more detailedtypology, and only include the parts that directlyoverlap with the target domain. If this is not doable,the domain may need to be excluded altogether.

For example, we attempted to introduce a TOPIC

relation type to cover cases like The President isfuming over recent media reports or They’re an-gry about the equipment we broke (both clearlycausal). Unfortunately, opening up the entire do-main of topic relations turned out to be too broadand confusing. For example, it is hard to tell whichof the following are even describing the same kindof topic relationship, never mind which ones canalso be causal: fought over his bad behavior (be-havior caused fighting); fought over a teddy bear(fought for physical control); worried about beinglate; worried that I might be late; I’m skepticalregarding the code’s robustness. We ultimately de-termined that teasing apart this domain would haveto be out of scope for this work.

7 Contributions and LingeringDifficulties

Our approach leaves open several questions abouthow to annotate causal relations and other semanti-cally blurry relations.

First, it does not eliminate the need for binarychoices about whether a given relation is present;our annotators must still mark each instance aseither indicating causation or not. Likewise foreach of the overlapping relations. Yet some casessuggest overtones of causality or correlation, butare not prototypically causal or correlative. Thesecases still necessitate making a semi-abitrary call.

The ideal solution would somehow acknowledgethe continuous nature of meaning – that an expres-sion can indicate a relationship that is not causal,entirely causal, slightly causal, or anywhere in be-tween. But it is hard to imagine how such a contin-uous representation would be annotated in practice.

Second, some edge cases remain a challenge forour new scheme. Most notably, we did not examineevery semantic domain sharing some overlap withcausality. Relations we did not address include:• Origin/destination (as mentioned in §5.3; e.g.,

the sparks from the fire, toward that goal)

• Topic (see §6)

• Componential relationships (e.g., As part of thebuilding’s liquidation, other major tenants willalso vacate the premises)

• Evidentiary basis (e.g., We went to war basedon bad intelligence)

• Having a role (e.g., As an American citizen, I donot want to see the President fail)

• Placing in a position (e.g., This move puts theAmerican people at risk)

These relations were omitted due to the time andeffort it would have taken to determine whetherand when to classify them as causal. We leaveuntangling their complexities for future work.

Other cases proved difficult because they seemto imply a causal relationship in each direction.The class of constructions indicating necessary pre-conditions was particularly troublesome. Theseconstructions are typified by the sentence (For us)to succeed, we all have to cooperate. (Other vari-ants use different language to express the modalityof obligation, such as require or necessary.) Onthe one hand, the sentence indicates that cooper-ation enables success. On the other hand, it alsosuggests that the desire for success necessitates thecooperation.7 We generally take the enablementrelationship to be the primary meaning, but this isnot an entirely satisfying account of the semantics.

Despite the need for further investigation ofthese issues, our attempt at extending causal lan-guage annotations to adjacent semantic domainswas largely a success. We have demonstrated thatit is practical and sometimes helpful to annotate alllinguistic expressions of a semantic relationship,even when they overlap with other semantic rela-tions. We were able to achieve high inter-annotatoragreement and to extract insights about how differ-ent meanings compete for constructions. We hopethat the new corpus, our annotation methodologyand the lessons it provides, and the observationsabout linguistic competition will all prove usefulto the research community.

7Necessary precondition constructions are thus similar toconstructions of PURPOSE, such as in order to. As spelledout in Dunietz et al. (2015), a PURPOSE connective containsa similar duality of causations in opposing directions: it in-dicates that a desire for an outcome causes an agent to act,and hints that the action may in fact produce the desired out-come. However, in PURPOSE instances, it is clearer whichrelationship is primary: the desired outcome may not obtain,whereas the agent is centainly acting on their motivation. Inprecondition constructions, however, both the preconditionand the result are imagined, making it harder to tell which ofthe two causal relationships is primary.

102

References

Steven Bethard, William J Corvey, Sara Klingen-stein, and James H. Martin. 2008. Building acorpus of temporal-causal structure. In Proceed-ings of LREC 2008, pages 908–915.

Aljoscha Burchardt, Katrin Erk, Anette Frank, An-drea Kowalski, Sebastian Pado, and ManfredPinkal. 2006. The SALSA corpus: a German cor-pus resource for lexical semantics. In Proceed-ings of the 5th International Conference on Lan-guage Resources and Evaluation (LREC-2006).Association for Computational Linguistics.

Jesse Dunietz, Lori Levin, and Jaime Carbonell.2015. Annotating causal language using corpuslexicography of constructions. In The 9th Lin-guistic Annotation Workshop held in conjunctionwith NAACL 2015, pages 188–196.

Jesse Dunietz, Lori Levin, and Jaime Carbonell. inpress. Automatically tagging constructions ofcausation and their slot-fillers. Transactions ofthe Association for Computational Linguistics.

Charles J. Fillmore, Paul Kay, and Mary CatherineO’Connor. 1988. Regularity and idiomaticity ingrammatical constructions: The case of let alone.Language, 64(3):501–538.

Adele Goldberg. 1995. Constructions: A Construc-tion Grammar Approach to Argument Structure.Chicago University Press.

Cecile Grivaz. 2010. Human judgements on cau-sation in French texts. In Proceedings of LREC2010. European Languages Resources Associa-tion (ELRA).

Nancy Ide, Christiane Fellbaum, Collin Baker, andRebecca Passonneau. 2010. The manually an-notated sub-corpus: A community resource forand by the people. In Proceedings of the 48thAnnual Meeting of the Association for Computa-tional Linguistics, pages 68–73. Association forComputational Linguistics.

Dan Klein and Christopher D. Manning. 2003. Ac-curate unlexicalized parsing. In Proceedings ofthe 41st Annual Meeting on Association for Com-putational Linguistics-Volume 1, pages 423–430.Association for Computational Linguistics.

Barbara Lewandowska-Tomaszczyk. 2007. Pol-ysemy, prototypes, and radial categories. TheOxford handbook of cognitive linguistics, pages139–169.

Mitchell Marcus, Grace Kim, Mary AnnMarcinkiewicz, Robert MacIntyre, Ann Bies,Mark Ferguson, Karen Katz, and Britta Schas-berger. 1994. The Penn Treebank: Annotatingpredicate argument structure. In Proceedings ofthe Workshop on Human Language Technology,HLT ’94, pages 114–119. Association for Com-putational Linguistics, Stroudsburg, PA, USA.

Paramita Mirza and Sara Tonelli. 2014. An analy-sis of causality between events and its relation totemporal information. In Proceedings of COL-ING, pages 2097–2106.

Nasrin Mostafazadeh, Alyson Grealish, NathanaelChambers, James Allen, and Lucy Vanderwende.2016. CaTeRS: Causal and temporal relationscheme for semantic annotation of event struc-tures. In Proceedings of the 4th Workshop onEvents: Definition, Detection, Coreference, andRepresentation, pages 51–61. Association forComputational Linguistics.

Ad Neeleman and Hans Van de Koot. 2012. TheTheta System: Argument Structure at the Inter-face, chapter The Linguistic Expression of Cau-sation, pages 20–51. Oxford University Press.

Tim O’Gorman, Kristin Wright-Bettner, andMartha Palmer. 2016. Richer Event Descrip-tion: Integrating event coreference with tempo-ral, causal and bridging annotation. ComputingNews Storylines, page 47.

Martha Palmer, Daniel Gildea, and Paul Kingsbury.2005. The Proposition Bank: An annotated cor-pus of semantic roles. Computational Linguis-tics, 31(1):71–106.

Rashmi Prasad, Nikhil Dinesh, Alan Lee, EleniMiltsakaki, Livio Robaldo, Aravind Joshi, andBonnie Webber. 2008. The Penn Discourse Tree-Bank 2.0. In Nicoletta Calzolari, Khalid Choukri,Bente Maegaard, Joseph Mariani, Jan Odijk, Ste-lios Piperidis, and Daniel Tapias, editors, Pro-ceedings of LREC 2008. European LanguageResources Association (ELRA), Marrakech, Mo-rocco.

Josef Ruppenhofer, Michael Ellsworth, Miriam RLPetruck, Christopher R. Johnson, Jan Scheffczyk,and Collin F. Baker. 2016. FrameNet II: Ex-tended theory and practice.

Evan Sandhaus. 2008. The New York Times an-notated corpus. Linguistic Data Consortium,Philadelphia.

103

Nathan Schneider, Jena D. Hwang, Vivek Srikumar,Meredith Green, Abhijit Suresh, Kathryn Con-ger, Tim O’Gorman, and Martha Palmer. 2016.A corpus of preposition supersenses. In Pro-ceedings of the 10th Linguistic Annotation Work-shop held in conjunction with ACL 2016 (LAW-X2016), pages 99–109. Association for Computa-tional Linguistics.

Nathan Schneider, Vivek Srikumar, Jena D. Hwang,and Martha Palmer. 2015. A hierarchy with, of,and for preposition supersenses. In Proceed-ings of The 9th Linguistic Annotation Workshop,pages 112–123. Association for ComputationalLinguistics, Denver, Colorado, USA.

Karin K. Schuler. 2005. VerbNet: A Broad-Coverage, Comprehensive Verb Lexicon. Ph.D.thesis, University of Pennsylvania, Philadelphia,PA. AAI3179808.

Noah A. Smith, Claire Cardie, Anne L. Washing-ton, and John Wilkerson. 2014. Overview of the2014 NLP unshared task in poliinformatics. InProceedings of the ACL 2014 Workshop on Lan-guage Technologies and Computational SocialScience.

Pontus Stenetorp, Sampo Pyysalo, Goran Topic,Tomoko Ohta, Sophia Ananiadou, and Jun’ichiTsujii. 2012. BRAT: a web-based tool for NLP-assisted text annotation. In Proceedings of theDemonstrations at the 13th Conference of theEuropean Chapter of the Association for Compu-tational Linguistics, pages 102–107. Associationfor Computational Linguistics.

Laure Vieu, Philippe Muller, Marie Candito, andMarianne Djemaa. 2016. A general frame-work for the annotation of causality basedon FrameNet. In Nicoletta Calzolari (Con-ference Chair), Khalid Choukri, Thierry De-clerck, Sara Goggi, Marko Grobelnik, BenteMaegaard, Joseph Mariani, Helene Mazo, Asun-cion Moreno, Jan Odijk, and Stelios Piperidis,editors, Proceedings of LREC 2016. EuropeanLanguage Resources Association (ELRA), Paris,France.

104


Catching the Common Cause: Extraction and Annotation of CausalRelations and their Participants

Ines Rehbein Josef RuppenhoferIDS Mannheim/University of Heidelberg, Germany

Leibniz Science Campus “Empirical Linguistics and Computational Language Modeling”{rehbein, ruppenhofer}@cl.uni-heidelberg.de

Abstract

In this paper, we present a simple, yeteffective method for the automatic iden-tification and extraction of causal rela-tions from text, based on a large English-German parallel corpus. The goal of thiseffort is to create a lexical resource forGerman causal relations. The resourcewill consist of a lexicon that describes con-structions that trigger causality as well asthe participants of the causal event, andwill be augmented by a corpus with an-notated instances for each entry, that canbe used as training data to develop a sys-tem for automatic classification of causalrelations. Focusing on verbs, our methodharvested a set of 100 different lexical trig-gers of causality, including support verbconstructions. At the moment, our corpusincludes over 1,000 annotated instances.The lexicon and the annotated data will bemade available to the research community.

1 Introduction

Causality is an important concept that helps us tomake sense of the world around us. This is ex-emplified by the Causality-by-default hypothesis(Sanders, 2005) that has shown that humans, whenpresented with two consecutive sentences express-ing a relation that is ambiguous between a causaland an additive reading, commonly interpret therelation as causal.

Despite, or maybe because of, its pervasive na-ture, causality is a concept that has proven tobe notoriously difficult to define. Proposals havebeen made that describe causality from a philo-sophical point of view, such as the Counterfac-tual Theory of causation (Lewis, 1973), theoriesof probabilistic causation (Suppes, 1970; Pearl,

1988), and production theories like the DynamicForce Model (Talmy, 1988).

Counterfactual Theory tries to explain causalitybetween two events C and E in terms of condi-tionals such as “If C had not occurred, E wouldnot have occurred”. However, psychological stud-ies have shown that this not always coincides withhow humans understand and draw causal infer-ences (Byrne, 2005). Probabilistic theories, on theother hand, try to explain causality based on theunderlying probability of an event to take place inthe world. The theory that has had the greatest im-pact on linguistic annotation of causality is prob-ably Talmy’s Dynamic Force Model which pro-vides a framework that tries to distinguish weakand strong causal forces, and captures differenttypes of causality such as “letting”, “hindering”,“helping” or “intending”.

While each of these theories manages to explainsome aspects of causality, none of them seemsto provide a completely satisfying account of thephenomenon under consideration. The problemof capturing and specifying the concept of causal-ity is also reflected in linguistic annotation efforts.Human annotators often show only a moderate oreven poor agreement when annotating causal phe-nomena (Grivaz, 2010; Gastel et al., 2011). Someannotation efforts abstain altogether from report-ing inter-annotator agreement at all.

A notable exception is Dunietz et al. (2015) whotake a lexical approach and aim at building a con-structicon for English causal language. By con-structicon they mean “a list of English construc-tions that conventionally express causality” (Duni-etz et al., 2015). They show that their approachdramatically increases agreement between the an-notators and thus the quality of the annotations(for details see section 2). We adapt their approachof framing the annotation task as a lexicon cre-ation process and present first steps towards build-

105

ing a causal constructicon for German. Our an-notation scheme is based on the one of Dunietz etal. (2015), but with some crucial changes (section3).

The resource under construction contains a lexi-con component with entries for lexical units (indi-vidual words and multiword expressions) for dif-ferent parts of speech, augmented with annota-tions for each entry that can be used to developa system for the automatic identification of causallanguage.The contributions of this paper are as follows.

1. We present a bootstrapping method to iden-tify and extract causal relations and their par-ticipants from text, based on parallel corpora.

2. We present the first version of a Germancausal constructicon, containing 100 entriesfor causal verbal expressions.

3. We provide over 1,000 annotated causal in-stances (and growing) for the lexical triggers,augmented by a set of negative instances tobe used as training data.

The remainder of the paper is structured as fol-lows. First, we review related work on annotatingcausal language (section 2). In section 3, we de-scribe our annotation scheme and the data we usein our experiments. Sections 4, 5 and 6 presentour approach and the results, and we conclude andoutline future work in section 7.

2 Related Work

Two strands of research are relevant to our work,a) work on automatic detection of causal relationsin text, and b) annotation studies that discuss thedescription and disambiguation of causal phenom-ena in natural language. As we are still in the pro-cess of building our resource and collecting train-ing data, we will for now set aside work on au-tomatic classification of causality such as (Mirzaand Tonelli, 2014; Dunietz et al., In press) as wellas the rich literature on shallow discourse pars-ing, and focus on annotation and identification ofcausal phenomena.

Early work on identification and extraction ofcausal relations from text heavily relied on know-ledge bases (Kaplan and Berry-Rogghe, 1991;Girju, 2003). Girju (2003) identifies instancesof noun-verb-noun causal relations in WordNetglosses, such as starvationN1 causes bonynessN2 .

She then uses the extracted noun pairs to searcha large corpus for verbs that link one of the nounpairs from the list, and collects these verbs. Manyof the verbs are, however, ambiguous. Based onthe extracted verb list, Girju selects sentences froma large corpus that contain such an ambiguousverb, and manually disambiguates the sentences tobe included in a training set. She then uses the an-notated data to train a decision tree classifier thatcan be used to classify new instances.

Our approach is similar to hers in that we alsouse the English verb cause as a seed to iden-tify transitive causal verbs. In contrast to Girju’sWordNet-based approach, we use parallel data andproject the English tokens to their German coun-terparts.

Ours is not the first work that exploits paral-lel or comparable corpora for causality detection.Hidey and McKeown (2016) work with monolin-gual comparable corpora, English Wikipedia andsimple Wikipedia. They use explicit discourseconnectives from the PDTB (Prasad et al., 2008)as seed data and identify alternative lexicalizationsfor causal discourse relations. Versley (2010) clas-sifies German explicit discourse relations withoutGerman training data, solely based on the Englishannotations projected to German via word-alignedparallel text. He also presents a bootstrappingapproach for a connective dictionary that relieson distribution-based heuristics on word-alignedGerman-English text.

Like Versley (2010), most work on identify-ing causal language for German has been focus-ing on discourse connectives. Stede et al. (1998;2002) have developed a lexicon of German dis-course markers that has been augmented with se-mantic relations (Scheffler and Stede, 2016). An-other resource for German is the TuBa-D/Z thatincludes annotations for selected discourse con-nectives, with a small number of causal connec-tives (Gastel et al., 2011). Bogel et al. (2014)present a rule-based system for identifying eightcausal German connectors in spoken multilogs,and the causal relations REASON, RESULT ex-pressed by them.

To the best of our knowledge, ours is the firsteffort to describe causality in German on a broaderscale, not limited to discourse connectives.

106

3 Annotation Scheme

Our annotation aims at providing a descriptionof causal events and their participants, similar toFrameNet-style annotations (Ruppenhofer et al.,2006), but at a more coarse-grained level. InFrameNet, we have a high number of differentcausal frames with detailed descriptions of the ac-tors, agents and entities involved in the event.1 Forinstance, FrameNet captures details such as theintentionality of the triggering force, to expresswhether or not the action was performed volition-ally.

In contrast, we target a more generic represen-tation that captures different types of causality,and that allows us to generalize over the differ-ent participants and thus makes it feasible to trainan automatic system by abstracting away from in-dividual lexical triggers. The advantage of suchan approach is greater generalizability and thushigher coverage, the success however remains tobe proven. Our annotation scheme includes thefollowing four participant roles:

1. CAUSE – a force, process, event or action thatproduces an effect

2. EFFECT – the result of the process, event oraction

3. ACTOR – an entity that, volitionally or not,triggers the effect

4. AFFECTED – an entity that is affected by theresults of the cause

Our role set is different from Dunietz etal. (2015) who restrict the annotation of causalarguments to CAUSE and EFFECT. Our motiva-tion for extending the label set is twofold. First,different verbal causal triggers show strong se-lectional preferences for specific participant roles.Compare, for instance, examples (1) and (2). Thetwo argument slots for the verbal triggers erzeu-gen (produce) and erleiden (suffer) are filled withdifferent roles. The subject slot for erzeugen ex-presses either CAUSE or ACTOR and the directobject encodes the EFFECT. For erleiden, on theother hand, the subject typically realises the roleof the AFFECTED entity, and we often have theCAUSE or ACTOR encoded as the prepositionalobject of a durch (by) PP.

1Also see Vieu et al. (2016) for a revised and improvedtreatment of causality in FrameNet.

(1) ElektromagnetischeElectromagnetic

FelderCause

fieldskonnencan

KrebsEffect

cancererzeugen.produce.

“Electromagnetic fields can cause cancer.”

(2) LanderCountries

wielike

IrlandAffected

Irelandwerdenwill

durchby

diethe

ReformCause

reformmassivemassive

NachteileEffect

disadvantageserleidensuffer.

“Countries like Ireland will sustain mas-sive disadvantages because of the reform.”

Given that there are systematic differences be-tween prototypical properties of the participants(e.g. an ACTOR is usually animate and a sentientbeing), and also in the way how they combine andselect their predicates, we would like to preservethis information and see if we can exploit it whentraining an automatic system.

In addition to the participants of a causal event,we follow Dunietz et al. (2015) and distinguishfour different types of causation (CONSEQUENCE,MOTIVATION, PURPOSE, INFERENCE), and twodegrees (FACILITATE, INHIBIT). The degree dis-tinctions are inspired by Wolff et al. (2005) whosee causality as a continuum from total preven-tion to total entailment, and describe this contin-uum with three categories, namely CAUSE, EN-ABLE and PREVENT. Dunietz et al. (2015) fur-ther reduce this inventory to a polar distinction be-tween a positive causal relation (e.g. cause) and anegative one (e.g. prevent), as they observed thathuman coders were not able to reliably apply themore fine-grained inventories.2 The examples be-low illustrate the different types of causation.

(3) CancerCause is second only toaccidentsCause as a cause of deathEffect

in childrenAffected CONSEQUENCE

(4) I would like to say a few words in order tohighlight two points PURPOSE

(5) She must be homeEffect because the lightis onCause INFERENCE

(6) The decision is madeCause so let us leavethe matter thereEffect MOTIVATION

Epistemic uses of causality are covered by theINFERENCE class while we annotate instances

2For the polar distinction, they report perfect agreement.

107

of speech-act causality (7) as MOTIVATION (seeSweetser (1990) for an in-depth discussion on thatmatter). This is also different from Dunietz etal. (2015) who only deal with causal language, notwith causality in the world. We, instead, are alsointerested in relations that are interpreted as causalby humans, even if they are not strictly expressedas causal by a lexical marker, such as temporal re-lations or speech-act causality.

(7) And if you want to say no, say noEffect

’Cause there’s a million ways to goCause

MOTIVATION

A final point that needs to be mentioned is thatDunietz et al. (2015) exclude items such as kill orpersuade that incorporate the result (e.g. death)or means (e.g. talk) of causation as part of theirmeaning. Again, we follow Dunietz et al. and alsoexclude such cases from our lexicon.

In this work, we focus on verbal triggers ofcausality. Due to our extraction method (section4), we are mostly dealing with verbal triggers thatare instances of the type CONSEQUENCE. There-fore we cannot say much about the applicability ofthe different annotation types at this point but willleave this to future work.

4 Knowledge-lean extraction of causalrelations and their participants

We now describe our method for automaticallyidentifying new causal triggers from text, based onparallel corpora. Using English-German paralleldata has the advantage that it allows us to use exist-ing lexical resources for English such as WordNet(Miller, 1995) or FrameNet (Ruppenhofer et al.,2006) as seed data for extracting German causalrelations. In this work, however, we focus on aknowledge-lean approach where we refrain fromusing preexisting resources and try to find out howfar we can get if we rely on parallel text only. Asa trigger, we use the English verb to cause that al-ways has a causal meaning.

4.1 Data

The data we use in our experiments come from theEnglish-German part of Europarl corpus (Koehn,2005). The corpus is aligned on the sentence-leveland contains more than 1,9 mio. English-Germanparallel sentences. We tokenised and parsed thetext to obtain dependency trees, using the Stan-ford parser (Chen and Manning, 2014) for English

Gentrification causes social problems

N1: NOUN VERB ADP ADJ N2: NOUN

Gentrifizierung fuhrt zu sozialen Problemen

dobj

nsubj amod

SB NK

NK

MO

Figure 1: Parallel tree with English cause andaligned German noun pair

and the RBG parser (Lei et al., 2014) for German.We then applied the Berkeley Aligner (DeNeroand Klein, 2007) to obtain word alignments forall aligned sentences. This allows us to map thedependency trees onto each other and to project(most of) the tokens from English to German andvice versa.3

4.2 MethodStep 1 First, we select all sentences in the corpusthat contain a form of the English verb cause. Wethen restrict our set of candidates to instances ofcause where both the subject and the direct objectare realised as nouns, as illustrated in example (8).

(8) Alcoholnsubj causes 17 000 needlessdeathsdobj on the road a year.

Starting from these sentences, we filter our can-didate set and only keep those sentences that alsohave German nouns aligned to the English subjectand object position. Please note that we do not re-quire that the grammatical function of the Germancounterparts are also subject and object, only thatthey are aligned to the English core arguments. Wethen extract the aligned German noun pairs anduse them as seed data for step 2 of the extractionprocess.

For Figure 1, for example, we would first iden-tify the English subject (gentrification) and di-rect object (problems), project them to their Ger-man nominal counterparts (Gentrifizierung, Prob-lemen), the first one also filling the subject slot butthe second one being realised as a prepositionalobject. We would thus extract the lemma forms forthe German noun pair (Gentrifizierung → Prob-lem) and use it for the extraction of causal triggersin step 2 (see Algorithm 1).

3Some tokens did not receive an alignment and are thusignored in our experiments.

108

Data: Europarl (En-Ge)Input: seed word: cause (En)Output: list of causal triggers (Ge)STEP 1: if seed in sentence then

if cause linked to subj, dobj (En) thenif subj, dobj == noun then

if subj, dobj aligned with nouns(Ge) then

extract noun pair (Ge);end

endend

endSTEP 2: for n1, n2 in noun pairs (Ge) do

if n1, n2 in sentence thenif common ancestor ca (n1, n2) then

if dist(ca, n1) == 1 &dist(ca, n2) <= 3 then

extract ancestor as trigger;end

endend

endAlgorithm 1: Extraction of causal triggers fromparallel text (Step 1: extraction of noun pairs;Step 2: extraction of causal triggers)

Step 2 We now have a set of noun pairs thatwe use to search the monolingual German part ofthe data and extract all sentences that include oneof these noun pairs. We test two settings, the firstone being rather restrictive while the second oneallows for more variation and thus will probablyalso extract more noise. We refer to the two set-tings as strict (setting 1) and loose (setting 2).

In setting 1, we require that the two nouns ofeach noun pair fill the subject and direct objectslot of the same verb.4 In the second setting, weextract all sentences that include one of the nounpairs, with the restriction that the two nouns havea common ancestor in the dependency tree that isa direct parent of the first noun5 and not furtheraway from the second noun than three steps up inthe tree.

This means that the tree in Figure 1 would beignored in the first setting, but not for setting 2.

4Please note that in this step of the extraction we do notcondition the candidates on being aligned to an English sen-tence containing the English verb cause.

5As word order in German is more flexible than in En-glish, the first noun is not defined by linear order but is theone that fills a subject slot in the parse tree.

Here we would extract the direct head of the firstnoun, which will give us the verb fuhren (lead),and extract up to three ancestors for the secondnoun. As the second noun, Problem, is attached tothe preposition zu (to) (distance 1) which is in turnattached to the verb fuhren (distance 2), we wouldconsider the example a true positive and extractthe verb fuhren as linking our two nouns.

While the first setting is heavily biased towardstransitive verbs that are causal triggers, setting 2will also detect instances where the causal triggeris a noun, as in (9).

(9) GentrifizierungCause

gentrificationistis

diethe

Ursachereason

vonof

sozialensocial

ProblemenEffect

problems.“Gentrification causes social problems.”

In addition, we are also able to find support verbconstructions that trigger causality, as in (10).

(10) DieThe

gemeinsamecommon

AgrarpolitikCause

agricultural policygibtgives

stetsalways

Anlassrise

zuto

hitzigenheated

DebattenEffect

debates“The common agricultural policy alwaysgives rise to heated debates”

As both the word alignments and the depen-dency parses have been created automatically, wecan expect a certain amount of noise in the data.Furthermore, we also have to deal with translationshifts, i.e. sentences that have a causal meaning inEnglish but not in the German translation. A casein point is example (11) where the English causehas been translated into German by the non-causalstattfinden (take place) (12).

(11) [...] that none of the upheavalnsubj wouldhave been caused [...]

(12) [...] dassthat

diesethis

Umwalzungenupheaval

nichtnot

stattgefundentake place

hattenhad

[...]

[...] that none of the upheaval would havetaken place [...]

109

Using the approach outlined above, we want toidentify new causal triggers to populate the lexi-con. We also want to identify causal instances forthese triggers for annotation, to be included in ourresource. To pursue this goal and to minimize hu-man annotation effort, we are interested in i) howmany German causal verbs can be identified usingthis method, and ii) how many false positives areextracted, i.e. instances that cannot have a causalreading. Both questions have to be evaluated onthe type level. In addition, we want to know iii)how many of the extracted candidate sentences arecausal instances. This has to be decided on the to-ken level, for each candidate sentence individually.

5 Results for extracting causal relationsfrom parallel text

Step 1 Using the approach described in section4.2, we extracted all German noun pairs from Eu-roparl that were linked to two nouns in the En-glish part of the corpus that filled the argumentslots of the verb cause. Most of the noun pairs ap-peared only once, 12 pairs appeared twice, 3 pairsoccured 3 times, and the noun pair Hochwasser(floodwater) – Schaden (damage) was the mostfrequent one with 6 occurrences. In total, we ex-tracted 343 unique German noun pairs from Eu-roparl that we used as seed data to indentify causaltriggers in step 2.

We found 45 different verb types that linkedthese noun pairs, the most frequent one being,unsurprisingly, verursachen (cause) with 147 in-stances. Also frequent were other direct transla-tions of cause, namely hervorrufen (induce) andauslosen (trigger), both with 31 instances, andanrichten (wreak) with 21 instances. We alsofound highly ambiguous translations like bringen(bring, 18 instances) and verbs that often appearin support verb constructions, like haben (have,11 instances), as illustrated below (examples (13),(14)).

(13) FundamentalismusN1

fundamentalismbringtbrings

inin

Gesellschaftensocieties

gravierendeserious

ProblemeN2

problemsmitwith

sichitself

“Fundamentalism causes many problemswithin societies”

(14) NunNow

weißknow

manone

aberbut

,,daßthat

diethe

MullverbrennungN1

incineration of wastediethe

EmissionN2

emissionvonof

Substanzensubstances

zurto the

Folgeresult

hathas

“It is well known that the incineration ofwaste causes emissions of substances”

Please note that at this point we do ignore theverbs and only keep the noun pairs, to be used asseed data for the extraction of causal triggers instep 2. From examples (13) and (14) above, weextract the following two noun pairs:

N1 N21 Fundamentalismus ⇒ Problem

fundamentalism ⇒ problem

2 Mullverbrennung ⇒ Emissionincineration of waste ⇒ emission

Step 2 Using the 343 noun pairs extracted in step1, we now search the monolingual part of the cor-pus and extract all sentences that include one ofthese noun pairs as arguments of the same verb. Asa result, we get a list of verbal triggers that poten-tially have a causal reading. We now report resultsfor the two different settings, strict and loose.

For setting 1, we harvest a list of 68 verb types.We manually filtered the list and removed in-stances that did not have a causal reading, amongstthem most of the instances that occurred onlyonce, such as spielen (play), schweigen (be silent),zugeben (admit), nehmen (take), finden (find).

Some of the false positives are in fact instancesof causal particle verbs. In German, the verb par-ticle can be separated from the verb stem. Wedid consider this for the extraction and contractedverb particles with their corresponding verb stem.However, sometimes the parser failed to assign thecorrect POS label to the verb particle, which iswhy we find instances e.g. of richten (rather than:anrichten, wreak), stellen (darstellen, pose), treten(auftreten, occur) in the list of false positives.

After manual filtering, we end up with a rathershort list of 22 transitive German verbs with acausal reading for the first setting.

For setting 2 we loosen the constraints for theextraction and obtain a much larger list of 406unique trigger types. As expected, the list also in-cludes more noise, but is still managable for do-ing a manual revision in a short period of time.

110

step1: noun pairs 343step2: causal triggers typessetting 1 (strict) 22setting 2 (loose) 79setting 3 (boost) 100

Table 1: No. of causal triggers extracted in differ-ent settings (Europarl, German-English)

As shown in Table 1, after filtering we obtain afinal list of 79 causal triggers, out of which 48 fol-low the transitive pattern<N1subj causes N2dobj>where the subject expresses the cause and the di-rect object the effect. There seem to be no re-strictions on what grammatical function can beexpressed by what causal role but we find strongselectional preferences for the individual triggers,at least for the core arguments (Table 2). Theverb verursachen (cause), for example, expressesCAUSE/ACTOR as the subject and EFFECT as thedirect object while abhangen (depend) puts theEFFECT in the subject slot and realises the CAUSE

as an indirect object. Often additional roles are ex-pressed by a PP or a clausal complement. Whilemany triggers accept either CAUSE or ACTOR tobe expressed interchangeably by the same gram-matical function, there also exist some triggersthat are restricted to one of the roles. Zu Grundeliegen (be at the bottom of), for example, does notaccept an ACTOR role as subject. These restric-tions will be encoded in the lexicon, to support theannotation.

5.1 Annotation and inter-annotatoragreement

From our extraction experiments based on parallelcorpora (setting 2), we obtained a list of 79 causaltriggers to be included in the lexicon. As we alsowant to have annotated training data to accompanythe lexicon, we sampled the data and randomly se-lected N = 50 sentences for each trigger.6

We then started to manually annotate the data.The annotation process includes the following twosubtasks:

1. Given a trigger in context, does it convey acausal meaning?

6In this work we focused on verbal triggers, thus the unitof analysis is the clause. This will not be the case for triggersthat evoke a discourse relation between two abstract objects,where an abstract object can be realized by one or more sen-tences, or only by a part of a sentence.

Cause/ Effect Affectedexample trigger Actorverursachen (cause) subj dobj for-PPabhangen (depend) iobj subj for-PPzwingen (force) subj to-PP dobjzu Grunde liegen subj iobj(be at the bottom of)aus der Welt schaffen subj dobj(to dispose of once and for all)

Table 2: Examples for causal triggers and theirroles

2. Given a causal sentence, which roles are ex-pressed within the sentence?7

What remains to be done is the annotation of thecausal type of the instance. As noted above, thereason for postponing this annotation step is thatwe first wanted to create the lexicon and be con-fident about the annotation scheme. A completelexicon entry for each trigger specifying the type(or types and/or constraints) will crucially supportthe annotation and make it not only more consis-tent, but also much faster.

So far, we computed inter-annotator agreementon a subsample of our data with 427 instances (and22 different triggers), to get a first idea of the fea-sibility of the annotation task. The two annota-tors are experts in linguistic annotation (the twoauthors of the paper), but could not use the lexi-con to guide their decisions, as this was still underconstruction at the time of the annotation.

We report agreement for the following two sub-tasks. The first task concerns the decision whetheror not a given trigger is causal. Here the two anno-tators obtained a percentage agreement of 94.4%and a Fleiss’ κ of 0.78.

An error analysis reveals that the first annota-tor had a stricter interpretation of causality thanannotator 2. Both annotators agreed on 352 in-stances being causal and 51 being non-causal.However, annotator 1 also judged 24 instances asnon-causal that had been rated as causal by an-notator 2. Many of the disagreements concernedthe two verbs bringen (bring) and bedeuten (mean)and were systematic differences that could easilybe resolved and documented in the lexicon and an-notation guidelines, e.g. the frequent support verbconstruction in example (15).

7We do not annotate causal participants across sentenceborders even if that this is a plausible scenario. See, e.g.,the annotation of implicit roles in SRL (Ruppenhofer et al.,2013).

111

no. % agr. κ

task 1 causal 427 94.4 0.78task 2 N1 352 94.9 0.74

N2 352 99.1 0.95

Table 3: Annotation of causal transitive verbs:number of instances and IAA (percentage agree-ment and Fleiss’ κ) for a subset of the data (427sentences, 352 instances annotated as causal byboth annotators)

(15) zumto the

Ausdruckexpression

bringenbring

“to express something”

For the second task, assigning role labels to thefirst (N1) and the second noun (N2), it became ob-vious that annotating the role of the first noun ismarkedly more difficult than for the second noun(Table 3). The reason for this is that the Actor-Cause distinction that is relevant to the first nounis not always a trivial one. Here we also observedsystematic differences in the annotations that wereeasy to resolve, mostly concerning the questionwhether or not organisations such as the EuropeanUnion, a member state or a comission are to beinterpreted as an actor or rather than as a cause.

We think that our preliminary results arepromising and confirm the findings of Dunietz etal. (2015), and expect an even higher agreementfor the next round of the annotations, where wealso can make use of the lexicon.

5.2 DiscussionSection 4 has shown the potential of our methodfor identifying and extracting causal relationsfrom text. The advantage of our approach is thatwe do not depend on the existence of precompiledknowledge bases but rely on automatically pre-processed parallel text only. Our method is ableto detect causal patterns across different parts ofspeech. Using a strong causal trigger and fur-ther constraints for the extraction, such as restrict-ing the candidate set to sentences that have a sub-ject and direct object NP that is linked to the tar-get predicate, we are able to guide the extractiontowards instances that, to a large degree, are infact causal. In comparison, Girju reported a ra-tio of 0.32 causal sentences (2,101 out of 6,523instances) while our method yields a ratio of 0.74(787 causal instances out of 1069). Unfortunately,this also reduces the variation in trigger types and

Unsicherheit uncertainty cosVerunsicherung uncertainty 0.87Unsicherheiten insecurities 0.80Unzufriedenheit dissatisfaction 0.78Frustration frustration 0.78Nervositat nervousness 0.75Ungewissheit incertitude 0.74Unruhe concern 0.74Ratlosigkeit perplexity 0.74Uberforderung excessive demands 0.73

Table 4: The 10 most similar nouns for Unsicher-heit (insecurity), based on cosine similarity andword2vec embeddings.

is thus not a suitable method for creating a repre-sentative training set. We address this problem byloosening the constraints for the extraction, whichallows us to detect a high variety of causal expres-sions, at a reasonable cost.

Our approach, using bilingual data, provides uswith a natural environment for bootstrapping. Wecan now use the already known noun pairs as seeddata, extract similar nouns to expand our seed set,and use the expanded set to find new causal ex-pressions. We will explore this in our final experi-ment.

6 Bootstrapping causal relations

In this section, we want to generalise over the nounpairs that we extracted in the first step of the ex-traction process. For instance, given the noun pair{smoking, cancer}, we would also like to searchfor noun pairs expressing a similar relation, suchas {alcohol, health problems} or {drugs, suffer-ing}. Accordingly, we call this third setting boost.Sticking to our knowledge-lean approach, we donot make use of resources such as WordNet orFrameNet, but instead use word embeddings toidentify similar words.8 For each noun pair in ourlist, we compute cosine similarity to all words inthe embeddings and extract the 10 most similarwords for each noun of the pair. We use a lemmadictionary extracted from the TuBa-D/Z treebank(release 10.0) (Telljohann et al., 2015) to look upthe lemma forms for each word, and ignore allwords that are not listed as a noun in our dictio-nary.

Table 4 shows the 10 words in the embeddingfile that have the highest similarity to the tar-get noun Unsicherheit (uncertainty). To minimisenoise, we also set a threshold of 0.75 and exclude

8We use the pre-trained word2vec embeddings providedby Reimers et al. (2014), with a dimension of 100.

112

all words with a cosine similarity below that score.Having expanded our list, we now create new nounpairs by combining noun N1 with all similar wordsfor N2, and N2 with all similar words for N1.9 Wethen proceed as usual and use the new, expandednoun pair list to extract new causal triggers thesame way as in the loose setting. As we want tofind new triggers that have not already been in-cluded in the lexicon, we discard all verb typesthat are already listed.

Using our expanded noun pair list for extractingcausal triggers, we obtain 131 candidate instancesfor manual inspection. As before, we remove falsepositives due to translation shifts and to noise andare able to identify 21 new instances of causal trig-gers, resulting in a total number of 100 Germanverbal triggers to be included in the lexicon (Ta-ble 1).

7 Conclusions and Future Work

We have presented a first effort to create a resourcefor describing German causal language, includinga lexicon as well as an annotated training suite.We use a simple yet highly efficient method to de-tect new causal triggers, based on English-Germanparallel data. Our approach is knowledge-lean andsucceeded in identifying and extracting 100 dif-ferent types for causal verbal triggers, with only asmall amount of human supervision.

Our approach offers several avenues for futurework. One straightforward extension is to useother English causal triggers like nouns, preposi-tions, discourse connectives or causal multiwordexpressions, to detect German causal triggers withdifferent parts of speech. We would also like tofurther exploit the bootstrapping setting, by pro-jecting the German triggers back to English, ex-tracting new noun pairs, and going back to Ger-man again. Another interesting setup is triangula-tion, where we would include a third language asa pivot to harvest new causal triggers. The intui-tion behind this approach is, that if a causal triggerin the source language is aligned to a word in thepivot language, and that again is aligned to a wordin the target language, then it is likely that thealigned token in the target language is also causal.Such a setting gives us grounds for generalisationswhile, at the same time, offering the opportunityto formulate constraints and filter out noise.

9To avoid noise, we are conservative and do not combinethe newly extracted nouns with each other.

Once we have a sufficient amount of trainingdata, we plan to develop an automatic system fortagging causality in German texts. To prove thebenefits of such a tool, we would like to apply oursystem in the context of argumentation mining.

Acknowledgments

This research has been conducted within the Leib-niz Science Campus “Empirical Linguistics andComputational Modeling”, funded by the LeibnizAssociation under grant no. SAS-2015-IDS-LWCand by the Ministry of Science, Research, and Art(MWK) of the state of Baden-Wurttemberg.

ReferencesTina Bogel, Annette Hautli-Janisz, Sebastian Sulger,

and Miriam Butt. 2014. Automatic detection ofcausal relations in German multilogs. In Proceed-ings of the EACL 2014 Workshop on ComputationalApproaches to Causality in Language, CAtoCL,Gothenburg, Sweden.

Ruth M.J. Byrne. 2005. The rational imagination:How people create counterfactual alternatives to re-ality. Behavioral and Brain Sciences, 30:439–453.

Danqi Chen and Christopher Manning. 2014. A fastand accurate dependency parser using neural net-works. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Process-ing, EMNLP, Doha, Qatar.

John DeNero and Dan Klein. 2007. Tailoring wordalignments to syntactic machine translation. In Pro-ceedings of the 45th Annual Meeting of the Associa-tion of Computational Linguistics, ACL’07, Prague,Czech Republic, June.

Jesse Dunietz, Lori Levin, and Jaime Carbonell. 2015.Annotating causal language using corpus lexicog-raphy of constructions. In Proceedings of The 9thLinguistic Annotation Workshop, LAW IX, Denver,Colorado, USA.

Jesse Dunietz, Lori Levin, and Jaime Carbonell. Inpress. Automatically tagging constructions of cau-sation and their slot-fillers. In Transactions of theAssociation for Computational Linguistics.

Anna Gastel, Sabrina Schulze, Yannick Versley, andErhard Hinrichs. 2011. Annotation of explicit andimplicit discourse relations in the TuBa-D/Z tree-bank. In Proceedings of the Conference of the Ger-man Society for Computational Linguistics and Lan-guage Technology, GSCL’11, Hamburg, Germany.

Roxana Girju. 2003. Automatic detection of causal re-lations for question answering. In Proceedings ofthe ACL 2003 Workshop on Multilingual Summa-rization and Question Answering, Sapporo, Japan.

113

Cecile Grivaz. 2010. Human judgements on causationin french texts. In Proceedings of the Seventh In-ternational Conference on Language Resources andEvaluation, LREC’10, Valletta, Malta.

Christopher Hidey and Kathy McKeown. 2016. Iden-tifying causal relations using parallel wikipedia ar-ticles. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Linguis-tics, ACL’16, Berlin, Germany.

Randy M. Kaplan and Genevieve Berry-Rogghe. 1991.Knowledge-based acquisition of causal relationshipsin text. Knowledge Acquisition, 3(3):317–337.

Philipp Koehn. 2005. Europarl: A parallel corpusfor statistical machine translation. In Proceedings ofthe 10th Machine Translation Summit, Phuket, Thai-land.

Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, andTommi Jaakkola. 2014. Low-rank tensors for scor-ing dependency structures. In Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics, ACL’14, Baltimore, Maryland.

David K. Lewis. 1973. Counterfactuals. Blackwell.

George A. Miller. 1995. Wordnet: A lexicaldatabase for english. Communications of the ACM,38(11):39–41.

Paramita Mirza and Sara Tonelli. 2014. An analysisof causality between events and its relation to tem-poral information. In Proceedings of the 25th Inter-national Conference on Computational Linguistics,COLING’14.

Judea Pearl. 1988. Probabilistic Reasoning in Intelli-gent Systems. San Francisco: Morgan Kaufman.

Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt-sakaki, Livio Robaldo, Aravind K. Joshi, and Bon-nie L. Webber. 2008. The Penn Discourse Tree-Bank 2.0. In Language Resources and Evaluation,LREC’08, Marrakech, Morocco.

Nils Reimers, Judith Eckle-Kohler, Carsten Schnober,Jungi Kim, and Iryna Gurevych. 2014. GermEval-2014: Nested Named Entity Recognition with neu-ral networks. In Workshop Proceedings of the 12thEdition of the KONVENS Conference, Hildesheim,Germany.

Josef Ruppenhofer, Michael Ellsworth, Miriam R.L.Petruck, Christopher R. Johnson, and Jan Schef-fczyk. 2006. FrameNet II: Extended Theory andPractice. International Computer Science Institute,Berkeley, California. Distributed with the FrameNetdata.

Josef Ruppenhofer, Russell Lee-Goldman, CarolineSporleder, and Roser Morante. 2013. Beyondsentence-level semantic role labeling: linking argu-ment structures in discourse. Language Resourcesand Evaluation, 47(3):695–721.

Ted J.M. Sanders. 2005. Coherence, causality andcognitive complexity in discourse. In First Interna-tional Symposium on the Exploration and Modellingof Meaning, SEM-05, Toulouse, France.

Tatjana Scheffler and Manfred Stede. 2016. Addingsemantic relations to a large-coverage connectivelexicon of German. In Proceedings of the 10th In-ternational Conference on Language Resources andEvaluation, LREC’16, Portoroz, Slovenia.

Manfred Stede and Carla Umbach. 1998. DiMLex:A lexicon of discourse markers for text generationand understanding. In The 17th International Con-ference on Computational Linguistics, COLING’98.

Manfred Stede. 2002. DiMLex: A lexical approach todiscourse markers. In Exploring the Lexicon – The-ory and Computation. Alessandria (Italy): Edizionidell’Orso.

Patrick Suppes. 1970. A Probabilistic Theory ofCausality. Amsterdam: North-Holland PublishingCompany.

Eve Sweetser. 1990. From etymology to pragmatics:metaphorical and cultural aspects of semantic struc-ture. Cambridge; New York: Cambridge UniversityPress.

Leonard Talmy. 1988. Force dynamics in languageand cognition. Cognitive Science, 12(1):49–100.

Heike Telljohann, Erhard Hinrichs, Sandra Kubler,Heike Zinsmeister, and Katrin Beck. 2015. Style-book for the Tubingen Treebank of Written Ger-man (TuBa-D/Z). Revised Version. Technical re-port, Seminar fur Sprachwissenschaft, UniversitatTubingen.

Yannick Versley. 2010. Discovery of ambiguous andunambiguous discourse connectives via annotationprojection. In Workshop on the Annotation and Ex-ploitation of Parallel Corpora, AEPC’10, Tartu, Es-tland.

Laure Vieu, Philippe Muller, Marie Candito, and Mar-ianne Djemaa. 2016. A general framework forthe annotation of causality based on FrameNet. InProceedings of the 10th International Conferenceon Language Resources and Evaluation, LREC’16,Portoroz, Slovenia.

Phillip Wolff, Bianca Klettke, Tatyana Ventura, andGrace Song. 2005. Expressing causation in Englishand other languages. In Woo kyoung Ahn, RobertGoldstone, Bradley C. Love, Arthur B. Markman,and Phillip Wolff, editors, Categorization inside andoutside the laboratory: Essays in honor of DouglasMedin, pages 29–48. American Psychological Asso-ciation, Washington, DC, US.

114


Assessing SRL Frameworks with Automatic Training Data ExpansionSilvana Hartmann†‡ Eva Mujdricza-Maydt∗‡ Ilia Kuznetsov†

Iryna Gurevych†‡ Anette Frank∗‡

†Ubiquitous Knowledge Processing Lab ∗Department ofDepartment of Computer Science Computational LinguisticsTechnische Universitat Darmstadt Heidelberg University‡Research Training School AIPHES ‡Research Training School AIPHES

{hartmann,kuznetsov,gurevych} {mujdricza,frank}@ukp.informatik.tu-darmstadt.de @cl.uni-heidelberg.de

Abstract

We present the first experiment-based studythat explicitly contrasts the three major se-mantic role labeling frameworks. As aprerequisite, we create a dataset labeledwith parallel FrameNet-, PropBank-, andVerbNet-style labels for German. We traina state-of-the-art SRL tool for German forthe different annotation styles and providea comparative analysis across frameworks.We further explore the behavior of theframeworks with automatic training datageneration. VerbNet provides larger seman-tic expressivity than PropBank, and we findthat its generalization capacity approachesPropBank in SRL training, but it benefitsless from training data expansion than thesparse-data affected FrameNet.

1 Introduction

We present the first study that explicitly contraststhe three popular theoretical frameworks for se-mantic role labeling (SRL) – FrameNet, PropBank,and VerbNet1 in a comparative experimental setup,i.e., using the same training and test sets annotatedwith predicate and role labels from the differentframeworks and applying the same conditions andcriteria for training and testing.

Previous work comparing these frameworks ei-ther provides theoretical investigations, for instancefor the pair PropBank–FrameNet (Ellsworth et al.,2004), or presents experimental investigations forthe pair PropBank–VerbNet (Zapirain et al., 2008;Merlo and van der Plas, 2009). Theoretical analy-ses contrast the richness of the semantic model ofFrameNet with efficient annotation of PropBank la-bels and their suitability for system training. Verb-

1See Fillmore et al. (2003), Palmer et al. (2005) and Kipper-Schuler (2005), respectively.

Net is considered to range between them on bothscales: it fulfills the need for semantically meaning-ful role labels; also, since the role labels are sharedacross predicate senses, it is expected to generalizebetter to unseen predicates than FrameNet, whichsuffers from data sparsity due to a fine-grainedsense-specific role inventory. Yet, unlike PropBankand FrameNet, VerbNet has been neglected in re-cent work on SRL, partially due to the lack of train-ing and evaluation data, whereas PropBank andFrameNet were popularized in shared tasks. As aresult, the three frameworks have not been com-pared under equal experimental conditions.

This motivates our contrastive analysis of allthree frameworks for German. We harness existingdatasets for German (Burchardt et al., 2006; Hajicet al., 2009; Mujdricza-Maydt et al., 2016) to createSR3de (Semantic Role Triple Dataset for German),the first benchmark dataset labeled with FrameNet,VerbNet and PropBank roles in parallel.

Our motivation for working on German is that –as for many languages besides English – sufficientamounts of training data are not available. Thisclearly applies to our German dataset, which con-tains about 3,000 annotated predicates. In sucha scenario, methods to extend training data auto-matically or making efficient use of generalizationacross predicates (i.e., being able to apply role la-bels to unseen predicates) are particularly desirable.We assume that SRL frameworks that generalizebetter across predicates gain more from automatictraining data generation, and lend themselves bet-ter to cross-predicate SRL. System performancealso needs to be correlated with the semantic ex-pressiveness of frameworks: with the ever-growingexpectations in semantic NLP applications, SRLframeworks also need to be judged with regard totheir contribution to advanced applications whereexpressiveness may play a role, such as questionanswering or summarization.

115

Our work explores the generalization propertiesof three SRL frameworks in a contrastive setup, as-sessing SRL performance when training and evalu-ating on a dataset with parallel annotations for eachframework in a uniform SRL system architecture.We also explore to what extent the frameworks ben-efit from training data generation via annotationprojection (Furstenau and Lapata, 2012).

Since all three frameworks have been appliedto several languages,2 we expect our findings forGerman to generalize to other languages as well.

Our contributions are (i) novel resources: par-allel German datasets for the three frameworks,including automatically acquired training data; and(ii) empirical comparison of the labeling perfor-mance and generalization capabilities of the threeframeworks, which we discuss in view of their re-spective semantic expressiveness.

2 Related Work

Overview on SRL frameworks FrameNet de-fines frame-specific roles that are shared amongpredicates evoking the same frame. Thus, general-ization across predicates is possible for predicatesthat belong to the same existing frame, but label-ing predicates for unseen frames is not possible.Given the high number of frames – FrameNet cov-ers about 1,200 frames and 10K role labels – largetraining datasets are required for system training.

PropBank offers a small role inventory of fivecore roles (A0 to A4) for obligatory argumentsand around 18 roles for optional ones. The coreroles closely follow syntactic structures and receivea predicate-specific interpretation, except for theAgent-like A0 and Patient-like A1 that implementDowty’s proto-roles theory (Dowty, 1991).

VerbNet defines about 35 semantically definedthematic roles that are not specific to predicates orpredicate senses. Predicates are labeled with Levin-type semantic classes (Levin, 1993). VerbNet istypically assumed to range between FrameNet withrespect to its rich semantic representation and Prop-Bank with its small, coarse-grained role inventory.

Comparison of SRL frameworks Previous ex-perimental work compares VerbNet and PropBank:Zapirain et al. (2008) find that PropBank SRL ismore robust than VerbNet SRL, generalizing bet-ter to unseen or rare predicates, and relying less onpredicate sense. Still, they aspire to use more mean-

2Cf. Hajic et al. (2009), Sun et al. (2010), Boas (2009).

ingful VerbNet roles in NLP tasks and thus proposeusing automatic PropBank SRL for core role iden-tification and then converting the PropBank rolesinto VerbNet roles heuristically to VerbNet, whichappears more robust in cross-domain experimentscompared to training on VerbNet data. Merlo andvan der Plas (2009) also confirm that PropBankroles are easier to assign than VerbNet roles, whilethe latter provide better semantic generalization.To our knowledge, there is no experimental workthat compares all three major SRL frameworks.

German SRL frameworks and data sets TheSALSA project (Burchardt et al., 2006; Rehbeinet al., 2012) created a corpus annotated with over24,200 predicate argument structures, using En-glish FrameNet frames as a basis, but creating newframes for German predicates where required.

About 18,500 of the manual SALSA annota-tions were converted semi-automatically to Prop-Bank-style annotations for the CoNLL 2009 sharedtask on syntactic and semantic dependency label-ing (Hajic et al., 2009). Thus, the CoNLL datasetshares a subset of the SALSA annotations. Tocreate PropBank-style annotations, the predicatesenses were numbered such that different frameannotations for a predicate lemma indicate differ-ent senses. The SALSA role labels were convertedto PropBank-style roles using labels A0 and A1for Agent- and Patient-like roles, and continuingup to A9 for other arguments. Instead of spans,arguments were defined by their dependency headsfor CoNLL. The resulting dataset was used as abenchmark dataset in the CoNLL 2009 shared task.

For VerbNet, Mujdricza-Maydt et al. (2016) re-cently published a small subset of the CoNLLshared task corpus with VerbNet-style roles. It con-tains 3,500 predicate instances for 275 predicatelemma types. Since there is no taxonomy of verbclasses for German corresponding to original Verb-Net classes, they used GermaNet (Hamp and Feld-weg, 1997) to label predicate senses. GermaNetprovides a fine-grained sense inventory similar tothe English WordNet (Fellbaum, 1998).

Automatic SRL systems for German State-of-the-art SRL systems for German are only availablefor PropBank labels: Bjorkelund et al. (2009) de-veloped mate-tools; Roth and Woodsend (2014)and Roth and Lapata (2015) improved on mate-tools SRL with their mateplus system. We base ourexperiments on the mateplus system.

116

Der Umsatz stieg um 14 % auf 1,9 Milliarden .’Sales rose by 14% to 1.9 billion’

PB A1 steigen.1 A2 A3VN Patient steigen-3 Extent GoalFN Item Change position

on a scaleDifference Final value

Figure 1: Parallel annotation example from SR3defor predicate steigen (’rise, increase’).

Training data generation In this work, we usea corpus-based, monolingual approach to trainingdata expansion. Furstenau and Lapata (2012) pro-pose monolingual annotation projection for lower-resourced languages: they create data labeled withFrameNet frames and roles based on a small set oflabeled seed sentences in the target language. Weapply their approach to the different SRL frame-works, and for the first time to VerbNet-style labels.

Other approaches apply cross-lingual projection(Akbik and Li, 2016) or paraphrasing, replacingFrameNet predicates (Pavlick et al., 2015) or Prop-Bank arguments (Woodsend and Lapata, 2014) inlabeled texts. We do not employ these approaches,because they assume large role-labeled corpora.

3 Datasets and Data Expansion Method

SR3de: a German parallel SRL dataset TheVerbNet-style dataset by Mujdricza-Maydt et al.(2016) covers a subset of the PropBank-styleCoNLL 2009 annotations, which are based on theGerman FrameNet-style SALSA corpus. This al-lowed us to create SR3de, the first corpus with par-allel sense and role labels from SALSA, PropBank,and GermaNet/VerbNet, which we henceforth ab-breviate as FN, PB, and VN respectively. Figure 1displays an example with parallel annotations.

Data statistics in Table 1 shows that with almost3,000 predicate instances, the corpus is fairly small.The distribution of role types across frameworkshighlights their respective role granularity, rangingfrom 10 for PB to 30 for VN and 278 for FN. Thecorpus offers 2,196 training predicates and coversthe CoNLL 2009 development and test sets; thusit is a suitable base for comparing the three frame-works. We use SR3de for the contrastive analysisof the different SRL frameworks below.

Training data expansion To overcome the datascarcity of our corpus, we use monolingual anno-tation projection (Furstenau and Lapata, 2012) togenerate additional training data. Given a set oflabeled seed sentences and a set of unlabeled ex-

Corpus train dev testtype token type token type token

predicate 198 2196 121 250 152 520

sense role role sense role role sense role role

SR3de-PB 506 10 4,293 162 6 444 221 8 1,022SR3de-VN 448 30 4,307 133 23 466 216 25 1,025SR3de-FN 346 278 4,283 133 145 456 176 165 1,017

Table 1: Data statistics for SR3de (PB, VN, FN).

pansion sentences, we select suitable expansionsbased on the predicate lemma and align dependencygraphs of seeds and expansions based on lexicalsimilarity of the graph nodes and syntactic similar-ity of the edges. The alignment is then used to mappredicate and role labels from the seed sentencesto the expansion sentences. For each seed instance,the k best-scoring expansions are selected. Givena seed set of size n and the maximal number ofexpansions per seed k, we get up to n · k additionaltraining instances. Lexical and syntactic similarityare balanced using the weight parameter α.

Our adjusted re-implementation uses the mate-tools dependency parser (Bohnet, 2010) andword2vec embeddings (Mikolov et al., 2013)trained on deWAC (Baroni et al., 2009) for wordsimilarity calculation. We tune the parameter αvia intrinsic evaluation on the SR3de dev set. Weproject the seed set SR3de-train directly to SR3de-dev and compare the labels from the k=1 best seedsfor a dev sentence to the gold label, measuring F1for all projections. Then we use the best-scoringα value for each framework to project annotationsfrom the SR3de training set to deWAC for predicatelemmas occurring at least 10 times. We vary thenumber of expansions k, selecting k from {1, 3, 5,10, 20}. Using larger k values is justified becausea) projecting to a huge corpus is likely to generatemany high-quality expansions, and b) we expecta higher variance in the generated data when alsoselecting lower-scoring expansions.

Intrinsic evaluation on the dev set provides anestimate of the projection quality: we observe F1score of 0.73 for PB and VN, and of 0.53 for FN.The lower scores for FN are due to data sparsityin the intrinsic setting and are expected to improvewhen projecting on a large corpus.

4 Experiments

Experiment setup We perform extrinsic evalu-ation on SR3de with parallel annotations for thethree frameworks, using the same SRL system foreach framework, to a) compare the labeling perfor-

117

mance of the learned models, and b) explore theirbehavior in response to expanded training data.

We employ the following settings (cf. Table 2):

#BL: Baseline We train on SR3de train, which issmall, but comparable across frameworks.

#FB: Full baseline We train on the full CoNLL-training sections for PropBank and SALSA,to compare to state-of-the-art results and con-trast the low-resource #BL to full resources.3

#EX: Expanded We train on data expanded viaannotation projection.

We train mateplus using the reranker optionand the default featureset for German4 excludingword embedding features.5 We explore the fol-lowing role labeling tasks: predicate sense predic-tion (pd in mateplus), argument identification (ai)and role labeling (ac) for predicted predicate sense(pd+ai+ac) and oracle predicate sense (ai+ac). Wereport F1 scores for all three role labeling tasks.

We assure equivalent treatment of all three SRLframeworks in mateplus and train the systems onlyon the given training data without any framework-specific information. Specifically, we do not ex-ploit constraints on predicate senses for PB inmateplus (i.e., selecting sense.1 as default sense),nor constraints for licensed roles (or role sets) for agiven sense (i.e., encoding the FN lexicon). Thus,mateplus learns predicate senses and role sets onlyfrom training instances.

Experiment results for the different SRL frame-works are summarized in Table 2.6 Below, wediscuss the results for the different settings.#BL: for role labeling with oracle senses (ai+ac),PB performs best, VN is around 5 percentagepoints (pp.) lower, and FN again 5 pp. lower. Withpredicate sense prediction (pd+ai+ac), performanceonly slightly decreases for VN and PB, while FNsuffers strongly: F1 is 17 pp. lower than for VN,despite the fact that its predicate labeling F1 issimilar to PB and higher than VN. This indicatesthat generalization across senses works much bet-ter for VN and PB roles. By contrast, FN, with itssense-dependent role labels, is lacking generaliza-tion capacity, and thus suffers from data sparsity.

3Both #FB training sets contain ≈ 17,000 predicate in-stances. There is no additional labeled training data for VN.

4https://github.com/microth/mateplus/tree/master/featuresets/ger

5Given only small differences in mateplus performancewhen using word embeddings, we report results without them.

6Significance is computed using approximation random-ization, i.e., SIGF (Pado, 2006) two-tailed, 10k iterations.

no train sense sense+role role only(pd) (pd+ai+ac) (ai+ac)

#BL: SR3de training corpora

(1) #BL-PB 58.84 73.70 74.76(2) #BL-VN 55.19 69.66 69.86(3) #BL-FN 58.26 52.76 64.72

#FB: CoNLL training sections

(4) #FB-CoNLL 82.88 84.01 86.26(5) #FB-SALSA 84.03 78.03 84.34

#EX: SR3de train with data expansion

(1) #BL-PB 58.84 73.70 74.76(6) #EX-k=1 58.65 75.09* 76.65**

(7) #EX-k=3 58.65 75.43 77.71**

(8) #EX-k=5 59.03 76.30* 78.27**

(9) #EX-k=10 59.03 74.65 77.95**

(10) #EX-k=20 59.42 74.36 78.15**

( 2) #BL-VN 55.19 69.66 69.86(11) #EX-k=1 55.00 68.75 68.86(12) #EX-k=3 55.19 69.14 69.02(13) #EX-k=5 55.19 68.49 68.57(14) #EX-k=10 55.19 66.34** 66.84**

(15) #EX-k=20 55.38 65.70** 66.91**

(3) #BL-FN 58.26 52.76 64.72(16) #EX-k=1 57.88 55.47** 69.18**

(17) #EX-k=3 58.65 54.13 69.37**

(18) #EX-k=5 57.88 54.54** 70.41**

(19) #EX-k=10 58.26 53.97 69.15**

(20) #EX-k=20 58.84 54.43 70.19**

Table 2: F1 scores for predicate sense and role la-beling on the SR3de test set; pd: predicate senselabeling; pd+ai+ac: sense and role labeling (cf.official CoNLL scores); ai+ac: role labeling withoracle predicate sense. We report statistical signifi-cance of role labeling F1 with expanded data #EXto the respective #BL (*: p < 0.05; **: p < 0.01).

#FB: The full baselines #FB show that a largertraining data set widely improves SRL performancecompared to the small #BL training sets. Onereason is the extended sense coverage in the #FBdatasets, indicating the need for a larger training set.Still, FN scores are 6 pp. lower than PB (pd+ai+ac).#EX: Automatically expanding the training set forPB leads to performance improvements of around3 pp. to #BL for k=5 (pd+ai+ac and ai+ac), but thescores do not reach those of #FB. A similar gainis achieved for FN with k=1. Contrary to initialexpectations, annotation projection tends to createsimilar instances to the seen ones, but at the sametime, it also introduces noise. Thus, larger k (k>5)results in decreased role labeling performance com-pared to smaller k for all frameworks.

FN benefits most from training data expansion,with a performance increase of 5 pp. to #BL, reach-

118

ing similar role labeling scores as VN for the oraclesense setting. For predicted senses, performance in-crease is distinctly smaller, highlighting that thesparse data problems for FN senses do not getsolved by training data expansion. Performanceimprovements are significant for FN and PB forboth role labeling settings. Against expectation, wedo not observe improved role labeling performancefor VN. We believe this is due to the more complexlabel set compared to PB and perform a analysessupporting this hypothesis below.

Analysis: complexity of the frameworks Weestimate the role labeling complexity of the frame-works by computing C(d), the average ambiguityof the role instances in the dataset d, d ∈ {PB, VN,FN}. C(d) consists of the normalized sum s overthe number n of role candidates licensed for eachrole instance in d by the predicate sense label; forrole instances with unseen senses, n is the numberof distinct roles in the framework. The sum s isthen divided by all role instances in dataset d.

Results are C(PB)=4.3, C(VN)=9.7, C(FN)=60.C(d) is inversely correlated to the expected perfor-mance of each framework, and thus predicts therole labeling performance for #BL (pd+ai+ac).

When considering only seen training instances,complexity is 1.67 for both PB and VN, and 1.79for FN. This indicates a larger difficulty for FN,but does not explain the difference between VNand PB. Yet, next to role ambiguity, the number ofinstances seen in training for individual role typesis a decisive factor for role labeling performance,and thus, the coarser-grained PB inventory has aclear advantage over VN and FN.

The sense labeling performance is lower for VNsystems compared to FN and PB. This correlateswith the fact that GermaNet senses used with VNare more fine-grained than those in FN, but moreabstract than the numbered PB senses. Still, weobserve high role labeling performance indepen-dently of the predicate sense label for both VN andPB. This indicates high generalization capabilitiesof their respective role sets.7

The 5 pp. gap between the VN and PB systemsis small, but not negligible. We expect that a suit-able sense inventory for German VN, analogous toVerbNet’s classes, will further enhance VN role la-

7This is confirmed when replacing the predicate sense labelwith the lemma for training: the role labeling results are fairlyclose for PB (74.34%) and VN (68.90%), but much lower forFN (54.26%).

beling performance. Overall, we conclude that thehigher complexity of the FrameNet role inventorycauses data sparsity, thus FN benefits most from thetraining data expansion for seen predicates. For theother two frameworks, cross-predicate projectioncould be a promising way to increase the trainingdata coverage to previously unseen predicates.

5 Discussion and Conclusion

We perform the first experimental comparison ofall three major SRL frameworks on a small Germandataset with parallel annotations. The experimentsettings ensure comparability across frameworks.

Our baseline experiments prove that the gen-eralization capabilities of the frameworks followthe hypothesized order of FrameNet < VerbNet< PropBank. Comparative analysis shows thatPropBank and VerbNet roles generalize well, alsobeyond predicates. Taking into account the seman-tic expressiveness of VerbNet, these results show-case the potential of VerbNet as an alternative toPropBank. By contrast, FrameNet’s role labelingperformance suffers from data sparsity in the small-data setting, given that its role inventory does noteasily generalize across predicates.

While VerbNet generalizes better than FrameNet,it does not benefit from our automatic training datageneration setup. Currently, annotation projectiononly applies to lemmas seen in training. Thus, thegeneralization capacities of VerbNet – and Prop-Bank – are not fully exploited. Relaxing constraintsin annotation projection, e.g., projecting acrosspredicates, could benefit both frameworks.

FrameNet suffers most from sparse-data prob-lems and thus benefits most from automatic train-ing data expansion for seen predicates, yet senselabeling persists as its performance bottleneck.

In future work we plan to a) further evalu-ate cross-predicate generalization capabilities ofVerbNet and PropBank in cross-predicate annota-tion projection and role labeling, b) explore semi-supervised methods and constrained learning (Ak-bik and Li, 2016), and c) explore alternative senseinventories for the German VerbNet-style dataset.

We publish our benchmark dataset with strictlyparallel annotations for the three frameworks tofacilitate further research.8

8http://projects.cl.uni-heidelberg.de/SR3de

119

Acknowledgements The present work was par-tially funded by a German BMBF grant toCLARIN-D and the DFG-funded Research Train-ing School ”Adaptive Preparation of Informationfrom Heterogeneous Sources” (GRK 1994/1). Wethank the anonymous reviewers for their valuablecomments and suggestions.

ReferencesAlan Akbik and Yunyao Li. 2016. K-SRL: Instance-

based Learning for Semantic Role Labeling. In Pro-ceedings of COLING 2016, the 26th InternationalConference on Computational Linguistics: Techni-cal Papers, pages 599–608, Osaka, Japan, Decem-ber.

Marco Baroni, Silvia Bernardini, Adriano Ferraresi,and Eros Zanchetta. 2009. The WaCky wide web: acollection of very large linguistically processed web-crawled corpora. Language Resources and Evalua-tion, 43(3):209–226.

Anders Bjorkelund, Love Hafdell, and Pierre Nugues.2009. Multilingual semantic role labeling. In Pro-ceedings of the Thirteenth Conference on Computa-tional Natural Language Learning (CoNLL 2009):Shared Task, pages 43–48, Boulder, CO, USA.

Hans C. Boas. 2009. Multilingual FrameNets in com-putational lexicography: methods and applications,volume 200. Walter de Gruyter.

Bernd Bohnet. 2010. Top accuracy and fast depen-dency parsing is not a contradiction. In Chu-RenHuang and Dan Jurafsky, editors, Proceedings ofColing 2010, pages 89–97.

Aljoscha Burchardt, Kathrin Erk, Anette Frank, An-drea Kowalski, Sebastian Pado, and Manfred Pinkal.2006. The SALSA Corpus: a German Corpus Re-source for Lexical Semantics. In Proceedings ofthe 5th International Conference on Language Re-sources and Evaluation (LREC 2006), pages 969–974, Genoa, Italy.

David Dowty. 1991. Thematic Proto-Roles and Argu-ment Selection. Language, 67(3):547–619.

Michael Ellsworth, Katrin Erk, Paul Kingsbury, andSebastian Pado. 2004. PropBank, SALSA, andFrameNet: How design determines product. InProceedings of the Workshop on Building LexicalResources From Semantically Annotated Corpora,LREC-2004, Lisbon, Portugal.

Christiane Fellbaum. 1998. WordNet: an eletroniclexical database. The Mit Press, Cambridge, Mas-sachusetts.

Charles J. Fillmore, Christopher R. Johnson, andMiriam R.L. Petruck. 2003. Background toFrameNet. International Journal of Lexicography,16(3):235–250.

Hagen Furstenau and Mirella Lapata. 2012. Semi-Supervised Semantic Role Labeling via StructuralAlignment. Computational Linguistics, 38(1):135–171.

Jan Hajic, Massimiliano Ciaramita, Richard Johans-son, Daisuke Kawahara, Maria Antonia Martı, LluısMarquez, Adam Meyers, Nivre Joakim, SebastianPado, Jan Stepanek, Pavel Stranak, Mihai Surde-nau, Nianwen Xue, and Yi Zhang. 2009. TheCoNLL-2009 Shared Task: Syntactic and SemanticDependencies in Multiple Languages. In Proceed-ings of the Thirteenth Conference on ComputationalNatural Language Learning (CoNLL): Shared Task,pages 1–18, Boulder, CO, USA.

Birgit Hamp and Helmut Feldweg. 1997. GermaNet -a Lexical-Semantic Net for German. In Proceedingsof the ACL workshop Automatic Information Extrac-tion and Building of Lexical Semantic Resources forNLP Applications, pages 9–15, Madrid, Spain.

Karin Kipper-Schuler. 2005. VerbNet: A broad-coverage, comprehensive verb lexicon. Ph.D. thesis,University of Pennsylvania, Pennsylvania, Philadel-phia, PA.

Beth Levin. 1993. English verb classes and alterna-tions: A preliminary investigation. University ofChicago press.

Paola Merlo and Lonneke van der Plas. 2009. Ab-straction and Generalisation in Semantic Role La-bels: PropBank, VerbNet or both? In Proceedingsof the 47th Annual Meeting of the ACL and the 4thIJCNLP of the AFNLP, pages 288–296, Suntec, Sin-gapore.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their composition-ality. In C. J. C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Q. Weinberger, editors, Ad-vances in Neural Information Processing Systems26, pages 3111–3119. Curran Associates, Inc.

Eva Mujdricza-Maydt, Silvana Hartmann, IrynaGurevych, and Anette Frank. 2016. CombiningSemantic Annotation of Word Sense & SemanticRoles: A Novel Annotation Scheme for VerbNetRoles on German Language Data. In Proceedings ofthe Tenth International Conference on Language Re-sources and Evaluation (LREC 2016), pages 3031–3038, Portoroz, Slovenia.

Sebastian Pado, 2006. User’s guide to sigf: Signifi-cance testing by approximate randomisation.

Martha Palmer, Daniel Gildea, and Paul Kingsbury.2005. The Proposition Bank: An annotated cor-pus of semantic roles. Computational Linguistics,31(1):71–106.

Ellie Pavlick, Travis Wolfe, Pushpendre Rastogi,Chris Callison-Burch, Mark Dredze, and Benjamin

120

Van Durme. 2015. FrameNet+: Fast Paraphras-tic Tripling of FrameNet. In Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International JointConference on Natural Language Processing (Vol-ume 2: Short Papers), pages 408–413, Beijing,China.

Ines Rehbein, Joseph Ruppenhofer, Caroline Sporleder,and Manfred Pinkal. 2012. Adding nominal spiceto SALSA - Frame-semantic annotation of Ger-man nouns and verbs. In Proceedings of the 11thConference on Natural Language Processing (KON-VENS’12), pages 89–97, Vienna, Austria.

Michael Roth and Mirella Lapata. 2015. Context-aware frame-semantic role labeling. Transactionsof the Association for Computational Linguistics(TACL), 3:449–460.

Michael Roth and Kristian Woodsend. 2014. Composi-tion of word representations improves semantic rolelabelling. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 407–413, Doha, Qatar.

Lin Sun, Thierry Poibeau, Anna Korhonen, and CedricMessiant. 2010. Investigating the cross-linguisticpotential of VerbNet-style classification. In Proceed-ings of the 23rd International Conference on Com-putational Linguistics (Coling 2010), pages 1056–1064.

Kristian Woodsend and Mirella Lapata. 2014. Textrewriting improves semantic role labeling. Journalof Artificial Intelligence Research, 51:133–164.

Benat Zapirain, Eneko Agirre, and Lluıs Marquez.2008. Robustness and generalization of role sets:PropBank vs. VerbNet. In Proceedings of ACL-08:HLT, pages 550–558, Columbus, OH, USA.

121

Author Index

Akhtar, Syed Sarfaraz, 91

Bary, Corien, 46Berck, Peter, 46Buechel, Sven, 1

Carbonell, Jaime, 95Çetinoglu, Özlem, 34

Dalbelo Bašic, Bojana, 82Delamaire, Amaury, 41Demberg, Vera, 24di Buono, Maria Pia, 82Dunietz, Jesse, 95

Eckart de Castilho, Richard, 67

Frank, Anette, 115Fujita, Atsushi, 57

Glavaš, Goran, 82Gupta, Arihant, 91Gurevych, Iryna, 115

Hahn, Udo, 1Hartley, Anthony, 57Hartmann, Silvana, 115Hendrickx, Iris, 46Hess, Leopold, 46

Ide, Nancy, 67

Kageura, Kyo, 57Kurfalı, Murathan, 76Kuznetsov, Ilia, 115

Lapponi, Emanuele, 67Levin, Lori, 95

Martínez Alonso, Héctor, 41Milic-Frayling, Natasa, 82Mújdricza-Maydt, Éva, 115

Napoles, Courtney, 13

Oepen, Stephan, 67

Pappu, Aasish, 13

Provenzale, Brian, 13

Rehbein, Ines, 105Rosato, Enrica, 13Ruppenhofer, Josef, 105

Sagot, Benoît, 41Scholman, Merel, 24Shrivastava, Manish, 91Šnajder, Jan, 82Srivastava, Arjit, 91Suderman, Keith, 67

Tanabe, Kikuko, 57Tetreault, Joel, 13Thijs, Kees, 46Toyoshima, Chiho, 57Tutek, Martin, 82

Vajpayee, Avijit, 91Velldal, Erik, 67Verhagen, Marc, 67

Yamamoto, Mayuka, 57

Zeyrek, Deniz, 76

123

LAW XI 2017 - sigann.github.io

Documents

Transcript of LAW XI 2017 - sigann.github.io