Neural Sequence Learning Models for Word Sense Disambiguation · 2017. 12. 3. · Neural Sequence...

12
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1156–1167 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics Neural Sequence Learning Models for Word Sense Disambiguation Alessandro Raganato, Claudio Delli Bovi and Roberto Navigli Department of Computer Science Sapienza University of Rome {raganato,dellibovi,navigli}@di.uniroma1.it Abstract Word Sense Disambiguation models ex- ist in many flavors. Even though super- vised ones tend to perform best in terms of accuracy, they often lose ground to more flexible knowledge-based solutions, which do not require training by a word expert for every disambiguation target. To bridge this gap we adopt a different perspective and rely on sequence learning to frame the disambiguation problem: we propose and study in depth a series of end-to-end neural architectures directly tailored to the task, from bidirectional Long Short-Term Memory to encoder-decoder models. Our extensive evaluation over standard bench- marks and in multiple languages shows that sequence learning enables more ver- satile all-words models that consistently lead to state-of-the-art results, even against word experts with engineered features. 1 Introduction As one of the long-standing challenges in Natural Language Processing (NLP), Word Sense Disam- biguation (Navigli, 2009, WSD) has received con- siderable attention over recent years. Indeed, by dealing with lexical ambiguity an effective WSD model brings numerous benefits to a variety of downstream tasks and applications, from Infor- mation Retrieval and Extraction (Zhong and Ng, 2012; Delli Bovi et al., 2015) to Machine Trans- lation (Carpuat and Wu, 2007; Xiong and Zhang, 2014; Neale et al., 2016). Recently, WSD has also been leveraged to build continuous vector repre- sentations for word senses (Chen et al., 2014; Ia- cobacci et al., 2015; Flekova and Gurevych, 2016). Inasmuch as WSD is described as the task of as- sociating words in context with the most suitable entries in a pre-defined sense inventory, the ma- jority of WSD approaches to date can be grouped into two main categories: supervised (or semi- supervised) and knowledge-based. Supervised models have been shown to consistently outper- form knowledge-based ones in all standard bench- marks (Raganato et al., 2017), at the expense, however, of harder training and limited flexibil- ity. First of all, obtaining reliable sense-annotated corpora is highly expensive and especially diffi- cult when non-expert annotators are involved (de Lacalle and Agirre, 2015), and as a consequence approaches based on unlabeled data and semi- supervised learning are emerging (Taghipour and Ng, 2015b; Bas ¸kaya and Jurgens, 2016; Yuan et al., 2016; Pasini and Navigli, 2017). Apart from the shortage of training data, a cru- cial limitation of current supervised approaches is that a dedicated classifier (word expert) needs to be trained for every target lemma, making them less flexible and hampering their use within end- to-end applications. In contrast, knowledge-based systems do not require sense-annotated data and often draw upon the structural properties of lexico- semantic resources (Agirre et al., 2014; Moro et al., 2014; Weissenborn et al., 2015). Such sys- tems construct a model based only on the underly- ing resource, which is then able to handle multiple target words at the same time and disambiguate them jointly, whereas word experts are forced to treat each disambiguation target in isolation. In this paper our focus is on supervised WSD, but we depart from previous approaches and adopt a different perspective on the task: instead of framing a separate classification problem for each given word, we aim at modeling the joint disam- biguation of the target text as a whole in terms of a sequence labeling problem. From this standpoint, WSD amounts to translating a sequence of words into a sequence of potentially sense-tagged tokens. 1156

Transcript of Neural Sequence Learning Models for Word Sense Disambiguation · 2017. 12. 3. · Neural Sequence...

Page 1: Neural Sequence Learning Models for Word Sense Disambiguation · 2017. 12. 3. · Neural Sequence Learning Models for Word Sense Disambiguation Alessandro Raganato, ... dealing with

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1156–1167Copenhagen, Denmark, September 7–11, 2017. c©2017 Association for Computational Linguistics

Neural Sequence Learning Models forWord Sense Disambiguation

Alessandro Raganato, Claudio Delli Bovi and Roberto NavigliDepartment of Computer Science

Sapienza University of Rome{raganato,dellibovi,navigli}@di.uniroma1.it

Abstract

Word Sense Disambiguation models ex-ist in many flavors. Even though super-vised ones tend to perform best in terms ofaccuracy, they often lose ground to moreflexible knowledge-based solutions, whichdo not require training by a word expertfor every disambiguation target. To bridgethis gap we adopt a different perspectiveand rely on sequence learning to framethe disambiguation problem: we proposeand study in depth a series of end-to-endneural architectures directly tailored to thetask, from bidirectional Long Short-TermMemory to encoder-decoder models. Ourextensive evaluation over standard bench-marks and in multiple languages showsthat sequence learning enables more ver-satile all-words models that consistentlylead to state-of-the-art results, even againstword experts with engineered features.

1 Introduction

As one of the long-standing challenges in NaturalLanguage Processing (NLP), Word Sense Disam-biguation (Navigli, 2009, WSD) has received con-siderable attention over recent years. Indeed, bydealing with lexical ambiguity an effective WSDmodel brings numerous benefits to a variety ofdownstream tasks and applications, from Infor-mation Retrieval and Extraction (Zhong and Ng,2012; Delli Bovi et al., 2015) to Machine Trans-lation (Carpuat and Wu, 2007; Xiong and Zhang,2014; Neale et al., 2016). Recently, WSD has alsobeen leveraged to build continuous vector repre-sentations for word senses (Chen et al., 2014; Ia-cobacci et al., 2015; Flekova and Gurevych, 2016).

Inasmuch as WSD is described as the task of as-sociating words in context with the most suitable

entries in a pre-defined sense inventory, the ma-jority of WSD approaches to date can be groupedinto two main categories: supervised (or semi-supervised) and knowledge-based. Supervisedmodels have been shown to consistently outper-form knowledge-based ones in all standard bench-marks (Raganato et al., 2017), at the expense,however, of harder training and limited flexibil-ity. First of all, obtaining reliable sense-annotatedcorpora is highly expensive and especially diffi-cult when non-expert annotators are involved (deLacalle and Agirre, 2015), and as a consequenceapproaches based on unlabeled data and semi-supervised learning are emerging (Taghipour andNg, 2015b; Baskaya and Jurgens, 2016; Yuanet al., 2016; Pasini and Navigli, 2017).

Apart from the shortage of training data, a cru-cial limitation of current supervised approaches isthat a dedicated classifier (word expert) needs tobe trained for every target lemma, making themless flexible and hampering their use within end-to-end applications. In contrast, knowledge-basedsystems do not require sense-annotated data andoften draw upon the structural properties of lexico-semantic resources (Agirre et al., 2014; Moroet al., 2014; Weissenborn et al., 2015). Such sys-tems construct a model based only on the underly-ing resource, which is then able to handle multipletarget words at the same time and disambiguatethem jointly, whereas word experts are forced totreat each disambiguation target in isolation.

In this paper our focus is on supervised WSD,but we depart from previous approaches and adopta different perspective on the task: instead offraming a separate classification problem for eachgiven word, we aim at modeling the joint disam-biguation of the target text as a whole in terms of asequence labeling problem. From this standpoint,WSD amounts to translating a sequence of wordsinto a sequence of potentially sense-tagged tokens.

1156

Page 2: Neural Sequence Learning Models for Word Sense Disambiguation · 2017. 12. 3. · Neural Sequence Learning Models for Word Sense Disambiguation Alessandro Raganato, ... dealing with

With this in mind, we design, analyze and com-pare experimentally various neural architecturesof different complexities, ranging from a singlebidirectional Long Short-Term Memory (Gravesand Schmidhuber, 2005, LSTM) to a sequence-to-sequence approach (Sutskever et al., 2014). Eacharchitecture reflects a particular way of model-ing the disambiguation problem, but they all sharesome key features that set them apart from pre-vious supervised approaches to WSD: they aretrained end-to-end from sense-annotated text tosense labels, and learn a single all-words modelfrom the training data, without fine tuning or ex-plicit engineering of local features.

The contributions of this paper are twofold.First, we show that neural sequence learning rep-resents a novel and effective alternative to the tra-ditional way of modeling supervised WSD, en-abling a single all-words model to compete witha pool of word experts and achieve state-of-the-artresults, while also being easier to train, arguablymore versatile to use within downstream applica-tions, and directly adaptable to different languageswithout requiring additional sense-annotated data(as we show in Section 6.2); second, we carryout an extensive experimental evaluation wherewe compare various neural architectures designedfor the task (and somehow left underinvestigatedin previous literature), exploring different config-urations and training procedures, and analyzingtheir strengths and weaknesses on all the standardbenchmarks for all-words WSD.

2 Related Work

The literature on WSD is broad and compre-hensive (Agirre and Edmonds, 2007; Navigli,2009): new models are continuously being de-veloped (Yuan et al., 2016; Tripodi and Pelillo,2017; Butnaru et al., 2017) and tested over awide variety of standard benchmarks (Edmondsand Cotton, 2001; Snyder and Palmer, 2004;Pradhan et al., 2007; Navigli et al., 2007, 2013;Moro and Navigli, 2015). Moreover, the fieldhas been explored in depth from different anglesby means of extensive empirical studies andevaluation frameworks (Pilehvar and Navigli,2014; Iacobacci et al., 2016; McCarthy et al.,2016; Raganato et al., 2017).

As regards supervised WSD, traditional ap-proaches are generally based on extracting localfeatures from the words surrounding the target,and then training a classifier (Zhong and Ng,

2010; Shen et al., 2013) for each target lemma.In their latest developments, these models includemore complex features based on word embed-dings (Taghipour and Ng, 2015b; Rothe andSchutze, 2015; Iacobacci et al., 2016).

The recent upsurge of neural networks hasalso contributed to fueling WSD research: Yuanet al. (2016) rely on a powerful neural languagemodel to obtain a latent representation for thewhole sentence containing a target word w;their instance-based system then compares thatrepresentation with those of example sentencesannotated with the candidate meanings of w.Similarly, Context2Vec (Melamud et al., 2016)makes use of a bidirectional LSTM architecturetrained on an unlabeled corpus and learns acontext vector for each sense annotation in thetraining data. Finally, Kageback and Salomonsson(2016) present a supervised classifier basedon bidirectional LSTM for the lexical sampletask (Kilgarriff, 2001; Mihalcea et al., 2004). Allthese contributions have shown that supervisedneural models can achieve state-of-the-art per-formances without taking advantage of externalresources or language-specific features. However,they all consider each target word as a separateclassification problem and, to the best of ourknowledge, very few attempts have been madeto disambiguate a text jointly using sequencelearning (Ciaramita and Altun, 2006).

Sequence learning, especially usingLSTM (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005; Graves, 2013),has become a well-established standard in nu-merous NLP tasks (Zhou and Xu, 2015; Ma andHovy, 2016; Wang and Chang, 2016). In par-ticular, sequence-to-sequence models (Sutskeveret al., 2014) have grown increasingly popular andare used extensively in, e.g., Machine Transla-tion (Cho et al., 2014; Bahdanau et al., 2015),Sentence Representation (Kiros et al., 2015), Syn-tactic Parsing (Vinyals et al., 2015), ConversationModeling (Vinyals and Le, 2015), MorphologicalInflection (Faruqui et al., 2016) and Text Summa-rization (Gu et al., 2016). In line with this trend,we focus on the (so far unexplored) context ofsupervised WSD, and investigate state-of-the-artall-words approaches that are based on neuralsequence learning and capable of disambiguatingall target content words within an input text, a keyfeature in several knowledge-based approaches.

1157

Page 3: Neural Sequence Learning Models for Word Sense Disambiguation · 2017. 12. 3. · Neural Sequence Learning Models for Word Sense Disambiguation Alessandro Raganato, ... dealing with

o3o2h1o1

later checked the report

Embedding Layer

LSTM Layers

Attention Layer

check1v

Softmax Layer

x1 x2 x4 x5

x1 x2 x3 x4 x5

c

h2 h3 h4

he

later the reporthe

x3

o4 h5o5

r1 3

n

o3o1

later checked the report

Embedding Layer

LSTM Layers

check1v

Softmax Layer

x1 x2 x4 x5

x1 x2 x3 x4 x5

y3

he

later the reporthe

x3

y2y1 y4 y5

o4 o5

r1 3

n

o2

y3y2y1 y4 y5

Figure 1: Bidirectional LSTM sequence labelingarchitecture for WSD (2 hidden layers). We usethe notation of Navigli (2009) for word senses: wi

p

is the i-th sense of w with part of speech p.

3 Sequence Learning for Word SenseDisambiguation

In this section we define WSD in terms of a se-quence learning problem. While in its classicalformulation (Navigli, 2009) WSD is viewed as aclassification problem for a given word w in con-text, with word senses of w being the class la-bels, here we consider a variable-length sequenceof input symbols ~x = 〈x1, ..., xT 〉 and we aimat predicting a sequence of output symbols ~y =〈y1, ..., yT ′〉.1 Input symbols are word tokensdrawn from a given vocabulary V .2 Output sym-bols are either drawn from a pre-defined sense in-ventory S (if the corresponding input symbols areopen-class content words, i.e., nouns, verbs, adjec-tives or adverbs), or from the same input vocabu-lary V (e.g., if the corresponding input symbolsare function words, like prepositions or determin-ers). Hence, we can define a WSD model in termsof a function that maps sequences of symbols xi ∈V into sequences of symbols yj ∈ O = S ∪ V .

Here all-words WSD is no longer brokendown into a series of distinct and separate clas-sification tasks (one per target word) but rathertreated directly at the sequence level, with a sin-gle model handling all disambiguation decisions.In what follows, we describe three different mod-els for accomplishing this: a traditional LSTM-based model (Section 3.1), a variant that incorpo-rates an attention mechanism (Section 3.2), and anencoder-decoder architecture (Section 3.3).

1In general ~x and ~y might have different lengths, e.g., if ~xcontains a multi-word expression (European Union) which ismapped to a unique sense identifier (European Union1

n).2V generalizes traditional vocabularies used in WSD and

includes both word lemmas and inflected forms.

3.1 Bidirectional LSTM Tagger

The most straightforward way of modeling WSDas formulated in Section 3 is that of considering asequence labeling architecture that tags each sym-bol xi ∈ V in the input sequence with a labelyj ∈ O. Even though the formulation is rathergeneral, previous contributions (Melamud et al.,2016; Kageback and Salomonsson, 2016) have al-ready shown the effectiveness of recurrent neuralnetworks for WSD. We follow the same line andemploy a bidirectional LSTM architecture: in fact,important clues for disambiguating a target wordcould be located anywhere in the context (not nec-essarily before the target) and for a model to beeffective it is crucial that it exploits informationfrom the whole input sequence at every time step.

Architecture. A sketch of our bidirectionalLSTM tagger is shown in Figure 1. It consists of:

• An embedding layer that converts each wordxi ∈ ~x into a real-valued d-dimensionalvector xi via the embedding matrix W ∈Rd× |V |;

• One or more stacked layers of bidirectionalLSTM (Graves and Schmidhuber, 2005).The hidden state vectors hi and output vec-tors oi at the ith time step are then obtained asthe concatenations of the forward and back-ward pass vectors

−→h i,−→o i and

←−h i,←−o i;

• A fully-connected layer with softmax activa-tion that turns the output vector oi at the ith

time step into a probability distribution overthe output vocabulary O.

Training. The tagger is trained on a datasetof N labeled sequences {(~xk, ~yk)}Nk=1 directlyobtained from the sentences of a sense-annotatedcorpus, where each ~xk is a sequence of wordtokens, and each ~yk is a sequence containing bothword tokens and sense labels. Ideally ~yk is a copyof ~xk where each content word is sense-tagged.This is, however, not the case in many real-worlddatasets, where only a subset of the content wordsis annotated; hence the architecture is designedto deal with both fully and partially annotatedsentences. Apart from sentence splitting andtokenization, no preprocessing is required on thetraining data.

1158

Page 4: Neural Sequence Learning Models for Word Sense Disambiguation · 2017. 12. 3. · Neural Sequence Learning Models for Word Sense Disambiguation Alessandro Raganato, ... dealing with

3.2 Attentive Bidirectional LSTM Tagger

The bidirectional LSTM tagger of Section 3.1 ex-ploits information from the whole input sequence~x, which is encoded in the hidden state hi. How-ever, certain elements of ~x might be more discrim-inative than others in predicting the output label ata given time step (e.g., the syntactic subject andobject when predicting the sense label of a verb).

We model this hunch by introducing an atten-tion mechanism, already proven to be effective inother NLP tasks (Bahdanau et al., 2015; Vinyalset al., 2015), into the sequence labeling architec-ture of Section 3.1. The resulting attentive bidi-rectional LSTM tagger augments the original ar-chitecture with an attention layer, where a contextvector c is computed from all the hidden statesh1, ...,hT of the bidirectional LSTM. The atten-tive tagger first reads the entire input sequence ~xto construct c, and then exploits c to predict theoutput label yj at each time step, by concatenat-ing it with the output vector oj of the bidirectionalLSTM (Figure 2).

We follow previous work (Vinyals et al., 2015;Zhou et al., 2016) and compute c as the weightedsum of the hidden state vectors h1, ...,hT . For-mally, let H ∈ Rn×T be the matrix of hiddenstate vectors [h1, ...,hT ], where n is the hiddenstate dimension and T is the input sequence length(cf. Section 3). c is obtained as follows:

u = ωT tanh(H)a = softmax(u)

c = HaT (1)

where ω ∈ Rn is a parameter vector, and a ∈ RT

is the vector of normalized attention weights.

3.3 Sequence-to-Sequence Model

The attentive tagger of Section 3.2 performs atwo-pass procedure by first reading the input se-quence ~x to construct the context vector c, andthen predicting an output label yj for each ele-ment in ~x. In this respect, the attentive archi-tecture can effectively be viewed as an encoderfor ~x. A further generalization of this modelwould then be a complete encoder-decoder ar-chitecture (Sutskever et al., 2014) where WSDis treated as a sequence-to-sequence mapping(sequence-to-sequence WSD), i.e., as the “transla-tion” of word sequences into sequences of poten-tially sense-tagged tokens.

h4o4o3o2h1o1

later checked the report

Embedding Layer

LSTM Layers

Attention Layer

check1v

Softmax Layer

x1 x2 x4 x5

x1 x2 x3 x4 x5

c

h2 h3

he

later the reporthe

x3

h5o5

r1 3

n

o3o1

later checked the report

Embedding Layer

LSTM Layers

check1v

Softmax Layer

x1 x2 x4 x5

x1 x2 x3 x4 x5

y3

he

later the reporthe

x3

y2y1 y4 y5

o4 o5

r1 3

n

o2

y3y2y1 y4 y5

Figure 2: Attentive bidirectional LSTM sequencelabeling architecture for WSD (2 hidden layers).

In the sequence-to-sequence framework, avariable-length sequence of input symbols ~x isrepresented as a sequence of vectors ~x =〈x1, ..., xT 〉 by converting each symbol xi ∈ ~x intoa real-valued vector xi via an embedding layer,and then fed to an encoder, which generates afixed-dimensional vector representation of the se-quence. Traditionally, the encoder function is aRecurrent Neural Network (RNN) such that:

ht = f(ht−1, xt)c = q({h1, ..., hT }) (2)

where ht ∈ Rn is the n-dimensional hidden statevector at time t, c ∈ Rn is a vector generatedfrom the whole sequence of input states, and fand q are non-linear functions.3 A decoder is thentrained to predict the next output symbol yt giventhe encoded input vector c and all the previouslypredicted output symbols 〈y1, ..., yt−1〉. More for-mally, the decoder defines a probability over theoutput sequence ~y = 〈y1, ..., yT ′〉 by decompos-ing the joint probability into ordered conditionals:

p(~y | ~x) =T ′∏t=1

p(yt | c, 〈y1, ..., yt−1〉) (3)

Typically a decoder RNN defines the hidden stateat time t as st = g(st−1, {c, yt−1}) and then feedsst to a softmax layer in order to obtain a condi-tional probability over output symbols.

3For instance, Sutskever et al. (2014) used an LSTM as f ,and q({h1, ..., hT }) = hT .

1159

Page 5: Neural Sequence Learning Models for Word Sense Disambiguation · 2017. 12. 3. · Neural Sequence Learning Models for Word Sense Disambiguation Alessandro Raganato, ... dealing with

he checked the report .

he the .

Embedding Layer

LSTM Layers

Attention Layer

LSTM Layers

Softmax Layer

Encoder

Decoder

check report1v

3n

c

x1 x2 x3 x4 x5

x1 x2 x3 x4 x5

h1 h2 h3 h4 h5 s1 s2 s3 s4 s5

y1 y2 y3 x4 x5

<EOS>

s6

x6

s7

x7

<GO>

Figure 3: Encoder-decoder architecture for sequence-to-sequence WSD, with 2 bidirectional LSTM lay-ers and an attention layer.

In the context of WSD framed as a sequencelearning problem, a sequence-to-sequence modeltakes as input a training set of labeled sequences(cf. Section 3.1) and learns to replicate an in-put sequence ~x while replacing each content wordwith its most suitable word sense from S. In otherwords, sequence-to-sequence WSD can be viewedas the combination of two sub-tasks:

• A memorization task, where the model learnsto replicate the input sequence token by tokenat decoding time;

• The actual disambiguation task where themodel learns to replace content words acrossthe input sequence with their most suitablesenses from the sense inventory S.

In the latter stage, multi-word expressions (suchas nominal entity mentions or phrasal verbs) arereplaced by their sense identifiers, hence yieldingan output sequence that might have a differentlength than ~x.

Architecture. The encoder-decoder architecturegeneralizes over both the models in Sections 3.1and 3.2. In particular, we include one or more bidi-rectional LSTM layers at the core of both the en-coder and the decoder modules. The encoder uti-lizes an embedding layer (cf. Section 3.1) to con-vert input symbols into embedded representations,feeds it to the bidirectional LSTM layer, and thenconstructs the context vector c, either by simplyletting c = hT (i.e., the hidden state of the bidi-rectional LSTM layer after reading the whole in-put sequence), or by computing the weighted sumdescribed in Section 3.2 (if an attention mecha-nism is employed). In either case, the context vec-tor c is passed over to the decoder, which gener-ates the output symbols sequentially based on c

and the current hidden state st, using one or morebidirectional LSTM layers as in the encoder mod-ule. Instead of feeding c to the decoder only atthe first time step (Sutskever et al., 2014; Vinyalsand Le, 2015), we condition each output symbolyt on c, allowing the decoder to peek into the in-put at every step, as in Cho et al. (2014). Finally, afully-connected layer with softmax activation con-verts the current output vector of the last LSTMlayer into a probability distribution over the out-put vocabulary O. The complete encoder-decoderarchitecture (including the attention mechanism)is shown in Figure 3.

4 Multitask Learning with MultipleAuxiliary Losses

Several recent contributions (Søgaard and Gold-berg, 2016; Bjerva et al., 2016; Plank et al., 2016;Luong et al., 2016) have shown the effectivenessof multitask learning (Caruana, 1997, MTL) ina sequence learning scenario. In MTL the ideais that of improving generalization performanceby leveraging training signals contained in relatedtasks, in order to exploit their commonalities anddifferences. MTL is typically carried out by train-ing a single architecture using multiple loss func-tions and a shared representation, with the under-lying intention of improving a main task by incor-porating joint learning of one or more related aux-iliary tasks. From a practical point of view, MTLworks by including one task-specific output layerper additional task, usually at the outermost levelof the architecture, while keeping the remaininghidden layers common across all tasks.

In line with previous approaches, and guided bythe intuition that WSD is strongly linked to otherNLP tasks at various levels, we also design andstudy experimentally a multitask augmentation of

1160

Page 6: Neural Sequence Learning Models for Word Sense Disambiguation · 2017. 12. 3. · Neural Sequence Learning Models for Word Sense Disambiguation Alessandro Raganato, ... dealing with

the models described in Section 3. In particular,we consider two auxiliary tasks:

• Part-of-speech (POS) tagging, a standardauxiliary task extensively studied in previouswork (Søgaard and Goldberg, 2016; Planket al., 2016). Predicting the part-of-speechtag for a given token can also be informa-tive for word senses, and help in dealing withcross-POS lexical ambiguities (e.g., book aflight vs. reading a good book);

• Coarse-grained semantic labels (LEX)based on the WordNet (Miller et al., 1990)lexicographer files,4 i.e., 45 coarse-grainedsemantic categories manually associated withall the synsets in WordNet on the basis ofboth syntactic and logical groupings (e.g.,noun.location, or verb.motion). These verycoarse semantic labels, recently employedin a multitask setting by Alonso and Plank(2017), group together related senses andhelp the model to generalize, especially oversenses less covered at training time.

We follow previous work (Plank et al., 2016;Alonso and Plank, 2017) and define an auxiliaryloss function for each additional task. The overallloss is then computed by summing the main loss(i.e., the one associated with word sense labels)and all the auxiliary losses taken into account.

As regards the architecture, we consider boththe models described in Sections 3.2 and 3.3 andmodify them by adding two softmax layers in ad-dition to the one in the original architecture. Fig-ure 4 illustrates this for the attentive tagger of Sec-tion 3.2, considering both POS and LEX as auxil-iary tasks. At the jth time step the model predictsa sense label yj together with a part-of-speech tagPOSj and a coarse semantic label LEXj .5

5 Experimental Setup

In this section we detail the setup of our experi-mental evaluation. We first describe the trainingcorpus and all the standard benchmarks forall-words WSD; we then report technical detailson the architecture and on the training process forall the models described throughout Section 3 andtheir multitask augmentations (Section 4).

4https://wordnet.princeton.edu/man/lexnames.5WN.html

5We use a dummy LEX label (other) for punctuationand function words.

o3o2h1o1

later checked the report

Embedding Layer

LSTM Layers

Attention Layer

check1v

Softmax Layer

x1 x2 x4 x5

x1 x2 x3 x4 x5

c

h2 h3 h4

s3

he

later the reporthe

x3

s2s1 s4 s5

o4 h5o5

r1 3

n

later checked the report

x1 x2 x4 x5

he

x3

Embedding Layer

LSTM Layers

Attention Layer

check1vlater thehey1

PRONPOS1

otherLEX1

reportr1 3

n

Softmax WSD + Softmax POS + Softmax LEX

y2

ADVPOS2

y3

VERBPOS3

DETPOS4

y4 y5

NOUNPOS5

adv.all

LEX2

verb.cognitionLEX3

noun.communication

LEX4 LEX5

other

Fully-connected Layer

Figure 4: Multitask augmentation (with both POSand LEX as auxiliary tasks) for the attentive bidi-rectional LSTM tagger of Section 3.2.

Evaluation Benchmarks. We evaluated ourmodels on the English all-words WSD task,considering both the fine-grained and coarse-grained benchmarks (Section 6.1). As regardsfine-grained WSD, we relied on the evaluationframework of Raganato et al. (2017), whichincludes five standardized test sets from theSenseval/SemEval series: Senseval-2 (Edmondsand Cotton, 2001, SE2), Senseval-3 (Snyder andPalmer, 2004, SE3), SemEval-2007 (Pradhanet al., 2007, SE07), SemEval-2013 (Navigliet al., 2013, SE13) and SemEval-2015 (Moroand Navigli, 2015, SE15). Due to the lack of areasonably large development set for our setup,we considered the smallest among these testsets, i.e., SE07, as development set and excludedit from the evaluation of Section 6.1. As forcoarse-grained WSD, we used the SemEval-2007task 7 test set (Navigli et al., 2007), which isnot included in the standardized framework,and mapped the original sense inventory fromWordNet 2.1 to WordNet 3.0.6 Finally, we carriedout an experiment on multilingual WSD usingthe Italian, German, French and Spanish dataof SE13. For these benchmarks we relied onBabelNet (Navigli and Ponzetto, 2012)7 as unifiedsense inventory.

6We utilized the original sense-key mappings availableat http://wordnetcode.princeton.edu/3.0 fornouns and verbs, and the automatic mappings by Daude et al.(2003) for the remaining parts of speech (not available in theoriginal mappings).

7http://babelnet.org

1161

Page 7: Neural Sequence Learning Models for Word Sense Disambiguation · 2017. 12. 3. · Neural Sequence Learning Models for Word Sense Disambiguation Alessandro Raganato, ... dealing with

Dev Test DatasetsSE07 SE2 SE3 SE13 SE15

BLSTM 61.8 71.4 68.8 65.6 69.2BLSTM + att. 62.4 71.4 70.2 66.4 70.8BLSTM + att. + LEX 63.7 72.0 69.4 66.4 72.4BLSTM + att. + LEX + POS 64.8 72.0 69.1 66.9 71.5Seq2Seq 60.9 68.5 67.9 65.3 67.0Seq2Seq + att. 62.9 69.9 69.6 65.6 67.7Seq2Seq + att. + LEX 64.6 70.6 67.8 66.5 68.7Seq2Seq + att. + LEX + POS 63.1 70.1 68.5 66.5 69.2

IMS 61.3 70.9 69.3 65.3 69.5IMS+emb 62.6 72.2 70.4 65.9 71.5Context2Vec 61.3 71.8 69.1 65.6 71.9Leskext+emb ?56.7 63.0 63.7 66.2 64.6UKBgloss w2w 42.9 63.5 55.4 ?62.9 63.3Babelfy 51.6 ?67.0 63.5 66.4 70.3

MFS 54.5 65.6 ?66.0 63.8 ?67.1

Concatenation of All Test DatasetsNouns Verbs Adj. Adv. All

70.2 56.3 75.2 84.4 68.971.0 58.4 75.2 83.5 69.771.6 57.1 75.6 83.2 69.971.5 57.5 75.0 83.8 69.968.7 54.5 74.0 81.2 67.369.5 57.2 74.5 81.8 68.470.4 55.7 73.3 82.9 68.570.1 55.2 75.1 84.4 68.6

70.5 55.8 75.6 82.9 68.971.9 56.6 75.9 84.7 70.171.2 57.4 75.2 82.7 69.670.0 51.1 51.7 80.6 64.264.9 41.4 69.5 69.7 61.168.9 50.7 73.2 79.8 66.4

67.7 49.8 73.1 80.5 65.5

Table 1: F-scores (%) for English all-words fine-grained WSD on the test sets in the framework of Ra-ganato et al. (2017) (including the development set SE07). The first system with a statistically significantdifference from our best models is marked with ? (unpaired t-test, p < 0.05).

At testing time, given a target word w, ourmodels used the probability distribution over O,computed by the softmax layer at the correspond-ing time step, to rank the candidate senses of w;we then simply selected the top ranking candidateas output of the model.

Architecture Details. To set a level playing fieldwith comparison systems on English all-wordsWSD, we followed Raganato et al. (2017) and,for all our models, we used a layer of wordembeddings pre-trained8 on the English ukWaCcorpus (Baroni et al., 2009) as initialization, andkept them fixed during the training process. Forall architectures we then employed 2 layers ofbidirectional LSTM with 2048 hidden units (1024units per direction).

As regards multilingual all-words WSD (Sec-tion 6.2), we experimented, instead, with twodifferent configurations of the embedding layer:the pre-trained bilingual embeddings by Mrksicet al. (2017) for all the language pairs of interest(EN-IT, EN-FR, EN-DE, and EN-ES), and thepre-trained multilingual 512-dimensional em-beddings for 12 languages by Ammar et al. (2016).

8We followed Iacobacci et al. (2016) and used theWord2Vec (Mikolov et al., 2013) skip-gram model with 400dimensions, 10 negative samples and a window size of 10.

Training. We used SemCor 3.0 (Miller et al.,1993) as training corpus for all our experiments.Widely known and utilized in the WSD literature,SemCor is one of the largest corpora annotatedmanually with word senses from the sense inven-tory of WordNet (Miller et al., 1990) for all open-class parts of speech. We used the standardizedversion of SemCor as provided in the evaluationframework9 which also includes coarse-grainedPOS tags from the universal tagset. All modelswere trained for a fixed number of epochs E = 40using Adadelta (Zeiler, 2012) with learning rate1.0 and batch size 32. After each epoch we evalu-ated our models on the development set, and thencompared the best iterations (E∗) on the develop-ment set with the reported state of the art in eachbenchmark.

6 Experimental Results

Throughout this section we identify the modelsbased on the LSTM tagger (Sections 3.1-3.2) bythe label BLSTM, and the sequence-to-sequencemodels (Section 3.3) by the label Seq2Seq.

6.1 English All-words WSDTable 1 shows the performance of our models onthe standardized benchmarks for all-words fine-grained WSD. We report the F1-score on each in-

9http://lcl.uniroma1.it/wsdeval

1162

Page 8: Neural Sequence Learning Models for Word Sense Disambiguation · 2017. 12. 3. · Neural Sequence Learning Models for Word Sense Disambiguation Alessandro Raganato, ... dealing with

SemEval-2007 task 7BLSTM + att. + LEX 83.0 IMS 81.9BLSTM + att. + LEX + POS 83.1 Chen et al. (2014) 82.6Seq2Seq + att. + LEX 82.3 Yuan et al. (2016) 82.8Seq2Seq + att. + LEX + POS 81.6 UKB w2w 80.1

Table 2: F-scores (%) for coarse-grained WSD.

dividual test set, as well as the F1-score obtainedon the concatenation of all four test sets, dividedby part-of-speech tag.

We compared against the best supervised andknowledge-based systems evaluated on the sameframework. As supervised systems, we consideredContext2Vec (Melamud et al., 2016) and It MakesSense (Zhong and Ng, 2010, IMS), both the orig-inal implementation and the best configurationreported by Iacobacci et al. (2016, IMS+emb),which also integrates word embeddings using ex-ponential decay.10 All these supervised systemswere trained on the standardized version of Sem-Cor. As knowledge-based systems we consid-ered the embeddings-enhanced version of Lesk byBasile et al. (2014, Leskext+emb), UKB (Agirreet al., 2014) (UKBgloss w2w) , and Babelfy (Moroet al., 2014). All these systems relied on theMost Frequent Sense (MFS) baseline as back-offstrategy.11 Overall, both BLSTM and Seq2Seqachieved results that are either state-of-the-art orstatistically equivalent (unpaired t-test, p < 0.05)to the best supervised system in each benchmark,performing on par with word experts tuned overexplicitly engineered features (Iacobacci et al.,2016). Interestingly enough, BLSTM modelstended consistently to outperform their Seq2Seqcounterparts, suggesting that an encoder-decoderarchitecture, despite being more powerful, mightbe suboptimal for WSD. Furthermore, introducingLEX (cf. Section 4) as auxiliary task was gener-ally helpful; on the other hand, POS did not seemto help, corroborating previous findings (Alonsoand Plank, 2017; Bingel and Søgaard, 2017).

The overall performance by part of speechwas consistent with the above analysis, show-ing that our models outperformed all knowledge-based systems, while obtaining results that are su-perior or equivalent to the best supervised mod-

10We are not including Yuan et al. (2016), as their modelsare not available and not replicable on the standardized testsets, being based on proprietary data.

11Since each system always outputs an answer, F-scoreequals both precision and recall, and statistical significancecan be expressed with respect to any of these measures.

SemEval-2013 task 12IT FR DE ES

BLSTM (bilingual) 61.6 55.2 69.2 65.0BLSTM (multilingual) 62.0 55.5 69.2 66.4UMCC-DLSI 65.8 60.5 62.1 71.0DAEBAK! 61.3 53.8 59.1 60.0

MFS 57.5 45.3 67.4 64.5

Table 3: F-scores (%) for multilingual WSD.

els. It is worth noting that RNN-based ar-chitectures outperformed classical supervised ap-proaches (Zhong and Ng, 2010; Iacobacci et al.,2016) when dealing with verbs, which are shownto be highly ambiguous (Raganato et al., 2017).

The performance on coarse-grained WSD fol-lowed the same trend (Table 2). Both BLSTM andSeq2Seq outperformed UKB (Agirre et al., 2014)and IMS trained on SemCor (Taghipour and Ng,2015a), as well as recent supervised approachesbased on distributional semantics and neural archi-tectures (Chen et al., 2014; Yuan et al., 2016).

6.2 Multilingual All-words WSD

All the neural architectures described in this pa-per can be readily adapted to work with differentlanguages without adding sense-annotated data inthe target language. In fact, as long as the firstlayer (cf. Figures 1-3) is equipped with bilingualor multilingual embeddings where word vectors inthe training and target language are defined in thesame space, the training process can be left un-changed, even if based only on English data. Theunderlying assumption is that words that are trans-lations of each other (e.g., house in English andcasa in Italian) are mapped to word embeddingsthat are as close as possible in the vector space.

In order to assess this, we considered one of ourbest models (BLSTM+att.+LEX) and replaced themonolingual embeddings with bilingual and mul-tilingual embeddings (as specified in Section 5),leaving the rest of the architecture unchanged. Wethen trained these architectures on the same En-glish training data, and ran the resulting modelson the multilingual benchmarks of SemEval-2013for Italian, French, German and Spanish. Whiledoing this, we exploited BabelNet’s inter-resourcemappings to convert WordNet sense labels (usedat training time) into BabelNet synsets compliantwith the sense inventory of the task.

F-score figures (Table 3) show that bilingual andmultilingual models, despite being trained only onEnglish data, consistently outperformed the MFS

1163

Page 9: Neural Sequence Learning Models for Word Sense Disambiguation · 2017. 12. 3. · Neural Sequence Learning Models for Word Sense Disambiguation Alessandro Raganato, ... dealing with

baseline and achieved results that are competitivewith the best participating systems in the task. Wealso note that the overall F-score performance didnot change substantially (and slightly improved)when moving from bilingual to multilingual mod-els, despite the increase in the number of targetlanguages treated simultaneously.

6.3 Discussion and Error AnalysisAll the neural models evaluated in Section 6.1utilized the MFS back-off strategy for instancesunseen at training time, which amounted to9.4% overall for fine-grained WSD and 10.5%for coarse-grained WSD. Back-off strategy aside,85% of the times the top candidate sense for a tar-get instance lay within the 10 most probable en-tries in the probability distribution over O com-puted by the softmax layer.12 In fact, our sequencemodels learned, on the one hand, to associate atarget word with its candidate senses (somethingword experts are not required to learn, as theyonly deal with a single word type at a time); onthe other, they tended to generate softmax distri-butions reflecting the semantics of the surrondingcontext. For example, in the sentence:

(a) The two justices have been attending federal-ist society events for years,

our model correctly disambiguated justices withthe WordNet sense justice3

n (public official)rather than justice1

n (the quality of being just),and the corresponding softmax distribution washeavily biased towards words and senses relatedto persons or groups (commissioners, defendants,jury, cabinet, directors). On the other hand, in thesentence:

(b) Xavi Hernandez, the player of Barcelona, has106 matches,

the same model disambiguated matches with thewrong WordNet sense match1

n (tool for start-ing a fire). This suggests that the signal car-ried by discriminative words like player vanishesrather quickly. In order to enforce global coher-ence further, recent contributions have proposedmore sophisticated models where recurrent archi-tectures are combined with Conditional RandomFields (Huang et al., 2015; Ma and Hovy, 2016).Finally, a number of errors were connected toshorter sentences with limited context for disam-biguation: in fact, we noted that the average pre-

12We refer here to the same model considered in Section6.2 (i.e., BLSTM+att.+LEX).

cision of our model, without MFS back-off, in-creased by 6.2% (from 74.6% to 80.8%) on sen-tences with more than 20 word tokens.

7 Conclusion

In this paper we adopted a new perspective on su-pervised WSD, so far typically viewed as a clas-sification problem at the word level, and framed itusing neural sequence learning. To this aim wedefined, analyzed and compared experimentallydifferent end-to-end models of varying complex-ities, including augmentations based on an atten-tion mechanism and multitask learning.

Unlike previous supervised approaches, where adedicated model needs to be trained for every con-tent word and each disambiguation target is treatedin isolation, sequence learning approaches learn asingle model in one pass from the training data,and then disambiguate jointly all target wordswithin an input text. The resulting models con-sistently achieved state-of-the-art (or statisticallyequivalent) figures in all benchmarks for all-wordsWSD, both fine-grained and coarse-grained, effec-tively demonstrating that we can overcome the sofar undisputed and long-standing word-expert as-sumption of supervised WSD, while retaining theaccuracy of supervised word experts.

Furthermore, these models are sufficiently flex-ible to allow them, for the first time in WSD, tobe readily adapted to languages different from theone used at training time, and still achieve compet-itive results (as shown in Section 6.2). This crucialfeature could potentially pave the way for cross-lingual supervised WSD, and overcome the short-age of sense-annotated data in multiple languagesthat, to date, has prevented the development of su-pervised models for languages other than English.

As future work, we plan to extend our evalua-tion to larger sense-annotated corpora (Raganatoet al., 2016) as well as to different sense invento-ries and different languages. We also plan to ex-ploit the flexibility of our models by integratingthem into downstream applications, such as Ma-chine Translation and Information Extraction.

Acknowledgments

The authors gratefully acknowledgethe support of the ERC ConsolidatorGrant MOUSSE No. 726487.

1164

Page 10: Neural Sequence Learning Models for Word Sense Disambiguation · 2017. 12. 3. · Neural Sequence Learning Models for Word Sense Disambiguation Alessandro Raganato, ... dealing with

ReferencesEneko Agirre and Philip Edmonds. 2007. Word sense

disambiguation: Algorithms and applications, vol-ume 33. Springer Science & Business Media.

Eneko Agirre, Oier Lopez de Lacalle, and AitorSoroa. 2014. Random Walks for Knowledge-BasedWord Sense Disambiguation. Comput. Linguist.,40(1):57–84.

Hector Martınez Alonso and Barbara Plank. 2017.When is Multitask Learning Effective? Seman-tic Sequence Prediction under Varying Data Condi-tions. In Proc. of ACL, pages 44–53.

Waleed Ammar, George Mulcaire, Yulia Tsvetkov,Guillaume Lample, Chris Dyer, and Noah A. Smith.2016. Massively Multilingual Word Embeddings.CoRR, abs/1602.01925.

Osman Baskaya and David Jurgens. 2016. Semi-supervised Learning with Induced Word Senses forState of the Art Word Sense Disambiguation. J. Ar-tif. Int. Res., 55(1):1025–1058.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural Machine Translation by JointlyLearning to Align and Translate. In ICLR Workshop.

Marco Baroni, Silvia Bernardini, Adriano Ferraresi,and Eros Zanchetta. 2009. The WaCky wide web:a collection of very large linguistically processedweb-crawled corpora. Language Resources andEvaluation, 43(3):209–226.

Pierpaolo Basile, Annalina Caputo, and Giovanni Se-meraro. 2014. An Enhanced Lesk Word Sense Dis-ambiguation Algorithm through a Distributional Se-mantic Model. In Proc. of COLING, pages 1591–1600.

Joachim Bingel and Anders Søgaard. 2017. IdentifyingBeneficial Task Relations for Multi-task Learning inDeep Neural Networks. In Proc. of EACL, pages164–169.

Johannes Bjerva, Barbara Plank, and Johan Bos. 2016.Semantic Tagging with Deep Residual Networks. InProc. of COLING, pages 3531–3541.

Andrei Butnaru, Radu Tudor Ionescu, and FlorentinaHristea. 2017. ShotgunWSD: an Unsupervised Al-gorithm for Global Word Sense Disambiguation In-spired by DNA Sequencing. In Proc. of EACL,pages 916–926.

Marine Carpuat and Dekai Wu. 2007. Improving Sta-tistical Machine Translation using Word Sense Dis-ambiguation. In Proc. of EMNLP, pages 61–72.

Rich Caruana. 1997. Multitask learning. MachineLearning, 28(1):41–75.

Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014.A Unified Model for Word Sense Representation andDisambiguation. In Proc. of EMNLP, pages 1025–1035.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learn-ing Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. InProc. of EMNLP, pages 1724–1734.

Massimiliano Ciaramita and Yasemin Altun. 2006.Broad-coverage Sense Disambiguation and Infor-mation Extraction with a Supersense Sequence Tag-ger. In Proc. of EMNLP, pages 594–602.

Jordi Daude, Lluıs Padro, and German Rigau. 2003.Validation and Tuning of WordNet Mapping Tech-niques. In Proc. of RANLP, pages 117–123.

Oier Lopez de Lacalle and Eneko Agirre. 2015. AMethodology for Word Sense Disambiguation at90% based on large-scale Crowd Sourcing. In Proc.of SEM, pages 61–70.

Claudio Delli Bovi, Luca Telesca, and Roberto Nav-igli. 2015. Large-Scale Information Extraction fromTextual Definitions through Deep Syntactic and Se-mantic Analysis. TACL, 3:529–543.

Philip Edmonds and Scott Cotton. 2001. Senseval-2:Overview. In Proc. of SENSEVAL, pages 1–5.

Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, andChris Dyer. 2016. Morphological Inflection Gener-ation Using Character Sequence to Sequence Learn-ing. In Proc. of NAACL-HLT, pages 634–643.

Lucie Flekova and Iryna Gurevych. 2016. Supersenseembeddings: A unified model for supersense inter-pretation, prediction, and utilization. In Proc. ofACL, pages 2029–2041.

Alex Graves. 2013. Generating Sequences With Re-current Neural Networks. CoRR, abs/1308.0850.

Alex Graves and Jurgen Schmidhuber. 2005. Frame-wise phoneme classification with bidirectionalLSTM and other neural network architectures. Neu-ral Networks, 18(5-6):602–610.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.Li. 2016. Incorporating Copying Mechanism inSequence-to-Sequence Learning. In Proc. of ACL,pages 1631–1640.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. LongShort-Term Memory. Neural Comput., 9(8):1735–1780.

Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec-tional LSTM-CRF Models for Sequence Tagging.CoRR, abs/1508.01991.

Ignacio Iacobacci, Mohammad Taher Pilehvar, andRoberto Navigli. 2015. SensEmbed: LearningSense Embeddings for Word and Relational Simi-larity. In Proc. of ACL, pages 95–105.

1165

Page 11: Neural Sequence Learning Models for Word Sense Disambiguation · 2017. 12. 3. · Neural Sequence Learning Models for Word Sense Disambiguation Alessandro Raganato, ... dealing with

Ignacio Iacobacci, Mohammad Taher Pilehvar, andRoberto Navigli. 2016. Embeddings for Word SenseDisambiguation: An Evaluation Study. In Proc. ofACL, pages 897–907.

Mikael Kageback and Hans Salomonsson. 2016. WordSense Disambiguation using a Bidirectional LSTM.In Proceedings of CogALex, pages 51–56.

Adam Kilgarriff. 2001. English Lexical Sample TaskDescription. In Proc. of SENSEVAL, pages 17–20.

Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. InProc. of NIPS, pages 3294–3302.

Thang Luong, Quoc V. Le, Ilya Sutskever, OriolVinyals, and Lukasz Kaiser. 2016. Multi-task se-quence to sequence learning. In ICLR Workshop.

Xuezhe Ma and Eduard Hovy. 2016. End-to-end Se-quence Labeling via Bi-directional LSTM-CNNs-CRF. In Proc. of ACL, pages 1064–1074.

Diana McCarthy, Marianna Apidianaki, and KatrinErk. 2016. Word Sense Clustering and Clusterabil-ity. Comput. Linguist., 42(2):245–275.

Oren Melamud, Jacob Goldberger, and Ido Dagan.2016. context2vec: Learning Generic Context Em-bedding with Bidirectional LSTM. In Proc. ofCoNLL, pages 51–61.

Rada Mihalcea, Timothy Chklovski, and Adam Kilgar-riff. 2004. The Senseval-3 English Lexical SampleTask. In Proc. of SENSEVAL, pages 25–28.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. In ICLR Workshop.

George A. Miller, R.T. Beckwith, Christiane D. Fell-baum, D. Gross, and K. Miller. 1990. WordNet:an online lexical database. International Journal ofLexicography, 3(4):235–244.

George A. Miller, Claudia Leacock, Randee Tengi, andRoss T. Bunker. 1993. A semantic concordance. InProc. of HLT, pages 303–308.

Andrea Moro and Roberto Navigli. 2015. SemEval-2015 Task 13: Multilingual All-Words Sense Dis-ambiguation and Entity Linking. In Proc. of Se-mEval, pages 288–297.

Andrea Moro, Alessandro Raganato, and Roberto Nav-igli. 2014. Entity Linking meets Word Sense Disam-biguation: a Unified Approach. TACL, 2:231–244.

Nikola Mrksic, Ivan Vulic, Diarmuid O Seaghdha, RoiReichart, Ira Leviant, Milica Gasic, Anna Korhonen,and Steve Young. 2017. Semantic Specialisation ofDistributional Word Vector Spaces using Monolin-gual and Cross-Lingual Constraints. TACL, 5.

Roberto Navigli. 2009. Word sense disambiguation: Asurvey. ACM Computing Surveys, 41(2):10.

Roberto Navigli, David Jurgens, and Daniele Vannella.2013. SemEval-2013 Task 12: Multilingual WordSense Disambiguation. In Proc. of SemEval, vol-ume 2, pages 222–231.

Roberto Navigli, Kenneth C. Litkowski, and Orin Har-graves. 2007. SemEval-2007 Task 07: Coarse-grained English All-words Task. In Proc. of Se-mEval, pages 30–35.

Roberto Navigli and Simone Paolo Ponzetto. 2012.BabelNet: The Automatic Construction, Evaluationand Application of a Wide-Coverage MultilingualSemantic Network. Artificial Intelligence, 193:217–250.

Steven Neale, Lus Gomes, Eneko Agirre, Oier Lopezde Lacalle, and Antnio Branco. 2016. Word Sense-Aware Machine Translation: Including Senses asContextual Features for Improved Translation Mod-els. In Proc. of LREC, pages 2777–2783.

Tommaso Pasini and Roberto Navigli. 2017. Train-O-Matic: Large-Scale Supervised Word Sense Dis-ambiguation in Multiple Languages without ManualTraining Data. In Proc. of EMNLP.

Mohammad Taher Pilehvar and Roberto Navigli.2014. A Large-scale Pseudoword-based EvaluationFramework for State-of-the-art Word Sense Disam-biguation. Comput. Linguist., 40(4):837–881.

Barbara Plank, Anders Søgaard, and Yoav Goldberg.2016. Multilingual Part-of-Speech Tagging withBidirectional Long Short-Term Memory Models andAuxiliary Loss. In Proc. of ACL, pages 412–418.

Sameer S. Pradhan, Edward Loper, Dmitriy Dligach,and Martha Palmer. 2007. SemEval-2007 Task 17:English Lexical Sample, SRL and All Words. InProc. of SemEval-2007, pages 87–92.

Alessandro Raganato, Claudio Delli Bovi, and RobertoNavigli. 2016. Automatic Construction and Evalu-ation of a Large Semantically Enriched Wikipedia.In Proc. of IJCAI, pages 2894–2900.

Alessandro Raganato, Jose Camacho-Collados, andRoberto Navigli. 2017. Word Sense Disambigua-tion: A Unified Evaluation Framework and Empiri-cal Comparison. In Proc. of EACL, pages 99–110.

Sascha Rothe and Hinrich Schutze. 2015. AutoEx-tend: Extending Word Embeddings to Embeddingsfor Synsets and Lexemes. In Proc. of ACL, pages1793–1803.

Hui Shen, Razvan Bunescu, and Rada Mihalcea. 2013.Coarse to Fine Grained Sense Disambiguation inWikipedia. In Proc. of SEM, pages 22–31.

Benjamin Snyder and Martha Palmer. 2004. The en-glish all-words task. In Proc. of Senseval-3, pages41–43.

1166

Page 12: Neural Sequence Learning Models for Word Sense Disambiguation · 2017. 12. 3. · Neural Sequence Learning Models for Word Sense Disambiguation Alessandro Raganato, ... dealing with

Anders Søgaard and Yoav Goldberg. 2016. Deepmulti-task learning with low level tasks supervisedat lower layers. In Proc. of ACL, pages 231–235.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to Sequence Learning with Neural Net-works. In Proc. of NIPS, pages 3104–3112.

Kaveh Taghipour and Hwee Tou Ng. 2015a. One Mil-lion Sense-Tagged Instances for Word Sense Disam-biguation and Induction. In Proc. of CoNLL, pages338–344.

Kaveh Taghipour and Hwee Tou Ng. 2015b. Semi-Supervised Word Sense Disambiguation UsingWord Embeddings in General and Specific Domains.In Proc. of NAACL-HLT, pages 314–323.

Rocco Tripodi and Marcello Pelillo. 2017. A Game-Theoretic Approach to Word Sense Disambiguation.Comput. Linguist., 43(1):31–70.

Oriol Vinyals, Ł ukasz Kaiser, Terry Koo, Slav Petrov,Ilya Sutskever, and Geoffrey Hinton. 2015. Gram-mar as a foreign language. In Proc. of NIPS, pages2773–2781.

Oriol Vinyals and Quoc V. Le. 2015. A Neural Conver-sational Model. In Proc. of ICML, JMLR: W&CP,volume 37.

Wenhui Wang and Baobao Chang. 2016. Graph-basedDependency Parsing with Bidirectional LSTM. InProc. of ACL, pages 2306–2315.

Dirk Weissenborn, Leonhard Hennig, Feiyu Xu, andHans Uszkoreit. 2015. Multi-Objective Optimiza-tion for the Joint Disambiguation of Nouns andNamed Entities. In Proc. of ACL, pages 596–605.

Deyi Xiong and Min Zhang. 2014. A Sense-BasedTranslation Model for Statistical Machine Transla-tion. In Proc. of ACL, pages 1459–1469.

Dayu Yuan, Ryan Doherty, Julian Richardson, ColinEvans, and Eric Altendorf. 2016. Semi-supervisedWord Sense Disambiguation with Neural Models.In Proc. of COLING, pages 1374–1385.

Matthew D. Zeiler. 2012. ADADELTA: An AdaptiveLearning Rate Method. CoRR, abs/1212.5701.

Zhi Zhong and Hwee Tou Ng. 2010. It Makes Sense:A Wide-Coverage Word Sense Disambiguation Sys-tem for Free Text. In Proc. of ACL: System Demon-strations, pages 78–83.

Zhi Zhong and Hwee Tou Ng. 2012. Word SenseDisambiguation Improves Information Retrieval. InProc. of ACL, pages 273–282.

Jie Zhou and Wei Xu. 2015. End-to-End Learningof Semantic Role Labeling using Recurrent NeuralNetworks. In Proc. of ACL, pages 1127–1137.

Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, BingchenLi, Hongwei Hao, and Bo Xu. 2016. Attention-Based Bidirectional Long Short-Term Memory Net-works for Relation Classification. In Proc. of ACL,pages 207–212.

1167