University of Nebraska - LincolnDigitalCommons@University of Nebraska - Lincoln
Library Philosophy and Practice (e-journal) Libraries at University of Nebraska-Lincoln
2015
Metadata And Linked Data in Word SenseDisambiguationMatthew CorsmeierSan Jose State University, [email protected]
Follow this and additional works at: http://digitalcommons.unl.edu/libphilprac
Part of the Artificial Intelligence and Robotics Commons, Computational Linguistics Commons,and the Library and Information Science Commons
Corsmeier, Matthew, "Metadata And Linked Data in Word Sense Disambiguation" (2015). Library Philosophy and Practice (e-journal).Paper 1227.http://digitalcommons.unl.edu/libphilprac/1227
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 1
Introduction
Word Sense Disambiguation (WSD) is referred to as an “AI-complete” problem
(Mallery, 1998), i.e., a task that is relatively easy for people, but considerably more difficult for
machines. If someone makes a query for a polysemous word (e.g., “plant,” “bass,” “mercury,”
etc…), how is an information retrieval system to understand which sense of the word is
intended? There exist tried-and-tested methods, such as just using the most predominant sense
of the word (McCarthy, Koeling, Weeds, & Carroll, 2004); or looking at the words next to the
query term to determine the statistically most likely meaning (Jurafsky & Martin, 2009; Manning
& Schütze, 1999); but these methods often produce less-than-satisfactory results [often around
70%] (Navigli, 2009). Furthermore, these methods have been heavily dependent on the manual
creation of knowledge sources (Edmonds, 2000), which are expensive to create and subject to
change, thus creating what is termed a knowledge acquisition bottleneck (Gale, Church, &
Yarowsky, 1992). Linked Data technologies (Berners-Lee, 2006), however, allow us to utilize
existing ontologies and lexica, which can then be exploited to improve the automatic semantic
understanding of the word. This paper will examine several systems that purport to
disambiguate words by using Linked Data, and some of the models these systems use to ensure
interoperability.
Literature Review
The most complete treatment of the subject of WSD is arguably Agirre & Edmonds [ed.]
(2007), which presents a detailed definition of the problem, along with a history thereof, and
numerous algorithms which are used in practice. Kwong (2013) offers slightly more recent
coverage, along with predictions as to how WSD methods will evolve in the near future.
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 2
Generalists might find sufficient the survey from Navigli (2009), or the chapters covering WSD
in either Jurafsky & Martin (2009) or Manning & Schütze (1999). SemEval [which was
originally named Senseval (Kilgarriff, 1998)] is an ongoing evaluation project which is used as a
baseline to assess various WSD methods, including many which will be examined in this paper.
Linked Linguistic Open Data (LLOD) is heavily dependent on metadata, and any
consideration thereof would require an examination of its standards. A brief history of the topic
of linguistic annotation can be found in Palmer & Xue (2013). Bird & Simons (2003a) and Ide,
Romary, & de la Clergerie (2004) proposed sets of best practices for linguistic annotations, while
Simons, Bird, & Spanne (2008) offered a more recent set of recommendations that specifically
suggested language codes from ISO 639-31 be used in metadata. Ide & Pustejovsky (2010)
suggested a list of best practices for language technology metadata, focusing heavily on the work
of the OLAC and European Languages Resource Association (ELRA). Gracia, Montiel-
Ponsoda, Cimiano, Gómez-Pérez. Buitelaar, & McCrae (2012) considered the issue of Linked
Data being stored in different languages, and suggested that techniques such as ontology
localization, ontology mapping, and cross-lingual ontology-based information access and
presentation would help prevent information from being locked up in linguistic data silos. Gayo,
Kontokostas, & Auer (2013) presented a set of best practices for multilingual linked open data,
and point out that SPARQL queries can be improved if tags are identified by language. Reviews
of specific linguistic annotation schemes include: the Open Languages Archives Community
[OLAC] metadata set (Bird & Simons,2003b); the General Ontology for Linguistic Description
[GOLD] (Farrar & Langendoen; 2003); ISOcat, a Data Category Registry (DCR) for the ISO TC
37 (terminology and other language and content resources) registry (Kemps-Snijders,
1 http://www-01.sil.org/iso639-3/
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 3
Windhouwer, Wittenburg, & Wright (2009); the ISO/TC 37/SC 4 standard (Lee & Romary,
2010); the lemon (LExicon Model for ONtologies) model (McCrae, Aguado-de-Cea, Buitelaar,
Cimiano, Declerck, Gómez-Pérez, ... & Wunner, 2012); and Lexical Markup Framework [LMF]
(Francopoulo, 2013), which had a strong influence on the lemon model.
A number of papers detail projects that utilized the schemes listed above. Montiel-
Ponsoda, Gracia del Río, Aguado de Cea, & Gómez-Pérez (2011) showed how the lemon model
can be extended using a metamodel in OWL, which would allow translation to be represented on
a separate layer. Buitelaar, Cimiano, Haase, & Sintek (2009) advocated using ontologies beyond
those of RDFS, OWL, and SKOS, and presented a model called LexInfo, which combines
aspects of older models. Chiarcos, Dipper, Götze, Leser, Lüdeling, Ritz, & Stede (2008) treated
the Ontology of Linguistic Annotation, which is especially useful for corpora that have been
annotated a number of different times in a number of different methods.
Two of the most commonly used linguistic tools on the Semantic Web are the general-
purpose lexical ontologies WordNet (Fellbaum, 1998) and FrameNet (Baker, Fillmore, & Lowe,
1998). Although Ide (2014) argued that FrameNet was the “ideal resource for representation as
linked data” (18), the majority of the projects covered later in this paper utilized WordNet, and
thus this tool will be examined in more detail. Both FrameNet and WordNet are often used in
Linked Data projects to automatically annotate texts with semantic metadata. Projects that have
used these databases include Huang (2007), wherein WordNet files were converted to be
presented in OWL to assist in machine comprehension of metaphor; and BabelNet (Navigli,
2012), a resource which will be reviewed later in this paper. Ehrmann, Cecconi, Vannella,
McCrae, Cimiano, & Navigli (2014) converted BabelNet into Linked Data via the lemon model;
and Moro, Navigli, Tucci & Passonneau (2014) used BabelNet to automatically annotate the
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 4
Manually Annotated Sub-Corpus 3.0 (MASC) and therewith were able to perform automatic
WSD with an accuracy of 70%, an impressive figure; but still too low to see much practical
adoption.
Other examples of linguistic tools used with Semantic Web technologies include
Krizhanovsky & Smirnov (2013), wherein Wiktionary was utilized to automatically create a
general-purpose lexical ontology; Hellmann, Brekle, & Auer (2013) described a similar project
wherein Wiktionary extractors made use of DBpedia to create RDF triples; de Melo (2014a)
introduced lexvo.org, a system which automatically creates URIs for each word and sense, thus
guaranteeing a constant reference; Mendes, Jakob, García-Silva, & Bizer (2011), introduced
DBpedia Spotlight, an open-source program that automatically annotates texts to the Linked
Open Data cloud by using the URIs in DBpedia; and Sérasset (2014) described the extraction of
multilingual lexical data from Wiktionary, the importation thereof into DBNary, and the final
conversion of the data into MLLOD (Multilingual Lexical Linked Open Data) via the lemon
model.
A number of very different methods of using Semantic Web technologies to disambiguate
word senses have been attempted and analyzed. Elbedweihy, Wrigley, Ciravegna, & Zhang
(2013) used a combination of WordNet, BabelNet, and Wikipedia to help generate SPARQL
queries, which would subsequently resolve ambiguities in the original queries with a success rate
of 76%; Fragos (2013) also used WordNet - in this case the extended glosses of WordNet - to
train WSD systems; McCarthy et al. (2004) used the WordNet similarity package and raw textual
corpora to solve WSD by using the predominant sense of the word, a method which achieved a
success rate of 64%; Ide (2006) treated the problem of polysemy by mapping FrameNet sets to
WordNet.
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 5
Some case study reviews show the strengths and weaknesses of more general models:
Haase (2004) looked at tags for digital images to argue that semantic metadata can help alleviate
some of the issues of precision caused by selecting overly narrow terms; de Melo & Weikum
(2008) argued that “language-related knowledge” forms the backbone of the semantic web, and
presented ways in which linguistic items such as languages, scripts, and terms can
unambiguously be linked with URIs, and from whence new links can automatically be formed;
and Tagarelli, Longo, & Greco (2009) showed how notions of sense relatedness can be
calculated by examining overlaps between dictionary glosses and measuring distances for
ontology paths.
The rest of the paper will cover in more detail several tools and models that feature
prominently in the use of metadata to disambiguate word senses. As WordNet comes up so often
in this paper, a rudimentary comprehension of the workings of the lexicon will be necessary to
fully understand how other tools covered in this paper work. The subsequent sections will look
at lexvo.org; the lemon model; BabelNet; WordNet++; and finally tag disambiguation with
TAGora Sense Repository. This paper will conclude with a brief examination of some of the
issues of relying on meta- and linked data for Word Sense Disambiguation.
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 6
A figure showing linked resources available in the Linguistic Linked Open Data Cloud.
Taken from http://linguistics.okfn.org/llod on November 19, 2014.
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 7
WordNet
WordNet (Fellbaum, 1998) is a general-purpose lexical ontology that features prominently in the
web of linked data. In WordNet, words are organized according to Synset, which include not
only synonyms, but also hypernyms (broader categories), hyponyms (narrower categories),
meronyms (part-to-whole relationships), and more. Classifying words by hyper- and hyponyms
allows entries to be nested hierarchically, which can assist in determining the correct sense of a
word. WordNet is free to use and fully queriable, and below are screenshots showing the results
for a query of the word play, a highly polysemous word which will again be considered in detail
when we examine BabelNet.
Basic results of a query for the word “play” at WordNet. As one can easily see, “play” is highly polysemous, and most senses of the word are listed with their accompanying Synsets. Accessed November 19, 2014 at http://bit.ly/1uSKI0e
Drilling down on the first sense of the word “play,” we can see a list of hyponyms (narrower forms), as well as a detailed hierarchy showing the hypernyms this sense of “play” is nested in, going all the way to “entity”, which is the listing for all nouns in WordNet. Accessed November 19, 2014 at http://bit.ly/1xQxMrV
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 8
In addition to organizing words in “Synsets,” WordNet assigns a “sense key” to each
sense of a word, a fact that will be exploited to great effect in BabelNet. For example, the sense
of “play” indicating a dramatic performance would be listed as play1n, with the superscript “1”
indicating that this is the first sense of “play” listed in WordNet, and the subscript “n” indicating
that this instance of “play” is a noun. Play1n would be in the same Synset as drama1
n and
dramatic play1n, which are also the first senses listed for their respective terms, and nouns as
well. The sense of the word “play” referring to children’s games would be listed as play8n, and
in the same Synset as child’s play2n, as they are respectively the eighth and second sense of
their terms (Navigli & Ponzetto, 2012a). Besides being used in the projects listed below,
WordNet is also used by Wikipedia, as all pages at Wikipedia have WordNet senses
automatically associated with them (Ponzetto & Navigli, 2010). However, one criticism of
WordNet is that many senses listed are too similar, which may limit the usefulness of WordNet
in WSD tasks (Ide, 2006).
Various metadata schemes have been used to link the information contained within the
WordNet ontology with other databases and ontologies. There is a nearly complete XML
version of WordNet2, and an RDF version
3 which was structured according to the lemon model
(cf. below). There exists as well a mapping of WordNet entries to schema.org terms4. The World
Wide Web Consortium also maintains extensive documentation regarding RDF/OWL
representations of WordNet (van Assem, Gangemi, & Schreiber, 2006). Finally, entries in
2 http://wordnet.princeton.edu/wordnet/download/standoff/
3 http://wordnet-rdf.princeton.edu/
4 http://schema.rdfs.org/mappings/schemaorg_wn.owl
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 9
WordNet are semantically linked to a number of other LLOD resources, including lexvo.org (cf.
below), and VerbNet.
Lexvo.org
Lexvo.org (de Melo, 2014a) is an easy-to-use service which provides URIs to identify any term
in a given language. These URIs “serve as interchange URIs that can easily be created from any
word-segmented natural language text” (de Melo, 2014a, 3). Unambiguous reference to a
specific language is achieved by using ISO639-3 codes, and lexvo.org records also give the parts
of speech for the various senses, and provide links to translations in other languages. By
referring to what language the term is from, lexvo.org can help assist in multilingual
disambiguation, which can
The results for the English word “play.” Accessed November 21, 2014 at http://www.lexvo.org/page/term/eng/play
help prevent some accidental, embarrassing results5. In addition to linking individual senses of a
word to WordNet synsets (copies of which are hosted at lexvo.org), records at lexvo.org are
5 Problems can occur when words in foreign languages have the same spellings but different meanings.
E.g. the German word Mist is considered a faux ami (false friend), as its meaning is very different from the
orthographically identical word in English – der Mist would be probably be translated as “crap” in English.
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 10
linked to Library of Congress Subject Headings; records at OpenCyc6; and linked internally at
lexvo.org to translations for the various senses of the word (e.g., here is the URI for the word jeu,
and here is the IRI7 for pièce de théâtre, two French words which capture two of the senses
connoted by the English play). However, it should be pointed out that when I retrieved the
record for “play,” a number of URIs linked to it had expired – a problem with linked data that I
will explore again at the conclusion of this paper. Lexvo.org uses sources such as Wikipedia,
Wiktionary, and the Unicode CLDR (Common Locale Data Repository)8 to supply descriptions
of all the languages it covers. In 2008, Lexvo.org became the first web site to publish Linked
Data based on Wiktionary on the Web (de Melo, 2014a), and it also utilizes the URIs created for
words in the WordNet lexical RDF/OWL Representation of WordNet (van Assem, Gangemi, &
Schreiber, 2006). Lexvo.org can help free data from “data silos” by providing unambiguous
URIs for individual languages, terms within, and even senses of these terms. Lexvo.org links are
used by the British Library in their British National Bibliography data; the Spanish National
Library; Sudoc – the French academic catalog; LOCAH Linked Archives Hub project, and
DBpedia Spotlight (de Melo, 2014b)
Lemon
Lemon (LExicon Model for Ontologies) “is designed to represent lexical information about
words and terms relative to an ontology on the Web” (McCrae et al., 2012, 703), and is what is
referred to as an ontology-lexicon, which shows how the classes of the ontology are realized
6 http://www.cyc.com/platform/opencyc
7 Technically, URIs cannot contain non-ASCII characters (Cyganiak, Wood, & Lanthaler, 2014), and since
pièce de théâtre contains diacritics, the URL for it would be considered an IRI.
8 http://cldr.unicode.org/
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 11
linguistically (Buitelaar, 2010). Lemon follows a principle referred to as semantics by reference,
in that “the (lexical) meaning of the entries in the lexicon is assumed to be expressed exclusively
in the ontology and the lexicon merely points to the appropriate concepts” (McCrae et al., 2012,
703). Lemon is similar to the SKOS project (Miles and Bechhofer, 2009), but differs in that it is
“an independent and external model, intended to be published with arbitrary ontology-based
conceptualisations [sic] … in order to provide a richer description of the knowledge captured in
those resources in one or several natural languages” (McCrae et al., 2012, 703).
Lemon is an RDF-native form, which allows it to exploit existing Semantic Web technologies.
Drilling down on one of the results of a query for “play”. Accessed November 19, 2014 at http://lemon-model.net/lexica/uby/wn/WN_LexicalEntry_46245
The intention of the lemon model is not to be a semantic model, but instead it references existing
resources (e.g., ontologies) which gives it the capability to represent semantics. The model does
this through its “(lexical) sense” object, which avoids using the concept of a word sense that is
commonly found in many existing models, a practice that Kilgariff (1997) criticized in his paper
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 12
about the issues created by word senses (McCrae et al., 2014). The lemon model is used by a
great number of linked data initiatives, including BabelNet, which will be considered later in this
paper (Ehrmann et al., 2014).
A diagram of the lemon model
from Ehrmann et al., 2014, 404
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 13
BabelNet
Simply put, BabelNet9 (Navigli & Ponzetto, 2012a) is a combination of the structure of
Wikipedia augmented with Synsets from WordNet. The two databases were integrated by an
automatic mapping, and any lexical gaps in resource-poor languages were plugged by using
machine translation. The resulting “encyclopedic dictionary” provides lexical information on
concepts and named entities in many languages, and the end result is a semantically-linked
ontology. From version 2.0 onwards, the database was also linked with the Open Multilingual
WordNet [OMWN] (Bond & Foster, 2013), a collection of wordnets in different languages, and
OmegaWiki10
, a collaborative dictionary that is available in a number of different languages.
The 2.0 version also had more than 9 million Babel synsets (i.e. entries linked in semantic
networks) in over 50 languages (Ehrmann et al., 2014; Navigli, 2014). Currently, version 2.5 is
in beta mode.
The knowledge contained within these Babel synsets can be used to perform knowledge-
rich (Navigli & Velardi, 2005), graph-based (Bird & Liberman, 2001) Word Sense
Disambiguation in both monolingual and multilingual settings. By utilizing the partial structure
of a typical Wikipedia page, BabelNet is able to extract semantic information regarding an entry.
For example, by using Wikipedia’s redirect and disambiguation pages, internal and multilingual
links, and category designations as well, BabelNet is able to automatically glean significant
semantic information regarding an entry [cf. graph below] (Navigli & Ponzetto, 2012a; Navigli
& Ponzetto, 2012b).
9 babelnet.org
10 http://www.omegawiki.org/Meta:Main_Page
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 14
From Navigli & Ponzatto, 2012, 220
The connections in WordNet reimagined as a graph. In this graph, the node would be the Synset, and the edges the lexical and semantic relations between terms. Remark the superscript numbers revealing the sense of a polysemous word, and the subscript letter indicating the part of speech.
The connections in Wikipedia reimagined as a graph. Here the edges are the hyperlinks between entries. For the sake of conciseness only a small portion of the graph is shown, but notice the inclusion of “tragedy” in both graphs; BabelNet was able to deduce these two entries referred to the same sense by calculating the percentage of words in common.
BabelNet uses a mapping algorithm to determine which sense of a polysemous WordNet entry
should be paired with which disambiguated Wikipedia entry. In the example illustrated above,
WordNet’s play1n would be paired with Wikipedia’s PLAY(THEATRE) entry since the graphs
for two entries share more words in common than the graphs for the other choices. Concerning
Word Sense Disambiguation, the gains in precision over rival methods may be quite small, but
the gains in recall are fairly high. Furthermore, BabelNet can also take advantage of the
multilingual links in Wikipedia to assist even in monolingual WSD, as can be seen in the figure
below (Navigli & Ponzetto, 2012a). There also exists an RDF version of BabelNet 2.0 which
contains about 1.1 billion triples, among which there are over 9 million SKOS concepts, and
nearly 100 million lemon lexical senses (Ehrmann et al., 2014). BabelNet can currently be used
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 15
as a stand-alone resource with its Java API, a SPARQL endpoint, or as a Linked Data interface
as part of the Linguistic Linked Open Data (LLOD) cloud (Navigli, 2014). A wiki on converting
BabelNet as Linguistic Linked is also maintained by the World Wide Web Consortium11
.
From Navigli & Ponsetto, 2012, 221
BabelNet takes the linked data from Wikipedia and WordNet to create a Babel Synset, which would also include translations of the word in various languages. Having access to such translations can even be used to assist in monolingual WSD.
11
http://www.w3.org/community/bpmlod/wiki/Converting_BabelNet_as_Linguistic_Linked_Data
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 16
Results of a query for the word “play” at BabelNet 2.5, which is currently in beta mode. Accessed November 20, 2014 at http://babelnet.org/exploreResult?word=play&lang=EN
Drilling down to the first result of the query above. Accessed November 20, 2014 at http://babelnet.org/search?word=bn:00028604n&details=1&orig=play&lang=EN
Continuation of the web page displayed on the left, showing WordNet senses and definitions, as well as redirections and definitions from Wikipedia
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 17
WordNet++
WordNet++ (Ponzetto & Navigli, 2010) is an English-only subset of BabelNet in which entries
from Wikipedia are mapped to the corresponding senses in WordNet. For example, by looking
at the disambiguation links for the Wikipedia entry SODA (SOFT DRINK), we can construct a
disambiguation context which includes the words: soft, drink, cola, sugar. The possible matches
at WordNet include soda1n, which includes words like salt, acetate, chlorate, and benzoate.
The context for soda2n includes soft, drink, cola, bitter, etc…. Having the largest intersection
between them, WordNet++ would match Wikipedia’s SODA(SOFT DRINK) with WordNet’s
soda2n. Furthermore, WordNet++ can use the additional links in Wikipedia (e.g. SODA(SOFT
DRINK) is linked to the entry SYRUP to create even larger Synsets. This method generated
consistently high results on several Semeval tasks (Ponzetto & Navigli, 2010).
Tag Disambiguation with DBpedia
Delicious12
allows users to assign text descriptions to user-contributed tags, but these can’t
readily be used by machines (Garcia, Szomszor, Alani, & Corcho, 2009). Garcia et al.
developed an approach that uses the TAGora Sense Repository (TSR)13
, a linked data resource
that provides metadata about tags and their possible senses, to disambiguate the mostly like sense
of a word. The TSR is ultimately linked to DBpedia (Morsey, Lehmann, Auer, Stadler, &
Hellmann, 2012), a representation in the RDF model of a portion of the information in
Wikipedia. This approach faced one difficultly that many other WSD systems do not face,
12
https://delicious.com/
13 Currently at http://www.tagora-project.eu/ The URL listed in Garcia et al. 2009 did not work at the time
of writing.
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 18
namely tags do not occur in sentences, and therefore this approach could not examine the word
in its context to disambiguate it. However, tags seldom occur singularly, and this method was
able to look at the other tags in order to try to determine the most likely sense. Garcia et al.
(2009) created an algorithm which represented the tags and the context in vectors, and then used
similarity measures to choose the most likely sense. The authors stated that this approach was
designed to be used for tag disambiguation, but could be used for other functions as well.
Conclusion
Like many other activities in the realm of artificial intelligence, Word Sense
Disambiguation is a task that is notoriously difficult for machines, but relatively simple for
humans. Successful methods for disambiguating polysemous words automatically have been
developed, but these methods are heavily reliant on marked-up corpora, which require significant
investments of time and money (Edmonds, 2000). Linked Open Data, however, is allowing
computers to exploit semantically marked-up data in an attempt to share resources, and thus we
can use already discovered knowledge to solve hitherto unseen problems. This is only possible,
of course, because of a realization of the importance of interoperability and a commitment to
shared standards.
But a dependence on interconnected data has created a new set of challenges. One huge
problem with Linked Data is the quality of links. The ever-changing nature of information
means that websites with static information are at a disadvantage, for if the information
contained within the page ever changes, it has to be changed manually to remain current. While
this problem is obviated by using Linked Data, a new problem arises if the link disappears
without notice. In my cursory explorations of several tools examined above, I came across
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 19
several dead links, which naturally caused me to question the practicality of these tools. De
Melo (2014), in introducing lexvo.org, explained that he was reluctant to use dynamic
information from Wiktionary, as the site changes frequently. It seems entirely possible that even
a slight structural change at a site like Wiktionary could wreak havoc with the resources linked to
it, and it is not certain that the administrators at Wiktionary will take these consequences into
account when debating alterations. Furthermore, there presently seems to be no mechanism to
prune and replace dead links, and such a deficit seems to be a liability for any service relying on
linked data.
Another concern is that none of these standards or tools will gain purchase, and we could
end up with a multitude of systems competing with and failing to operate with each other. While
a model like Lemon seems to have gained wide use, BabelNet is a relatively young project, and
may drift towards obsolescence like the projects WiSeNEt (Moro & Navigli, 2012) and MENTA
(de Melo & Weikum, 2010), two recent similar projects that are scarcely mentioned today.
Nonetheless, the exponentially increasing amount of data available through Linked Open
Data seems to suggest that WSD systems will come to increasingly rely on it, and on the
metadata they will use to locate the sought-after information. Naturally, such linking would not
be possible without a commitment to shared standards, and the success of these linked data tools
can be considered another victory for the virtues of interoperability. However, in our rush to
connect everything, it would be advisable to regularly examine the quality of the data we are
linking.
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 20
References
Agirre, E., & Edmonds, P. G. (ed.). (2007). Word Sense Disambiguation: Algorithms and
Applications. Berlin: Springer.
Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley Framenet project. In
Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics
and 17th International Conference on Computational Linguistics-Volume 1 (pp. 86-90).
Association for Computational Linguistics.
Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation.
arXiv:cs/9903003v1
Bird, S., & Simons, G. (2003a). Extending Dublin Core metadata to support the description and
discovery of language resources. Computers and the Humanities, 37(4), 375-388.
Bird, S., & Simons, G. (2003b). Seven dimensions of portability for language documentation and
description. Language, 79, 557-582. Accessed October 21, 2014 at
http://www.jstor.org.libaccess.sjlibrary.org/stable/4489465
Bond, F. & Foster, R. (2013). Linking and extending an Open Multilingual Wordnet. In Proc. of
51st Annual Meeting of the Association for Computational Linguistics, 1352–1362.
Buitelaar, P. (2010). Ontology-based semantic lexicons: Mapping between terms and object
descriptions. In: C.R. Huang, N. Calzolari, A. Gangemi, A. Lenci, A. Oltramari & L.
Prevot (Eds.), Ontology and the Lexicon, pp. 212–223. Cambridge: Cambridge
University Press.
Buitelaar, P., Cimiano, P., Haase, P., & Sintek, M. (2009). Towards linguistically grounded
ontologies. In The Semantic Web: Research and Applications, pp. 111-125. Berlin:
Springer
Berners-Lee, T. (2006). Linked Data. Accessed September 29, 2014 at
http://www.w3.org/DesignIssues/LinkedData.html
Chiarcos, C., Dipper, S., Götze, M., Leser, U., Lüdeling, A., Ritz, J., & Stede, M. (2008). A
flexible framework for integrating annotations from different tools and tagsets.
Traitement Automatique des Langues, 49(2), 271-293.
Cyganiak, R.; Wood, D.; Lanthaler, M. (2014) RDF 1.1 Concepts and abstract syntax. Accessed
October 7, 2014 at http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#dfn-iri
de Melo, G. & Weikum, G. (2010). MENTA: Inducing multilingual taxonomies from Wikipedia,
in: Proceedings of the Nineteenth ACM Conference on Information and, Knowledge
Management, pp. 1099-1108. Toronto, Canada, 26–30.
de Melo, G. (2014a). Lexvo.org: Language-related information for the linguistic linked data
cloud. Semantic Web Journal. Accessed October 15, 2014 at http://www.semantic-web-
journal.net/system/files/swj521.pdf
de Melo, G. (2014b). Lexvo.org main page. Accessed November 18, 2014 at www.lexvo.org
de Melo, G., & Weikum, G. (2008, October). Language as a foundation of the semantic web. In
International Semantic Web Conference (Posters & Demos). Accessed October 17, 2014
at https://www.mpi-inf.mpg.de/~gdemelo/papers/demelo-lexvo-iswc2008.pdf
Edmonds, P. (2000). Designing a task for SENSEVAL-2 [Technical report]. Brighton, UK:
University of Brighton.
Ehrmann, M., Cecconi, F., Vannella, D., McCrae, J., Cimiano, P., & Navigli, R. (2014).
Representing multilingual data as Linked Data: the Case of BabelNet 2.0. In Proceedings.
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 21
of Language Resource Evaluation Conference. Accessed October 20, 2014 at
http://www.lrec-conf.org/proceedings/lrec2014/pdf/810_Paper.pdf
Elbedweihy, K., Wrigley, S. N., Ciravegna, F., & Zhang, Z. (2013). Using Babelnet in bridging
the gap between natural language queries and linked data concepts. In NLP-DBPEDIA@
ISWC. Accessed October 17, 2014 at http://ceur-ws.org/Vol-
1064/Elbedweihy_Using_BabelNet.pdf
Farrar, S., & Langendoen, D. T. (2003). A linguistic ontology for the semantic web. Glot
International, 7(3), 97-100.
Fellbaum, C. (ed.). (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT
Press.
Fragos, K. (2013). Modeling WordNet glosses to perform word sense disambiguation.
International Journal on Artificial Intelligence Tools, 22(02). Accessed October 24,
2014 at
http://www.worldscientific.com.libaccess.sjlibrary.org/doi/abs/10.1142/S0218213013500
036
Francopoulo, G. (ed.). (2013) LMF Lexical Markup Framework. Hoboken, NJ: Wiley.
Gale, W. A., Church, K., & Yarowsky, D. (1992). Estimating upper and lower bounds on the
performance of word-sense disambiguation programs. In Proceedings of the 30th Annual
Meeting of the Association for Computational Linguistics, pp. 249–256.
Garcia, A., Szomszor, M., Alani, H., & Corcho, O. (2009). Preliminary results in tag
disambiguation using DBpedia. Accessed October 20, 2014 at
http://oro.open.ac.uk/20006/1/ckcar09-final.pdf
Gayo, J. E. L., Kontokostas, D., & Auer, S. (2013). Multilingual linked open data patterns.
Semantic Web Journal Accessed October 15, 2014 at http://www.semantic-web-
journal.net/system/files/swj495.pdf
Gracia, J., Montiel-Ponsoda, E., Cimiano, P., Gómez-Pérez, A., Buitelaar, P., & McCrae, J.
(2012). Challenges for the multilingual web of data. Web Semantics: Science, Services
and Agents on the World Wide Web, 11, 63-71.
Haase, K. (2004) Context for semantic metadata. Proceedings of the 12th Annual ACM
International Conference on Multimedia. ACM, 2004. Accessed October 16, 2014 at
http://www.yamod.ch/media/11026/haase01.pdf
Hellmann, S., Brekle, J., & Auer, S. (2013). Leveraging the crowdsourcing of lexical resources
for bootstrapping a linguistic data cloud. In Semantic Technology, pp. 191-206. Berlin:
Springer.
Huang, X. X. (2007). An OWL-based WordNet lexical ontology. Journal of Zhejiang University
SCIENCE A, 8(6), 864-870.
Ide, N. (2006). Making senses: Bootstrapping sense-tagged lists of semantically-related words.
In Computational Linguistics and Intelligent Text Processing, pp. 13-27. Berlin: Springer.
Ide, N. (2014). FrameNet and Linked Data. In Proceedings of Frame Semantics in NLP: A
Workshop in Honor of Chuck Fillmore (Vol. 1929, pp. 18-21).
Ide, N., & Pustejovsky, J. (2010, January). What does interoperability mean, anyway? Toward an
operational definition of interoperability for language technology. In Proceedings of the
Second International Conference on Global Interoperability for Language Resources.
Hong Kong, China. Accessed October 21, 2014 at
http://www.cs.vassar.edu/~ide/papers/ICGL10.pdf
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 22
Ide, N., Romary, L., & de la Clergerie, E. (2004). International standard for a linguistic
annotation framework. Natural Language Engineering, 10(3-4), 211-225.
Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing: An Introduction to
Natural Language Processing, Computational Linguistics, and Speech Recognition.
Upper Saddle River, NJ: Prentice Hall.
Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., & Wright, S. E. (2009). ISOcat:
Remodelling metadata for language resources. International Journal of Metadata,
Semantics and Ontologies, 4(4), 261-276.
Kilgarriff, A. (1997). I don’t believe in word senses. Computers and the Humanities, 31(2), 91–
113.
Kilgarriff, A. (1998, May). Senseval: An exercise in evaluating word sense disambiguation
programs. In Proceedings of the First International Conference on Language Resources
and Evaluation, pp. 581-588. Accessed October 24, 2014 at
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=3CDF14935273C9C8D3D5BF
AD0964F349?doi=10.1.1.32.1931&rep=rep1&type=pdf
Krizhanovsky, A. A., & Smirnov, A. V. (2013). An approach to automated construction of a
general-purpose lexical ontology based on Wiktionary. Journal of Computer and Systems
Sciences International, 52(2), 215-225.
Kwong, O. Y. (2013). New Perspectives on Computational and Cognitive Strategies for Word
Sense Disambiguation. Berlin: Springer.
Lee, K., & Romary, L. (2010). Towards interoperability of ISO standards for Language Resource
Management. Proc. ICGL 2010.
Mallery, J. C. (1988). Thinking about foreign policy: Finding an appropriate role for artificial
intelligence computers. Ph.D. dissertation. Cambridge, MA: MIT Political Science
Department.
Manning, C.D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing.
Cambridge, MA: MIT Press.
McCarthy, D., Koeling, R., Weeds, J., & Carroll, J. (2004, July). Finding predominant word
senses in untagged text. In Proceedings of the 42nd Annual Meeting on Association for
Computational Linguistics (p. 279). Association for Computational Linguistics.
Accessed October 24, 2014 at http://sro.sussex.ac.uk/1210/1/senseranks.pdf
McCrae, J., Aguado-de-Cea, G., Buitelaar, P., Cimiano, P., Declerck, T., Gómez-Pérez, A., ... &
Wunner, T. (2012). Interchanging lexical resources on the Semantic Web. Language
Resources and Evaluation, 46(4), 701-719.
Mendes, P. N., Jakob, M., García-Silva, A., & Bizer, C. (2011, September). DBpedia spotlight:
Shedding light on the web of documents. In Proceedings of the 7th International
Conference on Semantic Systems, pp. 1-8. ACM. Accessed October 21, 2014 at
http://oa.upm.es/11477/2/INVE_MEM_2011_105377.pdf
Miles, A., & Bechhofer, S. (2009). SKOS simple knowledge organization system reference.
http://www.w3.org/TR/skos-reference/. Accessed November 19, 2014.
Montiel-Ponsoda, E., Gracia del Río, J., Aguado de Cea, G., & Gómez-Pérez, A. (2011).
Representing translations on the semantic web. Accessed October 17, 2014 at
http://oa.upm.es/10295/1/Representing.pdf
Moro, A. & Navigli, R. (2012). WiSeNet: Building a Wikipedia-based semantic network with
ontologized relations, in: Proceedings of the 21st ACM Conference on Information and
Knowledge Management. Maui, Hawaii.
METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 23
Moro, A., Navigli, R., Tucci, F. M., & Passonneau, R. J. (2014). Annotating the MASC Corpus
with BabelNet. In Proceedings of LREC. Accessed October 16, 2014 at
http://wwwusers.di.uniroma1.it/~moro/MoroEtAL_LREC2014.pdf
Morsey, M.; Lehmann, J.; Auer, S.; Stadler, C.; & Hellmann, S. (2012). DBpedia and the live
extraction of structured data from Wikipedia. Program, 46(2), 157-181. DOI:
10.1108/00330331211221828
Navigli, R. & Ponzetto, S. P. (2012a). BabelNet: The automatic construction, evaluation and
application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193,
217-250.
Navigli, R. & Ponzetto, S. P. (2012b). Joining forces pays off: multilingual joint word sense
disambiguation. In Proceedings of EMNLP, pp. 1399–1410. Jeju, Korea.
Navigli, R., & Velardi, P. (2005). Structural semantic interconnections: A knowledge-based
approach to word sense disambiguation. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 27(7), 1075-1086.
Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys (CSUR),
41(2), 10.
Navigli, R. (2012). BabelNet goes to the (Multilingual) Semantic Web. In: ISWC 2012 Workshop
on Multilingual Semantic Web. Accessed October 21, 2014 at http://ceur-ws.org/Vol-
936/paper1.pdf
Navigli, R. (2014). About Babelnet. Accessed November 21, 2014 at http://babelnet.org/about
Palmer, M. & Xue, N. (2013). Linguistic annotation. In Clark, A., Fox, C., & Lappin, S. (eds.).
The Handbook of Computational Linguistics and Natural Language Processing, West
Sussex, UK: John Wiley & Sons.
Ponzetto, S. P. & Navigli, R. (2010). Knowledge-rich word sense disambiguation rivaling
supervised systems. In: Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics, Association for Computational Linguistics, pp. 1522-1531.
Stroudsburg, PA: Association for Computational Linguistics.
Sérasset, G. (2014). DBnary: Wiktionary as a Lemon-based multilingual lexical resource in RDF.
Semantic Web Journal-Special issue on Multilingual Linked Open Data. Accessed
October 24, 2014 at http://www.semantic-web-journal.net/system/files/swj648.pdf
Simons, G., Bird, S., & Spanne, J. (2008). Best practice recommendations for language resource
description. Accessed October 21, 2014 at http://www.language-
archives.org/REC/bpr.html
Tagarelli, A., Longo, M., & Greco, S. (2009). Word sense disambiguation for XML structure
feature generation. In The Semantic Web: Research and Applications (pp. 143-157).
Berlin: Springer.
van Assem, M., Gangemi, A. & Schreiber, G. (2006). RDF/OWL Representation of WordNet.
W3C Working Draft, World Wide Web Consortium. http://www.w3.org/TR/wordnet-rdf/
Top Related