The Annotated Holmes

67
Detecting Meaning with Sherlock Holmes The Annotated Holmes How can we read better? Francis Bond Division of Linguistics and Multilingual Studies http://www3.ntu.edu.sg/home/fcbond/ [email protected] Location: LHS-LT Creative Commons Attribution License: you are free to share and adapt as long as you give appropriate credit and add no additional restrictions: https://creativecommons.org/licenses/by/4.0/. HG8011 (2021)

Transcript of The Annotated Holmes

Detecting Meaning with Sherlock Holmes∗

The Annotated HolmesHow can we read better?

Francis BondDivision of Linguistics and Multilingual Studies

http://www3.ntu.edu.sg/home/fcbond/[email protected]

Location: LHS-LT

∗Creative Commons Attribution License: you are free to share and adapt as long as you give appropriatecredit and add no additional restrictions: https://creativecommons.org/licenses/by/4.0/.

HG8011 (2021)

Outline

ã An online edition of the Canon

ã Teaching through Tagging— Interactive Lexical Semantics

ã Sense Distributions in NTU-MC

ã Word Sense Disambiguation

ã Sentiment

ã Where do we go from here?

The Annotated Holmes 1

An online edition of the Canon

The Annotated Holmes 2

The Adventure of the Readable Texts

For my students (for this class), I wanted texts

ã faithful to the original

ã easy to read on various devices

ã aesthetically pleasing

ã linked to the rich world of information of the Great Game

Sadly, no one site has given us what we want, so like many before us, weended up producing our own.

In the 2017 Beaten Annual, work done with Arthur Bond 3

Data! Data! Data!

First, I took a look at what was already out there:

Edition NameGute The Project Gutenberg HTMLACD Arthur Conan Doyle EncylopediaBSW Baker Street WikiSS Short StoriesSHC The Complete Sherlock Holmes CanonLit2Go Lit2GoCamden Camden House: The Complete Sherlock HolmesMoonFind MoonFind: Searching for Sherlock

“Data! data! data!”he cried impatiently.“I can’t make bricks without clay.”The Adventure of the Copper Beeches (ACD 1892) 4

Selection Criteria

ã Useful Metadata: When and where published, Author, Copyright, …

ã Annotation: links to TV and film versions, chronologies, definitions of wordsand Sherlockian scholarship

ã Spoilers! Does the annotation reveal the villain?

ã Font: Is it a nice serif font (like in the Strand Magazine)

ã Resizability: can you read it on your phone

ã Pictures: does it have the illustrations nicely embedded?

ã Miscellaneous: does it have search?

‘It has long been an axiom of mine that the little things are infinitely the most important.’ A Case of Identity 5

Results

Table 1: Rating Online Texts of the CanonEdition Meta Annotation Spoilers Font Resize Pics MiscACD A A F C F C SearchGute C F - A A F BookBSW A A F B F F AdsSS C F - C A F AdsSHC C F - A B F ToCLit2Go A F - C F F AudioCamden A F - C F A ToCMoonFind F F - C F F Search

Every site had some good points, but no one is perfect.

The Annotated Holmes 6

Our Approach

https://github.com/fcbond/sh-canon/

ã Resizable, nice font with tufte-css

ã Full table of contents, linked from each story

ã Metadata in the margin: the date and place of publication, story number,which collection, date of action?

ã Illustrations in the text ToDo

ã Links to annotation at the end

â Wikipediaâ Arthur Conan Doyle Encylopediaâ Bill Dolan’s Study Guide

‘There is nothing new under the sun. It has all been done before.’ A Study in Scarlet 7

Annotation (new + ToDo)

ã Locations link to google maps

â What should we do for made-up places?â Should link locations to geonames

ToDoã Currency links to amount in today’s currency

done for SPEC

ã Person + Organization should link to entry in ACD Encylopedia

ã Texts are searchable with a site-specific google searchmetadata is hidden from indexing by the <!--googleoff: all--> com-mand.

The Annotated Holmes 8

Annotation ToDo

ã The sense level annotations of the NTU Multilingual Corpus, which linkseach open class word to its definition in Wordnet

â We can also show translations and/or hypernyms

ã Content level annotation like the Annotated Sherlock Holmes

ã We have syntactic annotation (treebanks) for The Adventure of the Speck-led Band which it would be good to display for those who are interested

ã More from the Beacon Society’s award winners, including information aboutword meaning, rhetorical devices, …

ã Links to facsimiles of the original stories

The Annotated Holmes 9

Translations and Glosses

ã The Canon has been widely translated: we would like to also prepare edi-tions of foreign languages, and consider the issue of linking the translations.E.g., InLéctor, who have produced a nice bilingual English-Spanish editionof The Adventures of Sherlock Holmes.

ã With the multilingual wordnets, we can present glosses of the correct sense,in any of 34 languages (although not all will be complete)

ã We can investigate how to choose the words to highlight

â Low frequency (rare) wordâ Lower frequency sense of a word (uncommon meaning)â No direct translation — give definitionâ Word selected by e.g., Bill Dolan, my students, …

it is quite hard to do this nicely, ...

The Annotated Holmes 10

Some Technical Issues

ã There is a lot of duplication in online resources (and I have only made itworse)

ã How can we share information?

â Put the code and meta-data on githubSPEC PUBLISHED The Strand Magazine in February 1892SPEC ORDER 10SPEC COLLECTION The Adventures of Sherlock HolmesSPEC DORN http://www.beaconsociety.com/uploads/3/7/3/8/37380505/dorn_spec_study_guide.pdfSPEC TITLE The Adventure of the Speckled BandSPEC IMAGE She lifted her veil 104.jpgare there standard ways to refer to lines?

ã Is everyone happy to share information?

â We have to make sure not to use data without permission

The Annotated Holmes 11

Conclusion

ã We now have some stories with every (open-class) word linked to an entryin a dictionary, and its grammatical categoryWhat else can we do with this?

ã For SPEC we also have this in Chinese, Japanese, Indonesian, Italian (andare working on Abui, Bulgarian, Dutch, Spanish and Polish). And these arelinked to the EnglishWhat we can do with this?

ã If you have any ideas (or want the data, or want to help me make more)please let me know.

The Annotated Holmes 12

The Canon for Learners

There are some nice versions annotated for non-English speakers.

The Annotated Holmes 13

The Canon for Learners: problems

ã BUT, the glosses are not very good. The correct sense is not highlighted(or sometimes even listed)

â cane — whipv; reedn, canen

â cast — throwv, make formsv, send outv, cast ouvt; metal castn, formn

should be cast one’s eyes “take a quick look”â crane — mechanical crane

should be Crane Water “place name”â masonry — “The occupation or work of a mason”

should be stonework

ã It is not clear how the words are selectedWhy not dog-cart, homely, spring up, stump?

ã With our sense tags we can do better than this!

The Annotated Holmes 14

Teaching through Tagging

The Annotated Holmes 15

Teaching Meaning to my Students

ã Read a Sherlock Holmes storyâ So far SPEC, DANC, REDH and SCAN

(also news, essays and Japanese short stories)

ã Look at every content wordâ Find its meaning in a dictionary (wordnet)â Or write a new definition if neededâ Say if it has positive or negative connotation (for some)this shows how little you need to know to understand and enjoy

ã Rewrite it using only the most common 1,000 words of Englishâ Like XKCD’s Simple Writer (used in Thing Explainer)â We only did this for REDHthis also shows some interesting misunderstandings

ã Look for Idioms and Metaphors

some of the things I make/let my students do 16

Sense Distributions inNTU-MC

The Annotated Holmes 17

Cross-lingual comparison

We consider a single Sherlock Holmes story The Adventure of the Speck-led Band(Doyle, 1892) in the original English and translations in MandarinChinese, Indonesian and Japanese (NTU Multilingual Corpus: Tan and Bond,2012). The senses are tagged with the Princeton Wordnet of English (Fell-baum, 1998), the Chinese Open Wordnet (see Wang and Bond, 2013), theWordnet Bahasa (Bond et al., 2014) and the Japanese wordnet(Isahara et al.,2008) enhanced with pronouns, exclamatives and classifiers (COW: Seahand Bond, 2014; Morgado da Costa and Bond, 2016). On the basis of word-net alignment within the Open Multilingual WordNet (Bond and Foster, 2013),we are able to compare the distribution of senses across languages.

Language Sentences Words Concepts MWC SWCEnglish 599 11741 6425 285 6140Indonesian 709 10345 6140 279 5861Japanese 702 13936 4925 174 4751Mandarin 619 12681 8263 316 7947

Table 2: Corpus size per language

The Annotated Holmes 18

Ambiguity VariationLanguage Lexical Corpus Maximum MaximumEnglish 1.33 1.26 10 4Indonesian 2.88 1.15 11 7Japanese 1.68 1.05 9 7Mandarin 1.30 1.01 9 10

Table 3: Ambiguity per language

ã Lexical Ambiguityâ Indonesian is high as we record both suffixed and root forms: igal,

mengigal “dance” (will treat as variants in OMW 2.0)â Japanese is high due to character variation and multiple scripts:檜,桧,ひのき,ヒノキ hinoki “Japanese cedar” (variants)

ã Corpus ambiguityâ Much closer; Japanese and Chinese less ambiguous⇒ � due to the Chinese characters �

The Annotated Holmes 19

One Sense Per Discourse

ã Gale hypothesized that a single word would tend to have a single sensewithin a given discourse.

ã This was definitely not the case here, in any language.

ã We get from 9–11 different meanings, with (light) verbs and adverbs beingmost ambiguous — more true for nouns

â en: see, be, so, make, haveâ ja: ある,分かる,持つ,考える,ものâ zh: 好,可能,是,发现,有â id: ada, tinggal, melihat, akhir, jadi

The Annotated Holmes 20

Variation for a concept

ã Variation (how many ways can you express the same concept) was gener-ally lower (4-7)

ã Except for Chinese, where we had 10 ways of saying “however”: 还是 (1),还 (2),却 (11),但 (11),但是 (29),然而 (2),仍然 (1),可是 (26),尽管 (1),不过 (4)

â this is probably an artifact of the fact we have checked adverbs less, …â may also be due to inconsistent tokenization

ã We suspect this may change a little for different genres (future work)

The Annotated Holmes 21

Different languages are different

(1) I am sure that I shall say nothing of the kind.a. いやいや

iyaiyaby+no+means

、,,

そんなsonnathat+kind+of

ことkotothing

はwaTOP

言わ-んiwa-nsay-NEG

よyoyo

“no no, I will not say that kind of thing”

ã the components are lexicalized very differently

ã iyaiya ↔ I am sure that I shall ???

ã Decomposing pronouns gives us a lot of this, but the equivalence is far fromdirect: Can DRS help us here?

The Annotated Holmes 22

Pronomilization

(2) Shei shot himj and then herselfia. 奥-さん

oku-sanwife-HON

がgaNOM

旦那-さんdanna-sanhusband-HON

をwoACC

撃ってutteshoot-CONJ

、,,

それからsorekaraand+then

自分jibunself

もmotoo

撃ったuttashoo-PST

Wifei shot husbandj and then shot selfi too

ã Why are Japanese and Chinese different here? We don’t yet know, …

The Annotated Holmes 23

Pronomilization

(3) Shei shot himj and then herselfia. 她

tā3SG

拿nátake

枪qiānggun

先xiānfirst

打dǎshoot

丈夫zhàngfūhusband

,,,

然后ránhòuand+then

打dǎshoot

自己zìjǐself

Shei took the gun to first shoot husbandj, and then shot selfi

The Annotated Holmes 24

AutomaticWord Sense Disambiguation

The Annotated Holmes 25

Word Sense Disambiguation Overview

ã Many words have several meanings (homonymy/polysemy)

ã Determine which sense of a word is used in a specific text

ã Often, the different senses of a word are closely related

â title1 - right of legal ownershipâ title2 - document that is evidence of the legal ownership,

ã sometimes, several senses can be activated in a single context

â …This could bring competition to the tradeâ competition1 - the act of competingâ competition2 - the people who are competing

The Annotated Holmes 26

What are Word Senses?

ã The meaning of a word in a given context

ã Word sense representations

â With respect to a dictionary (WordNet)∗ chair = a seat for one person, with a support for the back;

”he put his coat over the back of the chair and sat down”∗ chair = the officer who presides at the meetings of an organization;

”address your remarks to the chairperson”â With respect to the translation in a second language

∗ chair = chaise∗ chair = directeur

â With respect to the context where it occurs (discrimination)∗ “Sit on a chair” “Take a seat on this chair”∗ “The chair of the Math Department” “The chair of the meeting”

The Annotated Holmes 27

Approaches to Word Sense Disambiguation

ã Knowledge-Based Disambiguation

â Use of external lexical resources such as dictionaries and ontologiesâ Discourse properties

ã Supervised Disambiguation

â based on a labeled training setâ basically a sequence labeling task with a lot of labels

ã Unsupervised Disambiguation

â based on unlabeled corporaâ learn sense distinctions then disambiguate!

The Annotated Holmes 28

All Words Word Sense Disambiguation

ã Attempt to disambiguate all open-class words in a textHe put his suit over the back of the chair

ã Knowledge-based approaches

â Use information from dictionariesâ Definitions / Examples for each meaningâ Find similarity between definitions and current context

ã Position in a semantic network

â Find that table is closer to chair “furniture” than to chair “person”

ã Use discourse properties

â A word exhibits the same sense in a discourse / in a collocation

The Annotated Holmes 29

WSD with Machine Readable Dictionaries (MRD)

ã MRD-based WSD shown to provide very high unsupervised baseline (e.g.Lesk algorithm in Senseval tasks)

ã Suitable for all words WSD tasks (no data bottleneck)

ã MRDs have (relatively) high availability compared to sensebanked data

ã MRD-based WSD is easily adaptable to new MRDs, languages

The Annotated Holmes 30

What does an MRD give us?

ã For each word in the language vocabulary, an MRD provides:

â A list of meaningsâ Definitions (for all word meanings)â Typical usage examples (for most word meanings)

ã A thesaurus adds:

â An explicit synonymy relation between word meanings

ã A semantic network/ontology adds:

â Hypernymy/hyponymy (IS-A), meronymy/holonymy (PART-OF), antonymy,entailnment, etc.

The Annotated Holmes 31

Definitions and Examples

WordNet definitions/examples for the noun plant

1. buildings for carrying on industrial labor; “they built a large plantto manufacture automobiles”

2. a living organism lacking the power of locomotion

3. something planted secretly for discovery by another; “the policeused a plant to trick the thieves; he claimed that the evidenceagainst him was a plant”

4. an actor situated in the audience whose acting is rehearsed butseems spontaneous to the audience

The Annotated Holmes 32

Synonyms and other Relations

WordNet synsets for the noun plant

1. plant, works, industrial plant

2. plant, flora, plant life

WordNet semantic relations for the sense plant life

ã hypernym: organism, being

ã hyponym: house plant, fungus, …

ã meronym: plant tissue, plant part

ã holonym: Plantae, kingdom Plantae, plant kingdom

The Annotated Holmes 33

Lesk Algorithm

Identify senses of words in context using definition overlap (Michael Lesk1986)

1. Retrieve from MRD all sense definitions of the words to be disambiguated

2. Determine the definition overlap for all possible sense combinations

ã number of words overlapping in both definitionsã context can be a window larger than a sentence

3. Choose senses that lead to highest overlap

The Annotated Holmes 34

Example: disambiguate pine cone

ã pine

1. kinds of evergreen tree with needle-shaped leaves2. waste away through sorrow or illness

ã cone

1. solid body which narrows to a point2. something of this shape whether solid or hollow3. fruit of certain evergreen trees

pine1∩ cone1 = 0 pine2∩ cone1 = 0pine1∩ cone2 = 0 pine2∩ cone2 = 0pine1∩ cone3 = 2 pine2∩ cone3 = 0evergreen tree

The Annotated Holmes 35

LESK for many words

ã I saw a man who is 98 years old and can still walk and tell jokes

ã Nine open class words: see(26), man(11), year(4), old(8), can(5), still(4),walk(10), tell(8), joke(3)

ã 43,929,600 sense combinationsif we compare every definition against every definition

ã How to find the optimal sense combination?

â Find an approximate solution (e.g., simulated annealing)â Use a simpler algorithm

The Annotated Holmes 36

Simplified Lesk

ã Original Lesk: measure overlap between sense definitions for all words incontext

â Identify simultaneously the correct senses for all words in contextâ Compare the definitions of words to the definitions of words

ã Simplified Lesk: measure overlap between sense definitions of a word andcurrent context

â Identify the correct sense for one word at a timeâ Search space significantly reduced

The Annotated Holmes 37

Simplified Lesk Algorithm

1. Retrieve from MRD all sense definitions of the words to be disambiguated

2. Determine the overlap between each sense definition and the current con-text

3. Choose senses that lead to highest overlap

Disambiguate: Pine cones hanging in a tree

ã PINE

1. kinds of evergreen tree with needle-shaped leaves2. waste away through sorrow or illness

pine1∩ Sentence = 1 pine2∩ Sentence = 0

The Annotated Holmes 38

Extended Lesk Algorithm (Banerjee and Pedersen, 2003)

1. Retrieve from MRD all sense definitions of the words to be disambiguated

ã Add definitions of hypernyms, hyponymsã Add definitions of the words in the definitions

2. Determine the overlap between each extended sense definition and theextended sense of each word in the context

3. Choose senses that lead to highest overlap

ã kinds of evergreen tree with needle-shaped leaves

evergreen bearing foliage throughout the yeartree1 a tall perennial woody plant having a main trunk and branches form-

ing an elevated crown; includes gymnosperms and angiospermstree2 tree diagram, a figure that branches from a single root; ”genealogical

tree”

The Annotated Holmes 39

Extended Simplified Lesk (Baldwin et al. 2009)

1. Retrieve from MRD all sense definitions of the words to be disambiguated

ã Add definitions and synonyms of hypernyms, hyponymsã Add definitions of the disambiguated words in the definitions

2. Determine the overlap between each extended sense definition and theeach word in the context

3. Choose senses that lead to highest overlap

ã kinds of evergreen1 tree1 with needle-shaped leaves

evergreen bearing foliage throughout the yeartree1 a tall perennial woody plant having a main trunk and branches form-

ing an elevated crown; includes gymnosperms and angiosperms

The Annotated Holmes 40

Position in a Semantic Network

ã Try to find how closely related different senses are

ã …by measuring how close they are in a network

ã The simplest measure is just the shortest path

â measuring all combinations is exponentialâ normally filter by part of speech

ã Better measures weight the paths

â Small differences get low weights

The Annotated Holmes 41

Corpus based Methods

ã If you have a sense tagged corpus (very rare)

â Most Frequent Sense (MFS) does very well∗ count the occurrences of each sense∗ pick the one that occurs most often

ã You can improve on this with a sequence tagger, using n words of context

â the three words on either side help (like with POS)â a window of 10–50 words helps!

The Annotated Holmes 42

Corpus based Learning for WSD

ã Collect a set of examples that illustrate the various possible classificationsor outcomes of an event.

ã Identify patterns in the examples associated with each particular class ofthe event.

ã Generalize those patterns into rules.

ã Apply the rules to classify a new event.

The Annotated Holmes 43

Supervised WSD

ã Learn a classifier from manually sense-tagged text using machine learning

ã Resources

â Sense Tagged Textâ Dictionary (implicit source of sense inventory)â Syntactic Analysis (POS tagger, Chunker, Parser, …)

ã Scope

â Typically one target word per contextâ Part of speech of target word resolvedâ Lends itself to some-words

ã Reduces WSD to a classification problem where a target word is assignedthe most appropriate sense from a given set of possibilities based on thecontext in which it occurs

The Annotated Holmes 44

Tagged Corpus

ã Bonnie and Clyde are two really famous criminals, I think they were bank/1robbers

ã My bank/1 charges too much for an overdraft.

ã I went to the bank/1 to deposit my check and get a new ATM card.

ã The University of Minnesota has an East and a West Bank/2 campus righton the Mississippi River.

ã My grandfather planted his pole in the bank/2 and got a great big catfish!

ã The bank/2 is pretty muddy, I can’t walk there.

The Annotated Holmes 45

Bag-of-words context

bank/1 a an and are ATM Bonnie card charges check Clyde criminals depositfamous for get I much My new overdraft really robbers the they think to tootwo went were

bank/2 a an and big campus cant catfish East got grandfather great has his Iin is Minnesota Mississippi muddy My of on planted pole pretty right RiverThe the there University walk West

The Annotated Holmes 46

Simple Supervised Approach

ã For each word wi in S

â If wi is in bag-of-words(bank/1) then∗ Sense/1 = Sense/1 + 1;

â If wi is in bag-of-words(bank/2) then∗ Sense/2 = Sense/2 + 1;

ã If Sense/1 > Sense/2 then bank/1

ã else if Sense/2 > Sense/1 then bank/2

ã else most frequent sense (bank/2)

The Annotated Holmes 47

Let’s try it

bank/1 a an and are ATM Bonnie card charges check Clyde criminals depositfamous for get I much My new overdraft really robbers the they think to tootwo went were

bank/2 a an and big campus cant catfish East got grandfather great has his Iin is Minnesota Mississippi muddy My of on planted pole pretty right RiverThe the there University walk West

? I’m going to lay down my heavy load, down by the river bank.

? As a leading consumer bank in Singapore, DBS has an extensive branchand ATM network,

? My bank’s Singapore headquarters is by the river at boat quay.

The Annotated Holmes 48

Commonly used features

ã Identify collocational features from sense tagged data.

ã Word immediately to the left or right of target: (unigram)

â I have my bank/1 statement.â The river bank/2 is muddy.

ã Pair of words to immediate left or right of target: (bigram)

â The world’s richest bank/1 is here in New York.â The river bank/2 is muddy.

ã Words found within k positions around target, (k = 10− 50: bag of words)

â My credit is just horrible because my bank/1 has made several mistakeswith my account and the balance is very low.

The Annotated Holmes 49

Discourse based Methods

ã One sense per discourse

ã One sense per collocation

The Annotated Holmes 50

One Sense per Discourse

ã A word tends to preserve its meaning across all its occurrences in a dis-course (Gale, Church, Yarowksy 1992)

â 8 words with two-way ambiguity, e.g. plant, crane, …â 98% of the two-word occurrences in the same discourse carry the same

meaning

ã The grain of salt: Performance depends on granularity

â Performance of “one sense per discourse” over all words is ≈ 70%

The Annotated Holmes 51

One Sense per Collocation

ã A word tends to preserve its meaning when used in the same collocation(Yarowsky 1993)

â Strong for adjacent collocationsâ Weaker as the distance between words increases

ã For example, in a typical corpus

â industrial plant is always the plant/factoryâ plant life is always the plant/flora

ã 97% precision on words with two-way ambiguity

ã ≈ 70% on all words

The Annotated Holmes 52

Typical Performance

ã First Sense: 63% (baseline)

ã Extended Lesk: 68%

ã Supervised: 70-72% (most words)

ã Much harder task than POS tagging

â Improve by reducing granularity (cluster senses)â Improve by increasing training dataâ Improve with more features (adding in syntax)

The Annotated Holmes 53

How can we annotate data?

ã Get people to do it

â per word (e.g. look at all plant) annotation much faster then per sentence

ã Look at translations

â disambiguate with other languages

ã Learn collocations from unambiguous synonyms(pinecone, cone, strobilus, strobile)

ã Bootstrap

â Annotate some, assume one sense/discourse

The Annotated Holmes 54

WSD with Multiple Languages

ã For multilingual corpora

â crosslingual links narrow the interpretations

ã The result is a cheaply tagged corpus

委員長 として 党の 結束を 大切に したいchairperson, A 作为

I B 委员长,would like to 我

regard C 希望the unity of E 维护the party F 党内

as important. G 团结。

The Annotated Holmes 55

WSD with Multiple Wordnets (2)

ã English

â party1 “an organization to gain political power”â party2 “a group of people gathered together for pleasure”â party3 “a band of people associated temporarily in some activity”â party4 “an occasion on which people can assemble for social interaction”

ã Japanese

â 党 1 “an organization to gain political power”

The Annotated Holmes 56

Summary

ã There are many approaches to WSD

ã We haven’t solved it yet.

The Annotated Holmes 57

Sentiment

The Annotated Holmes 58

Words carry connotations

ã Words can directly show how we feel about something

â That is goodâ That is awful

ã Words can indirectly show how we feel about something

â That is cheapâ That is economicalâ That is old-fashionedâ That is classicâ That is vintage

This is part of our knowledge of language, so it should be in the lexicon.

The Annotated Holmes 59

Some more examplesPositive Neutral Negativeinterested questioning nosyemploy use exploitthrifty saving stingysteadfast tenacious stubbornsated filled crammedcourageous confident conceitedunique different peculiarmeticulous selective picky

Can you identify the negative words?

ã East End is a gritty neighborhood, but the rents are low.ã On my train to Sussex, I sat next to a real stunner.ã Every morning my neighbor takes his mutt to the moor. It always barks

loudly when leaving the building.ã You need to be pushy when you are looking for a job.ã Watson is bullheaded sometimes, but he always gets the job done.

The Annotated Holmes 60

How can we represent this?

ã One approach is a simple valence score (−100 – +100)

Score Example Example Example Corpus Examples95 fantastic very good perfect, splendidly64 good good soothing, pleasure34 ok sort of good not bad easy, interesting0 beige neutral puff

-34 poorly a bit bad rumour, cripple-64 bad bad not good hideous, death-95 awful very bad deadly, horror-stricken

ã You can also have two scores (positive and negative) or even three (posi-tive, negative and neutral)

ã People also consider the associated emotion: (Plutchik, 1980)joy vs sadness; anger vs fear; trust vs disgust; surprise vs anticipation.

The Annotated Holmes 61

High and Low Examples

We will just look at a simple valence. Here are some results from DANCand SPEC, annotated in three languages.

Concept freq score English score Chinese score Japanese Scorei40833 24 50 marriage 39 婚事 34 結婚 58

wedding 34i11080 5 40 rich 33 有钱 34 裕福 66i72643 4 33 smile 32 微笑 34 笑み 34i23529 40 −68 die −80 去世 −60 亡くなる −63

死亡 −64 死ぬ −62i36562 5 −83 murder −95 谋杀 −95 殺し −64

殺害 −63

Frequency is across all languages, score is average for lemma.

ã Senses in the same concept tend to have similar scores

â this is true within and across languages

The Annotated Holmes 62

Cross-Language Correlation

We looked at the agreement between annotators for the same conceptacross all three languages. The annotators were shown the scores organizedper word and per sense: where there was a large divergence (greater thanone standard deviation), they went back and checked their annotation. Afterthis harmonization we calculated Pearson’s ρ (Pearson, 1895).

Pair ρ # samplesChinese-English .73 6,843Chinese-Japanese .77 4,099English-Japanese .76 4,163

The Annotated Holmes 63

Bibliography

ã Cheng Xiaoqing (2007) Sherlock in Shanghai: Stories of Crime and Detec-tion Tr. Timothy C. Wong. Honolulu: University of Hawai’i Pres, ISBN978-0-8248-3099-1

ã Christian Metz (1974) Film Language: A Semiotics of the Cinema [Essaissur la signification au cinéma], Oxford University Press, 1974

ã Word Sense Disambiguation: Jurafsky and Martin (2009), Chapter 20.1–8figure borrowed from them

ã Some slides based on Rada Mihalcea and Ted Pedersen’s tutorial at AAAI-2005 “Advances in Word Sense Disambiguation”

ã Nice demo of similarities at:http://maraca.d.umn.edu/cgi-bin/similarity/similarity.cgi

The Annotated Holmes 64

*References

Francis Bond and Ryan Foster. 2013. Linking and extending an open multilingual wordnet.In 51st Annual Meeting of the Association for Computational Linguistics: ACL-2013, pages1352–1362. Sofia. URL http://aclweb.org/anthology/P13-1133.

Francis Bond, Lian Tze Lim, Enya Kong Tan, and Hammam Riza. 2014. The combined word-net Bahasa. Nusa: Linguistic studies of languages in and around Indonesia, 57:83–100.

Arthur Conan Doyle. 1892. The Adventures of Sherlock Homes. George Newnes, London.

Christine Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press.

Hitoshi Isahara, Francis Bond, Kiyotaka Uchimoto, Masao Utiyama, and Kyoko Kanzaki. 2008.Development of the Japanese WordNet. In Sixth International conference on LanguageResources and Evaluation (LREC 2008). Marrakech.

The Annotated Holmes 65

Luís Morgado da Costa and Francis Bond. 2016. Wow! what a useful extension to wordnet!In 10th International Conference on Language Resources and Evaluation (LREC 2016).Portorož.

Karl Pearson. 1895. Notes on regression and inheritance in the case of two parents. Pro-ceedings of the Royal Society of London, 58:240–242.

Robert Plutchik. 1980. Emotion: Theory, research, and experience, volume Vol. 1. Theoriesof emotion. Academic, New York.

Yu Jie Seah and Francis Bond. 2014. Annotation of pronouns in a multilingual corpus of Man-darin Chinese, English and Japanese. In 10th Joint ACL - ISO Workshop on InteroperableSemantic Annotation. Reykjavik.

Liling Tan and Francis Bond. 2012. Building and annotating the linguistically diverse NTU-MC(NTU-multilingual corpus). International Journal of Asian Language Processing, 22(4):161–174.

Shan Wang and Francis Bond. 2013. Building the Chinese Open Wordnet (COW): Startingfrom core synsets. In Proceedings of the 11th Workshop on Asian Language Resources,a Workshop at IJCNLP-2013, pages 10–18. Nagoya.

The Annotated Holmes 66