Download - Interpreting noun compounds using paraphrases

Interpreting noun compounds using

paraphrases

András DobóUniversity of Oxford

Stephen G. PulmanUniversity of Oxford

Interpreting noun compounds using paraphrases

1. Motivation

2. Related work

3. Method

4. Results

5. Summary

6. Future work

Motivation

English is full of noun compounds, which are sequences of nouns acting as a single noun

Their interpretation is crucial for many NLP tasks

Using dictionaries is unfeasible

Automated methods

Related work

Statistical approaches Web queries or large corpora Two main categories of methods

Inventory based approaches Small number of abstract relational categories Criticized for numerous reasons

Paraphrasing approaches Verbs and prepositions as paraphrases Water bottle = bottle that is for water be for

Method

Paraphrasing method Ranked list of paraphrases for each NC Uses large corpora to search for paraphrases

Second noun is the head subject = second noun, object = first noun

Validates paraphrases using web queries Two main approaches in the search of

paraphrases

Subject-paraphrase-object-triples

Counts the frequency of all (subject, paraphrase, object) triples in the corpus

Then for each NC it searches for those triples, where subject = second noun, object = first noun

List of suitable paraphrases for each NC Ranks paraphrases for each NC using a

scoring method based on their frequency

Subject-paraphrase-and-paraphrase-object-pairs

Counts the frequency of all (subject, paraphrase) and (paraphrase, object) pairs in the corpus

Then for each NC it searches for those pairs, where subject = second noun, object = first noun

Two lists of paraphrases for each NC Rank paraphrases for each NC using a

scoring method based on their frequency

Scoring methods

Subject-paraphrase-object-triples version: Simply the frequency of the relevant (subject,

paraphrase, object) triple Subject-paraphrase-and-paraphrase-object-

pairs version: Using frequencies is not suitable The product of pointwise the mutual information of

the relevant (subject, paraphrase) and (paraphrase, object) pairs

Used corpora and their preprocessing

Search for paraphrases: British National Corpus

100 million words Grammatical relations from parser

Web 1T 5-gram Corpus Generated from 1 trillion words of web page text Grammatical relations from POS patterns

Noun verb determiner noun

Validation of paraphrases: The Web through Google and Yahoo!

Passive paraphrases

Their surface subject is actually their object

(subject, paraphrase) = (paraphrase2, object) paraphrase: passive, without preposition paraphrase2: active version of paraphrase subject = object Their frequencies are counted together

Passive paraphrases

(subject, paraphrase, object) = (subject2, paraphrase2, object2) paraphrase: passive, with by preposition paraphrase2: active version of paraphrase, without

preposition object2 = subject subject2 = object Their frequencies are counted together

Such (paraphrase, object) and (subject2, paraphrase2) pairs are treated the same way

Patientive ambitransitive verbs

Three main groups of verbs: strictly transitive, strictly intransitive, ambitransitive

Strictly intransitive verbs have two subclasses: unergative and unaccusative

Ambitransitive verbs have two subclasses too: agentive and patientive

Patientive ambitransitive verbs in intransitive use behave in the same way as passive verbs they are treated the same way

Using synonyms, hypernyms, sister words etc.

No paraphrases are found for several NCs Hypothesis: NCs comprising semantically

similar words are interpreted the same way Using semantically similar words in the

search for paraphrases Synonyms, hypernyms, sister words from

WordNet Semantically similar words that are automatically

found with a method proposed by Dekang Lin

Validation of paraphrases

Some paraphrases are incorrect

Validation is needed Hypothesis: If a paraphrase is suitable for a

NC, then there should exist at least some web pages containing the NC paraphrased by that paraphrase

Validation of paraphrases

Google and Yahoo! queries Simple queries: “n2Infl THAT p n1Infl” Extended queries:

Multiple verb tenses Wildcard characters (up to 9)

Score for each paraphrase is recalculated

Testing and evaluation

Tested on the first 50 NCs of the SemEval-2 Task #9

3 best paraphrases for each NC 5 native speakers recruited for evaluation

They score each paraphrase from 1 to 5 Their agreement was checked using Krippendorff’s

alpha, and it was too low

The (noun compound, paraphrase) pairs with highest disagreement were omitted

Best version

Subject-paraphrase-object-triples version Web 1T 5-gram Corpus Combination of two basic versions:

No substitute words Sister words Scores are recalculated in a way that favors

paraphrases returned by the first version Validation: Google, present simple, up to 1

wildcard

Results

Mixed performance

Average scores

Promising results given the difficulty of task

Noun compound 1st rank 2nd rank 3rd rank

arts museum be of be devoted to be for

bird droppings be in be for be

Rank of paraphrase Average score

1st rank 3.1842

2nd rank 2.7687

3rd rank 2.5583

Results

Best scoring NCsNoun compound Avg. Score

broadway youngster 4,7500

cell membrane 4,6000

cattle population 4,4000

arts museum 4,3333

business sector 4,2000

arts colleges 4,0000

backwoods protagonist 3,8750

antibiotic regimen 3,8667

census population 3,8667

business applications 3,7000

Worst scoring NCsNoun compound Avg. Score

championship bout 2,0000

buddhist philosophy 1,8000

cell block 1,7500

banana industry 1,7333

ancestor spirits 1,6000

anode loss 1,5000

bird droppings 1,2667

bow scrape 1,2500

activity spectrum 1,0000

altitude reconnaissance 1,0000

Future work

Parsing the Web 1T 5-gram Corpus

Much lower error rate in obtaining the grammatical relations

Extended validation part Employing synonyms, hypernyms, sister words or

semantically similar words Combining the different extensions

Summary

Interpreting noun compounds is crucial for many NLP tasks

We presented a method for noun compound interpretation that searches for paraphrases in large corpora and issues web queries to validate the results

The results are promising, and could be further improved

Acknowledgements

The attendance of this workshop was partly supported by the Hungarian National Office for Research and Technology within the framework of the R&D project MASZEKER (Modell-Alapú Szemantikus Kereső Rendszer – Model Based Semantic Search System).

Thank you!