Interpreting noun compounds using
paraphrases
András DobóUniversity of Oxford
Stephen G. PulmanUniversity of Oxford
Interpreting noun compounds using paraphrases
1. Motivation
2. Related work
3. Method
4. Results
5. Summary
6. Future work
Motivation
English is full of noun compounds, which are sequences of nouns acting as a single noun
Their interpretation is crucial for many NLP tasks
Using dictionaries is unfeasible
Automated methods
Related work
Statistical approaches Web queries or large corpora Two main categories of methods
Inventory based approaches Small number of abstract relational categories Criticized for numerous reasons
Paraphrasing approaches Verbs and prepositions as paraphrases Water bottle = bottle that is for water be for
Method
Paraphrasing method Ranked list of paraphrases for each NC Uses large corpora to search for paraphrases
Second noun is the head subject = second noun, object = first noun
Validates paraphrases using web queries Two main approaches in the search of
paraphrases
Subject-paraphrase-object-triples
Counts the frequency of all (subject, paraphrase, object) triples in the corpus
Then for each NC it searches for those triples, where subject = second noun, object = first noun
List of suitable paraphrases for each NC Ranks paraphrases for each NC using a
scoring method based on their frequency
Subject-paraphrase-and-paraphrase-object-pairs
Counts the frequency of all (subject, paraphrase) and (paraphrase, object) pairs in the corpus
Then for each NC it searches for those pairs, where subject = second noun, object = first noun
Two lists of paraphrases for each NC Rank paraphrases for each NC using a
scoring method based on their frequency
Scoring methods
Subject-paraphrase-object-triples version: Simply the frequency of the relevant (subject,
paraphrase, object) triple Subject-paraphrase-and-paraphrase-object-
pairs version: Using frequencies is not suitable The product of pointwise the mutual information of
the relevant (subject, paraphrase) and (paraphrase, object) pairs
Used corpora and their preprocessing
Search for paraphrases: British National Corpus
100 million words Grammatical relations from parser
Web 1T 5-gram Corpus Generated from 1 trillion words of web page text Grammatical relations from POS patterns
Noun verb determiner noun
Validation of paraphrases: The Web through Google and Yahoo!
Passive paraphrases
Their surface subject is actually their object
(subject, paraphrase) = (paraphrase2, object) paraphrase: passive, without preposition paraphrase2: active version of paraphrase subject = object Their frequencies are counted together
Passive paraphrases
(subject, paraphrase, object) = (subject2, paraphrase2, object2) paraphrase: passive, with by preposition paraphrase2: active version of paraphrase, without
preposition object2 = subject subject2 = object Their frequencies are counted together
Such (paraphrase, object) and (subject2, paraphrase2) pairs are treated the same way
Patientive ambitransitive verbs
Three main groups of verbs: strictly transitive, strictly intransitive, ambitransitive
Strictly intransitive verbs have two subclasses: unergative and unaccusative
Ambitransitive verbs have two subclasses too: agentive and patientive
Patientive ambitransitive verbs in intransitive use behave in the same way as passive verbs they are treated the same way
Using synonyms, hypernyms, sister words etc.
No paraphrases are found for several NCs Hypothesis: NCs comprising semantically
similar words are interpreted the same way Using semantically similar words in the
search for paraphrases Synonyms, hypernyms, sister words from
WordNet Semantically similar words that are automatically
found with a method proposed by Dekang Lin
Validation of paraphrases
Some paraphrases are incorrect
Validation is needed Hypothesis: If a paraphrase is suitable for a
NC, then there should exist at least some web pages containing the NC paraphrased by that paraphrase
Validation of paraphrases
Google and Yahoo! queries Simple queries: “n2Infl THAT p n1Infl” Extended queries:
Multiple verb tenses Wildcard characters (up to 9)
Score for each paraphrase is recalculated
Testing and evaluation
Tested on the first 50 NCs of the SemEval-2 Task #9
3 best paraphrases for each NC 5 native speakers recruited for evaluation
They score each paraphrase from 1 to 5 Their agreement was checked using Krippendorff’s
alpha, and it was too low
The (noun compound, paraphrase) pairs with highest disagreement were omitted
Best version
Subject-paraphrase-object-triples version Web 1T 5-gram Corpus Combination of two basic versions:
No substitute words Sister words Scores are recalculated in a way that favors
paraphrases returned by the first version Validation: Google, present simple, up to 1
wildcard
Results
Mixed performance
Average scores
Promising results given the difficulty of task
Noun compound 1st rank 2nd rank 3rd rank
arts museum be of be devoted to be for
bird droppings be in be for be
Rank of paraphrase Average score
1st rank 3.1842
2nd rank 2.7687
3rd rank 2.5583
Results
Best scoring NCsNoun compound Avg. Score
broadway youngster 4,7500
cell membrane 4,6000
cattle population 4,4000
arts museum 4,3333
business sector 4,2000
arts colleges 4,0000
backwoods protagonist 3,8750
antibiotic regimen 3,8667
census population 3,8667
business applications 3,7000
Worst scoring NCsNoun compound Avg. Score
championship bout 2,0000
buddhist philosophy 1,8000
cell block 1,7500
banana industry 1,7333
ancestor spirits 1,6000
anode loss 1,5000
bird droppings 1,2667
bow scrape 1,2500
activity spectrum 1,0000
altitude reconnaissance 1,0000
Future work
Parsing the Web 1T 5-gram Corpus
Much lower error rate in obtaining the grammatical relations
Extended validation part Employing synonyms, hypernyms, sister words or
semantically similar words Combining the different extensions
Summary
Interpreting noun compounds is crucial for many NLP tasks
We presented a method for noun compound interpretation that searches for paraphrases in large corpora and issues web queries to validate the results
The results are promising, and could be further improved
Acknowledgements
The attendance of this workshop was partly supported by the Hungarian National Office for Research and Technology within the framework of the R&D project MASZEKER (Modell-Alapú Szemantikus Kereső Rendszer – Model Based Semantic Search System).
Thank you!
Top Related