Finding multiwords of more than two words
description
Transcript of Finding multiwords of more than two words
![Page 1: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/1.jpg)
Finding multiwords of more than two words
Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vıt Baisa
Lexical Computing Ltd; Masaryk Univ., Cz
![Page 2: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/2.jpg)
Multiwords
• Lexical items with spaces in(Western languages)
![Page 3: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/3.jpg)
Two-word multiwords
• Church and Hanks 1989– Mutual information– A statistic that finds multiwords in a corpus
• Since– Other statistics
• T-score, Log-likelihood, Dice, Fishers Exact Test– Evaluation
• Krenn and Evert 2001, many others since– Better with grammar
• Wermter and Hahn 2006
• Problem solved
![Page 4: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/4.jpg)
More than two words
• Problem 1: what to count• Problem 2: statistics• Attempts include– Dias 2002– Petrovic Snajder Basic 2010
• Not convincing– No prima facie validity to results– Stats only; no grammar
![Page 5: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/5.jpg)
Responses
• Principle:– Word sketches work very well. Build on them
1. Multiword sketches2. Commonest match
![Page 6: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/6.jpg)
Multiword sketches
![Page 7: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/7.jpg)
![Page 8: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/8.jpg)
![Page 9: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/9.jpg)
![Page 10: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/10.jpg)
![Page 11: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/11.jpg)
![Page 12: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/12.jpg)
![Page 13: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/13.jpg)
![Page 14: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/14.jpg)
![Page 15: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/15.jpg)
Commonest match
• Problem– In our evaluation exercise:– Is world a good collocate of final• first glance
– No• Look at concordance 1. Multiword sketches2. Commonest match
![Page 16: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/16.jpg)
![Page 17: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/17.jpg)
Aha
![Page 18: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/18.jpg)
Intuition
• Where word1 occurs with word2, do they usually (/often) occur in a particular string?– If yes, show that string– (if no, as now)
• Grow the collocation – for as long as the commonest match accounts for
plenty of the data
![Page 19: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/19.jpg)
Algorithm
• Start: two lemmas forming collocation• Gather all N hits (+ contexts)• Identify the match – From leftmost of the two lemma to rightmost– Commonest match has frequency >= N/4 ?
• No: end, return lemma-pair• Yes
1. Update new_match to match, N to freq of match2. New-match = match extended one word to left (/right)3. Commonest match has frequency >= N/4 ?
» No: end, return match» Yes : return to 1.
![Page 20: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/20.jpg)
![Page 21: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/21.jpg)
![Page 22: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/22.jpg)
Status and plans
• Implemented but too slow– Re-engineering in progress
• Then– Alternative-format word sketches• Default?• Don’t show gramrels?
– Automatic collocations dictionary– Build into GDEX
![Page 23: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/23.jpg)
![Page 24: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/24.jpg)
Colligation and collocation
![Page 25: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/25.jpg)
Birmingham vs. Lancaster
• Lemmas or word forms?• Grammar or strings?• McEnery and Hardie, Corpus Linguistics, CUP
red texbooks
![Page 26: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/26.jpg)
![Page 27: Finding multiwords of more than two words](https://reader038.fdocuments.us/reader038/viewer/2022102714/568164a9550346895dd6a38f/html5/thumbnails/27.jpg)
In sum
• Two-word multiwords– Solved
• More than two– Hard– Build on word sketches– Two implemented solutions
• Multiword sketches • Commonest string
Thank you