Corpus Evaluation
description
Transcript of Corpus Evaluation
![Page 1: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/1.jpg)
Corpus Evaluation
Adam Kilgarriff
Lexical Computing Ltd
Corpus evaluation Portsmouth Nov 20111
![Page 2: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/2.jpg)
Portsmouth Nov 2011
Then Very few corpora Use what’s there
Now Corpora to spec Choice Need to evaluate
Corpus evaluation 2
![Page 3: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/3.jpg)
Portsmouth Nov 2011
IntrinsicSee what it looks like
ExtrinsicEmbed in a taskHow well do you do at the taskBetter
• It all depends what you want it for
Corpus evaluation 3
![Page 4: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/4.jpg)
Portsmouth Nov 2011
it all depends what you want it for but
‘general English (/French/Chinese/ …)’Many purposesNot specialist sublanguage
A decent construct?Not sure but it has form
• General language dictionaries
• “how good is a corpus, for making them?”
Corpus evaluation 4
![Page 5: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/5.jpg)
Portsmouth Nov 2011
General truths
Duplicates bad Noise bad Big good Diverse (good coverage of varieties
within research scope, not dominated by any one variety) good
Corpus evaluation 5
![Page 6: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/6.jpg)
Portsmouth Nov 2011 Corpus evaluation
word sketch
A corpus-derived one-page summary of a word’s grammatical and collocational behaviour
6
![Page 7: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/7.jpg)
Portsmouth Nov 2011 Corpus evaluation
Macmillan English DictionaryFor Advanced Learners
Ed: Rundell, 2002
7
![Page 8: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/8.jpg)
Portsmouth Nov 2011 Corpus evaluation
11 years 1999-2010
Feedback Good but anecdotal
Formal evaluation
8
![Page 9: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/9.jpg)
Portsmouth Nov 2011 Corpus evaluation
Goal
Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality
Ask a lexicographer For 42 headwords
• For 20 best collocates per headwords “should we include this collocation in a
published dictionary?”
9
![Page 10: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/10.jpg)
Portsmouth Nov 2011 Corpus evaluation
Sample of headwords Nouns verbs adjectives, random High (Top 3000) N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999) N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000) N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable
10
![Page 11: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/11.jpg)
Portsmouth Nov 2011 Corpus evaluation
Precision and recall
We tested precisionRecall is harder
How do we find all the collocations that the system should have found?
11
![Page 12: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/12.jpg)
Portsmouth Nov 2011 Corpus evaluation
Four languages, three families
Dutch ANW, 102m-word lexicographic corpus
English UKWaC, 1.5b web corpus
Japanese JpWaC, 400m web corpus
Slovene FidaPlus, 620m lexicographic corpus
12
![Page 13: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/13.jpg)
Portsmouth Nov 2011 Corpus evaluation
User evaluation
Evaluate whole system Will it help with my task
• Eg preparing a collocations dictionary
Contrast: developer evaluation Can I make the system better?
• Evaluate each module separately
• Current work
13
![Page 14: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/14.jpg)
Portsmouth Nov 2011 Corpus evaluation
Components
Corpus NLP tools
Segmenter, lemmatiser, POS-tagger
Sketch grammar Statistics
14
![Page 15: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/15.jpg)
Portsmouth Nov 2011 Corpus evaluation
Practicalities
Interface Good, Good-but
• Merge to good Maybe, Maybe-specialised, Bad
• Merge to bad
For each language Two/three linguists/lexicographers If they disagree
• Don't use for computing performance
15
![Page 16: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/16.jpg)
Portsmouth Nov 2011 Corpus evaluation
Results
Dutch 66% English 71% Japanese 87% Slovene 71%
Two thirds of a collocations dictionary can be gathered automatically
16
![Page 17: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/17.jpg)
Portsmouth Nov 2011
<world, final> problem
Is it good?Superficially noLook at concordances:
• World cup finals
Solution‘Commonest string’
Corpus evaluation 17
![Page 18: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/18.jpg)
Portsmouth Nov 2011 Corpus evaluation
Next step
Recall• 200 collocates per headword
• Selected from
• All the corpora we have
• Various parameter settings
• Plus just-in-time evaluation for 'new' collocates
ThenFor a sample of headwords
• These are the collocations we should get
18
![Page 19: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/19.jpg)
Portsmouth Nov 2011
From sketches to corpora
Hold other inputs constantJust one variesEvaluate that one
Hold tools, stats, grammar constantevaluate corpora
Corpus evaluation 19
![Page 20: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/20.jpg)
Portsmouth Nov 2011
Criteria
• Duplicates bad• Noise bad• Big good• Diverse (good coverage of varieties within
research scope, not dominated by any one variety) good
We think so
Corpus evaluation 20
![Page 21: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/21.jpg)
Portsmouth Nov 2011
Over next year
Build test sets Textbook cases
English• BNC vs UKWaC vs OEC vs Gigaword
Dutch• ANW corpus vs web corpus
web crawling, deduplicationWhich parameters give best results?
Corpus evaluation 21
![Page 22: Corpus Evaluation](https://reader035.fdocuments.us/reader035/viewer/2022081503/56815a69550346895dc7bda4/html5/thumbnails/22.jpg)
Portsmouth Nov 2011 Corpus evaluation
Thank you
http://www.sketchengine.co.uk
22