Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
-
Upload
joella-chandler -
Category
Documents
-
view
218 -
download
0
Transcript of Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
![Page 1: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/1.jpg)
Comparable Corpora BootCaT (CCBC)
Adam Kilgarriff, Avinesh PVS, Jan Pomikalek
Lexical Computing Ltd.
![Page 2: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/2.jpg)
Just-in-time corpora
Krista Varantola
Translators, terminologists
In-domain terminology: Domain dictionaries
• Don’t exist
• Out of date
• Not accessible
Collect in-domain web pages
Instant corpus
2
![Page 3: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/3.jpg)
BootCaT (Bootstrapping Corpora and Terms)
Baroni and Bernardini 2004
User: input ‘seed terms’
Send 3-at-a-time to a search engine• Returns search hits page
Retrieve those pages
A corpus!• Cleaning, deduplicating, linguistic processing
Extract terms• Can use extracted terms as seeds, iterate
3
![Page 4: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/4.jpg)
Works well
Widely used More implementations
SkE has WebBootCaT, web front end Secret:
piggybacks on search enginesThey do the donkey-work
• on-domain, text-rich pages, no spam, …
4
![Page 5: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/5.jpg)
Also in use for
General language corpusLong list of general seed words
• Pioneer: Serge Sharoff• LCL: Corpus Factory
‘Varieties of Learner English’General English, same queries except
• Region=UK, US, Canada, Aus, China, Japan, Korea
Validation under way
5
![Page 6: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/6.jpg)
Corpus query tool, since 2003
Widely used by lexicographers• Commercial
• OUP, CUP, Collins, Macmillan, Le Robert, Cornelsen, Shogukakan
• National dictionary projects• Bulgaria, Czech Republic, Estonia, Netherlands, Slovakia,
Slovenia
Universities• Linguistics, language research, NLP, language
teaching, teaching translation
6
The Sketch Engine
![Page 7: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/7.jpg)
55 languages and counting
Large corpora ready-to-use for
Arabic Bengali Bulgarian Chinese Czech Croatian Danish Dutch English Estonian Finnish French German Greek Gujarati Hebrew Hindi Indonesian Irish Italian Japanese Korean Latin Malay Malayalam Norwegian Persian Polish Portuguese Romanian Russian Serbian Setswana Slovak Slovene Spanish Swahili Swedish Tamil Telugu Thai Turkish Urdu Vietnamese
7
![Page 8: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/8.jpg)
Handles large corporaLargest to date: 8 billion words
Fast Web-based: no software to install Build ‘instant corpora’ from the web Load your own corpus
Quota of space on SkE server Word sketches
One-page, automatic accounts of a word’s grammatical and collocational behaviour
Free 30-day trial: sketchengine.co.uk8
![Page 9: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/9.jpg)
9
Adam Kilgarriff
Lexical Computing Ltd.
![Page 10: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/10.jpg)
WebBootCaT
BootCaT integrated in SkE BootCaT a corpus
Clean, de-dupe, POS-tag, thenLoad into Sketch Engine
10
![Page 11: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/11.jpg)
![Page 12: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/12.jpg)
![Page 13: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/13.jpg)
![Page 14: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/14.jpg)
![Page 15: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/15.jpg)
How big a corpus do we get?
![Page 16: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/16.jpg)
Observation
Specialist domain, L1 Specialist domain, L2 Matching terminology
16
![Page 17: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/17.jpg)
Going multilingual
Translate seeds English: volcanology volcanologist "volcanic
eruption" seismographs Eyjafjallajokull geodic "deformation monitoring" tephra magma stratigraphic tephrochronology geochronological "volcanic ash" ablation rhyolitic
French:vulcanologue volcanologie "éruption volcanique" sismographes Eyjafjallajokull "surveillance de la déformation" géodiques tephra magma téphrochronologiestratigraphique géochronologiques "de cendres volcaniques" ablation rhyolitiques
BootCaT for French
![Page 18: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/18.jpg)
![Page 19: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/19.jpg)
CCBC
Input: L1, L1 seeds, L2 Choose dictionary
Google as default• Google dictionary (25 lg pairs, limited API)• Google translate (1225 lg pairs, only 1 transl)
Option: edit translations Bootcat 2 corpora Bilingual word sketches
19
![Page 20: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/20.jpg)
Bilingual word sketches(very first pass)
For L1 nodeword nFor each of its translations n1, n2, …
• For each collocate c in word sketch• For each of its translations c1, c2, …
• Does ci occur as collocate in word sketch for ni?
• If yes: output <c; ni , ci >
• Add L1 and L2 examples sentences
20
![Page 21: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/21.jpg)
21
![Page 22: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/22.jpg)
Matching seeds – how?
User translates Yes but limited
Bilingual dictionary Yes but finding them?? Google dictionary
Machine translations Wikipedia
Matching articles
![Page 23: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/23.jpg)
Evaluation
Extract terms for L1, L2 Ask expert
1. Are they terms
2. Do the L1, L2 lists contain translations of each other?
23
![Page 24: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/24.jpg)
3 lg-pairsEn-Fr, En-De, En-CzOne expert for each pair
3 domainsVolcanoesStradivariusPancreatic cancer
• Wikipedia: En and De only
24
![Page 25: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/25.jpg)
Results
25
In brief
•Words good•Multiwords bad
![Page 26: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/26.jpg)
Unithood and termhood
To find termsFor multiwords only
• Does it hang together?
• UnithoodIt it distinctive?
• Keywords
• Termhood
We didn’t use termhood for multiwords but need to
26
![Page 27: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/27.jpg)
Next steps
Termhood for multiwords WebBootCaT from wikipedia From collocations to terms
More-than-2-word collocations• … deadline next Tuesday
27
![Page 28: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649e205503460f94b0b690/html5/thumbnails/28.jpg)
Thank you
http://www.sketchengine.co.uk
28