Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)
-
Upload
bennett-delaney -
Category
Documents
-
view
31 -
download
1
description
Transcript of Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)
![Page 1: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/1.jpg)
Comparable Corpora BootCaT (CCBC)(or: In Praise of BootCaT)
Adam Kilgarriff, Jan Pomikalek, Avinesh PVSLexical Computing Ltd.
Work Supported by EU FP7 Project PRESEMT
![Page 2: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/2.jpg)
Just-in-time corpora
Krista Varantola
Translators, terminologists
In-domain terminology: Domain dictionaries
• Don’t exist
• Out of date
• Not accessible
Collect in-domain web pages
Instant corpus2
![Page 3: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/3.jpg)
BootCaT (Bootstrapping Corpora and Terms)
Baroni and Bernardini 2004
User: input ‘seed terms’
Send 3-at-a-time to a search engine• Returns search hits page
Retrieve those pages
A corpus!• Cleaning, deduplicating, linguistic processing
Extract terms• Can use extracted terms as seeds, iterate
3
![Page 4: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/4.jpg)
Very successful
Widely used More implementations
SkE has WebBootCaT, web front end Secret:
piggybacks on search enginesThey do the donkey-work
• on-domain, text-rich pages, no spam, …
4
![Page 5: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/5.jpg)
Also use for
General language corpusLong list of general seed wordsPioneer: SharoffLCL: Corpus Factory
‘Varieties of Learner English’General English, same queries except
• Region=UK, US, Canada, Aus, China, Japan, Korea
Validation under way
5
![Page 6: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/6.jpg)
Sketch Engine
![Page 7: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/7.jpg)
Corpus query tool, since 2003
Widely used by lexicographersCommercial
• OUP, CUP, Collins, Macmillan, Le Robert, Cornelsen, Shogukakan
National dictionary projects• Bulgaria, Czech Republic, Estonia,
Netherlands, Slovakia, Slovenia
UniversitiesLinguistics, language research, NLP,
language teaching7
![Page 8: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/8.jpg)
44 languages and counting
Large corpora ready-to-use for
Arabic Bengali Bulgarian Chinese Czech Croatian Danish Dutch English Estonian Finnish French German Greek Gujarati Hebrew Hindi Indonesian Irish Italian Japanese Korean Latin Malay Malayalam Norwegian Persian Polish Portuguese Romanian Russian Serbian Setswana Slovak Slovene Spanish Swahili Swedish Tamil Telugu Thai Turkish Urdu Vietnamese
8
![Page 9: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/9.jpg)
Handles large corporaLargest to date: 8 billion words
Fast Web-based: no software to install Build ‘instant corpora’ from the web Load your own corpus
Quota of space on SkE server Word sketches
One-page, automatic accounts of a word’s grammatical and collocational behaviour
Free 30-day trial: sketchengine.co.uk9
![Page 10: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/10.jpg)
10
Adam Kilgarriff
Lexical Computing Ltd.
![Page 11: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/11.jpg)
WebBootCaT
BootCaT integrated in SkE BootCaT a corpus
Clean, de-dupe, POS-tag, thenLoad into Sketch Engine
11
![Page 12: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/12.jpg)
![Page 13: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/13.jpg)
![Page 14: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/14.jpg)
![Page 15: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/15.jpg)
![Page 16: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/16.jpg)
Observation
Specialist domain, L1 Specialist domain, L2 Matching terminology
16
![Page 17: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/17.jpg)
Going multilingual
Translate seeds English: volcanology volcanologist "volcanic
eruption" seismographs Eyjafjallajokull geodic "deformation monitoring" tephra magma stratigraphic tephrochronology geochronological "volcanic ash" ablation rhyolitic
French:vulcanologue volcanologie "éruption volcanique" sismographes Eyjafjallajokull "surveillance de la déformation" géodiques tephra magma téphrochronologiestratigraphique géochronologiques "de cendres volcaniques" ablation rhyolitiques
Thanks again Google
BootCaT for French
![Page 18: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/18.jpg)
![Page 19: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/19.jpg)
CCBC
Input: L1, L1 seeds, L2 Choose dictionary
Google as default• Google dictionary (25 lg pairs, limited API)• Google translate (1225 lg pairs, only 1 transl)
Option: edit translations Bootcat 2 corpora Bilingual word sketches
19
![Page 20: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/20.jpg)
Bilingual word sketches(very first pass)
For L1 nodeword nFor each of its translations n1, n2, …
• For each collocate c in word sketch• For each of its translations c1, c2, …
• Does ci occur as collocate in word sketch for ni?
• If yes: output <c; ni , ci >
• Add L1 and L2 examples sentences
20
![Page 21: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/21.jpg)
21
![Page 22: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/22.jpg)
Notes
Grammatical relationsUsed to find collocationsThen thrown away
Thresholds: what is “in a word sketch” Which dictionary
Issue: as for seeds
Live (just)22
![Page 23: Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135d8550346895d9d4703/html5/thumbnails/23.jpg)
23