Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan...
Transcript of Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan...
Using collocations from comparable corpora to find translation equivalents
Serge Sharoff, Bogdan Babych, Anthony Hartley
Centre for Translation StudiesUniversity of Leeds
{s.sharoff,b.babych,a.hartley}@leeds.ac.uk
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
2
Outline1. Introduction
– Problems in finding translation equivalents– Corpora and tools used
2. Methodology for finding translation equivalents– Basic steps for collocations– Construction of the MWE database– An extension for single words
3. Results– Evaluation– Legitimate translation variation– Future work
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
3
Outline1. Introduction
– Problems in finding translation equivalents– Corpora and tools used
2. Methodology for finding translation equivalents– Basic steps for collocations– Construction of the MWE database– An extension for single words
3. Results– Evaluation– Legitimate translation variation– Future work
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
4
What are translation equivalents?• Terminology: ignitron=Ignitron=игнитрон• General lexicon: исчерпывающий
ответ=irrefragable answer• strong: 57 subentries in Oxford Russian
Dictionary (ORD), but no strong feeling, field, opposition, sense, voice
• Parallel corpora are not always available: strong voice: 16 in Europarl vs. 46 in the BNC
• Comparable corpora for terminology: (Dagan, Church, 1997; Bennison, et al, 2000), but not for words from the general lexicon
• Comparable corpora for translators: absolutely vs. assolutamente, but not a procedure for finding equivalents
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
5
The problems we address– Hospital admission can prove a particularly
daunting experience.– I did all the cleaning, cooking and kept his
books in order, which was no mean feat.
• The problem of finding a bridge between two comparable corpora
• Main steps1. Generalising source contexts in SL2. Translating generalisations using bilingual
MRDs and generalising them3. Filtering suggestions down to what occurs
in TL
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
6
Corpora and tools used
• Databases of multiword expressions• IMS Corpus Workbench (Christ, Evert)• Distributional similarity classes (Rapp)• Oxford Russian Dictionary from OUP
Corpus Size (MW) Time frameThe British National Corpus 100 MW 1970-1992A corpus of major British newspapers 200 MW 2004The English Internet Corpus,a random snapshot from the English Internet 160 MW 2005The Russian National Corpus, a representative corpus comparable to the BNC 100 MW 1970-2004A corpus of major Russian newspapers 70 MW 2002-2004The Russian Internet corpus, a random snapshot from the Russian Internet 160 MW 2005
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
7
Outline1. Introduction
– Problems in finding translation equivalents– Corpora and tools used
2. Methodology for finding translation equivalents– Basic steps for collocations– Construction of the MWE database– An extension for single words
3. Results– Evaluation– Legitimate translation variation– Future work
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
8
Step 1: Generalising contexts• Distributional similarity list: Θ(s0) =
s1, . . . , sN
• Simcluster S(s0) (words with intersecting similarity lists) ∀w ∈ S(s0) ⇔ w ∈ Θ(s0)&w ∈ ∪Θ(si )– strong ~ powerful, weak, strength, potent,
heavy, good, overwhelming, intense, robust, tough, weaken, compelling, fierce
– experience ~ knowledge, opportunity, life, encounter, skill, feeling, reality, sensation, dream, vision, learning, perception, learn
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
9
Step 2: Translating generalisations• Full translation class: TF = S(T(S(s0)))• Reduced translation class: TR = S(T(s0))
+ T(S(s0))– опыт (experience; experiment) ~ ability,
acquire, aptitude, capability, capacity, competence, courage, evidence, experience, experiment, expertise, feasibility, flair, hypothesis, ingenuity, intelligence, investigation, knowledge, laboratory, learning, method, opportunity, perception, qualification, rat, research, skill, stamina, statistical, strength, study, talent, technique, test, training, vision.
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
10
Step 3: finding MWEs in TL• Cartesian product of translation classes produced for
words in the query• Filtering them against MWEs really occurring in
corpora• четкая программа (‘precise programme’) ~
– clear idea (486)– detailed plan (247)– right idea (123)– detailed proposal (112)– detailed work (109)– detailed research (108)– clear policy (88)– clear strategy (83)– clear plan (70)– right policy (64)– right strategy (52)
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
11
Building the MWE database• permissive vs. prudent filtering (Manning,
Schütze, 1999)• weapon~NN of~IN mass~JJ,• Filter: ~IN ~JJ$ British News Russian News
no of words 217,394,039 77,625,002 no of types 877,566 433,391 REs in filter 25 18 N-gram types pass RE filter 2-grams 6,361,596 5,457,848 3-grams 14,306,653 11,092,908 4-grams 19,668,956 11,514,626 N-gram types Pass frq > 1 2-grams 2,176,849(34.2%) 1,786,171(32.7%) 3-grams 2,869,617(20.1%) 1,756,200(15.8%) 4-grams 2,100,598(10.7%) 924,626(8.0%)
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
12
An extension for single words1. Produce two sets of 5 best LL
collocates for the immediate left and right contexts of the search expression
2. Produce TR classes for the search expression and its best collocates
3. Combine TR classes separately for the left and right context
4. Intersect the set of right collocates in the left class with the set of left collocates in the right class
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
13
Outline1. Introduction
– Problems in finding translation equivalents– Corpora and tools used
2. Methodology for finding translation equivalents– Basic steps for collocations– Construction of the MWE database– An extension for single words
3. Results– Evaluation– Legitimate translation variation– Future work
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
14
Questionnaire Problem example
daunting experience,as in Hospital admission can prove a particularly daunting experience. Translation suggestions Score безрадостная ситуация волнующая возможность мрачное впечатление тягостное чувство устрашающий опыт Your suggestion ? (optional)
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
15
The scoring system5 = The suggestion is an appropriate translation
as it is.4 = The suggestion can be used with some minor
amendment (e.g. by turning a verb into a participle)
3 = The suggestion is useful as a hint for another, appropriate translation (e.g. suggestion elated cannot be used, but its close synonym exhilarated can)
2 = The suggestion is not useful, even though it is still in the same domain (e.g. fear is proposed for a problem referring to hatred)
1 = The suggestion is totally irrelevant
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
16
Equivalents for unseen cases
– Patrick West recently claimed that Britain’s extravagant mourning for Princess Diana and Holly and Jessica was ’recreational grief’. Maybe we also suffer from recreational fear.
• спортивный интерес (lit. ‘sports interest’, leisure interest)
• Some translators see more solutions in a context
• Not a competition with dictionaries, but solutions for genuinely difficult cases
25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora
17
Future work• Disambiguation of simclasses:
– union ~ federation, strike, trade, worker, soviet, employer, organization, miner, communist, russia, republic, cosatu, confederation
• ASSIST semantic classes (232 categories):– I1.1- = Money: lack; (bankrupt, beggar,
impoverished, unpaid)– A5.1- = Evaluation: bad (abject, abysmal,
bastard, crap)
• Finding clusters for language pairs• Methods from EBMT/SMT