Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan...

17
Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies University of Leeds {s.sharoff,b.babych,a.hartley}@lee ds.ac.uk

Transcript of Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan...

Page 1: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

Using collocations from comparable corpora to find translation equivalents

Serge Sharoff, Bogdan Babych, Anthony Hartley

Centre for Translation StudiesUniversity of Leeds

{s.sharoff,b.babych,a.hartley}@leeds.ac.uk

Page 2: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

2

Outline1. Introduction

– Problems in finding translation equivalents– Corpora and tools used

2. Methodology for finding translation equivalents– Basic steps for collocations– Construction of the MWE database– An extension for single words

3. Results– Evaluation– Legitimate translation variation– Future work

Page 3: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

3

Outline1. Introduction

– Problems in finding translation equivalents– Corpora and tools used

2. Methodology for finding translation equivalents– Basic steps for collocations– Construction of the MWE database– An extension for single words

3. Results– Evaluation– Legitimate translation variation– Future work

Page 4: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

4

What are translation equivalents?• Terminology: ignitron=Ignitron=игнитрон• General lexicon: исчерпывающий

ответ=irrefragable answer• strong: 57 subentries in Oxford Russian

Dictionary (ORD), but no strong feeling, field, opposition, sense, voice

• Parallel corpora are not always available: strong voice: 16 in Europarl vs. 46 in the BNC

• Comparable corpora for terminology: (Dagan, Church, 1997; Bennison, et al, 2000), but not for words from the general lexicon

• Comparable corpora for translators: absolutely vs. assolutamente, but not a procedure for finding equivalents

Page 5: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

5

The problems we address– Hospital admission can prove a particularly

daunting experience.– I did all the cleaning, cooking and kept his

books in order, which was no mean feat.

• The problem of finding a bridge between two comparable corpora

• Main steps1. Generalising source contexts in SL2. Translating generalisations using bilingual

MRDs and generalising them3. Filtering suggestions down to what occurs

in TL

Page 6: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

6

Corpora and tools used

• Databases of multiword expressions• IMS Corpus Workbench (Christ, Evert)• Distributional similarity classes (Rapp)• Oxford Russian Dictionary from OUP

Corpus Size (MW) Time frameThe British National Corpus 100 MW 1970-1992A corpus of major British newspapers 200 MW 2004The English Internet Corpus,a random snapshot from the English Internet 160 MW 2005The Russian National Corpus, a representative corpus comparable to the BNC 100 MW 1970-2004A corpus of major Russian newspapers 70 MW 2002-2004The Russian Internet corpus, a random snapshot from the Russian Internet 160 MW 2005

Page 7: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

7

Outline1. Introduction

– Problems in finding translation equivalents– Corpora and tools used

2. Methodology for finding translation equivalents– Basic steps for collocations– Construction of the MWE database– An extension for single words

3. Results– Evaluation– Legitimate translation variation– Future work

Page 8: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

8

Step 1: Generalising contexts• Distributional similarity list: Θ(s0) =

s1, . . . , sN

• Simcluster S(s0) (words with intersecting similarity lists) ∀w ∈ S(s0) ⇔ w ∈ Θ(s0)&w ∈ ∪Θ(si )– strong ~ powerful, weak, strength, potent,

heavy, good, overwhelming, intense, robust, tough, weaken, compelling, fierce

– experience ~ knowledge, opportunity, life, encounter, skill, feeling, reality, sensation, dream, vision, learning, perception, learn

Page 9: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

9

Step 2: Translating generalisations• Full translation class: TF = S(T(S(s0)))• Reduced translation class: TR = S(T(s0))

+ T(S(s0))– опыт (experience; experiment) ~ ability,

acquire, aptitude, capability, capacity, competence, courage, evidence, experience, experiment, expertise, feasibility, flair, hypothesis, ingenuity, intelligence, investigation, knowledge, laboratory, learning, method, opportunity, perception, qualification, rat, research, skill, stamina, statistical, strength, study, talent, technique, test, training, vision.

Page 10: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

10

Step 3: finding MWEs in TL• Cartesian product of translation classes produced for

words in the query• Filtering them against MWEs really occurring in

corpora• четкая программа (‘precise programme’) ~

– clear idea (486)– detailed plan (247)– right idea (123)– detailed proposal (112)– detailed work (109)– detailed research (108)– clear policy (88)– clear strategy (83)– clear plan (70)– right policy (64)– right strategy (52)

Page 11: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

11

Building the MWE database• permissive vs. prudent filtering (Manning,

Schütze, 1999)• weapon~NN of~IN mass~JJ,• Filter: ~IN ~JJ$ British News Russian News

no of words 217,394,039 77,625,002 no of types 877,566 433,391 REs in filter 25 18 N-gram types pass RE filter 2-grams 6,361,596 5,457,848 3-grams 14,306,653 11,092,908 4-grams 19,668,956 11,514,626 N-gram types Pass frq > 1 2-grams 2,176,849(34.2%) 1,786,171(32.7%) 3-grams 2,869,617(20.1%) 1,756,200(15.8%) 4-grams 2,100,598(10.7%) 924,626(8.0%)

Page 12: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

12

An extension for single words1. Produce two sets of 5 best LL

collocates for the immediate left and right contexts of the search expression

2. Produce TR classes for the search expression and its best collocates

3. Combine TR classes separately for the left and right context

4. Intersect the set of right collocates in the left class with the set of left collocates in the right class

Page 13: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

13

Outline1. Introduction

– Problems in finding translation equivalents– Corpora and tools used

2. Methodology for finding translation equivalents– Basic steps for collocations– Construction of the MWE database– An extension for single words

3. Results– Evaluation– Legitimate translation variation– Future work

Page 14: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

14

Questionnaire Problem example

daunting experience,as in Hospital admission can prove a particularly daunting experience. Translation suggestions Score безрадостная ситуация волнующая возможность мрачное впечатление тягостное чувство устрашающий опыт Your suggestion ? (optional)

Page 15: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

15

The scoring system5 = The suggestion is an appropriate translation

as it is.4 = The suggestion can be used with some minor

amendment (e.g. by turning a verb into a participle)

3 = The suggestion is useful as a hint for another, appropriate translation (e.g. suggestion elated cannot be used, but its close synonym exhilarated can)

2 = The suggestion is not useful, even though it is still in the same domain (e.g. fear is proposed for a problem referring to hatred)

1 = The suggestion is totally irrelevant

Page 16: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

16

Equivalents for unseen cases

– Patrick West recently claimed that Britain’s extravagant mourning for Princess Diana and Holly and Jessica was ’recreational grief’. Maybe we also suffer from recreational fear.

• спортивный интерес (lit. ‘sports interest’, leisure interest)

• Some translators see more solutions in a context

• Not a competition with dictionaries, but solutions for genuinely difficult cases

Page 17: Using collocations from comparable corpora to find translation equivalents Serge Sharoff, Bogdan Babych, Anthony Hartley Centre for Translation Studies.

25 May 2006 S.Sharoff, B.Babych, A.Hartley. Using Collocations from Comparable Corpora

17

Future work• Disambiguation of simclasses:

– union ~ federation, strike, trade, worker, soviet, employer, organization, miner, communist, russia, republic, cosatu, confederation

• ASSIST semantic classes (232 categories):– I1.1- = Money: lack; (bankrupt, beggar,

impoverished, unpaid)– A5.1- = Evaluation: bad (abject, abysmal,

bastard, crap)

• Finding clusters for language pairs• Methods from EBMT/SMT