Machine translation Context-based approach Lucia Otoyo.

25
Machine translation Context-based approach Lucia Otoyo

Transcript of Machine translation Context-based approach Lucia Otoyo.

Page 1: Machine translation Context-based approach Lucia Otoyo.

Machine translationContext-based approach

Lucia Otoyo

Page 2: Machine translation Context-based approach Lucia Otoyo.

Machine translation

Computerized task of translating from one natural language to another

• Human vs. machine translation• Difficulties of MT

Page 3: Machine translation Context-based approach Lucia Otoyo.

Brief history of MT

• 17th century Descartes & Leibniz• 1930 bilingual dictionary + rules• After war (Warren Wiewer)–decoding msg.• 1954 – first public demonstration of MT IBM (spawned research)• 1966 ALPAC – less accurate & more cost• 1980 increasing demand, rule-based born• 1990 parallel corpora approach

Page 4: Machine translation Context-based approach Lucia Otoyo.

MT approaches

Rule based

Parallel corpora based

Context based

Conclusion

Page 5: Machine translation Context-based approach Lucia Otoyo.

Rule Based approach

• Dominant in 1980• Resourses: Set of rules & bilingual dict.• Steps: Syntax -grammar

Semantics - meaningPragmatics – difference btw. Lang.

• Disadvantages:-language experts for rules

-new language pair - new rules-not possible to include all the rules-rules have exceptions

MT diagram

Page 6: Machine translation Context-based approach Lucia Otoyo.

Parallel corpora based

• Example based (word freq. & combination)• Statistical (phrase extract. & combination)• Resources: parallel corpora (pre-trans.), decoder,

alignment software• Steps: disassemble test into phrases, search the corpora

and match phrases, substitute, align phrases to form text

• Advantages vs. Disadvantages-Easy to apply to new language-more readable as using human pre-translated text-General translation vs. Specific domain-Lexical ambiguity

MT diagram

Page 7: Machine translation Context-based approach Lucia Otoyo.

Context Based MT

Target Language

N-gram Connector

Overlap-based decoder

N-gram candidatesSubstitution

request

Stored n-gram pairs

approved n-gram pairs

Source Language

N-gram segmenter

Cache database

Cross-language n-gram database

Resources

Bilingual dictionary

Target corpora

Source corpora

Gazetteers

N-gram builder

Flooder

Edge Locker

Synonym generator

MT diagram

Page 8: Machine translation Context-based approach Lucia Otoyo.

CBMT edge illustration

‘This context based machine translation approach looks very interesting’.

1. ‘This context based machine’

2. ‘context based machine translation’

3. ‘based machine translation approach’

4. ‘machine translation approach looks’

5. ‘translation approach looks very ’

6. ‘approach looks very interesting’

edge locking

Page 9: Machine translation Context-based approach Lucia Otoyo.

CBMT n-grams

Break down source text into n-grams(4-8)‘This context based machine translation approach looks very interesting’.

• If ‘n’ = 4 then n-grams as follows:

1. ‘This context based machine’

2. ‘context based machine translation’

3. ‘based machine translation approach’

4. ‘machine translation approach looks’

5. ‘translation approach looks very ’

6. ‘approach looks very interesting’

Page 10: Machine translation Context-based approach Lucia Otoyo.

CBMT n-grams

‘This context based machine translation approach looks very interesting’.

1. ‘This context based machine’

2. ‘context based machine translation’

3. ‘based machine translation approach’

4. ‘machine translation approach looks’

5. ‘translation approach looks very ’

6. ‘approach looks very interesting’

Page 11: Machine translation Context-based approach Lucia Otoyo.

CBMT n-grams

‘This context based machine translation approach looks very interesting’.

1. ‘This context based machine’

2. ‘context based machine translation’

3. ‘based machine translation approach’

4. ‘machine translation approach looks’

5. ‘translation approach looks very ’

6. ‘approach looks very interesting’

Page 12: Machine translation Context-based approach Lucia Otoyo.

CBMT n-grams

‘This context based machine translation approach looks very interesting’.

1. ‘This context based machine’

2. ‘context based machine translation’

3. ‘based machine translation approach’

4. ‘machine translation approach looks’

5. ‘translation approach looks very ’

6. ‘approach looks very interesting’

Page 13: Machine translation Context-based approach Lucia Otoyo.

CBMT n-grams

‘This context based machine translation approach looks very interesting’.

1. ‘This context based machine’

2. ‘context based machine translation’

3. ‘based machine translation approach’

4. ‘machine translation approach looks’

5. ‘translation approach looks very ’

6. ‘approach looks very interesting’

Page 14: Machine translation Context-based approach Lucia Otoyo.

CBMT n-grams

‘This context based machine translation approach looks very interesting’.

1. ‘This context based machine’

2. ‘context based machine translation’

3. ‘based machine translation approach’

4. ‘machine translation approach looks’

5. ‘translation approach looks very ’

6. ‘approach looks very interesting’

diagram

Page 15: Machine translation Context-based approach Lucia Otoyo.

CBMT Flooding

• Search the monolingual corpora with translated n-grams

• Produces large number of n-grams with different translations for each word

• words can be in any order, taking into account differences between languages

• each n-gram 100-3000 high density matches

diagram

Page 16: Machine translation Context-based approach Lucia Otoyo.

CBMT Target language lattice overlap maximization

• Align all the n-grams with each other• choose the ones, with the highest number of left and right

side overlaps • Eliminate non or partially overlapping n-grams

• 1. n-gram ‘This approach for computer’• 2. n-gram ‘This context based machine’• 3. n-gram ‘based machine translation approach’

diagram

Page 17: Machine translation Context-based approach Lucia Otoyo.

CBMT Cross language database

• stores cross language n-gram correspondences for later use• to speed up the translation process

diagram

Page 18: Machine translation Context-based approach Lucia Otoyo.

CBMT target language

• Find globally longest target language overlap with the highest match density

1. ‘This context based machine’

2. ‘context based machine translation’

3. ‘based machine translation approach’

4. ‘machine translation approach looks’

5. ‘translation approach looks very ’

6. ‘approach looks very interesting

‘This context based machine translation approach looks very interesting’.

diagram

Page 19: Machine translation Context-based approach Lucia Otoyo.

CBMT – synonymy

• Word and Phrasal Synonymy -increase accuracy if no or only partial overlaps found

-dynamic synonyms, no predefined coded patterns

Stages:1. Search for the word in corpus(1000-100000 context

related phrases)

• 1. ‘This establishment was founded in the year’

• 2. ‘The number of people working in the establishment is far greater than’

• 3. ‘The establishment is the first hotel’, etc

Page 20: Machine translation Context-based approach Lucia Otoyo.

CBMT – synonymy cont.

2. Search the corpus only with the phrases• 1. ‘This ________ was founded in the year’

• 2. ‘The number of people working in the _______ is far greater than’

• 3. ‘The ________ is the first hotel’, etc

3. This may return:• 1. ‘This company was founded in the year’

• 2. ‘The number of people working in the business is far greater than’

• 3. ‘The institution is the first hotel’, etc

4. Rank synonyms according to various criteria and flood

diagram

Page 21: Machine translation Context-based approach Lucia Otoyo.

CBMT Edge locking

• First and last words only confirmed by overlap once or few times

• search for other source sentences, where first & last words in original n-gram also in middle of newly found n-gram

• this confirms suitability within a particular context • Use also for words around interior punctuation

illustration

diagram

Page 22: Machine translation Context-based approach Lucia Otoyo.

CBMT Target corpora

• monolingual

• Very large (50GB – 1 TB)

• The bigger the more accurate translation

• Easy to obtain from the web

diagram

Page 23: Machine translation Context-based approach Lucia Otoyo.

CBMT Bilingual dictionary

• Very large• The bigger the more accurate translation• Usually widely available for most languages• Used to translate the n-grams• large number of n-grams• different translations for each word• Words can be in any order, taking into account

differences between languages• each n-gram 100-3000 high density matches

diagram

Page 24: Machine translation Context-based approach Lucia Otoyo.

Conclusion

• Can we?

– Create a universal foundation for all languages

– Eliminate the need for human translators– Solve the biggest obstacle in MT – ambiguity

Page 25: Machine translation Context-based approach Lucia Otoyo.

Conclusion

• Can we?

– Create a universal foundation for all languages

– Eliminate the need for human translators– Solve the biggest obstacle in MT – ambiguity

It does not seem so in the foreseeable future