Machine Translation, Language Divergence and Lexical Resources
description
Transcript of Machine Translation, Language Divergence and Lexical Resources
![Page 1: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/1.jpg)
Machine Translation, Language Divergence and Lexical
Resources
Pushpak BhattacharyyaComputer Science and Engineering
Department IIT Bombay
![Page 2: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/2.jpg)
Acknowledgement
• NLP-AI members, CSE Dept, IIT Bombay.
![Page 3: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/3.jpg)
What is MT
Conversion of source language text to target language text
Computer Program
Document in L1Document in L2
![Page 4: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/4.jpg)
Kinds of MT Systems(How much of Human Participation)
• Fully Automatic• Semi Automatic
– Human Aided MT (HAMT)• Pre-editing• Post-editing
example
– Machine Aided HT (MAHT)• On-line Dictionaries• Terminology Data Banks • Translation Memories
example
![Page 5: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/5.jpg)
Kinds of MT Systems(domain coverage)
• General Purpose
(SYSTRAN in Europe)
• Domain Specific (Tom-Mateo in Canada;
Translates weather reports between
French and English)
![Page 6: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/6.jpg)
Kinds of MT Systems(point of entry from source to the target text)
fwd
Deep understanding level
Interlingual le vel
Logico-semant ic level
Syntactico-functio nal level
Morpho-syntac tic level
Syntagmatic level
Graphemic leve l Direct translation
Syntactic transfer (surface )
Syntactic transfer (deep)
Conceptual transfer
Semantic transfer
Multilevel transfer
Ontological interlingua
Semantico-linguistic interlingua
SPA-structures (semantic& predicate-arg ument)
F-structures (functional)
C-structures (constituent)
Tagged tex t
Text
Mixing lev els Multilevel descriptio n
Semi-direct translatio n
![Page 7: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/7.jpg)
Why is MT difficult?Classical NLP problems
• Ambiguity– Lexical – Structural
• Ellipsis• Co-reference
– Anaphora – Hypernymic examples
![Page 8: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/8.jpg)
Why is MT DifficultLanguage Divergence
• Lexico-Semantic Divergence
• Structural Divergence
![Page 9: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/9.jpg)
Language Divergence(English Hindi: Noun to Adjective)
• The demands on sportsmen today can lead to burnout at an early age.
(noun – the state of being extremely tired or ill, either physically or mentally, because you have worked too hard)
• खि�ला�ड़यों� से जो� आजो अपेक्षा�एं� हैं�, वे उन्हैं� कम उम्र म� हैं� अक्रि�यों�शी�ला कर सेकती� हैं�।
![Page 10: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/10.jpg)
Language Divergence(English Hindi: Noun to Verb)
• Every concert they gave us was a sell-out.
(an event for which on the tickets have been sold)
• उनक हैंर से�गी�ती- क�यों$�म क सेभी� टि'क' क्रि(क गीएंथे।
![Page 11: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/11.jpg)
Language Divergence(English Hindi: Adjective to Adverb)
• The children watched in wide-eyed amazement.
(with eyes fully open because of fear, great surprise, etc)
• (च्चे आश्चयों$ से आ,�� फा�ड़ दे� रहैं थे।
![Page 12: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/12.jpg)
Language Divergence(English Hindi: Adjective to Verb)
• He was in a bad mood at breakfast and wasn't very communicative.
(able and willing to talk and give information to other people)
• न�श्ती क सेमयों वेहैं �र�( म0ड म� थे� और ज्यों�दे� (�ती- ची�ती नहैं5 कर रहैं� थे�।
![Page 13: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/13.jpg)
Language Divergence(English Hindi: Preposition to Adverb)
• It gets cooler toward evening. (near a point in time)
• शी�म हैं�ती- हैं�ती ठं� डक (ढ़ जो�ती� हैं8।
![Page 14: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/14.jpg)
Language Divergence(English Hindi: idiomatic usage)
• Given her interest in children, teaching seems the right job for her.
(when you consider sth)
• (च्चे� क प्रक्रिती (म�) उसेक: टिदेलाचीस्पी� दे�ती हुएं, अध्यों�पेन उसेक लिलाएं उलिचीती लागीती� हैं8।
![Page 15: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/15.jpg)
Language Divergence(Marathi-Hindi-English: case marking and postpositions transfer:
works!)
• प्रथम ता�ख्या�ता• वेती$म�न(simple present)
– ती� जो�ती�.– वेहैं जो� ती� हैं8।– He goes.
• स्थि@रसेत्यों(universal truth)– पेBथ्वे� से0यों�$भी�वेती� क्रिफारती.– पेBथ्वे� से0यों$ क ची�र� ओर घू0म ती� हैं8।– The earth revolves round the sun.
![Page 16: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/16.jpg)
Language Divergence(Marathi-Hindi-English: case marking and postpositions: works again!)
• ऐक्रितीहैं�लिसेक सेत्यों(historical truth)– कB ष्ण अजोI$न�से से��गीती�...– कB ष्ण अजोI$न से कहैं ती हैं�...– Krushna says to Arjuna…
• अवेतीरण (quoting)– दे�मला म्हैंणती�ती, ...– दे�मला कहैं ती हैं�, ...– Damle says,...
![Page 17: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/17.jpg)
Language Divergence(Marathi-Hindi-English: case marking and postpositions: does not
work!)
• से�क्रिनक्रिहैंती भी0ती (immediate past)– कधी� आला�से? हैं� योंती� इतीक�ची !– क( आयों? (से अभी� आयों� ।– When did you come? Just now (I came).
• क्रिनNसे�शीयों भीक्रिवेष्यों (certainty in future)– आती� ती� म�र ��ती� ��से !– अ( वेहैं म�र ��योंगी� हैं� !– He is in for a thrashing.
• आश्वा�सेन (assurance)– म� तीIम्हैं�ला� उद्या� भी'ती�.– म� आपे से कला मिमलाती� हूँ,।– I will see you tomorrow.
![Page 18: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/18.jpg)
Language Divergence Theory: Lexico-Semantic Divergences
• Conflational divergence
• Structural divergence
• Categorial divergence
• Head swapping divergence
• Lexical divergence
![Page 19: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/19.jpg)
Language Divergence Theory: Syntactic Divergences
• Constituent Order divergence
• Adjunction Divergence
• Preposition-Stranding divergence
• Null Subject Divergence
• Pleonastic Divergence
![Page 20: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/20.jpg)
MT approaches
interlingua Based
Direct
Transfer Based
Vaquiouse Triangle
![Page 21: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/21.jpg)
Interlingua Methodology Directly obtain the meaning of the source sentence. Do target sentence generation from the meaning representation.
John gave the book to Mary. Meaning representation:
give-action: agent: John object: the book receiver: Mary
ATLAS system in Fujitsu precursor to World wide project on UNL
![Page 22: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/22.jpg)
Competing approaches
Direct
Transfer based
![Page 23: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/23.jpg)
Direct approach
Word replacementsI like mangoesmaOM AcCa laga AamaI like (root) mangoes
MorphologymaOM AcCa lagata AamaI like mangoes
Syntactic re-arrangement maOM Aama AcCa lagata hO
I mangoes like Idiomatization
mauJao Aama AcCa lagata hOI (dative) mangoes like
![Page 24: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/24.jpg)
Transfer Based
Source sentence processed for parsing, chunking etc.
SS
NPNPVPVP
VV NPNP
IIlikelike
mangoesmangoes
![Page 25: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/25.jpg)
Transfer Based
Transfer structures obtained for the target sentence.
SS
NPNPVPVP
VV
IIlikelike
NPNP
mangoesmangoes
![Page 26: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/26.jpg)
Transfer BasedMorphology and language specific modifications
SS
NPNPVPVP
VV
mauJaomauJaoAcCa lagataa hOAcCa lagataa hO
NPNP
AamaAama
![Page 27: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/27.jpg)
Relation Between the Transfer and the Interlingua Models
Interpretation generation
transfer
Parsing generation
Interlingua
Source languageParse tree
Target LanguageParse tree
source languagewords
Target language words
![Page 28: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/28.jpg)
State of Affairs
Systran reports 19 different language
pairs. Only 8 alright for intended use. Even fewer are capable of quality written
or spoken text translation.
![Page 29: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/29.jpg)
Notable Systems in India
• Anusaaraka (IITK and IIIT Hyderabad: information access: one of the earliest systems)
• Angla-Hindi (IITK: Transfer Based)• Shakti and Shiva (IIIT Hyderabad: Use of
simple modules to create complex and high level performance)
• UNL Based system (IIT Bombay- part of the UN effort: emphasis on semantics)
• Hindi-Tamil system (AU-KBC, Chennai: based on the approach at IIIT Hyderabad)
![Page 30: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/30.jpg)
Semantics: use of Lexical Resources
• WordNet
• Word Sense Disambiguation
![Page 31: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/31.jpg)
Wordnet
• A lexical knowledgebase based on conceptual lookup
• Organizing concepts in a semantic network.
• Organize lexical information in terms of word meaning, rather than word form
• Wordnet can also be used as a thesaurus.
![Page 32: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/32.jpg)
Lexical Matrix
![Page 33: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/33.jpg)
The Structure of Hindi Wordnet
• 30,000 unique words
• 13,000 synsets
• Wordnet Relations
1. Lexical Relations (between word forms)
Synonymy
Antonymy
2. Semantic Relations (between word meanings)
Hyponymy/Hypernymy
Meronymy/Holonymy
Entailment/Troponymy
![Page 34: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/34.jpg)
A small part of Hindi Wordnet
![Page 35: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/35.jpg)
Hindi WordNet APIs
findtheinfo getindex
in_wn index_lookup read_synset
free_synset
free_index morphstr
Hindi Data
![Page 36: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/36.jpg)
The Hindi WSD System
![Page 37: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/37.jpg)
Approach to WSD ….
Hindi WordnetHindi Document
Intersection SimilarityContext Bag Semantic Bag
![Page 38: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/38.jpg)
WSD Algorithm
1. For a polysemous word w needing diambiguation, a set of context
2. words in its surrounding window is collected. Let this collection be C, the context bag. The window is the current sentence and the preceding and the following sentences.
3. For each sense s of w, do the following Let B be the bag of words obtained from the
1. Synonyms in the synsets2. Glosses of the synsets3. Example Sentences of the synsets4. Hypernyms (recursively upto the roots)5. Glosses of Hypernyms6. Example Sentences of Hypernyms
![Page 39: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/39.jpg)
WSD Algorithm (continued)
7. Hyponyms
8. Glosses of Hypernyms (recursively upto the leaves)
9. Example Sentences of Hyponyms
10. Meronyms (recursively upto the beginner synset)
11. Glosses of Meronyms
12. Example sentences of meronyms
4. Mesure the overlap between C and B using intersection similarity
5. Output that sense as the winner sense which has the maximum overlap simialrity value
![Page 40: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/40.jpg)
Evaluation
• Only Nouns
• Test corpora from CIIL, Mysore.
• Corpus from 8 domains, each containing around 2000 words on an average.
![Page 41: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/41.jpg)
![Page 42: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/42.jpg)
![Page 43: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/43.jpg)
ResultAccuracy
0 20 40 60 80
Agriculture
Science and Sociology
Sociology
Short Story
Mass Media
Children Literature
History
Science
Do
main
Percentage of Accuracy
![Page 44: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/44.jpg)
Conclusions(Knowledge Based MT)
• Language Divergence is the bottleneck
• Not only for languages from distant families (English-Japanese)
• But also for siblings within a family (Hindi-Marathi)
• Solution lies in creating and exploiting knowledge structures
![Page 45: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/45.jpg)
Conclusions(Statistical MT)
• Complementary (not really competing) approach
• Example: IBM approach to translation from/to English and other languages (French, Chinese, and currently Hindi)
• Needs vast amount of text aligned corpora
• Basic idea is to maximize P(T|S) over all target sentences T: needs language modeling (P(T)) and translation modeling (P(S|T))
![Page 46: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/46.jpg)
Pre Editing
The inspection team appointed by the United Nations visited Iraq early July, 2003.
The <cnp> inspection team </cnp> {which was} appointed by the <org> United Nations </org> visited Iraq {in} early <date>July, 2003</date>.
![Page 47: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/47.jpg)
Post Editing
• back (I want to eat well today)
MMmaOM Aaja AcCa Kanaa caahta hUM
mauJao Aaja AcCa Kanaa caaihe
![Page 48: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/48.jpg)
Terminology DB and Translation Memory
• Special lexicon containing the domain terms and their translations– Nuclear Energy- AaNaivak }jaa-
• Memories of previous translations– Apply fragments of previous translations to new translation situations
Available
– He bought a pen– ]snanao ek klama KrIda– All ministers have huge houses– saBaI pMtaoMko pasa bahut baDo Gar hOMNew– He bought a huge house– ]snanao ek bahut baDa Gar KrIda
![Page 49: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/49.jpg)
Pitfall of Translation Memory
• German: Ein messer ist im schrank; er miβt eletrizitat.
• TM1: Ein messer ist im schrank ->A meter is in the cabinet.
• TM2: er miβt eletrizitat.It measures electricity
• New situationEin messer ist im schrank; er ist sehr scharf.
• A meter is in the cabinet; it is very sharp (?).• Messer in German: Meter/Knife in English. back
![Page 50: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/50.jpg)
Ambiguity
Chair
![Page 51: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/51.jpg)
Co-reference Resolution
• Pronoun– Sequence of commands to a robot:
• place the wrench on the table.• Then paint it.
– What does it refer to? (anaphora- back reference)
• Learning of his intentions, Shivaji went to meet Afjal Khan, prepared with concealed weapons
– Who does his refer to? (cataphora- forward ref)
• Hypernymic– Children love to see lions? These animals, however,
are getting extinct.
![Page 52: Machine Translation, Language Divergence and Lexical Resources](https://reader036.fdocuments.us/reader036/viewer/2022081506/568150fc550346895dbf1b4c/html5/thumbnails/52.jpg)
Elipsis
Sequence of command to the Robot:
Move the table to the corner.
Also the chair.
Second command needs completing by using the first part of the previous command.
back