A. Townsend Peterson, Carmen Martinez-Campos, Yoshinori Nakazawa, Enrique Martinez-Meyer
Example-based Machine Translation Pursuing Fully Structural NLP Kurohashi-lab M1 56430 Toshiaki...
Transcript of Example-based Machine Translation Pursuing Fully Structural NLP Kurohashi-lab M1 56430 Toshiaki...
Language & K nowledge Engineering Lab
Example-based Machine Translation Pursuing Fully
Structural NLP
Kurohashi-lab M156430 Toshiaki Nakazawa
Language & K nowledge Engineering Lab
Outline
I. History of Machine Translation
II. Introduction of recent MT systemsi. Statistic Machine Translation (SMT)ii. Example-based Machine Translation
(EBMT)
III. Related work for EBMTi. Logical Formii. Efficient retrieval method
IV. EBMT pursuing fully structural NLP
V. Conclusion
Language & K nowledge Engineering Lab
Outline
I. History of Machine Translation
II. Introduction of recent MT systemsi. Statistic Machine Translation (SMT)ii. Example-based Machine Translation
(EBMT)
III. Related work for EBMTi. Logical Formii. Efficient retrieval method
IV. EBMT pursuing fully structural NLP
V. Conclusion
Language & K nowledge Engineering Lab
History of Machine Translation
19401950
19601970
1980
Beginning of Machine
Translation
MT quality didn’t improved despite spending much
money
Doldrums of MT
MT quality had been improving because of the development
of NLP
“Machine Translation based on analogy”
is proposed[Nagao, 1981]
“Mu project” started SMT had been
becoming active[Brown et al., 1993]
Not enough quality yet…
When I look at an article in Russian, I say: "This is really
written in English, but is has been coded in some strange symbols. I
will now proceed to decode."
[Warren Weaver, 1947]
Language & K nowledge Engineering Lab
Outline
I. History of Machine Translation
II. Introduction of recent MT systemsi. Statistic Machine Translation (SMT)ii. Example-based Machine Translation
(EBMT)
III. Related work for EBMTi. Logical Formii. Efficient retrieval method
IV. EBMT pursuing fully structural NLP
V. Conclusion
Language & K nowledge Engineering LabStatistical Machine Translation (SMT)
Learn models for translation from parallel corpus statistically
Not use any linguistic resources
Small translation unit (= “word”)
Require large parallel corpus for highly-accurate translation
田植えフェスティバル石川県輪島市で外国の大使や一般の参加者など千人あまりが急な斜面の棚田で田植えを体験する催しが行われました。
輪島市白米町には(しろよねまち)千枚田と呼ばれる(せんまいだ)大小二千百枚の棚田が急な斜面から海に向かって拡がっています。
田植え体験は農作業を通して米作りの意義などを考えていこうという地球環境平和財団の呼び掛けで開かれたもので、海外三十四ヵ国の大使や書記官、それに一般の参加者ら合わせておよそ千人が集まりました。田植えに使われた苗は去年の秋、天皇陛下が皇居で収穫された稲籾から育てたものです。
参加者たちは裸足になって水田に足を踏み入れ地元に伝わる田植え歌に合わせて慣れない手つきで苗を植えていました。
きょうの輪島市は雲が広がったもののまずまずの天気となり、出席された高円宮さまも海からの風に吹かれながら田植えに加わっていました。地球環境平和財団では今年の夏休みに全国の子どもたちを対象に草刈りや生きものの観察会を開く他、秋には稲刈体験を行なう予定にしています。
Ambassadors and diplomats from 37 countries took part in a rice planting festival on Sunday in small paddies on steep hillsides in Wajima, central Japan.
About one-thousand people gathered at the hill, where some two-thousand 100 miniature paddies, called Senmaida, stretch toward the Sea of Japan.
The event was organized by the private Foundation for Global Peace and Environment.
The rice seedlings are grown from grain harvested by the Emperor at the Imperial Palace in Tokyo last autumn.
Barefoot participants waded into the paddies to plant the seedlings by hand while singing a local folk song about the practice of rice planting.
Parallel Corpus
Language & K nowledge Engineering Lab
Basic Method for SMT
Translate by maximizing the probability:
)|()(maxarg
)|(maxarg
EJPEP
JEPE
E
E
Language Model Translation Model
Learn from a parallel corpus
Language & K nowledge Engineering Lab
Translation Model
IBM Model 4 [Brown et al., 93]
# of Japanese words which each English word
generatesModel for generating NULL to justify the # of
words
Probability of translation from one E word to one J wordModel for word order
×
×
×
=
Translation Model
Language & K nowledge Engineering Lab
Overview of EBMT
ParallelCorpus Alignment TMDB
Output
Translation
Input
Advanced NLP technologies
交差点 で 、
at the intersection
Language & K nowledge Engineering LabExample-based Machine Translation (EBMT)
Divide the input sentence into a few parts Find similar expressions (= examples,
TMs) from parallel corpus for each part Combine the examples to generate output
translation Use any linguistic resources as much as
possible Larger translation unit (larger example) is
better
Language & K nowledge Engineering Lab
Flow of EBMT
Language & K nowledge Engineering Lab
Furthermore...
Translation algorithm is implicit in EBMT
→ Probabilistic Model for EBMT
[Aramaki et al., 05]
Recently, the number of studies handling bigger unit is increasing
Difference between SMT and EBMT is becoming smaller
Most active study = Phrase-based SMT SMT and EBMT will be merged (?)
Language & K nowledge Engineering Lab
Outline
I. History of Machine Translation
II. Introduction of recent MT systemsi. Statistic Machine Translation (SMT)ii. Example-based Machine Translation
(EBMT)
III. Related work for EBMTi. Logical Formii. Efficient retrieval method
IV. EBMT pursuing fully structural NLP
V. Conclusion
Language & K nowledge Engineering LabAlignment method using Logical Form
Logical Form– Represent the relations among the content
words of a sentence by unordered graph Nodes are content words Branches indicate
underlying semantic relations
– Abstract language-particular aspects of a sentenceEx. word order, inflectional
morphology, function words
[Arul et al., 01]
Spanish
English
Under Hyperlink Information, click the hyperlink address
Language & K nowledge Engineering LabEfficient Retrieval Method [Doi et al,. 04]
Similarity between input and examples is calculated by word-based Edit Distance
Finding suitable examples from a large parallel corpus takes a long time
Challenged to resolve this problem by– Classifying sentences into groups according to
the # of content words and function words– Compressing all sentences in a group into
“directed word graph”– Searching best example in a group by A*
algorithm
Language & K nowledge Engineering Lab
Outline
I. History of Machine Translation
II. Introduction of recent MT systemsi. Statistic Machine Translation (SMT)ii. Example-based Machine Translation
(EBMT)
III. Related work for EBMTi. Logical Formii. Efficient retrieval method
IV. EBMT pursuing fully structural NLP
V. Conclusion
Language & K nowledge Engineering Lab
Why EBMT?
Pursuing structural NLP– Improvement of basic analyses leads to
improvement of MT as an application of basic analyses
– Feedback from application (MT) can be expected
Adequacy of problem settings– Not a large corpus, but similar examples in
relatively close domain Ex. Translation of -> version up of instruction manual
related patent document ...
Language & K nowledge Engineering Lab
Overview of EBMT
ParallelCorpus Alignment TMDB
Output
EBMT
Input
Advanced NLP technologies
Translation
Language & K nowledge Engineering Lab
Alignment
交差点 で 、突然
あの車 が
飛び出して 来た のです 。
the car
came
at me
from the side
at the intersection
Japanese :交差点で、突然あの車が飛び出して来たのです。English : The car came at me from the side at the intersection.
1. Transform into dependency structure
2. Word-based alignment using bilingual lexicon
3. Extend the correspondence of phrases
4. Extract Translation Examples
Language & K nowledge Engineering Lab
Translation
my
traffic
The light
was green
when
entering
the intersection
Language Model
My traffic light was green when entering the intersection.
Input
Output
交差
点 に
入る
時
私 の
信号 は
青
でした 。
(cross)
(point)
(enter)
(when)
(my)
(signal)
(blue)
(was)
came
at me
from the side
at the intersection
私 の
サイン
家 に
入る
時
脱ぐ
交差
点 で 、
突然
飛び出して 来た のです 。
信号 は
青
でした 。
my
signature
traffic
The light
was green
to remove
when
entering
a house
Translation Examples
(suddenly)
(rush out)
(house)
(put off)
(signal)
(enter)
(when)
(cross)
(point)
(my)
(signal)
(blue)
(was)
交差点に入る時私の信号は青でした。
Language & K nowledge Engineering Lab
IWSLT2005
IWSLT – International Workshop on Spoken
Language Translation– Aiming at translation of ASR (Automatic
Speech Recognition) Outline of campaign
– Training set: parallel corpus including 20K sentences
– Development set: two sets including 500 and 506 sentences
– Test set: manual transcription and ASR output (500 sentences each)
Language & K nowledge Engineering Lab
Evaluation Results
Name BLUE
ATR-C3 0.4774
MICROSOFT 0.4057
ATR-SLR 0.3884
TUV 0.3718
NGKUT 0.3418
USC 0.2741
Name NIST
ATR-C3 8.1720
MICROSOFT 8.0375
TUV 7.8472
NGKUT 7.7158
ATR-SLR 4.3928
USC 2.9648
Manual Transcription(Supplied & Tools)
Language & K nowledge Engineering Lab
Conclusion
In this presentation …– History of Machine Translation– SMT and EBMT– Two related work for EBMT– Introduction of our EBMT system
Future work– Improve our EBMT system
Resolve paraphrase problem Apply anaphora resolution