Improving Machine Translation Quality with Automatic Named Entity Recognition
description
Transcript of Improving Machine Translation Quality with Automatic Named Entity Recognition
![Page 1: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/1.jpg)
Improving Machine Translation Quality with Automatic Named
Entity Recognition
Bogdan Babych
Centre for Translation StudiesUniversity of Leeds, UK
Department of Computer ScienceUniversity of Sheffield, UK
Anthony Hartley
Centre for Translation StudiesUniversity of Leeds, UK
![Page 2: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/2.jpg)
Overview• Problems of Named Entities (NEs) for MT• Experiment set-up
– Segmentation of the MT output– Scoring scheme
• Results of the experiment• Discussion
– Improving MT with IE techniques
• Conclusions and future work
![Page 3: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/3.jpg)
Problems of NEs for MT• NEs are the weak point for many MT systems• Distinct linguistic properties of proper nouns and
different translation strategies for NE• “NE internal” errors:
– Proper / common noun disambiguation errors– Errors in morphosyntactic categories of NEs
• “NE external” errors in the context of NEs:– Word sense disambiguation errors– Errors in morphosyntactic features in NE context– Segmentation errors
![Page 4: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/4.jpg)
Translation strategies for NEs
• Language-dependent strategies– Eastern Slavonic languages: person names are
transcribed with Cyrillic characters
• Strategies dependent on a type of NE– [Newmark, 1982: 70-83]: organisation names are often
left untranslated– Languages with Cyrillic writing system: organisation
names are often left in original Roman orthography• E.g.: 4 articles on international economy from BBC Russian
site: Roman-script NEs cover 6% of the total 1000 tokens
![Page 5: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/5.jpg)
Proper / common disambiguation errors– English: “Ray Rogers” – MT ProMT E-R: “Луч Rogers”
(‘A ray (beam of light) Rogers’)
– English: “Bill Fisher”– MT ProMT E-R : “Выставить счёт Рыбаку”– MT ProMT E-F : “Facturez le Pêcheur”
(‘(To) send a bill to a fisher’)
– English: “Jeff Levy”– MT Systran E-F : “prélèvement de Jeff”
(‘Jeff’s imposing of a tax’)
![Page 6: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/6.jpg)
Contextual changes around unrecognised NEs
• Errors in morphosyntactic categories– English: “… they have been flying in United cockpits” – E-R MT: “… они летали в Объединенных кабинах”
(‘they have been flying in united (joined) cockpits’)
• Segmentation errors– English: “Eastern Airlines executives notified union
leaders …”– E-R MT: “Восточные исполнители авиалиний
уведомили профсоюзных руководителей…”(‘Oriental executives of the Airlines notified …')
![Page 7: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/7.jpg)
Compound errors -- combining:
• “NE internal” errors and errors in the context of NEs
• Lexical disambiguation errors and errors in morphosyntactic disambiguation / segmentation– English: “In Ford-UAW talks…”– E-R MT: “В Броде - UAW говорит”
(‘In a ford (shallow place) - UAW is talking’)
![Page 8: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/8.jpg)
Information Extraction (IE) technology
• IE: from unrestricted text to a database– specific subject domain (e.g. satellite launches)
– predefined template with fields to be filled
• IE tasks:– NE recognition
– Co-reference resolution
– Word sense disambiguation
– Template element filling
– Scenario template filling
– Summary generation
![Page 9: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/9.jpg)
NE recognition in IE• NE recognition is specifically addressed and
benchmarked (DARPA MUC6 & MUC7 competitions)
• Manually annotated “gold standard” available• Highly accurate
– leading IE systems achieve F-score 80-90%– performance is higher and less dependent on a
subject domain (compared to Scenario Template Filling)
• Available under GPL: NE recognition module ANNIE in Sheffield’s GATE system
![Page 10: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/10.jpg)
Using NE recognition for MT
• GATE-ANNIE system allows automatic annotation of NEs in English texts
• MT systems accept Do-Not-Translate (DNT) lists– acceptable translation strategy for many organisation names in
certain language pairs
• Suggestion: if NE recognition is more accurate for IE systems, then general MT quality will improve (compared to the baseline performance)– NE-Internal changes are predictable (DNT strategy)
– Changes in the context of NEs are more interesting and more difficult to predict
![Page 11: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/11.jpg)
Experiment set-up• Purpose: evaluating morphosyntactic changes in
the context of NEs after DNT-processing
• Corpus: – 30 texts (news articles) from MUC6 evaluation set
(11,975 tokens, 510 NE occurrences, 174 NE types)– GATE “responses” -- NE recognition output file
generated by GATE-1 for MUC6 competition(Precision - 84%; Recall - 94%; F-measure - 89.06%)
• MT systems: – E-R ProMT 98; E-F ProMT 2001; E-F Systran 2000
![Page 12: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/12.jpg)
Experiment set-up (contd.)
• Stage 1: Automatic generation of DNT lists from GATE-1 annotation
• Stage 2: Generating translations for 3 systems– Baseline translation (without a DNT list)– DNT-processed translation
• Stage 3: Automatic segmentation of translations into NE-internal and NE-external zones
• Stage 4: Manual scoring of NE-external differences
![Page 13: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/13.jpg)
Segmentation algorithm
• Annotated NEs in the English original are looked up in the DNT-processed translation
• Strings between found NEs are then looked up in the baseline translation
• If a string is not found, it is highlighted (signaling a difference in the context of the NE)– Result: NE-internal and NE-external zones in the
baseline translation are separated– NE-external differences are highlighted– No complex alignment
![Page 14: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/14.jpg)
Segmentation algorithm (contd.)
ORIGINAL DNT-PROCESSED BASELINE
Separately in its
SSEECC
filing UUSSAAiirr
disclosed details of its
plan for financing the
PPiieeddmmoonntt
Acquisition
Отдельно в его
регистрации
SSEECC,
UUSSAAiirr
раскрыл детали его
планов
финансирования
приобретения
PPiieeddmmoonntt
Отдельно в его
регистрации
СЕКУНДЫ,
USAir
раскрыл детали его
планов
финансирования
Предгорного
приобретения.
![Page 15: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/15.jpg)
Scoring scheme
Evaluating morphosyntactic well-formedness
Score Baseline translation DNT-processedtranslation
+ 1 not well-formed well-formed+ 0.5 not well-formed; not well-formed;
some features aremore correct
= 0 equally (not) well-formed– 0.5 not well-formed;
some features aremore correct
not well-formed
– 1 well-formed not well-formed
![Page 16: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/16.jpg)
Scoring examples: +1 score
+1 Original:(It) represents 4,400 Western Union employeesaround the country.Baseline translation:(Он) представляет 4,400 Западных служащихСоюза по всей стране.('It represents 4,400 Western employees of theUnion around the country')DNT-processed translation:(Он) представляет 4,400 служащих WesternUnion по всей стране.('(It) represents 4,400 employees of WesternUnion around the country')
![Page 17: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/17.jpg)
Scoring examples: +0.5 score+0.5 Original:
Western Union Corp. said its subsidiary, WesternUnion Telegraph Co.…Baseline translation:Западная Корпорация Союза сказала еевспомогательную, Западную КомпаниюТелеграфа Союза…('Western Corporation of a Union said itsauxiliary (case.acc.), Western Company ofTelegraph of a Union …')DNT-processed translation:Western Union Corp. Сказанный его филиал,Western Union Telegraph Co. …('Western Union Corp. Its branch (case.nom) issaid, Western Union Telegraph Co.…')
![Page 18: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/18.jpg)
Scoring examples: =0 score
=0 Original:American Airlines Calls for MediationBaseline translation:Американские Авиалинии Призывают Кпосредничеству(American Airlines Call(num.plur.) for Mediation)DNT-processed translation:American Airlines Призывает Кпосредничеству(American Airlines Calls(num.sing.) for Mediation)
![Page 19: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/19.jpg)
Scoring examples: -0.5 score
–0.5 Original:USAir said that William R. Howard will be elected presidentof USAirBaseline translation:USAir сказал тот Уильям Р. Говард будут избраныпрезидентом USAIRUSAir said that (particular) (demonstr.pron,nom.) William R.Howard will be elected president of USAirDNT-processed translation:USAir сказал того Уильяма Ра. Говард будут избраныпрезидентом USAirUSAir said of that (particular) (demonstr.pron,gen.) WilliamRa. Howard will be elected president of USAir
![Page 20: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/20.jpg)
Scoring examples: -1 score–1 Original:
to discuss the benefits of combining TWA andUSAirBaseline translation:чтобы обсудить выгоды от объединения TWAи USAIR('to discuss the benefits of the merge (noun) (of)TWA and USAir')DNT-processed translation:чтобы обсудить выгоды от объединяющегосяTWA и USAir('to discuss the benefits of the combining(participle, sing.) TWA and (of) USAir')
![Page 21: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/21.jpg)
Manually scored part of the corpus• 50 highlighted strings for each MT system• Gain score: Overall score / Scored differences
Number of:Original– GATE
MT E-RProMT
MT E-FProMT
MT E-FSystran
Paras. withNE
218 225 225 239
Paras. withcontextualdifferences
139(61.8%)
132(58.7%)
207(86.6%)
Paras.manuallyscored
31(22.3%)
28(21.2%)
30(14.5%)
Strings withdifferences
211 212 411
Stringsscored
50(23.7%)
50(23.6%)
50(12.2%)
![Page 22: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/22.jpg)
Results of the experiment
ProMT 1998E-R
ProMT 2001E-F
Systran 2000E-F
Mark N Score N Score N Score+1* 28 = +28.0 23 = + 23.0 18 = + 18.0
+0.5* 2 = +1.0 5 = + 2.5 24 = + 12.00* 4 = 0 7 = 0 8 = 0
–0.5* 3 = –1.5 1 = – 0.5 1 = – 0.5–1* 13 = –13.0 14 = – 14.0 10 = – 10.0
SUM 50 +14.5 50 + 11.0 61 + 19.5Gain +29% +22% +32%
![Page 23: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/23.jpg)
Results for additional 50 strings...ProMT 1998
E-R50: 100:
Mark N Score N Score+1* 28 = +28.0 59 = +59.0
+0.5* 2 = +1.0 8 = +4.00* 4 = 0 6 = 0
–0.5* 3 = –1.5 7 = –3.5–1* 13 = –13.0 31 = –31.0
SUM 50 +14.5 111 +28.5Gain +29% +26%
![Page 24: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/24.jpg)
Improvement in the context of NEs
• Aspects of improvement:– morphosyntactic features and categories– word sense disambiguation – word order and syntactic segmentation
• Consistency in improvement– for both languages – for all MT systems
![Page 25: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/25.jpg)
Examples of improvement
Original:TWA stock closed at $28 …Baseline translation:E-F
Systran Fermé courant de TWA à $28 … (‘Closed (Past participle) current (Noun/Presentparticiple) of TWA at $28 …’)DNT-processed translation:L’action de TWA s’est fermée à $28 … ('The stock of TWA closed (Verb) at $28 …')
![Page 26: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/26.jpg)
Examples of improvement:2
Original:National Mediation Board is expected to release Pan Am Corp. fromtheir contract negotiations.Baseline translation:E-R
ProMT Национальное Правление Посредничества, как ожидается,выпустит Кастрюлю - Корпорация от их переговоровконтракта.('National Mediation Board is expected to release [put on the market]a Saucepan - Corporation from their contract negotiations.')DNT-processed translation:National Mediation Board, как ожидается, освободит Pan AmCorp. от их переговоров контракта.(‘National Mediation Board is expected to release [make free] PanAm Corp. from their contract negotiations.’)
![Page 27: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/27.jpg)
Improvement: languages and systemsOriginal:The agreement was reached by a coalition of fourof Pan Am's five unions.Baseline translation:E-R
ProMT Соглашение было достигнуто коалициейчетырех Кастрюли пять союзов Ама.('The agreement was reached by a coalition offour of a Saucepan five unions of Am.')DNT-processed translation:Соглашение было достигнуто коалициейчетырех из пяти союзов Pan Am.('The agreement was reached by a coalition offour out of five unions of Pan Am ')
![Page 28: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/28.jpg)
Improvement: languages and systems:2Original:The agreement was reached by a coalition of fourof Pan Am's five unions.Baseline translation:E-F
ProMT L'accord a été atteint par une coalition de quatrede casserole cinq unions d'Am.(‘The agreement was reached by a coalition offour of saucepan five unions of Am.’)DNT-processed translation:L'accord a été atteint par une coalition de quatrede cinq unions de Pan Am.(‘The agreement was reached by a coalition offour of five unions of Pan Am.’)
![Page 29: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/29.jpg)
Improvement: languages and systems:3Original:The agreement was reached by a coalition of fourof Pan Am's five unions.Baseline translation:E-F
Systran L'accord a été conclu par une coalition de quatrede la casserole étais cinq syndicats.(‘The agreement was reached by a coalition offour of the saucepan was five trades-unions.’)DNT-processed translation:L'accord a été conclu par une coalition de quatrede Pan Am's cinq syndicats.(‘The agreement was reached by a coalition offour of Pan Am’s five trades-unions.’)
![Page 30: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/30.jpg)
Discussion• Different aspects of MT quality are
interdependent– improvements on one level help other levels
• IE techniques target specific tasks also necessary for the SL analysis stage in MT– NE recognition– co-reference resolution– word sense disambiguation
• MT can benefit from clearly defined evaluation procedures for specific IE tasks
![Page 31: Improving Machine Translation Quality with Automatic Named Entity Recognition](https://reader035.fdocuments.us/reader035/viewer/2022081505/568152dd550346895dc0f943/html5/thumbnails/31.jpg)
Conclusions and Future Work• NE recognition within IE framework improves
not only treatment of NEs by MT, but also boosts the overall MT quality:– morphosyntactic and lexical well-formedness– features of the wider context of NEs
• Future work: harnessing other focused technologies for MT– co-reference resolution– word sense disambiguation– evaluating the baseline performance of MT systems