2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean...
-
Upload
bethanie-phillips -
Category
Documents
-
view
216 -
download
0
Transcript of 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean...
![Page 1: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/1.jpg)
2008 – copyright SYSTRAN
SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation
Jean Senellart, Jin Yang, Jens Stephan
![Page 2: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/2.jpg)
2008 – copyright SYSTRAN
Overview
SYSTRAN – 40 years of innovation
The MT Challenges
SYSTRANLabProjectsHybrid EnginesFrom Research to Products
CWMT08
Conclusions
![Page 3: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/3.jpg)
2008 – copyright SYSTRAN
SYSTRAN
40 years of history
Located in Paris (La Défense) and San Diego
+70 employees: ~ 20 linguists, ~ 30 engineersIncluding 10 PhDs
![Page 4: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/4.jpg)
2008 – copyright SYSTRAN
Core Technology
Core technology “Rule-Based”Based on language descriptionAnalysis – Transfer – Generation paradigmBuild a « syntax tree » based on hierarchical constituents with multi-level relationshipsMulti-pass analysis
• Morphology Analysis• Homograph Resolution• Clause Boundary• Syntagm Identification• Syntactic Role Identification• …
Rely heavily on linguistic resources
![Page 5: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/5.jpg)
2008 – copyright SYSTRAN
![Page 6: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/6.jpg)
2008 – copyright SYSTRAN
Languages
Chinese 882 Korean 78Arabic 422 Italian 62Spanish 358 Ukrainian 47English 350 Polish 42Hindi 325 Dutch 23Portuguese 250 Serbo-Croatian 21Russian 170 Greek 18French 130 Czech 12Japanese 125 Albanian 6Urdu 100 Slovak 6German 100Farsi 82
22 source languages
70 language pairs
Dictionaries: 200K-1M entries per LP~6M reference multi-source / multi-target dictionary
3600
![Page 7: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/7.jpg)
2008 – copyright SYSTRAN
SYSTRAN Activity
Retail products:Windows Desktop ProductSYSTRAN Mobile on PDAMac OS Dashboard Widget
Online ServicesSYSTRANBox, SYSTRANNet, SYSTRANLinks
Corporate customersSymantec, Cisco, Verizon, Ford, Daimler, Chemical
Abstract…Institutional Customers
EC and US agenciesPortals - Online Translation
“Babel Fish”, Google, Yahoo!, Microsoft Live, …
![Page 8: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/8.jpg)
2008 – copyright SYSTRAN
MT Challenges RBMT/SMT Strengths and Weaknesses - I
Rule-Based system builds a translation with available linguistic resources (dictionaries, rules)
Human-built resources• Incremental
Track the translation process• Predictable output
Some phenomena are hard to formalize• Need semantic/pragmatic knowledge
Not designed to deal with exceptions to the rules• … which are very frequent
![Page 9: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/9.jpg)
2008 – copyright SYSTRAN
MT Challenges RBMT/SMT Strengths and Weaknesses - II
Statistical system finds a translation within a choice of many, many possible translations
Very easy to build• Automatic training process
Knowledge acquisition is easy…• Not limited to predefined linguistic patterns – “phrase”
… but cannot “understand” or generalize information • Not even elementary rules
Output is “unpredictable”
![Page 10: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/10.jpg)
2008 – copyright SYSTRAN
MT ChallengesCorpus-Based or Rule-Based Approach?
No conflict between “corpus” and “rule-based” approaches
Possible to learn rules• Already learns terminology – monolingual and multilingual• Some approaches acquire complex rules
Possible to find the best translation amongst several translations“Decoding” can be constrained by syntactic restrictionsLinguistic rules but corpus drives!
![Page 11: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/11.jpg)
2008 – copyright SYSTRAN
SYSTRANLab
Research Projects Overview
Toward Hybrid EnginesCollaborationsStatistical Post-Edition
Lattice Decoding
Source Analysis Adaptation
From Research to Products
![Page 12: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/12.jpg)
2008 – copyright SYSTRAN
Research Projects
Resources AcquisitionConsolidating a 6M entry multilingual dictionaryAcquiring more from corpus – lexicon and rules
Linguistic DevelopmentEntity Recognition with local grammarsAutonomous Generation modules
Introduction of corpus-based technology
ApplicationsMore interactive applicationsProfessional Post-Edition Module (POEM)
![Page 13: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/13.jpg)
2008 – copyright SYSTRAN
SYSTRANLab Research Projects
The Phoenix Project
Collaboration with P. Koehn (University of Edinburgh)
Introduce corpus-based decision modules in SYSTRAN
Specialized modulesWord Sense DisambiguationLattice GenerationPreposition / Determiner Choice
![Page 14: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/14.jpg)
2008 – copyright SYSTRAN
SYSTRANLab Research Projects
The Sphinx Project
Collaboration with CNRC
Sequential use of SYSTRAN and statistical engines (Statistical Post-Edition)
GALE (DARPA Project)
Participated in WMT07, NIST08
![Page 15: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/15.jpg)
2008 – copyright SYSTRAN
SYSTRANLab Research Projects
The Pegasus Project
Collaboration with H. Schwenk (Université du Maine)
Introduce linguistic knowledge in statistical engines
Participated in WMT08
![Page 16: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/16.jpg)
2008 – copyright SYSTRAN
SYSTRANLabHybrid Engines
Introduce Self-Learning capability
Learn “post-edition rules”
Deep integration of statistical decision modules
Insert linguistic knowledge in statistical
engines
HYBRIDHYBRID
![Page 17: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/17.jpg)
2008 – copyright SYSTRAN
CWMT08
Chinese-English MT evaluation
Primary: RBMT+SPE
Contrast: RBMTStarted in 1994, 1.2M terms, S&T-focus
BLEU4 BLEU4-SBP
NIST5 GTM mWER mPER ICT
Primary-a 0.2275 0.2193 7.9180 0.7101 0.7209 0.5085 0.3262
Contrast-b 0.1956 0.1930 7.6356 0.7089 0.7165 0.5123 0.2942
![Page 18: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/18.jpg)
2008 – copyright SYSTRAN
CWMT08: SPE Usage
SPE module trained on 1.8m sentencesCWMT08 training data not use
Not only translation by also annotation by RBMTDates, numerals, etc.
Transfer model is filteredExclusion of “bad rules” by rule based filteringExamples are “random” quotes, entities appearing
Some expressions are “protected”Constituents will be replaced with placeholders before SPETranslated with RBMTRe-injected in translation after SPE
SPE model for CWMT08 is trained using GIZA++, and decoding using Moses (www.statmt.org/moses)
![Page 19: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/19.jpg)
2008 – copyright SYSTRAN
Statistical Post-EditionA Case Study
Case Study – SYMANTEC – English>Chinese
BLEU PERFECT Improv / Degrad
SYSTRAN Raw 20.89 2 -SYSTRAN Cust 34.49 4.8 refSYSTRAN Raw + Translation Model
46.86 7.4 -
SYSTRAN Cust + Translation Model
50.90 10.5 15
![Page 20: 2008 – copyright SYSTRAN SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft.com.](https://reader036.fdocuments.us/reader036/viewer/2022062518/56649ea75503460f94baad96/html5/thumbnails/20.jpg)
2008 – copyright SYSTRAN
Conclusions
Our approach is to start with rule-based frameworkDeveloped techniques give very competitive resultsMajor focus on “degradation” controlLearn more advanced post-edition rules
Generic Translation – still a long way to goBigger still better?
Domain TranslationQuality is there – statistics provides adaptation and fluidity
Need dedicated applications, workflow
Bootstrapping new language pair development