8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation...

26
8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation for European Languages to the User ICT 231720 Deliverable 8.13 Project funded by the European Community under the Seventh Framework Programme for Research and Technological Development.

Transcript of 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation...

Page 1: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

8.13: Public Project Presentation (Update)

Stephan Busemann

Distribution: Public

EuroMatrixPlus

Bringing Machine Translation

for European Languages to the User

ICT 231720 Deliverable 8.13

Project funded by the European Community

under the Seventh Framework Programme for

Research and Technological Development.

Page 2: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

Project ref no. ICT-231720Project acronym EuroMatrixPlus

Project full title Bringing Machine Translation for European Languages to theUser

Instrument STREPThematic Priority ICT-2007.2.2 Cognitive systems, interaction, roboticsStart date / duration 01 March 2009 / 38 Months

Distribution PublicContractual date of delivery August 31, 2011Actual date of delivery November 30, 2011Date of last update November 28, 2011Deliverable number 8.13Deliverable title Public Project Presentation (Update)Type ReportStatus & version FinalNumber of pages 3Contributing WP(s) WP8WP / Task responsible DFKIOther contributors CU, LSILAuthor(s) Stephan BusemannEC project o�cer Michel BrochardKeywords

The partners in DFKI GmbH, Saarbrucken (DFKI)EuroMatrixPlus University of Edinburgh (UEDIN)are: Charles University (CUNI-MFF)

Johns Hopkins University (JHU)Fondazione Bruno Kessler (FBK)Universite du Maine, Le Mans (LeMans)Dublin City University (DCU)Lucy Software and Services GmbH (Lucy)Central and Eastern European Translation, Prague (CEET)Ludovit Stur Institute of Linguistics,Slovak Academy of Sciences (LSIL)Institute of Information and Communication Technologies,Bulgarian Academy of Sciences (IICT-BAS)

For copies of reports, updates on project activities and other EuroMatrixPlus-relatedinformation, contact:

The EuroMatrixPlus Project Co-ordinatorProf. Dr. Hans Uszkoreit, DFKI GmbHStuhlsatzenhausweg 3, 66123 Saarbrucken, [email protected] +49 (681) 85775-5282 - Fax +49 (681) 85775-5338

Copies of reports and other material can also be accessed via the project’s homepage:http://www.euromatrixplus.net/

c� 2011, The Individual Authors

No part of this document may be reproduced or transmitted in any form, or by any means,

electronic or mechanical, including photocopy, recording, or any information storage and

retrieval system, without permission from the copyright owner.

Page 3: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

Executive Summary

This document contains the public project presentation representing the current state of the

EuroMatrixPlus project after 33 months. After motivating and describing the goals set out,

a survey of scientific progress is given and each major point detailed in the sequel. After an

overview of dissemination activities, the presentation concludes with an assessment of how the

goals are being met.

This presentation may form the basis for project presentations. The corresponding source

file will be used by the Consortium to create updates whenever needed.

The slide set is available as a PDF document from the project website at

http://www.euromatrixplus.eu/activities/.

3

Page 4: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

EuroMatrix Plus - ICT 231720

Bringing Machine Translation for European Languages

to the User

March 2009 - April 2012

Page 5: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

2 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Motivation – Approaches to MT Different approaches to MT have complementary PROs and CONs: Source: Chen & Chen: A Hybrid Approach to Machine Translation System Design,

Computational Linguistics and Chinese Language Processing, 1996

Page 6: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

3 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Motivation – Direction of Research The different paradigms of

rule-based MT (RBMT) and statistical MT (SMT) complement each other regarding their pros and cons. Thus we

• Combine their strengths to

compensate for their weaknesses

• Develop special strategies to tackle difficult to deal with phenomena

RBMT SMT

Syntax, Morphology ++ - Structural Semantics + --

Lexical Semantics - +

Lexical Adaptivity -- +

Lexical Reliability + -

Page 7: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

4 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Objectives of EuroMatrix Plus

1. Continue the rapid advance of machine translation technology, creating example systems for every official EU language, and providing other machine translation developers with our infrastructure for building statistical translation models.

2. Continue and broaden the controlled systematic investigation of different approaches and techniques to accelerate the scientific evolution of novel methods, including both selection and cross-fertilization. The aim is to arrive at scientifically well understood novel combinations of methods that are demonstrably superior to the state of the art.

3. Focus on bringing machine translation to the users – both professional translation services and lay end users.

Page 8: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

5 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Objectives for EuroMatrix Plus (cont’d)

4. Contribute to the growth and competitiveness of the European MT research scene and infrastructure through its open international competitive shared tasks and through living community supported surveys of resources, tools, systems and their respective capabilities.

5. Create an openly accessible sample application that enables users to automatically translate news stories and web pages from any European language into any other, and whose corrections will be exploited as data for improving translation technology.

Page 9: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

6 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Project Features

• FP7 ICT Grant 231720 • Budget: 5.94 M€ • Duration: 03/2009 - 04/2012

(38 months) • Co-ordinator: DFKI GmbH • http://www.euromatrixplus.eu

Page 10: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

7 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

The Partners in Detail

Name Country Research Focus

Deutsches Forschungs-schungszentrum für

Künstliche Intelligenz GmbH

Germany Hybrid MT

University of Edinburgh United Kingdom Statistical MT

Charles University Czech Republic Tree-based MT

Johns Hopkins University United States of America Community-based MT

Fondazione Bruno

Kessler Italy Statistical MT

Laboratoire d'Informatique de

l'Université du Maine France Statistical MT

Page 11: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

8 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

The Partners in Detail (cont‘d)

Name Country Research Focus

Dublin City University Ireland Translation and Localization & MT

Lucy Software and Services GmbH Germany Hybrid MT

CEET language solutions Czech Republic MT evaluation

Ľudovít Štúr Institute of Linguistics Slovakia MT between closely

related languages

Institute of Information and Communication Technologies of the

Bulgarian Academy of Sciences

Bulgaria HPSG-based MT

Page 12: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

9 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Scientific Results

Considerable improvements of SMT by enriching phrase-based and hierarchical models

Next step in hybrid MT research by adding statistical weights to intermediate representations of a commercial RBMT system

Progress in translating between data-poor language pairs by using another translation path through a pivot language, and exploiting comparable data

New training method for quickly updating the model and thus utilizing corrections provided by users

Exploiting monolingual post-editing results to improve MT

Embedding of MT technology into translation and localisation workflows, combined with Translation Memories

Page 13: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

10 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Improving SMT by Enriching Models Shallow syntax modeling • Improve statistical MT by reordering the source text to reflect the

structure of the target text

Reordering for hierarchical models • Use a maximum entropy model to score the movement proposed

by rule application

the x of the x →  das x des x the np of the np →  das  x des x the np of the x →  das  x des x

Mixed source syntax model • Hierarchical rules use general non-

terminals without enforcing a particular category.

• Source syntax rules use linguistic categories which restrict the type of phrase the non-terminal can be replaced with.

• The mixed source syntax model relaxes the strict categories of the syntax model to facilitate translation

Page 14: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

11 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Progress in Hybrid MT

A commercial rule- based MT system was extended by statistical modules according to the „SMT feeds RBMT“   hybrid architecture. Results: 1. RBMT analysis now includes a state of the art stochastic parser

in order to select the best from the many parse trees. 2. The transfer lexicon has been extended with bilingual

terminology extracted from a parallel corpus, enriched with linguistic information including the internal structure of multiword expressions, frequency and category of the overall term.

RBMT Engine

Source Text

Target Text

Lexicon

Linguistic Processing

Manual Validation

Phrase Table

Parallel Corpus

Alignment, Phrase

extraction

Page 15: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

12 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Other Hybrid MT Set-Ups Explored

Pivot languages The pivot method used is composition.

Tree-based translation In addition to morphological and shallow syntactic layers, a new

TectoMT system utilizes the so called tectogrammatical layer, which describes deep syntax including co-reference information. TectoMT is built using our new publicly available platform Treex and makes use of both rule-based and statistical processing.

HPSG-based translation Following the usual set-up of an RBMT system, HPSG

processing is used for analysis and generation, whereas the transfer between the HPSG semantic representations is modeled statistically.

Page 16: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

13 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Languages With Low Resources

Updated and extending existing resources • Europarl (57m words, 21 languages) • News Commentary (1-2m words) • UN corpus (300m words) • Monolingual news corpus (1b words) Creating new data resources • Czech-English:

• Corpus annotated with tectogrammatical information • Slovak-Czech and English-Slovak:

• Sentence aligned corpus annotated with lemma and morphological information

• Bulgarian-English parallel tree bank

Page 17: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

14 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Extending parallel corpora Non-parallel corpora can be exploited to extend a parallel corpus.

1. Translating monolingual texts 2. Extracting parallel sentences from comparable corpora (e.g.

press agency releases) using information retrieval methods

Languages With  Low  Resources  (cont’d)

Page 18: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

15 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

The Moses SMT Decoder

• Moses is an open source SMT system that allows you to automatically train translation models for any language pair. All you need is a parallel corpus. An efficient search algorithm finds quickly the highest probability translation among the exponential number of choices.

• Moses development has been funded in the Sixth and Seventh Framework Programme for Research and Technological Development. It is currently supported within EuroMatrix Plus.

• Moses is licensed under the LGPL

• Detailed information is found at http://statmt.org/moses/

Page 19: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

16 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

The Moses Success Story

• More than 18,000 downloads of release packages, probably many more by SVN checkout.

• 3,785+ revisions in 31 branches inside the SVN repository. • 718+ scientific citations reported by Google Scholar. • Moses mailing list has around 485 members and is "one of the

most active MT-related list out there". • MT Marathons organised by EuroMatrix Plus attracted lots of

Moses projects/people interested in the software. • Autodesk used Moses for a post-editing productivity test

presented at the MT Marathon 2010 in Dublin. • Installed, tested and used by EC DGT, by EuroScript GmbH,

Germany, by Spanish language service provider Pangeanic. • TAUS Data Association: "the translation industry is steadily

appropriating the Moses translation engine."

Page 20: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

17 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

User-Centered Support for SMT

• Incremental model updates • Redefine SMT as a dynamic, continuous learning process • New methods for updates of statistical translation models • Allows us to incorporate user feedback immediately

• Translation aid tools for interactive MT

• Allow monolingual users to translate sentences written in foreign languages

• The monolingual user is shown a visualization of possible translations for each phrase in an input sentence, and chooses among them to construct a translation.

Page 21: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

18 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Integrated Localization Workflow • Translation Memories (TMs) are still the base technology in

industrial localization workflows. How can MT be integrated? • Loose integration of TM and MT: decide which to prefer

• TM/MT recommendation model based on estimated post-editing effort

• TM/MT reranking model for outputs • Improving post-editing

experience • Tight integration: use bits of

TM in MT • Constrain the MT system

in such a way that matched input bits are translated as per TM and others as per MT system

• Tree-alignment based system

Page 22: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

19 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Conferences and Workshops EuroMatrix Plus organizes various conferences and workshops on a regular basis, thus disseminating its results to both the scientific community and industrial companies. For instance, • Translingual Europe: 2010 in Berlin, Germany (in connection with

Localization World) • Joint CNGL-EuroMatrix Plus Workshop for Users (in connection

with AMTA 2010) • WMT workshop:

• 2010 in Uppsala, Sweden, in connection with ACL2010 • 2011 in Edinburgh, UK, in connection with EMNLP2011

• MT Marathons with papers, discussions, tutorials and hands-on experience: • Dublin, Ireland, January 2010 • Le Mans, France, September 2010 • Trento, Italy, September 2011

Page 23: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

20 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Scientific Publications The work in the project has so far lead to more than 80 scientific publications. Some examples: • Parallel Sentence Generation from Comparable Corpora for

improved SMT (2011) by Sadaf Abdul-Rauf, Holger Schwenk, Machine Translation Journal

• Convergence of Translation Memory and Statistical Machine Translation (2010) by Philipp Koehn, Jean Senellart (AMTA Workshop on MT Research and the Translation Industry)

• Hierarchical Hybrid Translation between English and German (2010) by Yu Chen, Andreas Eisele, Proceedings of the 14th Annual Conference of the European Association for Machine Translation

A complete list of publications by the project is found at

http://www.euromatrixplus.eu/publications/

Page 24: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

21 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Fulfilling the Goals 1. Advancing MT Technology

Translation and language model can now incorporate new data, such as user edits, instantaneously without having to retrain on the entire corpus. A large effort went into making use of (shallow) syntactical information to increase the translation quality.

2. Investigating different approaches & technologies Different approaches to hybrid MT are being investigated. A rule-based commercial system has been successfully extended with stochastic modules. A hybrid HPSG-based translation approach is currently being implemented.

3. Bringing MT to the users Work on integrating MT into translation and localization workflows has been carried out to find the most useful set-up for professional users. Lay users are targeted by the WikiTrans work package.

Page 25: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

22 © 2011 EuroMatrix Plus Consortium – Public Project Presentation

Fulfilling the Goals (cont‘d)

4. Contributing to the European MT research scene Numerous workshops and conferences organized by the consortium increase the visibility of European MT research efforts and foster the exchange between researchers and industry.

5. Create an openly accessible MT sample application Based on the successful open source system Moses, a broker server platform has been developed (“MT  Serverland”).  Work  on creating an interface for WikiTrans is currently being carried out.

Page 26: 8.13: Public Project Presentation (Update) · 2017-04-18 · 8.13: Public Project Presentation (Update) Stephan Busemann Distribution: Public EuroMatrixPlus Bringing Machine Translation

23 © 2011 EuroMatrix Plus Consortium – Public Project Presentation