Download - TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

Transcript
Page 1: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE

PangeaMT Putting Open Standards to work16:00-16:15Monday 4 June

Manuel HerranzPangeanic

Page 2: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

PangeaMT – putting open standards to work… well

Manuel Herranz#manuelhrrnz #pangeanic E: [email protected]

Page 3: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

MACHINE TRANSLATION

Make myday,

I S N O T

I Sbecome

a post-editor

Page 4: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

Chomsky: Imagine that ifin the futurelarge enoughamounts of data existed, they could be processed bycomputers withenoughcomputingpower

rule-based systems, IBM licenses, many linked to patent EN/RU & Intel

First statistical papers

1st Open source SMT

Translation industryappropriating Moseshttp://euromatrixplus.net/moses

DIY SMT

http://t.co/HDTboxQ

Page 5: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

PEAK of ColdWar and informationcontrol.Products & informationdirected toconsumers/ users / citizens

BEGINNING of data resources. Internet.Accessability toinformation

Content generated BY USERS / CONSUMERS / CITIZENS, multilingual, free informationexchange across theworld

Page 6: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

Types of LSPs (Ben Sargent – TMS Inspiration Days April ‘11 – Krakow)

a) develop it for their use and for their clients (developers of a system),

b) buyers of systems (they do not want the headache of starting from scratch and prefer to buy ready-build solutions) and finally

c) there are those who prefer the mix&match approach (buying some good solutions outside and building interfaces and what they know works best for their business). The trend is towards unification

Page 7: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

2007/08

.

2009/10

2011/12

• DIY SMT• Empower Users• Glossary• Automated re-training• Transfer architecture and know-how to users

• Compatibility withcommercial formats (ttx, sdlxliff, itd)

2007 and before

• RB tests with commercial software• Insufficiently good output• Only internal production

• EU Post-Editing Award• V1: Small data sets (2-5M words), automotive & electronics

• (ES), then Fr/It/De in other fields

• Division born• 00's of engine trials and language combinations

• Open-Source to commercial

• TMX / XLIFF workflows

As of May 2009: 487 Billion gigabytes or1,000,000,000 * 487,000,000,000 = 4,87 x 1020

EstimatesUp 50% a year (Oracle)Doubles every 11 hours (IBM)

Page 8: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

OBJECTIVES = CHALLENGES 2007 - 2010

Turn academic development (Moses) into a commercial application.

To provide High Q MT for Post-Editing and save time and cost. No Google-type broad TR but domain-specific, user-centric.

Lower entry level for MT. Bring affordability user control / empowerment to MT. Bring it to the user, take away from programmer.

How? By fostering open-standard geared translation automation strategies.

To use only community-based Open standards –> Oasis / ISO: xliff / tmx, xml). NO proprietary formats (technology independence) so USERS are not “locked” in to buying and updating expensive software.

DIY SMT June 2011 http://t.co/HDTboxQ

Page 9: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

9

The rush for data

Soon realised that there was a rush to gather data but that other resources around data were necessary

cleaning

More cleaning

Page 10: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

10

cleaning

More cleaning

<tu srclang="en-GB">

<tuv xml:lang="EN-GB">

<seg>A system for recovering the methane that is emitted from the manure so that

it does not leak into the atmosphere.</seg>

</tuv>

<tuv xml:lang="FR-FR">

<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel

d'origine animale de sorte qu'il ne se dissipe pas dans l'atm sphère.</seg>

</tuv>

<tuv xml:lang=“EN-US">

<seg>On 22nd May we decided not to join the group.</seg>

<tuv xml:lang=“DE-DE">

<seg>Am 22. </seg>

<tu srclang="en-GB">

<tuv xml:lang="EN-GB">

<seg>The President of the United States visited Costa Rica.</seg>

</tuv>

<tuv xml:lang=“ES-ES">

<seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora

Michelle, visitaron Costa Rica el pasado sábado.</seg>

</tuv>

Page 11: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

11

cleaning

More cleaning

<tuv xml:lang=“JP">

<seg>同書は「通訳・翻訳キャリアガイド」の2011-2012年度版。

英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。</seg>

<tuv xml:lang=“EN-US">

<seg>It is a journalistic point of view and strengths of the English-

language newspaper Japan Times. It includes a description of the exciting and

rewarding work of translation and interpretation, as well as the introduction of

consciousness and how to acquire the required professional skills. The road to

becoming a translator and interpreter also down to the actual work site, a

comprehensive guide to interpreting the reality of today'stranslation industry.

</seg>

Page 13: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

Translation MT+PE

Automotive 400 wph 900 wph

Marketing 250 wph 450 wph

Software 350 wph 1,000 wph

Page 14: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012
Page 15: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012
Page 16: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012
Page 17: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

Page 18: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

Domains are managed at TM and at engine level

Page 19: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

I created this engine with medical, pharma TMX and added environmental

TMs to boost coverage - Client deals with plant-based natural drugs / ayurveda

Tag-based TM selection

Page 20: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

Page 21: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012
Page 22: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012
Page 23: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012
Page 24: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012

2015

2014

2013

2011

2010

2012

2018

2017

2016

User

em

po

werm

ent

• MT acceptance growth (still)

• Translator engagement challenge (being solved particularly with in-house translators & economic climate)

• Need for data is being addressed – still more work to be done.

• The difference will be madeby data handling and MTtechniques (hybrid, combination, syntax, re-ordering, etc)

• Users and practitioners now can build their own systems, A TREND BEING FOLLOWED BY OTHER PLAYERS.

Until 2011/12

YEAR2016

00

0's

of c

usto

miz

ed

MT

syste

ms

In 5 years... after 2017… where?

Tech. notthe realm of afew providers

Ubiquitious MT2009