TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

28
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE Moses on the Cloud for Do-It-Yourself Machine Translationranslation By Andrejs Vasiļjevs

description

LETS MT! This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supporetd by the European Commission Grant Number 288487 under the 7th Framework Programme. Latest news on Twitter - #MosesCore

Transcript of TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

Page 1: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE

Moses on the Cloud for Do-It-Yourself Machine Translationranslation

By Andrejs Vasiļjevs

Page 2: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

sAndrejs Vasiļjevs

Chairman of the Board, [email protected]

Moses on the Cloud for Do-It-Yourself Machine

Translation

Page 3: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

• Language technology developer

• Localization service provider

• Leadership in smaller languages

• Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania)

• 135 employees

• Strong R&D team

• 9 PhDs and candidates

Page 4: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

MTmachine translation

machine translation

Page 5: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

INNOVATIONd i s r u p ti v e

d i s r u p ti v e

Page 6: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

CHALLENGE

Page 7: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

15largest

languages

50%

Page 8: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

DATA

Page 9: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

one size fits all

?

Page 10: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

just use Moses?

[ttable-file]0 0 5 /.../unfactored/model/phrase-table.0-0.gz% ls steps/1/LM_toy_tokenize.1* | catsteps/1/LM_toy_tokenize.1steps/1/LM_toy_tokenize.1.DONEsteps/1/LM_toy_tokenize.1.INFOsteps/1/LM_toy_tokenize.1.STDERRsteps/1/LM_toy_tokenize.1.STDERR.digeststeps/1/LM_toy_tokenize.1.STDOUT% train-model.perl \--corpus factored-corpus/proj-syndicate \--root-dir unfactored \--f de --e en \--lm 0:3:factored-corpus/surface.lm:0% moses -f moses.ini -lmodel-file "0 0 3 ../lm/europarl.srilm.gz“use-berkeley = truealignment-symmetrization-method = berkeleyberkeley-train = $moses-script-dir/ems/support/berkeley-train.shberkeley-process = $moses-script-dir/ems/support/berkeley-process.shberkeley-jar = /your/path/to/berkeleyaligner-2.1/berkeleyaligner.jarberkeley-java-options = "-server -mx30000m -ea"berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"berkeley-process-options = "-EMWordAligner.numThreads 8"berkeley-posterior = 0.5tokenizein: raw-stemout: tokenized-stemdefault-name: corpus/tokpass-unless: input-tokenizer output-tokenizertemplate-if: input-tokenizer IN.$input-extension OUT.$input-extensiontemplate-if: output-tokenizer IN.$output-extension OUT.$output-extensionparallelizable: yesworking-dir = /home/pkoehn/experimentwmt10-data = $working-dir/data

Page 11: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

buildyour ownMT engine

!

Page 12: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

s

customized MT

Page 13: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

Tilde / CoordinatorLATVIA

University of EdinburghUK

Uppsala UniversitySWEDEN

Copehagen UniversityDENMARK

University of ZagrebCROATIA

MoraviaCZECH REPUBLIC

SemLabNETHERLANDS

Page 14: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

• Online collaborative platform for MT building from user-provided data

• Repository of parallel and monolingual corpora for MT generation

• Automated training of SMT systems from specified collections of data

• Users can specify particular training data collections and build customised MT engines from these collections

• Users can also use LetsMT! platform for tailoring MT system to their needs from their non-public data

Page 15: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

• User-driven cloud-based MT factory, based on open-source MT tools

• Services for data collection, MT generation, customization and running of variety of user-tailored MT systems

• Application in localization among the key usage scenarios

• Strong synergy with FP7 project ACCURAT to advance data-driven machine translation for under-resourced languages and domains

Page 16: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

• Stores SMT training data• Supports different formats –

TMX, XLIFF, PDF, DOC, plain text

• Converts to unified format• Performs format

conversions and alignmentResourceRepository

Page 17: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

c

MT

Page 18: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

• Integration with CAT tools• Integration in web pages • Integration in web browsers• API-level integration

integration

Page 19: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

Training UsingSharing of training data

Giza++Moses SMT toolkit

SMT Resource Repository

SMT Multi-Model Repository

(trained SMT models)

Proc

esin

g, E

valu

ation

...

Upl

oad

Anon

ymou

sac

cess

Auth

entic

ated

acce

ss

System management, user authentication, access rights control ...

Web page

Web service

Web pagetranslation widget

CAT tools

Web browserPlug-ins

SMT Resource Directory

SMT System Directory

Moses decoder

Page 20: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

sUser interface webpage UI, web service API

Application Logic Resource Repositorystores MT training data and trained models

High-performance Computing Clusterexecutes all computationally heavy tasks: SMT training, MT service, Processing and aligning of training data etc.

Interface Layer

Web Page UI Public API

Application Logic LayerResource

Repository Adapter

SMT training

Data Storage Layer(Resource Repository)

High-performance Computing (HPC) Cluster

SGE

Widget ...CAT toolsCAT tools CAT toolsBrowser plug-ins

http

sR

ES

T

http

/http

sht

ml

http

sR

ES

T

h ttp

sR

ES

T, S

OA

P, .

. .

TC

P/IP

h tt p

RE

ST

/ SO

AP

CPUCPU

CPU CPU

CPU CPU

CPU

CPU

htt p

RE

ST

/ SO

AP

Translation

RE

ST

System DB

RR API

SVN

File Share

Web Browsers

HPC frontend CPUREST

SystemArchitecture

Page 21: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012
Page 22: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012
Page 23: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012
Page 24: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012
Page 25: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

%

productivity32.9%*

* Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium

Latvian

Page 26: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

%

productivity

25.1%

* LetsMT! Project Deliverable D6.4

28.5%

Czech Polish

Page 27: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

• incremental training,

• distributed language models

• interpolated language models for domain adaptation

• randomized language models to train using huge corpora

• translation of formatted texts

• running Moses decoder in a server mode

New Moses features

Page 28: TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

tilde.comtechnologies

for smaller

languages

The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456