TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

Post on 26-Jun-2015

3.220 views 1 download

Tags:

description

LETS MT! This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supporetd by the European Commission Grant Number 288487 under the 7th Framework Programme. Latest news on Twitter - #MosesCore

Transcript of TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE

Moses on the Cloud for Do-It-Yourself Machine Translationranslation

By Andrejs Vasiļjevs

sAndrejs Vasiļjevs

Chairman of the Board, Tildeandrejs@tilde.com

Moses on the Cloud for Do-It-Yourself Machine

Translation

• Language technology developer

• Localization service provider

• Leadership in smaller languages

• Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania)

• 135 employees

• Strong R&D team

• 9 PhDs and candidates

MTmachine translation

machine translation

INNOVATIONd i s r u p ti v e

d i s r u p ti v e

CHALLENGE

15largest

languages

50%

DATA

one size fits all

?

just use Moses?

[ttable-file]0 0 5 /.../unfactored/model/phrase-table.0-0.gz% ls steps/1/LM_toy_tokenize.1* | catsteps/1/LM_toy_tokenize.1steps/1/LM_toy_tokenize.1.DONEsteps/1/LM_toy_tokenize.1.INFOsteps/1/LM_toy_tokenize.1.STDERRsteps/1/LM_toy_tokenize.1.STDERR.digeststeps/1/LM_toy_tokenize.1.STDOUT% train-model.perl \--corpus factored-corpus/proj-syndicate \--root-dir unfactored \--f de --e en \--lm 0:3:factored-corpus/surface.lm:0% moses -f moses.ini -lmodel-file "0 0 3 ../lm/europarl.srilm.gz“use-berkeley = truealignment-symmetrization-method = berkeleyberkeley-train = $moses-script-dir/ems/support/berkeley-train.shberkeley-process = $moses-script-dir/ems/support/berkeley-process.shberkeley-jar = /your/path/to/berkeleyaligner-2.1/berkeleyaligner.jarberkeley-java-options = "-server -mx30000m -ea"berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"berkeley-process-options = "-EMWordAligner.numThreads 8"berkeley-posterior = 0.5tokenizein: raw-stemout: tokenized-stemdefault-name: corpus/tokpass-unless: input-tokenizer output-tokenizertemplate-if: input-tokenizer IN.$input-extension OUT.$input-extensiontemplate-if: output-tokenizer IN.$output-extension OUT.$output-extensionparallelizable: yesworking-dir = /home/pkoehn/experimentwmt10-data = $working-dir/data

buildyour ownMT engine

!

s

customized MT

Tilde / CoordinatorLATVIA

University of EdinburghUK

Uppsala UniversitySWEDEN

Copehagen UniversityDENMARK

University of ZagrebCROATIA

MoraviaCZECH REPUBLIC

SemLabNETHERLANDS

• Online collaborative platform for MT building from user-provided data

• Repository of parallel and monolingual corpora for MT generation

• Automated training of SMT systems from specified collections of data

• Users can specify particular training data collections and build customised MT engines from these collections

• Users can also use LetsMT! platform for tailoring MT system to their needs from their non-public data

• User-driven cloud-based MT factory, based on open-source MT tools

• Services for data collection, MT generation, customization and running of variety of user-tailored MT systems

• Application in localization among the key usage scenarios

• Strong synergy with FP7 project ACCURAT to advance data-driven machine translation for under-resourced languages and domains

• Stores SMT training data• Supports different formats –

TMX, XLIFF, PDF, DOC, plain text

• Converts to unified format• Performs format

conversions and alignmentResourceRepository

c

MT

• Integration with CAT tools• Integration in web pages • Integration in web browsers• API-level integration

integration

Training UsingSharing of training data

Giza++Moses SMT toolkit

SMT Resource Repository

SMT Multi-Model Repository

(trained SMT models)

Proc

esin

g, E

valu

ation

...

Upl

oad

Anon

ymou

sac

cess

Auth

entic

ated

acce

ss

System management, user authentication, access rights control ...

Web page

Web service

Web pagetranslation widget

CAT tools

Web browserPlug-ins

SMT Resource Directory

SMT System Directory

Moses decoder

sUser interface webpage UI, web service API

Application Logic Resource Repositorystores MT training data and trained models

High-performance Computing Clusterexecutes all computationally heavy tasks: SMT training, MT service, Processing and aligning of training data etc.

Interface Layer

Web Page UI Public API

Application Logic LayerResource

Repository Adapter

SMT training

Data Storage Layer(Resource Repository)

High-performance Computing (HPC) Cluster

SGE

Widget ...CAT toolsCAT tools CAT toolsBrowser plug-ins

http

sR

ES

T

http

/http

sht

ml

http

sR

ES

T

h ttp

sR

ES

T, S

OA

P, .

. .

TC

P/IP

h tt p

RE

ST

/ SO

AP

CPUCPU

CPU CPU

CPU CPU

CPU

CPU

htt p

RE

ST

/ SO

AP

Translation

RE

ST

System DB

RR API

SVN

File Share

Web Browsers

HPC frontend CPUREST

SystemArchitecture

%

productivity32.9%*

* Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium

Latvian

%

productivity

25.1%

* LetsMT! Project Deliverable D6.4

28.5%

Czech Polish

• incremental training,

• distributed language models

• interpolated language models for domain adaptation

• randomized language models to train using huge corpora

• translation of formatted texts

• running Moses decoder in a server mode

New Moses features

tilde.comtechnologies

for smaller

languages

The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456