TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE

Moses on the Cloud for Do-It-Yourself Machine Translationranslation

By Andrejs Vasiļjevs

sAndrejs Vasiļjevs

Chairman of the Board, Tildeandrejs@tilde.com

Moses on the Cloud for Do-It-Yourself Machine

Translation

• Language technology developer

• Localization service provider

• Leadership in smaller languages

• Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania)

• 135 employees

• Strong R&D team

• 9 PhDs and candidates

MTmachine translation

machine translation

INNOVATIONd i s r u p ti v e

d i s r u p ti v e

CHALLENGE

15largest

languages

one size fits all

just use Moses?

[ttable-file]0 0 5 /.../unfactored/model/phrase-table.0-0.gz% ls steps/1/LM_toy_tokenize.1* | catsteps/1/LM_toy_tokenize.1steps/1/LM_toy_tokenize.1.DONEsteps/1/LM_toy_tokenize.1.INFOsteps/1/LM_toy_tokenize.1.STDERRsteps/1/LM_toy_tokenize.1.STDERR.digeststeps/1/LM_toy_tokenize.1.STDOUT% train-model.perl \--corpus factored-corpus/proj-syndicate \--root-dir unfactored \--f de --e en \--lm 0:3:factored-corpus/surface.lm:0% moses -f moses.ini -lmodel-file "0 0 3 ../lm/europarl.srilm.gz“use-berkeley = truealignment-symmetrization-method = berkeleyberkeley-train = $moses-script-dir/ems/support/berkeley-train.shberkeley-process = $moses-script-dir/ems/support/berkeley-process.shberkeley-jar = /your/path/to/berkeleyaligner-2.1/berkeleyaligner.jarberkeley-java-options = "-server -mx30000m -ea"berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"berkeley-process-options = "-EMWordAligner.numThreads 8"berkeley-posterior = 0.5tokenizein: raw-stemout: tokenized-stemdefault-name: corpus/tokpass-unless: input-tokenizer output-tokenizertemplate-if: input-tokenizer IN.$input-extension OUT.$input-extensiontemplate-if: output-tokenizer IN.$output-extension OUT.$output-extensionparallelizable: yesworking-dir = /home/pkoehn/experimentwmt10-data = $working-dir/data

buildyour ownMT engine

customized MT

Tilde / CoordinatorLATVIA

University of EdinburghUK

Uppsala UniversitySWEDEN

Copehagen UniversityDENMARK

University of ZagrebCROATIA

MoraviaCZECH REPUBLIC

SemLabNETHERLANDS

• Online collaborative platform for MT building from user-provided data

• Repository of parallel and monolingual corpora for MT generation

• Automated training of SMT systems from specified collections of data

• Users can specify particular training data collections and build customised MT engines from these collections

• Users can also use LetsMT! platform for tailoring MT system to their needs from their non-public data

• User-driven cloud-based MT factory, based on open-source MT tools

• Services for data collection, MT generation, customization and running of variety of user-tailored MT systems

• Application in localization among the key usage scenarios

• Strong synergy with FP7 project ACCURAT to advance data-driven machine translation for under-resourced languages and domains

• Stores SMT training data• Supports different formats –

TMX, XLIFF, PDF, DOC, plain text

• Converts to unified format• Performs format

conversions and alignmentResourceRepository

• Integration with CAT tools• Integration in web pages • Integration in web browsers• API-level integration

integration

Training UsingSharing of training data

Giza++Moses SMT toolkit

SMT Resource Repository

SMT Multi-Model Repository

(trained SMT models)

System management, user authentication, access rights control ...

Web page

Web service

Web pagetranslation widget

CAT tools

Web browserPlug-ins

SMT Resource Directory

SMT System Directory

Moses decoder

sUser interface webpage UI, web service API

Application Logic Resource Repositorystores MT training data and trained models

High-performance Computing Clusterexecutes all computationally heavy tasks: SMT training, MT service, Processing and aligning of training data etc.

Interface Layer

Web Page UI Public API

Application Logic LayerResource

Repository Adapter

SMT training

Data Storage Layer(Resource Repository)

High-performance Computing (HPC) Cluster

Widget ...CAT toolsCAT tools CAT toolsBrowser plug-ins

h tt p

CPUCPU

CPU CPU

Translation

System DB

RR API

File Share

Web Browsers

HPC frontend CPUREST

SystemArchitecture

productivity32.9%*

* Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium

Latvian

productivity

* LetsMT! Project Deliverable D6.4

Czech Polish

• incremental training,

• distributed language models

• interpolated language models for domain adaptation

• randomized language models to train using huge corpora

• translation of formatted texts

• running Moses decoder in a server mode

New Moses features

tilde.comtechnologies

for smaller

languages

The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

Technology

Transcript of TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

TAUS Annual Conference 2012, WHAT IS NEW AT TILDE, Andrejs Vasiljevs, Tilde

rincondelcurrante.files.wordpress.com · Eiemplo: calor: no Ileva tilde porque es aguda y termina en consonante distinta de n o s. ... Ileva tilde porque es esdrújula y todas las

TAUS Annual Conference 2013, TaaS (Terminology as a Service), Indra Samite, Tilde

Loudoun County Public Schools / Overview · USO DEL ACENTO ORTOGRÁFICO O TILDE El acento ortográfico (tilde) es un signo gráfico que se utiliza, en determinadas ocasiones, para

Challenges to Regulation: Intenet Ecosytem ANDREJS DOMBROVSKIS Deputy Director, Electronic Communications and Post (SPRK) BEREC – EMERG – EAPEREG - REGULATEL.

Language Arts Glossary Acento, tilde Accent. Language Arts Glossary Verbo Action word.

Managing Learning and Knowledge Capital Human Resource Development: Chapter 11 Evaluation Copyright © 2010 Tilde University Press.

TAUS MT Showcace, MT Applications in the EU Public Sector, Adrejs Vasiljevs, Tilde

Tilde Spanish Courses

S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012.

ASCII Charset: 0 Contains:ASCII space through tilde

a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com

Prof. Andrejs Rauhvargers, Latvia President, Lisbon Recognition Convention Committee

Andrejs Vasiljevs (Tilde) at the Industry Leaders Forum 2015

[Tilde Binger] - Asherah Goddesses in Ugarit, Israel and the Old Testament

The TILDE File Naming Scheme

BIT&T 2007 How to get partners, knowledge and funds for innovation Andrejs Vasiljevs Director, Products and services Tilde BIT&T 2007, 20.04.2007.

Introduction to tilde codes. The tilde codes… Tilde codes are used to create custom result pictures from mentometer system with great freedom of choice.

Tilde Group

USAC Colloquium Constructing Polyhedra Andrejs Treibergs - University …treiberg/PolyhedraSlides.pdf · 2011. 10. 31. · Constructing Polyhedra Andrejs Treibergs University of Utah