Statistical Machine Translation with Moses Hieu Hoang Localization World 2013 0.6227.

Post on 29-Mar-2015

217 views 0 download

Tags:

Transcript of Statistical Machine Translation with Moses Hieu Hoang Localization World 2013 0.6227.

Statistical Machine Translation with Moses

Hieu HoangLocalization World 2013

0.6227

Moses by Hieu Hoang, University of Edinburgh

2

Agenda

• What is Statistical Machine Translation?• What is Moses?– Common misconceptions

• Coming up• What can we do for you?

Moses by Hieu Hoang, University of Edinburgh

3

Agenda

• What is Statistical Machine Translation?• What is Moses?– Common misconceptions

• Coming up• What can we do for you?

Moses by Hieu Hoang, University of Edinburgh

4

What is Statistical Machine Translation?

It is very tempting to say that a book written in Chinese is simply a book written in English which was coded into the “Chinese code.” If we have useful methods for solving almost any cryptographic problem, may it not be thatwith proper interpretation we already have useful methods for translation?

Warren Weaver1949

Moses by Hieu Hoang, University of Edinburgh

5

• NLP Application– search engines, text mining etc.

• Big-data– bi-text from the Internet• eg. multilingual websites, documents

– large monolingual data• Learn to translate– from previous translations– models of language

What is Statistical Machine Translation?

Moses by Hieu Hoang, University of Edinburgh

6

What is Statistical Machine Translation?Training

Training Data Linguistic Toolsbi-textmonolingual datadictionary

SMT Systemtranslation modellanguage modellots of numbers…

Using

Source Text

SMT Systemtranslation modellanguage modellots of numbers…

§

Source Text

Moses by Hieu Hoang, University of Edinburgh

7

What is a model?

thanks to Precision Translation Tools

• Translation Model• Language Model– (of the target language)

Moses by Hieu Hoang, University of Edinburgh

8

What is a model?• Translation model– source translation– probability

source target probability

den Vorschlag the proposal 0.6227

‘s proposal 0.1068

a proposal 0.0341

the idea 0.0250

this proposal 0.0227

proposal 0.0205

…. ….

Moses by Hieu Hoang, University of Edinburgh

9

What is a model?• Language model– Likelihood of sentence– in target language

text probability

I would like 0.489

would like to 0.905

like to commend 0.002

to commend the 0.472

commend the rapporteur

0.147

…. ….

Moses by Hieu Hoang, University of Edinburgh

10

Agenda

• What is Statistical Machine Translation?• What is Moses?– Common misconceptions

• Coming up• What can we do for you?

Moses by Hieu Hoang, University of Edinburgh

11

What is Moses?

• Replacement for Pharoah– Academic software– Closed-source

• Open source• Re-written, clean code– More features

• Large developer community– Initiated by Hieu Hoang– Developed at NLP Workshop

Moses by Hieu Hoang, University of Edinburgh

12

Agenda

• What is Statistical Machine Translation?• What is Moses?– Timeline– Common misconceptions

• Coming up• What can we do for you?

Moses by Hieu Hoang, University of Edinburgh

13

What is Moses?

• Only for Linux• Difficult to use• Unreliable• Only phrase-based• Developed by one person• Slow

Common Misconceptions

Moses by Hieu Hoang, University of Edinburgh

14

Only works on Linux

• Tested on– Windows 7 (32-bit) with Cygwin 6.1 – Mac OSX 10.7 with MacPorts– Ubuntu 12.10, 32 and 64-bit– Debian 6.0, 32 and 64-bit– Fedora 17, 32 and 64-bit– openSUSE 12.2, 32 and 64-bit

• Project files for– Visual Studio– Eclipse on Linux and Mac OSX

Moses by Hieu Hoang, University of Edinburgh

15

Difficult to use• Easier compile and install– Boost bjam – No installation required

• Binaries available for– Linux– Mac– Windows/Cygwin– Moses + Friends

• IRSTLM• GIZA++ and MGIZA

• Ready-made models trained on Europarl

Moses by Hieu Hoang, University of Edinburgh

16

Unreliable• Monitor check-ins• Unit tests• More regression tests• Nightly tests

– Run end-to-end training– http://www.statmt.org/moses/cruise/

• Tested on all major OSes• Train Europarl models

– Phrase-based, hierarchical, factored– 8 language-pairs– http://www.statmt.org/moses/RELEASE-1.0/models/

Moses by Hieu Hoang, University of Edinburgh

17

Only phrase-based model– replacement for Pharoah– extension of Pharaoh

• From the beginning– Factored models– Lattice and confusion network input– Multiple LMs, multiple phrase-tables

• since 2009– Hierarchical model– Syntactic models

Moses by Hieu Hoang, University of Edinburgh

18

Developed by one person• ANYONE can contribute

– 50 contributors

‘git blame’ of Moses repository

Kenneth

Heafield

Hieu Hoan

g

phkoeh

n

Ondrej Bojar

Barry H

addow

sanmarf

Tetsu

o Kiso

Eva H

asler

Rico Se

nnrich

wlin12

nicolab

ertoldi

eherb

st

Ales Ta

mchyn

a

Colin Cherr

y

Matous M

achace

k

Phil Willi

ams

0%5%

10%15%20%25%30%35%40%

Moses by Hieu Hoang, University of Edinburgh

19

Slow

thanks to Ken!!

Decoding

Moses by Hieu Hoang, University of Edinburgh

20

Slow

• Multithreaded

• Reduced disk IO– compress intermediate files

• Reduce disk space requirement

Time (mins) 1-core 2-cores 4-cores 8-cores Size (MB)

Phrase-based

60 47(79%)

37(63%)

33(56%)

893

Hierarchical 1030 677(65%)

473(45%)

375(36%)

8300

Training

Moses by Hieu Hoang, University of Edinburgh

21

What is Moses?Common Misconceptions

• Only for Linux• Difficult to use• Unreliable• Only phrase-based• Developed by one person• Slow

Moses by Hieu Hoang, University of Edinburgh

22

What is Moses?

• Only for Linux Windows, Linux, Mac• Difficult to use Easier compile and install• Unreliable Multi-stage testing• Only phrase-based Hierarchical, syntax model• Developed by one person everyone• Slow Fastest decoder, multithreaded training,

less IO

Common Misconceptions

Moses by Hieu Hoang, University of Edinburgh

23

Agenda

• What is Statistical Machine Translation?• What is Moses?– Common misconceptions

• Coming up• What can we do for you?

Moses by Hieu Hoang, University of Edinburgh

24

Coming up…• Code cleanup• Incremental Training• Better translation– smaller model– bigger data– faster training and decoding

• Applications– CAT tools– Speech translation

Moses by Hieu Hoang, University of Edinburgh

25

Applications

• EU Project– CASMACAT– MATECAT

Computer-Aided Translation

Moses by Hieu Hoang, University of Edinburgh

26

Agenda

• What is Statistical Machine Translation?• What is Moses?– Common misconceptions

• Coming up• What can we do for you?

Moses by Hieu Hoang, University of Edinburgh

27

What can we do for you?

– simpler Moses– graphical interface– Windows compatibility– terminology and glossary– incremental training

• What can you do for us?– code– data– funding

Moses by Hieu Hoang, University of Edinburgh

28

What can we do for you?

– simpler Moses– graphical interface– Windows compatibility– terminology and glossary– incremental training

• What can you do for us?– code– data– funding