TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom Hoar, Precision...

Post on 22-Apr-2015

820 views 0 download

description

This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme. For the latest updates, follow us on Twitter - #MosesCore

Transcript of TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom Hoar, Precision...

TAUS  MACHINE  TRANSLATION  SHOWCASE  

A Small LSP’s Guide To Commercialized Open Source SMT 15:30 – 15:50 Wednesday, 10 April 2013 Tom Hoar Precision Translation Tools

A Small LSP's Guide To Commercialized Open Source SMT

From 28 years of corpus exploitation

Tom Hoar Precision Translation Tools

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 3

Agenda

●  Introduction ●  Who is PTTools? ●  Fundamental Assumptions ●  Models and Proportions ●  SMT Statistical Models ●  New Perspective ●  Acknowledgements

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 4

Origin of MT?

●  … the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”

●  March 4, 1947 ●  From: Warren Weaver, Mathematician Rockefeller ●  To: Norbert Wiener, Professor of Mathematics MIT

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 5

Origin of Pessimism?

●  … as to the problem of mechanical translation, I frankly am afraid the boundaries of words in different languages are too vague and the emotional and international connotations are too extensive to make any quasi mechanical translation scheme very hopeful.

●  April 30, 1947 (day 56 later) ●  Norbert Wiener, Professor of Mathematic MIT

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 6

Sharing An Experience

●  ESL/EFL student: –  “What does 'wanton' mean?”

●  Teacher: –  “Where did you see it?” –  “How was it used?”

●  Despite this, students learn that meaning comes from vocabulary, spelling, grammar, syntax

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 7

Working With “Meaning”

●  CONTEXT + CONTENT = MEANING ●  Context: the container

–  i.e. domain, subject, usage, purpose, culture

●  Content: anything in the container –  i.e. vocabulary, spelling, grammar, syntax,

punctuation, style

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 8

The bird swam to its nest.

●  ESL/EFL students: “The meaning is wrong.”

●  Teacher: “Vocabulary, spelling, grammar, syntax, punctuation are all correct. Why is the meaning wrong?” –  Students are confused

●  Homework: Fix the meaning without changing the contents.

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 9

Context Is Determinative

●  Possible solution: –  The bird is a duck – or swan, goose, penguin,

cormorant, etc. ●  Lesson?

–  Change the container – change the meaning –  Machines can’t search for a greater context

●  Only humans can ●  How often do we look beyond the obvious?

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 10

Agenda

●  Introduction ●  Who is PTTools? ●  Fundamental Assumptions ●  Models and Proportions ●  SMT Statistical Models ●  New Perspective ●  Acknowledgements

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 11

Disclaimer

●  Speaker does not have a PhD ●  Results from the School of Hard Knocks,

Faculty of Scientific Repetition ●  Only affiliation with Moses team is a user

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 12

Precision Translation Tools

●  Software publisher –  Founded in Feb 2010, Bangkok, Thailand –  Not a translation services provider –  Software, training and support

●  “Do” Machine Translation ●  “Do” Moses Yourself Community Edition (free)

●  Senior managers over 75 years serving translation professionals and user documentation

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 13

Customers

●  Current –  ~300 customers/users –  30 countries

●  Target –  Small & medium LSPs (2-20 persons) –  Translators

●  Accomplishments –  First Maori – English SMT system –  First English – Khmer

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 14

Mission

●  Make statistical machine translation tools available to everyone with –  Open source foundation –  Simplified usability –  User education and training –  Autonomous ecosystems –  Intellectual property protection

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 15

Agenda

●  Introduction ●  Who is PTTools? ●  Fundamental Assumptions ●  Models and Proportions ●  SMT Statistical Models ●  New Perspective ●  Acknowledgements

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 16

7 Fundamental Assumptions

●  These are essential if SMT is to work. ●  They can not be proven. ●  They can only be observed through the

success or failure of an SMT system.

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 17

SMT Assumption 1

●  Most of the time, most authors create content with appropriate –  Vocabulary –  Spelling –  Grammar –  Syntax –  Punctuation –  Style

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 18

SMT Assumption 2

●  Most of the time, most translators create translations with appropriate –  Vocabulary –  Spelling –  Grammar –  Syntax –  Punctuation –  Style

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 19

SMT Assumption 3

●  In large collections of original content, fragments repeat proportionately to their occurrences in the real world

green birds fly quickly red birds fly to the nest white birds swim across the pond yellow birds eat sunflower seeds black birds eat yellow corn white birds swim gracefully black birds hover over the nest pink birds stand on one leg pink birds eat orange shrimp grey birds stand in the nest

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 20

SMT Assumption 4

●  In large collections of translations of original content, the translations mirror the repetitions in the original content

los pájaros verdes vuelan rápidamente los pájaros rojos vuelan al nido los pájaros blancos nadan en el estanque los pájaros amarillos comen semillas de girasol los pájaros negros comen maíz amarillo los pájaros blancos nadan con gracia los pájaros negros se ciernen sobre el nido los pájaros rosados se aguantan sobre una sola pierna los pájaros rosados comen camarones naranjas los pájaros grises están en el nido

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 21

SMT Assumptions 5 & 6

●  Repetitions in past “original content” will repeat in future content in the same proportions.

●  Mirrored repetitions in past translations of

“original content” will repeat in future content in the same proportions.

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 22

SMT Assumption 7

●  “Exceptions” are exceptions because they don't follow normative rules. –  If there’s a rule for a so-called exception, it is

a rule not an exception. –  “Exceptions” occur less frequently than

“norms.” Therefore, they do not significantly impact the proportions or frequency of repetitions in the large collections.

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 23

Agenda

●  Introduction ●  Who is PTTools? ●  Fundamental Assumptions ●  Models and Proportions ●  SMT Statistical Models ●  New Perspective ●  Acknowledgements

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 24

Machine Learning

●  Borrow content from a library ●  Study the content ●  Retain residual knowledge in memory ●  Return the content to the library ●  Organize and optimize the knowledge ●  Recall and use the residual knowledge to

predict future event

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 25

Statistical Machine Translation

●  Artificial Intelligence

●  Study = Train ●  Memory = Tables ●  Optimize = Tune ●  Predict = Translate

SMT Model Configuration

Translation Model

Reordering Table

Phrase Table

Language Model

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 26

What is a model? De afbeelding kan niet worden weergegeven. Mogelijk is er onvoldoende geheugen beschikbaar om de afbeelding te openen of is de afbeelding beschadigd. Start de computer opnieuw op en open het bestand opnieuw. Als de afbeelding nog steeds wordt voorgesteld door een rode X, kunt u de afbeelding verwijderen en opnieuw invoegen.

De afbeelding kan niet worden weergegeven. Mogelijk is er onvoldoende geheugen beschikbaar om de afbeelding te openen of is de afbeelding beschadigd. Start de computer opnieuw op en open het bestand opnieuw. Als de afbeelding nog steeds wordt voorgesteld door een rode X, kunt u de afbeelding verwijderen en opnieuw invoegen.

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 27

What is a model?

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 28

What is a model?

●  A representation of an original that maintain the original’s proportions, likeness, etc.

●  A working model replicates or emulates the functions of the original

●  A statistical model is a working model –  Uses statistical data to “do” something –  Statistical data = numbers about the past –  “Do” something = predict the future

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 29

Examples of Statistical Models

●  Financial models predict account balances

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 30

Examples of Statistical Models

●  Financial models predict account balances

●  Weather models predict hurricanes

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 31

Examples of Statistical Models

●  Financial models predict account balances

●  Weather models predict hurricanes

●  Traffic models predict traffic jams

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 32

Examples of Statistical Models

●  Financial models predict account balances

●  Weather models predict hurricanes

●  Traffic models predict traffic jams

●  SMT models predict translations

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 33

Proportions Matter

●  Barbie ●  Height 6'0" ●  Weight 100 lbs. ●  Size 4 ●  39" x 21" x 33"

●  Distorted likeness ●  >15% of segments

in EuroParl are parliamentary protocol

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 34

Agenda

●  Introduction ●  Who is PTTools? ●  Fundamental Assumptions ●  Models and Proportions ●  SMT Statistical Models ●  New Perspective ●  Acknowledgements

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 35

SMT Statistical Model

1. Make SMT model from “original content”

2. Use SMT model to translate new content (predict translations) without the “original content”

SMT Model Configuration

Translation Model

Reordering Table

Phrase Table

Language Model

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 36

Train Translation Model

●  domt train-tm train-model.perl

●  Count frequencies of sentence fragment pairs

●  One or more tables –  Can reach 15 GB

each

Original Content los pájaros verdes vuelan rápidamente green birds fly quickly los pájaros rojos vuelan al nido red birds fly to the nest los pájaros blancos nadan en el estanque white birds swim across the pond los pájaros amarillos comen semillas de girasol yellow birds eat sunflower seeds los pájaros negros comen maíz amarillo black birds eat yellow corn los pájaros blancos nadan con gracia white birds swim gracefully los pájaros negros se ciernen sobre el nido black birds hover over the nest los pájaros rosados se aguantan sobre una sola pierna pink birds stand on one leg los pájaros rosados comen camarones naranjas pink birds eat orange shrimp los pájaros grises están en el nido grey birds stand in the nest

Phrase Table Source language (stimulus) Target language (response) Probability los pájaros birds 50% los birds 50% negros black 50% pájaros negros black 50% los pájaros negros black birds 100% los pájaros negros comen black birds eat 100% los pájaros negros come n maíz black birds eat yellow 100% los pájaros negros comen maíz amarillo black birds eat yellow corn 100% pájaros verdes green 50% verdes green 50% los pájaros verdes green birds 100% los pájaros verdes vuelan green birds fly 100% los pájaros verdes vuelan rápidamente green birds fly quickly 100% grises grey 50% pájaros grises grey 50% los pájaros grises grey birds 100% los pájaros grises están grey birds stand 100% los pájaros grises están en grey birds stand in 100% los pájaros grises están e n el grey birds stand in the 100% los pájaros grises están en el nido grey birds stand in the nest 100%

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 37

Train Language Model

●  domt train-lm build-lm.sh

●  Count frequencies of sentence fragments in target language

●  One or more tables –  Can reach 25 GB

each

Language Model \ 2 - grams : - 1.30713 < s > green - 0.265492 green birds - 0.850518 birds fly - 0.677087 birds eat \ 3 - grams : - 0.112767 < s > green birds - 0.421503 birds fly quickly - 0.592076 birds eat yellow \ 4 - grams : - 0.10498 < s > green birds fly - 0.0527335 birds fly quickly < / s > - 0.0570311 birds eat orange shrimp \ 5 - grams : - 0.0732878 < s > green birds fly quickly - 0.0274306 birds fly to the nest - 0.0474597 birds swim across the pond - 0.0255669 birds eat yellow corn < / s >

Target Content green birds fly quickly

red birds fly to the nest

white birds swim across the pond

yellow birds eat sunflower seeds

black birds eat yellow corn

white birds swim gracefully

black birds hover over the nest

pink birds stand on one leg

pink birds eat orange shrimp

grey birds stand in the nest

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 38

Tune SMT Model [ttable-file] 0 0 5 ${path}/phrase-table.gz [distortion-file] 0-0 msd-bidirectional-fe 6 ${path}/reordering-table.gz [lmodel-file] 0 0 3 ${path}/irstlm_arpa.en.gz [weight-t] 0.169891 0.0856206 -0.0664389 0.0489578 0.0018491 [ttable-limit] 20

domt train-mert mert-moses.pl

Creates optimal settings for the components to work together

Configuration file defines paths to files and stores optimal settings

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 39

SMT Statistical Model

1. Make SMT model from “original content”

2. Use SMT model to translate new content (predict translations) without the “original content”

SMT Model Configuration

Translation Model

Reordering Table

Phrase Table

Language Model

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 40

SMT Model In Use

●  Step 1 domt translate moses -f config

Translation model creates thousands possible sentences

Translation Model

Reordering Table

Phrase Table

los pájaros negros nadan con gracia

1 green birds swim gracefully 2 red birds swim gracefully 3 black birds swim gracefully 4 yellow birds swim gracefully 5 birds yellow fly green corn 6 red corn eats white pond ... 10,000 pink birds swim gracefully

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 41

SMT Model In Use

●  Setp 2 Language model scores each possible sentence

Language Model

1 green birds swim gracefully 0.38 2 red birds swim gracefully 0.32 3 black birds swim gracefully 0.84 4 yellow birds swim gracefully 0.74 5 birds yellow fly green corn 0.07 6 red corn eats white pond 0.02 … … 10,000 pink birds swim gracefully 0.57

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 42

SMT Model In Use

●  Step 3 The highest score is most probable and selected as the translation

black birds swim gracefully

3 black birds swim gracefully 0.84

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 43

Is This Familiar?

●  You have a difficult sentence to translate ●  Despite your training and skills, you

create 4 or 5 possible translations with different words and word orders.

●  You struggle –  Which one is “right?” –  Which is the “best?”

●  You have to pick one or you don't get paid.

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 44

What Drives You?

●  How do you make your decision when all these things are equally “right” –  Meaning –  Grammar –  Syntax –  Etc.

●  You have to pick one or you don't get paid.

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 45

Feeling and Familiarity

●  The one that feels familiar –  Familiarity comes from frequency

●  SMT emulates this process –  SMT can generate 10,000-20,000

possibilities. Computers are good at that; people aren’t.

–  SMT calculates the probabilities for each one. Computers aren’t good at feelings.

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 46

Stimulus

●  “los pájaros negros nadan con gracia” ●  English possibilities generated

–  green birds swim gracefully –  red birds swim gracefully –  black birds swim gracefully –  yellow birds swim gracefully –  pink birds swim gracefully

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 47

Human Response

●  “black birds swim gracefully” –  I’m familiar with swans as black birds that

swim gracefully. –  I’m familiar with yellow and pink birds that

swims, but they don’t swim gracefully. –  I’m not familiar with green or red birds that

swim at all.

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 48

SMT Response

●  “black birds swim gracefully” –  All tokens are familiar because they’re in the

tables. –  The fragment “black birds swim” is the most

familiar because it occurs most frequently; therefore it scores highest.

–  The sentence scored highest because its fragments are in the language model more frequently.

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 49

Agenda

●  Introduction ●  Who is PTTools? ●  Fundamental Assumptions ●  Models and Proportions ●  SMT Statistical Models ●  New Perspective ●  Acknowledgements

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 50

Initial Challenges

●  Requires millions of pairs ●  Requires expensive, powerful hardware ●  Lacks trained user base ●  Faces hostile target users ●  Faces criticism from experts ●  Lacks professional features

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 51

Market Response ●  Private SaaS Portals

–  Asia Online –  SDL 1

–  Safaba –  Let's MT –  Tauyou –  Firma8 –  KantanaMT –  SmartMATE –  Straker Translations –  Cloudwords –  AVB Translations –  Lingo24 –  MemSource –  Translated.net –  Trusted Translations –  XTM International

Integrators & Consultants CrossLang Digital Silk Road PangeaMT Asia Online Safaba SDL 1 IBM Systran 2 LexWorks 2 Prompsit Language Engineering 3

Software Publishers Systran 2

ProMT 3

Precision Translation Tools Notes: 1 = LanguageWeaver not Open Source 2 = SYSTRAN Server, RbMT with Moses 3 = RbMT & SMT options

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 52

Learned Challenges

●  Customizing models requires possession and control of TMs –  Users don't entrust TMs to portals –  Perception they're subsidizing competitors

●  Portals must continuously create models –  Overhead for each new model –  No portal has talent for every language –  Revert to customer's talents

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 53

Updated Challenges

●  Requires millions of pairs ●  Requires expensive, powerful hardware ●  Lacks trained user bases ●  Faces hostile, untrained target users ●  Faces criticism from experts ●  Lacks professional features ●  “Trusted 3rd parties” don't exist ●  Continual need for new models

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 54

Productivity As Quality

●  Customers want quality –  Can't define it for computers to test for it

●  All automated quality scoring systems require human reference translations

●  100% match = raw SMT is identical to independent human translations, not post-edited translations

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 55

2012 Serendipitous Discovery

●  Don't need millions of sentence pairs within a constrained domain

●  PTTools customers with 130K to 300K segments achieve 100% matches on 20-40% of SMT output

●  Let's MT reports similar corpus sizes produce 20% productivity gains

●  Tauyou reports a few as 50K segments result in customer satisfaction with productivity gains

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 56

Productivity As Quality

●  Where does productivity begin? ●  How many 100% matches make

productivity gain inevitable?

* actual customer experience

Quality vs. Productivity

<100% Match (Post-editing)

100% Match (Productivity)

Annual TCO

Preparation Time

RbMT 90 – 95% 5% – 10% $150,000 2 – 3 weeks

SMT Pre 2007 > 99% < 1% $10,000 1 – 3 weeks

SMT 2007 to 2008 > 99% < 1% $6,000 5 – 14 days

SMT 2009 to 2011 90% – 95% 5% – 10% $1,500 2 – 7 days

SMT 2012 *60% – 80% *20% – 40% $1,200 6 – 48 hours

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 57

Adjusted Challenges

●  Requires millions of pairs ●  Requires expensive, powerful hardware ●  Lacks trained user bases ●  Faces hostile, untrained target users ●  Faces criticism from experts ●  Lacks professional features ●  “Trusted 3rd parties” don't exist ●  Continual need for new models

●  150,000 to 300,000 can work fine ●  Less than professional graphic arts ●  Professionals pay for training courses ●  Attitudes are proportionate to benefits ●  Early experts liquidate ●  New versions add new features ●  “Trusted 3rd parties” don't exist ●  Continual need for new models

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 58

Market Response Revisited

●  Portals, Full Service, Experts –  Perpetuate perception of complexity –  Control models created with free technology –  Protect investments

●  If juke boxes and radio stations preceded phonographs, what would today’s music industry sell? –  (a) CD’s –  (b) pay-per-play MP3s and digital radio?

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 59

Agenda

●  Introduction ●  Who is PTTools? ●  Fundamental Assumptions ●  Models and Proportions ●  SMT Statistical Models ●  New Perspective ●  Acknowledgements

12 april 2013 2012 © Precision Translation Tools Co., Ltd. 60

Acknowledgements ●  Precision Translation Tools ●  Prompsit Language Engineering ●  Tauyou ●  Safaba Translation Solutions ●  LetsMT! by Tilde ●  Digital Silk Road ●  PangeaMT by Pangeanic ●  CrossLang ●  KantanMT ●  Lingo24

DoMT   ®