Lucia Specia - Estimativa de qualidade em TA

Post on 18-Nov-2014

286 views 2 download

description

 

Transcript of Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions

Estimativa da qualidade da traducao

automatica

Lucia Specia

University of Sheffieldl.specia@sheffield.ac.uk

Faculdade de Letras da Universidade do Porto13 May 2013

Estimativa da qualidade da traducao automatica 1 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Outline

1 Quality of Machine Translation

2 Quality Estimation

3 Open issues

4 Conclusions

Estimativa da qualidade da traducao automatica 2 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Outline

1 Quality of Machine Translation

2 Quality Estimation

3 Open issues

4 Conclusions

Estimativa da qualidade da traducao automatica 3 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Introduction

Machine Translation:

Around since the early 1950s

Increasingly more popular since 1990: statisticalapproaches

Software tools and data available to build translationsystems - Moses and others

Increasing demand for cheaper and fast translations

How do we measure quality and progress over time?

So far... mostly automatic evaluation metrics

Estimativa da qualidade da traducao automatica 4 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Introduction

Machine Translation:

Around since the early 1950s

Increasingly more popular since 1990: statisticalapproaches

Software tools and data available to build translationsystems - Moses and others

Increasing demand for cheaper and fast translations

How do we measure quality and progress over time?

So far... mostly automatic evaluation metrics

Estimativa da qualidade da traducao automatica 4 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Introduction

Machine Translation:

Around since the early 1950s

Increasingly more popular since 1990: statisticalapproaches

Software tools and data available to build translationsystems - Moses and others

Increasing demand for cheaper and fast translations

How do we measure quality and progress over time?

So far... mostly automatic evaluation metrics

Estimativa da qualidade da traducao automatica 4 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Introduction

Machine Translation:

Around since the early 1950s

Increasingly more popular since 1990: statisticalapproaches

Software tools and data available to build translationsystems - Moses and others

Increasing demand for cheaper and fast translations

How do we measure quality and progress over time?

So far... mostly automatic evaluation metrics

Estimativa da qualidade da traducao automatica 4 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Introduction

Machine Translation:

Around since the early 1950s

Increasingly more popular since 1990: statisticalapproaches

Software tools and data available to build translationsystems - Moses and others

Increasing demand for cheaper and fast translations

How do we measure quality and progress over time?

So far... mostly automatic evaluation metrics

Estimativa da qualidade da traducao automatica 4 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

N-gram matching between system output and one ormore reference translations: BLEU and many others

Issue 1: Too many possible good quality translations,need thousands of references to capture valid variations

Solution: HyTER (Language Weaver) annotation tool togenerate all possible correct translations! [DM12]

Translations built bottom-up from word/phrasetranslation equivalents using FSA2-2.5 hours worth of expert annotation per sentenceOne annotator: 5.2× 106 pathsA bunch of annotators: 8.5× 1011 paths

Estimativa da qualidade da traducao automatica 5 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

N-gram matching between system output and one ormore reference translations: BLEU and many others

Issue 1: Too many possible good quality translations,need thousands of references to capture valid variations

Solution: HyTER (Language Weaver) annotation tool togenerate all possible correct translations! [DM12]

Translations built bottom-up from word/phrasetranslation equivalents using FSA2-2.5 hours worth of expert annotation per sentenceOne annotator: 5.2× 106 pathsA bunch of annotators: 8.5× 1011 paths

Estimativa da qualidade da traducao automatica 5 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

N-gram matching between system output and one ormore reference translations: BLEU and many others

Issue 1: Too many possible good quality translations,need thousands of references to capture valid variations

Solution: HyTER (Language Weaver) annotation tool togenerate all possible correct translations! [DM12]

Translations built bottom-up from word/phrasetranslation equivalents using FSA2-2.5 hours worth of expert annotation per sentenceOne annotator: 5.2× 106 pathsA bunch of annotators: 8.5× 1011 paths

Estimativa da qualidade da traducao automatica 5 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

Issue 2: Difficult to quantify severity of mismatchingn-grams

ref Do not buy this product, it’s their craziest invention!sys Do buy this product, it’s their craziest invention!

Some attempts to weight mismatches differently -sparse, lexicalised approach

However, same error is more or less important dependingon the user or purpose:

Severe if end-user does not speak source languageTrivial to post-edit by translators

Estimativa da qualidade da traducao automatica 6 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

Issue 2: Difficult to quantify severity of mismatchingn-grams

ref Do not buy this product, it’s their craziest invention!sys Do buy this product, it’s their craziest invention!

Some attempts to weight mismatches differently -sparse, lexicalised approach

However, same error is more or less important dependingon the user or purpose:

Severe if end-user does not speak source languageTrivial to post-edit by translators

Estimativa da qualidade da traducao automatica 6 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

Issue 2: Difficult to quantify severity of mismatchingn-grams

ref Do not buy this product, it’s their craziest invention!sys Do buy this product, it’s their craziest invention!

Some attempts to weight mismatches differently -sparse, lexicalised approach

However, same error is more or less important dependingon the user or purpose:

Severe if end-user does not speak source languageTrivial to post-edit by translators

Estimativa da qualidade da traducao automatica 6 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

Issue 2: Difficult to quantify severity of mismatchingn-grams

ref Do not buy this product, it’s their craziest invention!sys Do buy this product, it’s their craziest invention!

Some attempts to weight mismatches differently -sparse, lexicalised approach

However, same error is more or less important dependingon the user or purpose:

Severe if end-user does not speak source languageTrivial to post-edit by translators

Estimativa da qualidade da traducao automatica 6 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

Conversely:

ref The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

sys Six-hours battery, 30 minutes to full charge last.

Ok for gisting - meaning preservedVery costly for post-editing if style is to be preserved

Estimativa da qualidade da traducao automatica 7 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

MT evaluation metrics

Conversely:

ref The battery lasts 6 hours and it can be fully rechargedin 30 minutes.

sys Six-hours battery, 30 minutes to full charge last.

Ok for gisting - meaning preservedVery costly for post-editing if style is to be preserved

Estimativa da qualidade da traducao automatica 7 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Task-based evaluation

Measure translation quality within task. E.g. Autodesk -Productivity test through post-editing [Aut11]

2-day translation and post-editing , 37 participantsIn-house Moses (Autodesk data: software)Time spent on each segment

Estimativa da qualidade da traducao automatica 8 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Task-based evaluation

E.g.: Intel - User satisfaction with un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites [Int10]

Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

MT for chat and community forums [Int12]

∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)

Estimativa da qualidade da traducao automatica 9 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Task-based evaluation

E.g.: Intel - User satisfaction with un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites [Int10]

Overall customer satisfaction: 75% for English→Chinese

95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

MT for chat and community forums [Int12]

∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)

Estimativa da qualidade da traducao automatica 9 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Task-based evaluation

E.g.: Intel - User satisfaction with un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites [Int10]

Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hour

Customers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

MT for chat and community forums [Int12]

∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)

Estimativa da qualidade da traducao automatica 9 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Task-based evaluation

E.g.: Intel - User satisfaction with un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites [Int10]

Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

MT for chat and community forums [Int12]

∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)

Estimativa da qualidade da traducao automatica 9 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Task-based evaluation

E.g.: Intel - User satisfaction with un-edited MT

Translation is good if customer can solve problem

MT for Customer Support websites [Int10]

Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!

MT for chat and community forums [Int12]

∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)

Estimativa da qualidade da traducao automatica 9 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Outline

1 Quality of Machine Translation

2 Quality Estimation

3 Open issues

4 Conclusions

Estimativa da qualidade da traducao automatica 10 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Overview

Metrics either depend on references or post-editing/use oftranslations (task-based)

Our proposal

Quality assessment without reference, prior topost-editing/use of translations

Estimativa da qualidade da traducao automatica 11 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Overview

Metrics either depend on references or post-editing/use oftranslations (task-based)

Our proposal

Quality assessment without reference, prior topost-editing/use of translations

Estimativa da qualidade da traducao automatica 11 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Overview

Why don’t translators use (more) MT?

Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?

Estimativa da qualidade da traducao automatica 12 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Overview

Why don’t translators use (more) MT?Translations are not good enough!

What about TMs? Aren’t fuzzy matches useful?

Estimativa da qualidade da traducao automatica 12 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Overview

Why don’t translators use (more) MT?Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?

Estimativa da qualidade da traducao automatica 12 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Overview

Why don’t translators use (more) MT?Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?

Estimativa da qualidade da traducao automatica 12 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Framework

Quality estimation (QE): provide an estimate ofquality for new translated text *before* it is post-edited

Quality = post-editing effort

No access to reference translations: machine learningtechniques to predict post-editing effort scores

Considers interaction with TM systems: only used forlow fuzzy match cases, or to select between TM and MT

QTLaunchPad project

Multidimensional Quality Metrics for MT and HT, for manualand (semi-)automatic evaluation (QE):http://www.qt21.eu/launchpad/

Estimativa da qualidade da traducao automatica 13 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Framework

Quality estimation (QE): provide an estimate ofquality for new translated text *before* it is post-edited

Quality = post-editing effort

No access to reference translations: machine learningtechniques to predict post-editing effort scores

Considers interaction with TM systems: only used forlow fuzzy match cases, or to select between TM and MT

QTLaunchPad project

Multidimensional Quality Metrics for MT and HT, for manualand (semi-)automatic evaluation (QE):http://www.qt21.eu/launchpad/

Estimativa da qualidade da traducao automatica 13 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Framework

Quality estimation (QE): provide an estimate ofquality for new translated text *before* it is post-edited

Quality = post-editing effort

No access to reference translations: machine learningtechniques to predict post-editing effort scores

Considers interaction with TM systems: only used forlow fuzzy match cases, or to select between TM and MT

QTLaunchPad project

Multidimensional Quality Metrics for MT and HT, for manualand (semi-)automatic evaluation (QE):http://www.qt21.eu/launchpad/

Estimativa da qualidade da traducao automatica 13 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Framework

Quality estimation (QE): provide an estimate ofquality for new translated text *before* it is post-edited

Quality = post-editing effort

No access to reference translations: machine learningtechniques to predict post-editing effort scores

Considers interaction with TM systems: only used forlow fuzzy match cases, or to select between TM and MT

QTLaunchPad project

Multidimensional Quality Metrics for MT and HT, for manualand (semi-)automatic evaluation (QE):http://www.qt21.eu/launchpad/

Estimativa da qualidade da traducao automatica 13 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Framework

QE system

Examples: source &

translations,quality scores

Qualityindicators

Estimativa da qualidade da traducao automatica 14 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Framework

Sourcetext

MT system

Translation

QE system

Quality score

Examples: source &

translations,quality scores

Qualityindicators

Estimativa da qualidade da traducao automatica 14 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Examples of positive results

Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences

Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec

Accuracy in selecting best translation among 4 MTsystems

Best MT system Highest QE score54% 77%

Estimativa da qualidade da traducao automatica 15 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Examples of positive results

Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences

Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec

Accuracy in selecting best translation among 4 MTsystems

Best MT system Highest QE score54% 77%

Estimativa da qualidade da traducao automatica 15 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Examples of positive results

Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences

Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec

Accuracy in selecting best translation among 4 MTsystems

Best MT system Highest QE score54% 77%

Estimativa da qualidade da traducao automatica 15 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art

Quality indicators:

Source text TranslationMT system

Confidence indicators

Complexity indicators

Fluency indicators

Adequacyindicators

Learning algorithms: wide range

Datasets: few with absolute human scores (1-4/5 scores,PE time, edit distance)

Estimativa da qualidade da traducao automatica 16 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art

Quality indicators:

Source text TranslationMT system

Confidence indicators

Complexity indicators

Fluency indicators

Adequacyindicators

Learning algorithms: wide range

Datasets: few with absolute human scores (1-4/5 scores,PE time, edit distance)

Estimativa da qualidade da traducao automatica 16 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art

Quality indicators:

Source text TranslationMT system

Confidence indicators

Complexity indicators

Fluency indicators

Adequacyindicators

Learning algorithms: wide range

Datasets: few with absolute human scores (1-4/5 scores,PE time, edit distance)

Estimativa da qualidade da traducao automatica 16 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Outline

1 Quality of Machine Translation

2 Quality Estimation

3 Open issues

4 Conclusions

Estimativa da qualidade da traducao automatica 17 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art indicators

Shallow indicators:(S/T/S-T) Sentence length(S/T) Language model(S/T) Token-type ratio(S) Average number of possible translations per word(S) % of n-grams belonging to different frequencyquartiles of a source language corpus(T) Untranslated/OOV words(T) Mismatching brackets, quotation marks(S-T) Preservation of punctuation(S-T) Word alignment score, etc.

These do well for estimation post-editing effort...

...but are not enough for other aspects of quality, e.g.adequacy

Estimativa da qualidade da traducao automatica 18 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art indicators

Shallow indicators:(S/T/S-T) Sentence length(S/T) Language model(S/T) Token-type ratio(S) Average number of possible translations per word(S) % of n-grams belonging to different frequencyquartiles of a source language corpus(T) Untranslated/OOV words(T) Mismatching brackets, quotation marks(S-T) Preservation of punctuation(S-T) Word alignment score, etc.

These do well for estimation post-editing effort...

...but are not enough for other aspects of quality, e.g.adequacy

Estimativa da qualidade da traducao automatica 18 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art indicators

Linguistic indicators - count-based:

(S/T/S-T) Content/non-content words

(S/T/S-T) Nouns/verbs/... NP/VP/...

(S/T/S-T) Deictics (references)

(S/T/S-T) Discourse markers (references)

(S/T/S-T) Named entities

(S/T/S-T) Zero-subjects

(S/T/S-T) Pronominal subjects

(S/T/S-T) Negation indicators

(T) Subject-verb / adjective-noun agreement

(T) Language Model of POS

(T) Grammar checking (dangling words)

(T) Coherence

Estimativa da qualidade da traducao automatica 19 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art indicators

Linguistic indicators - alignment-based:

(S-T) Correct translation of pronouns

(S-T) Matching of dependency relations

(S-T) Matching of named entities

(S-T) Alignment of parse trees

(S-T) Alignment of predicates & arguments, etc.

Some indicators are language-dependent, others needresources that are language-dependent, but apply to mostlanguages, e.g. LM of POS tags

Estimativa da qualidade da traducao automatica 20 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art indicators

Linguistic indicators - alignment-based:

(S-T) Correct translation of pronouns

(S-T) Matching of dependency relations

(S-T) Matching of named entities

(S-T) Alignment of parse trees

(S-T) Alignment of predicates & arguments, etc.

Some indicators are language-dependent, others needresources that are language-dependent, but apply to mostlanguages, e.g. LM of POS tags

Estimativa da qualidade da traducao automatica 20 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art indicators

Fine-grained, lexicalised indicators:

target-word = “process” =

{1, if source-word = “hdhh alamlyt”.

0, otherwise.

target-word = “process” =

{1, if source-pos = “DT DTNN”.

0, otherwise.

Closer to error detection

Need large amounts of training data [BHAO11], or RB approaches

Estimativa da qualidade da traducao automatica 21 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

State-of-the-art indicators

Fine-grained, lexicalised indicators:

target-word = “process” =

{1, if source-word = “hdhh alamlyt”.

0, otherwise.

target-word = “process” =

{1, if source-pos = “DT DTNN”.

0, otherwise.

Closer to error detection

Need large amounts of training data [BHAO11], or RB approaches

Estimativa da qualidade da traducao automatica 21 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Do these indicators work?

To some extent... Issues:

Representation of shallow/deep indicators: counts,ratios, (absolute) differences?

F = S − T , F = |S − T |, F =T

S, F =

S − T

S...

Resources to extract deep indicators: availability andreliability

Data to extract fine-grained indicators: need previouslytranslated and post-edited data esp. for negativeexamples

Estimativa da qualidade da traducao automatica 22 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Do these indicators work?

To some extent... Issues:

Representation of shallow/deep indicators: counts,ratios, (absolute) differences?

F = S − T , F = |S − T |, F =T

S, F =

S − T

S...

Resources to extract deep indicators: availability andreliability

Data to extract fine-grained indicators: need previouslytranslated and post-edited data esp. for negativeexamples

Estimativa da qualidade da traducao automatica 22 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Do these indicators work?

To some extent... Issues:

Representation of shallow/deep indicators: counts,ratios, (absolute) differences?

F = S − T , F = |S − T |, F =T

S, F =

S − T

S...

Resources to extract deep indicators: availability andreliability

Data to extract fine-grained indicators: need previouslytranslated and post-edited data esp. for negativeexamples

Estimativa da qualidade da traducao automatica 22 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Do these indicators work?

To some extent... Issues:

Representation of shallow/deep indicators: counts,ratios, (absolute) differences?

F = S − T , F = |S − T |, F =T

S, F =

S − T

S...

Resources to extract deep indicators: availability andreliability

Data to extract fine-grained indicators: need previouslytranslated and post-edited data esp. for negativeexamples

Estimativa da qualidade da traducao automatica 22 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Manual scoring: agreement between translators

Absolute value judgements: difficult to achieve consistencyacross annotators even in highly controlled setup

en-es news WMT12 dataset: 3 professionaltranslators, 1-5 scores

15% of initial dataset discarded: annotators disagreed bymore than one categoryRemaining annotations had to be scaled (0.33, 0.17,0.50)

Estimativa da qualidade da traducao automatica 23 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Manual scoring: agreement between translators

Absolute value judgements: difficult to achieve consistencyacross annotators even in highly controlled setup

en-es news WMT12 dataset: 3 professionaltranslators, 1-5 scores

15% of initial dataset discarded: annotators disagreed bymore than one categoryRemaining annotations had to be scaled (0.33, 0.17,0.50)

Estimativa da qualidade da traducao automatica 23 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Manual scoring: Agreement between translators

en-pt subtitles of TV series: 3 non-professionalsannotators, 1-4 scores

351 cases (41%): full agreement445 cases (52%): partial agreement54 cases (7%): null agreement

Agreement by score:

Score Full4 59%3 35%2 23%1 50%

Estimativa da qualidade da traducao automatica 24 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Manual scoring: Agreement between translators

en-pt subtitles of TV series: 3 non-professionalsannotators, 1-4 scores

351 cases (41%): full agreement445 cases (52%): partial agreement54 cases (7%): null agreement

Agreement by score:

Score Full4 59%3 35%2 23%1 50%

Estimativa da qualidade da traducao automatica 24 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

HTER: Edit distance between MT output and its minimallypost-edited version

HTER =#edits

#words postedited version

Edits: substitute, delete, insert, shift

Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores

A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER

Estimativa da qualidade da traducao automatica 25 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

HTER: Edit distance between MT output and its minimallypost-edited version

HTER =#edits

#words postedited version

Edits: substitute, delete, insert, shift

Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores

A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER

Estimativa da qualidade da traducao automatica 25 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

HTER: Edit distance between MT output and its minimallypost-edited version

HTER =#edits

#words postedited version

Edits: substitute, delete, insert, shift

Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores

A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versa

Certain edits seem to require more cognitive effort thanothers - not captured by HTER

Estimativa da qualidade da traducao automatica 25 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

HTER: Edit distance between MT output and its minimallypost-edited version

HTER =#edits

#words postedited version

Edits: substitute, delete, insert, shift

Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores

A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER

Estimativa da qualidade da traducao automatica 25 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

TIME: varies considerably across translators (expected)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

100

200

300

400

500

600

A1

A2

A3

A4

A5

A6

A7

A8

Segments

Annotators

Seconds

Can we normalise this variation?

A dedicated QE system for each translator?

Estimativa da qualidade da traducao automatica 26 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

TIME: varies considerably across translators (expected)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.00

5.00

10.00

15.00

20.00

25.00

A1

A2

A3

A4

A5

A6

A7

A8

Annotators

Seconds / word

Segments

Can we normalise this variation?

A dedicated QE system for each translator?

Estimativa da qualidade da traducao automatica 26 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

Time, HTER, Keystrokes: data from 8 post-editors

Estimativa da qualidade da traducao automatica 27 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

More objective ways of annotating translations

PET: http://pers-www.wlv.ac.uk/~in1676/pet/

Estimativa da qualidade da traducao automatica 27 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

How to use estimated PE effort scores?

Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?

Wasting time to read scores and translations vs wasting“gisting” information

How to define a threshold on the estimated translationquality to decide what should be filtered out?

Translator dependentTask dependent (SDL)

Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?

Too much information vs hard-to-interpret scores

Estimativa da qualidade da traducao automatica 28 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

How to use estimated PE effort scores?

Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?

Wasting time to read scores and translations vs wasting“gisting” information

How to define a threshold on the estimated translationquality to decide what should be filtered out?

Translator dependentTask dependent (SDL)

Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?

Too much information vs hard-to-interpret scores

Estimativa da qualidade da traducao automatica 28 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

How to use estimated PE effort scores?

Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?

Wasting time to read scores and translations vs wasting“gisting” information

How to define a threshold on the estimated translationquality to decide what should be filtered out?

Translator dependentTask dependent (SDL)

Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?

Too much information vs hard-to-interpret scores

Estimativa da qualidade da traducao automatica 28 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Outline

1 Quality of Machine Translation

2 Quality Estimation

3 Open issues

4 Conclusions

Estimativa da qualidade da traducao automatica 29 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/

PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial products by SDL (document-level for gisting)and Multilizer

A number of open issues to be investigated...

Collaboration with “human translators” essential

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimativa da qualidade da traducao automatica 30 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/

PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial products by SDL (document-level for gisting)and Multilizer

A number of open issues to be investigated...

Collaboration with “human translators” essential

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimativa da qualidade da traducao automatica 30 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/

PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial products by SDL (document-level for gisting)and Multilizer

A number of open issues to be investigated...

Collaboration with “human translators” essential

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimativa da qualidade da traducao automatica 30 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/

PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial products by SDL (document-level for gisting)and Multilizer

A number of open issues to be investigated...

Collaboration with “human translators” essential

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimativa da qualidade da traducao automatica 30 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/

PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial products by SDL (document-level for gisting)and Multilizer

A number of open issues to be investigated...

Collaboration with “human translators” essential

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimativa da qualidade da traducao automatica 30 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Conclusions

It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/

PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems

Commercial products by SDL (document-level for gisting)and Multilizer

A number of open issues to be investigated...

Collaboration with “human translators” essential

My vision

Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence

Estimativa da qualidade da traducao automatica 30 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Estimativa da qualidade da traducao

automatica

Lucia Specia

University of Sheffieldl.specia@sheffield.ac.uk

Faculdade de Letras da Universidade do Porto13 May 2013

Estimativa da qualidade da traducao automatica 31 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

Autodesk.

Translation and Post-Editing Productivity.

In http: // translate. autodesk. com/ productivity. html ,2011.

Nguyen Bach, Fei Huang, and Yaser Al-Onaizan.

Goodness: a method for measuring machine translation confidence.

pages 211–219, Portland, Oregon, 2011.

Markus Dreyer and Daniel Marcu.

Hyter: Meaning-equivalent semantics for translation evaluation.

In Proceedings of the 2012 Conference of the North AmericanChapter of the Association for Computational Linguistics: HumanLanguage Technologies, pages 162–171, Montreal, Canada, 2012.

Intel.

Being Streetwise with Machine Translation in an EnterpriseNeighborhood.

Estimativa da qualidade da traducao automatica 31 / 31

Quality of Machine Translation Quality Estimation Open issues Conclusions

In http:

// mtmarathon2010. info/ JEC2010_ Burgett_ slides. pptx ,2010.

Intel.

Enabling Multilingual Collaboration through Machine Translation.

In http: // media12. connectedsocialmedia. com/ intel/ 06/

8647/ Enabling_ Multilingual_ Collaboration_ Machine_

Translation. pdf , 2012.

Estimativa da qualidade da traducao automatica 31 / 31