Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

Assumptions, Expectations and Outliers in Post-EditingLena Marg, Laura CasanellasLanguage Tools Team

@ EAMT Summit Dubrovnik, Croatia, June 2014

Background on MT Programs @

MT programs vary with regard to:

ScopeLocalesMaturity

System Setup & OwnershipMT Solution used

Key Objective of using MTFinal Quality Requirements

Source Content

MT Quality Evaluation @

1. Automatic Scores Provided by the MT system (typically BLEU) Provided by our internal scoring tool, weScore (range of metrics)

2. Human Evaluation Adequacy, scores 1-5 Fluency, scores 1-5

3. Productivity Tests Post-Editing versus Human Translation in iOmegaT, validated

through final Quality Assessments

The Database

Objective:Establish correlations between these 3 evaluation approaches to- draw conclusions on predicting productivity gains in advance- see how & when to use the different metrics best

Contents:- Content Type - Language Pair (English into XX)- MT engine provider & owner (i.e. who owns training & maintenance)- Metrics (BLEU & PE Distance, Adequacy & Fluency, Productivity deltas)- MT error analysis- Final QA scores- Level of experience of resource doing productivity test

Data from 2013

thedatabaseData Used

27 locales in total, with varying amounts of

available data

5 different MT systems (SMT / Hybrid

AssumptionsresultsGeneral assumptions around best performing languages and content types were confirmed

Assumptionsresults, IIInteresting results around correlation between productivity gained when translating and post-editing :

Not all the resources improve equally (or at all) when changing activities from translation to post-editing.

correlationresultsSummary

Pearson's r Variables Strength of Correlation Tests (N) Locales Statistical Significance (p value <)

0.82 Adequacy & Fluency Very strong positive relationship 182 22 0.0001

0.77 Adequacy & P Delta Very strong positive relationship 23 9 0.0001

0.71 Fluency & P Delta Very strong positive relationship 23 9 0.00015

0.55 Cognitive Effort Rank & PE Distance Strong positive relationship 16 10 0.027

0.41 Fluency & BLEU Strong positive relationship 146 22 0.0001

0.26 Adequacy & BLEU Weak positive relationship 146 22 0.0015

0.24 BLEU & P Delta Weak positive relationship 106 26 0.012

0.13 Numbers of Errors & PE Distance No or negligible relationship 16 10 ns

-0.30 Predominant Error & BLEU Moderate negative relationship 63 13 0.017

-0.32 Cognitive Effort Rank & PE Delta Moderate negative relationship 20 10 ns

-0.41 Numbers of Errors & BLEU Strong negative relationship 63 20 0.00085

-0.41 Adequacy & PE Distance Strong negative relationship 38 13 0.011

-0.42 PE Distance & P Delta Strong negative relationship 72 27 0.00024

-0.70 Fluency & PE Distance Very strong negative relationship 38 13 0.0001

-0.81 BLEU & PE Distance Very strong negative relationship 75 27 0.0001

takeaways

The strongest correlations were found between:

Adequacy & Fluency BLEU and PE Distance Adequacy & Productivity Delta Fluency & Productivity Delta Fluency & PE Distance

The Human Evaluations come out as stronger indicators for potential post-editing productivity gains than Automatic metrics.

CORRELATIONS

Looking at subsetsAdequacy and Fluency versus BLEU

da_DK de_DE es_ES es_LA fr_CA fr_FR it_IT ja_JP ko_KR pt_BR ru_RU zh_CN

-1.00

-0.80

-0.60

-0.40

-0.20

0.00

0.20

0.40

0.60

0.80

1.00

Adequacy, Fluency & BLEU Correlation – Select Locales

Adequacy & BLEU Fluency & BLEU

Pear

son'

s r

Adequacy, Fluency and BLEU correlation for locales with 4 or more test sets*

Although the tests sets here are too small to be statistically relevant, the correlations seem to vary significantly between locales.Would this be maintained with more data and what are the reasons for the differences?

Looking at subsets, II

Adequacy and Fluency versus PE Distance

Fluency and PE distance across all locales have a cumulative Pearson’s r of -0.70, a very strong negative relationship

Adequacy and PE distance across all locales have a cumulative Pearson’s r of -0.41, a strong negative relationship

de_DE es_ES/LA fr_FR/CA it_IT pt_BR

-1.00

-0.80

-0.60

-0.40

-0.20

0.00

0.20

0.40

0.60

0.80

1.00

Adequacy, Fluency and PE Distance Correlation

Adequacy & PE Distance Fluency & PE Distance

Looking at a few select locales with the highest numbers of tests, it looks more varied again.

Outliersresults

Based on some of the data shown previously, and from the point of view of consistent results versus outliers:

• For Human Evaluations of raw MT output, the inter-annotator agreement was consistent in terms of scores (same test set and language)

• Metrics based on human effort (productivity delta) are less consistent and might include significant variations between individual (same test set and language)

furtherquestionsBased on the premise that there are significant variations between different post-editors…

… and with the aim of learning from individual behaviors and predicting future productivity gains, we ask ourselves two questions:

• What circumstances or variables most reliably facilitate good-quality, highly productive post editing?

• Do conditions and parameters outside the post-editor’s control facilitate or hamper his or her success?

survey

Q1: What is your primary target language?Q2: What is your background?Q3: How many years experience ?Q4: How is your work environment?Q5: Which of the following CAT tools have you worked with?Q6: What is your level of proficiency on the CAT tool(s) you use?Q7: What is your translation methodology?Q8: How do you primarily enter text?Q9: What are your quality assurance and automation processes?Q10: What do you consider most important in your assignments?

5 languages (DE, FR, JP, PTBR, HU)38 linguists (belonging to 14 different teams)

Probably less surprising… Except for 1 respondent, all respondents have more experience with

translation than with post-editing The overall correlation between translation experience and post-

editing experience is “strong”

However, looking at correlations by localeGerman: very strong French: weak Japanese: weak PTBR: strong Hungarian: weak

This suggests that for German and Brazilian Portuguese only, the overall experience as professional translator (whether junior or senior) gives us insights into how much post-editing experience to expect. For the other 3 locales, profiles are more varied.

Q3: How many years experience do you have?

The choice of CAT tool is to some extend dependent on the client requirement, but what the data shows is that all locales & respondents are using a broad range of CAT tools for their work.

On average, respondents use / are familiar with 6-8 different CAT tools.

There is a slight trend that junior translators use / are familiar with more CAT tools than senior translators.

All respondents claim to be proficient and / or expert in their most frequently used CAT tool.

6 out of 8 Hungarian respondents call themselves “Experts”3 out of 8 Germans4 out of 9 French1 out of 7 PTBRNone of the Japanese respondents (despite on average most translation experience)

Q5: Which of the following CAT tools have you worked with? Please select all that apply

Of the 5 locales, the French respondents stand out as a very homogenous group with

- Rarely making use of any pre-processing steps- Never using free MT tools- Never using internal MT tools

The Japanese, Brazilian and Hungarian respondents are more likely to perform pre-processing steps

Japanese translators appear to copy to Word more than any other locale

Hungarian translators were the only group with almost half of the respondents never doing draft translations first, but working segment by segment

Q7: Please evaluate the following statements on translation methodology

Looking at respondents who Always / Frequently perform any of the 5 proposed actions,- There was no clear trend with regard to years of translation

experience- There was no clear trend with regard to background- There was no clear trend between resource working in an

office / at home etc.

With regard to text input methods,

French and German translators seem to make more use of CAT tool shortcuts.

Japanese requires the use of Input Method Editors.

… Q7: Please evaluate the following statements on translation methodology

• Romance languages are the best performers on MT.

• User Assistance is the most suitable content (apart from UGC).

• Translators do not improve homogenously when moving to post-editing (some of them do not improve at all).

• It is more difficult to foresee post-editing effort than to asses the quality of raw MT. The human effort is still the most variable aspect.

• In some locales (Germany, Brazil) “senior translators” accept post-editing as much as junior translators might do.

• Our French linguists seem to use less automation in their processes.

Final Conclusions

White Papers: Two white papers elaborating on the approach and results of the Analysis of the Database will be published in the near future.

www.welocalize.com

More research: We continue adding data to our Database; we have also included the survey on our hand-off material when doing productivity tests with the aim of gaining more insights into the post-editors background.

nextprojects

http://www.welocalize.com/

THANK [email protected]@welocalize.com

Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing

Business

Transcript of Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in Post-Editing