In an ideal world...
description
Transcript of In an ideal world...
![Page 1: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/1.jpg)
![Page 2: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/2.jpg)
Linguistic information is seamlessly combined to statistical information as part of translation systems to produce perfect translations
We are moving in that direction: Morphology
Syntax
Semantics (SRL): (Wu & Fung 2009) (Liu & Gildea 2010) (Aziz et al. 2011)
Meanwhile…
2
![Page 3: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/3.jpg)
Linguistic information to evaluate MT quality Based on reference translations
Linguistic information to estimate MT quality Using machine learning
Linguistic information to detect errors in MT Automatic post-editing
3
![Page 4: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/4.jpg)
Handle variations in MT (words and structure) wrt reference or identify differences between MT and reference
METEOR (Denkowski & Lavie 2011): words and phrases (Giménez & Màrquez 2010): matching of lexical, syntactic, semantic and discourse units
(Lo & Wu 2011): SRL and manual matching of ‘who’ did ‘what’ to ‘whom’, etc. (Rios et al. 2011): automatic SRL with automatic (inexact) matching of predicates and arguments
4
![Page 5: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/5.jpg)
Essentially: matching of linguistic units Similar to n-gram matching metrics, but units are not only words
Metrics based on lexical units perform better
Issues: Lack of (good) resources for certain languages
Unreliable processing of incorrect translations
Sparsity for sentence-level: depending on the actual features. E.g.: matching of named entities
5
![Page 6: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/6.jpg)
Goal: given the output of an MT system for a given input, provide an estimate of its quality
Uses◦ Filter bad quality translations from post-editing
◦ Select “perfect” translations for publishing
◦ Spot unreliable translations to readers of target language only
◦ Select best translation for a given input when multiple MT/TM systems are available
6
![Page 7: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/7.jpg)
NOT standard MT evaluation:
◦Reference translations are NOT available
◦ Estimation for unseen translations
My approach:
◦Translation unit: sentence
◦ Independent from MT system
7
![Page 8: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/8.jpg)
1. Define aspect of quality to estimate and
how to represent it
2. Identify and extract features that explain that
aspect of quality
3. Collect examples of translations with different
levels of quality and annotate them
4. Learn a model to predict quality scores for
new translations and evaluate it
8
![Page 9: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/9.jpg)
Source text TranslationMT
system
Confidence
indicators
Complexity
indicators
Fluency indicators
Adequacyindicators
Quality?
Features can be shallow or linguistically motivated
9
![Page 10: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/10.jpg)
(S/T/S-T) Sentence length (S/T) Language model (S/T) Token-type ratio (S) Readability metrics: Flesch, etc (S) Average number of possible translations per word (S) % of n-grams belonging to different frequency
quartiles of a source language corpus (T) Untranslated/OOV words (T) Mismatching brackets, quotation marks (S-T) Preservation of punctuation (S-T) Word alignment score etc
These do well for estimation of general quality wrt post-editing needs, but not enough for
other aspects of quality…
10
![Page 11: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/11.jpg)
Count-based (S/T/S-T) Content/non-content words (S/T/S-T) Nouns/verbs/… NP/VP/… (S/T/S-T) Deictics (references) (S/T/S-T) Discourse markers (references) (S/T/S-T) Named entities (S/T/S-T) Zero-subjects (S/T/S-T) Pronominal subjects (S/T/S-T) Negation indicators (T) Subject-verb / adjective-noun agreement (T) Language Model of POS (T) Grammar checking (dangling words) (T) Coherence
11
![Page 12: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/12.jpg)
Alignment-based (S-T) Correct translation of pronouns (S-T) Matching of dependency relations (S-T) Matching of named entities (S-T) Alignment of parse trees (S-T) Alignment of predicates & arguments etc
Some features are language-dependent, others need resources that are language-
dependent, but apply to most languages, e.g. LM of POS tags
12
![Page 13: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/13.jpg)
Count-based feature representation:◦ Source/target only: count or proportion◦ Contrastive features (S-T): very important – but
not a simple matching of linguistic units Alignment may not be possible (e.g. clauses/phrases) Force same linguistic phenomena in S an T?
Vs translated as Ns
How to model different linguistic phenomena?
S = linguistic unit in source; T = linguistic unit in target
F S T | |F S T S TF
S
TF
S …
13
![Page 14: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/14.jpg)
Count-based feature representation:◦ Monotonicity of features◦ Sparsity: is 0-0 as good as 10-10?
Our representation: precision and recall
◦ Does not rely on alignment◦ Upper bound = 1 (also holds for S,T=0)◦ Lower bound = 0
min( , )P
S TF
T min( , )
R
S TF
S
14
![Page 15: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/15.jpg)
S-T: (Pighin and Màrquez 2011): learn expected projection of SRL from source to target
S-T: (Xiong et al 2010)◦ Target LM of words and POS tags, dangling words (link
grammar parser), word posterior probabilities
S-T: (Bach et al 2011)◦ Sequences of words and POS tags, context,
dependency structures, alignment info
Fine grained – need a lot of training data: 72K sentences, 2.2M words and their manual
correction (!)
15
![Page 16: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/16.jpg)
Estimating post-editing effort Human scores (1-4): how much post-editing effort?
Estimating adequacy Human scores (1-4): to which degree does the translation convey the meaning of the original text?
1: requires complete retranslation
2: a lot of post-editing needed, but quicker than retranslation
3: a little post-editing needed 4: fit for purpose
1: completely inadequate 2: poorly adequate
3: Fairly Adequate 4: Highly Adequate
16
![Page 17: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/17.jpg)
Machine learning algorithm: SVM for regression
Evaluation Root Mean Square Error (RMSE)
N
jjj yy
NRMSE
1
2)ˆ(1
17
![Page 18: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/18.jpg)
English-Spanish Europarl data◦ 4 SMT systems 4 sets of 4,000 {source,
translation, score} triples
Quality score: 1-4 post-editing effort
Features: 96 shallow versus 169 shallow + ling:
18
![Page 19: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/19.jpg)
Distribution of post-editing effort scores:
Score MT1 MT2 MT3 MT4
1 4% 9% 10% 73%
2 25% 36% 39% 21%
3 54% 40% 43% 6%
4 17% 10% 9% 0%
Avg. quality
2.83 2.56 2.51 1.34
19
![Page 20: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/20.jpg)
RMSE:
Languages
MT System
All features
No ling. features
en-es MT1 0.600 0.574en-es MT2 0.682 0.671en-es MT3 0.671 0.654en-es MT4 0.541 0.534
Deviation of 17-22%
20
![Page 21: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/21.jpg)
MT: The student still has claimed to take the exam at the end of the year - although she has not chosen course.
SRC: A estudante ainda tem pretensão de prestar vestibular no fim do ano – embora não tenha escolhido o curso
REF: The student still has the intention to take the exam at the end of the year – although she has not chosen the course.
21
![Page 22: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/22.jpg)
Arabic-English Newswire data (GALE)◦ 2 SMT systems (Rosetta team) 2 sets of 2,585
{source, translation, score} triples
Quality score: 1-4 adequacy
Features: 82 shallow versus 122 shallow + ling:
22
![Page 23: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/23.jpg)
Distribution of adequacy scores:
Score MT1 MT2
1 2% 2.3%2 20% 23%3 45% 46%4 33% 28.7%
Avg. quality
3.11 3
23
![Page 24: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/24.jpg)
RMSE :
Languages
MT System
All feature
s
No ling feature
s
ar-en MT1 0.762 0.771ar-en MT2 0.756 0.737
Deviation of 14-26%
24
![Page 25: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/25.jpg)
Best performing: ◦ Length (words, content-words, etc.)
Absolute numbers are better than proportions◦ Language model / corpus frequency◦ Ambiguity of source words
Shallow features are better than linguistic features◦ Except for one adequacy estimation system
Source/target features are better than contrastive features (shallow and linguistic)◦ Absolute numbers are better than proportions
25
![Page 26: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/26.jpg)
Issues:◦ Feature representation◦Sparsity◦ Need deeper features for adequacy estimation◦Annotation:
1-4 post-editing effort: could be more objective 1-4 adequacy: can we isolate adequacy from
fluency?◦Language-dependency ◦ Reliability of resources
Low quality translations◦Availability of resources
26
![Page 27: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/27.jpg)
General vs specific errors
Bottom-up approach: word-based CE◦ (Xiong et al 2010)
Word posterior probability, dangling words (link grammar parser), target words & POS patterns
◦ (Bach et al 2011) Dependency relations, words and POS patterns, e.g.
relate target words to patterns of POS tags in source
27
![Page 28: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/28.jpg)
◦ (Bach et al 2011): best features are source-based
28
![Page 29: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/29.jpg)
Top-down approach (on-going work)◦ Corpus-based analysis: generalize errors in categories◦ Portuguese-English◦ 150 sentences (2 domains, 2 MT systems)◦ RBMT: more systematic errors
Linguistic IndicatorsEuroparl
MT1NewsMT1
EuroparlMT2
NewsMT2
Inflectional error 72 40 63 40Incorrect voice 2 6 13 6Mistranslated pronoun 61 40 63 35Missing pronoun 34 13 23 7Incorrect subject-verb order 6 10 12 9
• ~700 errors / 150 sentences• 42 error categories : a few rules per
category…
29
![Page 30: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/30.jpg)
It is possible to estimate the quality of MT systems wrt post-editing needs using shallow, language- and system-independent features
Adequacy estimation is a harder problem◦ Need more complex linguistic features…
Linguistic features are relevant:◦ Directly useful for error detection (word-level CE)◦ Directly useful for automatic post-editing◦ But… for sentence-level CE: Issues with sparsity Issues with representation: length bias
30
![Page 31: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/31.jpg)
Lucia [email protected]
![Page 32: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/32.jpg)
Aziz, W., Rios, M., Specia, L. (2011). Shallow Semantic Trees for SMT. WMT
Denkowski, M. and Lavie. A. 2011. Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems, WMT.
Giménez, J. and Màrquez, L. 2010. Linguistic Measures for Automatic Machine Translation Evaluation. Machine Translation, Volume 24, Numbers 3-4.
Hardmeier, C. 2011. Improving Machine Translation Quality Prediction with Syntactic Tree Kernels. EAMT-2011.
Liu, D. and Gildea, D. 2010. Semantic role features for machine translation. 23rd Conference on Computational Linguistics.
Pado, S., Galley, M., Jurafsky, D., and Manning, C. 2009. Robust Machine Translation Evaluation with Entailment Features. ACL.
32
![Page 33: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/33.jpg)
Pighin, D. and Màrquez, L. 2011. Automatic Projection of Semantic Structures: an Application to Pairwise Translation Ranking, SSST-5.
Tatsumi, M. and Roturier, J. 2010. Source Text Characteristics and Technical and Temporal Post-Editing Effort : What is Their Relationship ?, 43-51. 2nd JEC Workshop.
Wu,D. and Fung, P. 2009. Semantic roles for SMT: a hybrid two-pass model. HLT/NAAACL.
Xiong, D., Zhang, M. and Li, H. 2010. Error Detection for SMT Using Linguistic Features. ACL-2010.
33
![Page 34: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/34.jpg)
Best features (Pearson’s correlation) (S3 en-es):
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
CE
Aborted nodes
SMT score
Ratio scores
LM target
LM source
Bi-phrase prob
TM
Sent length
BAD 117
BAD 76
34
![Page 35: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/35.jpg)
Filtering out bad translations: 1-2 (S3 en-es) ◦ Average human scores in the top n translations:
2.5
2.6
2.7
2.8
2.9
3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
average top 100 average top 200 average top 300 average top 500
Average scores x TOP N
Human
CE
Aborted nodes
SMT score
Ratio scores
LM target
LM source
Bi-phrase prob
TM
Sent length
BAD 117
BAD 76
35
![Page 36: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/36.jpg)
QE x MT metrics: Pearson’s correlation (S3 en-es)
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1
BLEU-4
BLEU-2
NIST
TER
Meteor exact
Meteor porter
CE
36
![Page 37: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/37.jpg)
◦QE score x MT metrics: Pearson’s correlation across MT systems:
Test set Training set Pearson QE and human
S3 en-es S1 en-es 0.478
S2 en-es 0.517
S3 en-es 0.542
S4 en-es 0.423
S2 en-es S1 en-es 0.531
S2 en-es 0.562
S3 en-es 0.547
S4 en-es 0.442
37
![Page 38: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/38.jpg)
SMT model global score and internal features
Distortion count, phrase probability, ...
% search nodes aborted, pruned, recombined …
Language model using n-best list as corpus
Distance to centre hypothesis in the n-best list
Relative frequency of the words in the translation in the n-
best list
Ratio of SMT model score of the top translation to the sum of
the scores of all hypothesis in the n-best list, …
38
![Page 39: In an ideal world...](https://reader035.fdocuments.us/reader035/viewer/2022062322/5681506d550346895dbe6ac1/html5/thumbnails/39.jpg)
Best performing: ◦ Length (words, content-words, etc.)
Absolute numbers are better than proportions◦ Language model / corpus frequency◦ Ambiguity of source words
Shallow features are better than linguistic features◦ Except for one adequacy estimation system
Source/target features are better than contrastive features (shallow and linguistic)◦ Absolute numbers are better than proportions
Languages
MT System
All featur
es
No ling.
features
All features abs.
en-es MT1 0.600 0.574 0.595en-es MT2 0.682 0.671 0.664en-es MT3 0.671 0.654 0.662en-es MT4 0.541 0.534 0.523
39