Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By:...
-
date post
21-Dec-2015 -
Category
Documents
-
view
219 -
download
2
Transcript of Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By:...
![Page 1: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/1.jpg)
Minimum Error Rate Training in Statistical Machine Translation
By: Franz Och, 2003
Presented By: Anna Tinnemore, 2006
![Page 2: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/2.jpg)
GOAL
To directly optimize translation quality
WHY?? No direct correlation in popular evaluation criteria
F-Measure (parsing) Mean Average Precision (ranked retrieval) BLEU—multi-reference word error rate
(statistical machine translation)
![Page 3: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/3.jpg)
Problem: The difference in classification of error between the statistical approach and the automatic evaluation methods.
Solution (maybe): optimize model parameters according to individual evaluation methods
![Page 4: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/4.jpg)
Background
Optimal under “zero-one loss function”
A different metric would have a different optimal decision rule
![Page 5: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/5.jpg)
Background, continued
Problems: finding suitable feature functions (M) and parameter values(λ)
MMI (max mutual info) One unique global optimum Algorithms guaranteed to
find it Optimal translation quality?
![Page 6: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/6.jpg)
So what?
Review automatic evaluation criteria Two training criteria that might help New training algorithm for optimizing an
unsmoothed error count Och’s approach Evaluation of training criteria
![Page 7: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/7.jpg)
Translation quality metrics
mWER –(multi-reference word error rate) Compute edit distance to closest ref. transl.
mPER – (multi-reference position independent error rate)
bag of words, edit distance BLEU
The mean of the precision of n-grams NIST
Weighted precision of n-grams
![Page 8: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/8.jpg)
Training
Minimize error rate
Problems: argmax operation (6)- no
global optimum Many local optima
![Page 9: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/9.jpg)
Smoothed Error Count
This is easier to deal with than last function, but still tricky
Performance doesn’t change much with smoothing
![Page 10: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/10.jpg)
![Page 11: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/11.jpg)
Unsmoothed Error Count
Standard: Powell’s algorithm – grid-based line optimization
Fine-grained grid: slow Large grid: miss optimal solution
NEW: Log-linear model Guaranteed to find the optimal solution Much faster and more stable
![Page 12: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/12.jpg)
New Algorithm
Each candidate translation in C corresponds to a line
(t and m are constants)
Piecewise linear
![Page 13: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/13.jpg)
Algorithm: the nitty-gritty
For every f : Compute ordered sequence of linear intervals
that make up f(γ;f) Compute each change in error count
between intervals Merge all sequences γf and ΔEf
Traverse the sequence of boundaries while keeping track of error count to find the optimal γ
![Page 14: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/14.jpg)
Baseline
Same as alignment template approach This model, log-linear, had M = 8 features
Extract n-best candidate translations from all possible translations
Wait a minute . . .
![Page 15: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/15.jpg)
N-best???
Overfitting? Unseen data? First, compute n-best list using “made-up”
parameter values. Use this list to train model for new parameters.
Second, use new parameters, do new search, make new n-best list, append to old n-best list
Third, use new list to train model for even better parameters
![Page 16: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/16.jpg)
Keep going until the n-best list doesn’t change – all possible translations are in list
Each iteration generates approx. 200 additional translations
The algorithm only takes 5-7 iterations to converge
![Page 17: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/17.jpg)
Additional Sneaky Stuff
Problems with MMI (maximum mutual info) Reference sentences have to be part of n-best list
Solution: Fake reference sentences, of course Select from the n-best list, those sentences with
the fewest word errors with respect to the REAL references, and call these: “pseudo-references”
![Page 18: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/18.jpg)
Experiment
2002 TIDES Chinese-English small data track task
News text from Chinese to English
Note: no rule-based components used to translate numbers, dates, or names
![Page 19: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/19.jpg)
Development Corpus Results
![Page 20: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/20.jpg)
Test Corpus Results
![Page 21: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/21.jpg)
Conclusions
Alternative training criteria which directly relate to quality of translation Unsmoothed and smoothed error count on
development corpus Optimizing error rate in training yields better
results on unseen test data Maybe ‘true’ translation quality is also increased We don’t know because the evaluation metrics
need help
![Page 22: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/22.jpg)
Future Questions
How many parameters can be reliably estimated using differing criteria on development corpuses (corpi) of various sizes?
Does the criteria used make a difference? Which error rate criteria (smooth/unsmooth)
should be optimized in training?
![Page 23: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/23.jpg)
Boasting
This approach applies to any evaluation technique
If the evaluation methods ever get better, this algorithm will yield correspondingly better results
![Page 24: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/24.jpg)
Side-stepping
It’s possible that this algorithm could be used to “overfit” the evaluation method, giving falsely inflated scores
It’s not our problem. The developers of the evaluation methods should develop so this can’t happen
![Page 25: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/25.jpg)
. . . And Around The World
This algorithm has a place wherever evaluation methods are used
It could yield improvements in these other areas as well
![Page 26: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/26.jpg)
Questions, observations, accolades . . .
![Page 27: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/27.jpg)
My Observations
Improvements do not seem significant This exposes a problem in the evaluation
metrics, but does nothing to solve it Seems like a good idea, but has many
unanswered questions regarding optimal implementation
![Page 28: Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022013011/56649d5d5503460f94a3caf8/html5/thumbnails/28.jpg)
THANK YOU
and Good Night!