Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language...
-
date post
19-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language...
![Page 1: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/1.jpg)
Measuring Confidence Intervals for MT Evaluation Metrics
Ying Zhang (Joy)Stephan Vogel
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
![Page 2: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/2.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
2
Outline
• Automatic Machine Translation Evaluation– BLEU
– Modified BLEU
– NIST MTEval
• Confidence Intervals based on Bootstrap Percentile– Algorithm
– Comparing two MT systems
– Implementation
• Discussions– How much testing data is needed?
– How many reference translations are needed?
– How many bootstrap samples are needed?
![Page 3: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/3.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
3
Automatic Machine Translation Evaluation
• Subjective MT evaluations– Fluency and Adequacy scored by human judges– Very expensive in time and money
• Objective automatic MT evaluations– Inspired by the Word Error Rate metric used by ASR research– Measuring the “closeness” between the MT hypothesis and human
reference translations– Precision: n-gram precision– Recall:
• Against the best matched reference• Approximated by brevity penalty
– Cheap, fast– Highly correlated with subjective evaluations– MT research has greatly benefited from automatic evaluations– Typical metrics: IBM BLEU, CMU M-BLEU, CMU METEOR, NIST
MTeval, NYU GTM
![Page 4: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/4.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
4
BLEU Metrics
• Proposed by IBM’s SMT group (Papineni et al, 2002)
• Widely used in MT evaluations– DARPA TIDES MT evaluation
– IWSLT evaluation
– TC-Star
• BLEU Metric:
– Pn: Modified n-gram precision
– Geometric mean of p1, p2,..pn
– BP: Brevity penalty
– Usually, N=4 and wn=1/N.
)logexp(1
N
nnn pwBPBLEU
rcif
rcif
eBP cr
)/1(
1 c: length of the MT hypothesis r: effective reference length
![Page 5: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/5.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
5
BLEU Metric
• Example:– MT Hypothesis: the gunman was shot dead by police .
– Reference 1: The gunman was shot to death by the police .
– Reference 2: The gunman was shot to death by the police .
– Reference 3: Police killed the gunman .
– Reference 4: The gunman was shot dead by the police .
• Precision: p1=1.0(8/8) p2=0.86(6/7) p3=0.67(4/6) p4=0.6 (3/5)
• Brevity Penalty: c=8, r=9, BP=0.8825• Final Score:
• Usually n-gram precision and BP are calculated on the test set level
68.08825.06.067.086.014
![Page 6: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/6.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
6
Modified BLEU Metric
• BLEU focuses heavily on long n-grams because of the geometric mean
• Example:
• Modified BLEU Metric (Zhang, 2004)
– Arithmetic mean of the n-gram precision– More balanced contribution from different n-grams
N
nnn pwBPBLEUM
1
p1 p2 p3 p4 BLEU
MT1 1.0 0.21 0.11 0.06 0.19
MT2 0.35 0.32 0.28 0.26 0.30
![Page 7: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/7.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
7
NIST MTEval Metric
• Motivation– “Weight more heavily those n-grams that are more informative” (NIST
2002)
– Use a geometric mean of the n-gram score
• Pros: more sensitive than BLEU• Cons:
– Info gain for 2-gram and up is not meaningful• 80% of the score comes from unigram matches• Most matched 5-grams have info gain 0 !
– Score increases when the testing set size increases
N
nhypinwwall
occurcothatwwalln
n
n
wwInfo
BPNIST1
__..._
__..._1
1
1
)1(
)...(
n
nn wwofsoccurrenceofthe
wwofsoccurrenceofthewwInfo
...#
...#log)....(
1
1121
![Page 8: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/8.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
8
Questions Regarding MT Evaluation Metrics
• Do they rank the MT systems in the same way as human judges? – IBM showed a strong correlation between BLEU and human judgments
• How reliable are the automatic evaluation scores?• How sensitive is a metric?
– Sensitivity: the metric should be able to distinguish between systems of similar performance
• Is the metric consistent?– Consistency: the difference between systems is not affected by the
selection of testing/reference data
• How many reference translations are needed?• How much testing data is sufficient for evaluation?
• If we can measure the confidence interval of the evaluation scores, we can answer the above questions
![Page 9: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/9.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
9
Outline
• Overview of Automatic Machine Translation Evaluation– BLEU
– Modified BLEU
– NIST MTEval
• Confidence Intervals based on Bootstrap Percentile– Algorithm
– Comparing two MT systems
– Implementation
• Discussions– How much testing data is needed?
– How many reference translations are needed?
– How many bootstrap samples are needed?
![Page 10: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/10.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
10
Measuring the Confidence Intervals
• One BLEU/M-BLEU/NIST score per test set
• How accurate is this score?
• To measure the confidence interval a population is required
• Building a test set with multiple human reference translations is expensive
• Solution: bootstrapping (Efron 1986)– Introduced in 1979 as a computer-based method for estimating the
standard errors of a statistical estimation
– Resampling: creating an artificial population by sampling with replacement
– Proposed by Franz Och (2003) to measure the confidence intervals for automatic MT evaluation metrics
![Page 11: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/11.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
11
A Schematic of the Bootstrapping Process
Score0
![Page 12: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/12.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
12
An Efficient Implementation
• Translate and evaluate 2,000 test sets?– No Way!
• Resample the n-gram precision information for the sentences– Most MT systems are context independent at the sentence level;– MT evaluation metrics are based on information collected for each testing
sentences– E.g. for BLEU/M-BLEU and NIST
RefLen: 17 20 19 24ClosestRefLen 171-gram: 15 10 89.342-gram: 14 4 9.043-gram: 13 3 3.654-gram: 12 2 2.43
– Similar for human judgment and other MT metrics
• Approximation for NIST information gain• Scripts available at: http://projectile.is.cs.cmu.edu/research/public/tools/
bootStrap/tutorial.htm
![Page 13: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/13.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
13
Algorithm
Original test suite T0 with N segments and R reference translations
Represent the i-th segment of T0 as an n-tuple:
T0[i]=<si, ri1,ri2,..,riR>
for(b=1;b<=B;b++){
for(i=1;i<=N;i++){
s = random(1,N);
Tb[i] = T0[s];
}
Calculating BLEU/M-BLEU/NIST for Tb
}
Sort B BLEU/M-BLEU/NIST scores
Output scores ranked 2.5%th and 97.5%
![Page 14: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/14.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
14
Confidence Intervals
• 7 Chinese-English MT systems from June 2002 TIDES evaluation
• Observations:– Relative confidence interval: NIST<M-Bleu<Bleu– NIST scores have more discriminative powers than BLEU– The strong impact of long n-grams makes the BLEU score less stable
or … introduces more noise)
![Page 15: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/15.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
15
Are Two MT Systems Different?
• Comparing two MT systems’ performance– Using the similar method as for single system
– E.g. Diff(Sys1-Sys2):Median=-1.7355 [-1.5453,-1.9056]
– If the confidence intervals overlap with 0, two systems are not significantly different
• M-Bleu and NIST have more discriminative power than Bleu
• Automatic metrics have pretty high correlations with the human ranking
• Human judges like system E (Syntactic system) more than B (Statistical system), but automatic metrics do not
![Page 16: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/16.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
16
Outline
• Overview of Automatic Machine Translation Evaluation– BLEU
– Modified BLEU
– NIST MTEval
• Confidence Intervals based on Bootstrap Percentile– Algorithm
– Comparing two MT systems
– Implementation
• Discussions– How much testing data is needed?
– How many reference translations are needed?
– How many bootstrap samples are needed?
– Non-parametric interval or normal/t-intervals?
![Page 17: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/17.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
17
How much testing data is needed
NIST Scores
3.5
4
4.5
5
5.5
6
6.5
7
7.5
8
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percentage of Testing Data Size
NIS
T S
co
re
A B C D E F G
BLEU Scores
0
0.05
0.1
0.15
0.2
0.25
0.3
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percentage of testing data size
BL
EU
Sc
ore
A B C D E F G
M-Bleu Scores
0.05
0.07
0.09
0.11
0.13
0.15
0.17
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percentage of testing data size
M-B
leu
sc
ore
A B C D E F G
F+A Human Judgments based on Different Size of Testing Data
4
4.2
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
6
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percentage of Testing Data
Hu
ma
n J
ud
gm
en
t
A B C D E F G
![Page 18: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/18.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
18
How much testing data is needed
• NIST scores increase steadily with the growing test set size
• The distance between the scores of the different systems remains stable when using 40% or more of the test set
• The confidence intervals become narrower for larger test set
• Rule of thumb: doubling the testing data size narrows the confidence interval by 30% (theoretically justified)
* System A, (Bootstrap Size B=2000)
![Page 19: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/19.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
19
Effects of Using Multiple References
• Single reference from one translator may favor some systems
• Increasing the number of references narrows down the relative confidence interval
0
0.05
0.1
0.15
0.2
0.25
CE01 CE02 CE03 CE04 CE05 CE06 CE07MT systems
BL
EU
sc
ore REF01
REF02
REF03
REF04
4-REF
![Page 20: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/20.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
20
How Many Reference Translations are Sufficient?
• Confidence intervals become narrower with more reference translations
• [100%](1-ref) ~ [80~90%](2-ref) ~ [70~80%](3-ref) ~[60%~70%](4-ref)
• One additional reference translation compensates for 10~15% of testing data
* System A, (Bootstrap Size B=2000)
![Page 21: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/21.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
21
Do We Really Need Multiple References?
• Parallel multiple reference
• Single reference from multiple translators*
– Reduced bias from different translators
– Yields the same confidence interval/reliability as the parallel multiple reference
– Costs only half of the effort compared to building a parallel multiple reference set
*Originally proposed in IBM’s BLEU report
![Page 22: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/22.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
22
Single Reference from Multiple Translators
• Reduced bias by mixing from different translators
• Yields the same confidence intervals
0
0.05
0.1
0.15
0.2
0.25
CE01 CE02 CE03 CE04 CE05 CE06 CE07MT Systems
BL
EU
Sc
ore mixedREF1
mixedREF2
mixedREF3
mixedREF4
mixedREF5
mixedREF6
mixedREF7
mixedREF8
4-REF
![Page 23: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/23.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
23
Bootstrap-t Interval vs. Normal/t Interval
• Normal distribution / t-distribution
• Student’s t-interval (when n is small)
• Bootstrap-t interval– For each bootstrap sample, calculate
– The alpha-th percentile is estimated by the value , such that
– Bootstrap-t interval is
– e.g. if B=1000, the 50th largest value and the 950th largest value gives the bootstrap-t interval
)1,0(~ˆ .
^ Nse
Z
]ˆ,ˆ[^
)(^
)1( sezsez Assuming that
1
.
^ ~ˆ
ntse
Z
]ˆ,ˆ[^
1
^
1
)()1(
setset nn
Assuming that
^*
**
)(
)(ˆ)(
bse
bbZ
)(ˆ t BtbZ /}ˆ)({# )(*
]ˆ,ˆ[^^ )()1(
setset
![Page 24: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/24.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
24
Bootstrap-t interval vs. Normal/t interval (Cont.)
• Bootstrap-t intervals assumes no distribution, but– It can give erratic results– It can be heavily influenced by a few outlying data points
• When B is large, the bootstrap sample scores are pretty close to normal distribution
• Assume normal distribution gives more reliable intervals, e.g. for BLEU relative confidence interval (B=500)– STDEV=0.27 for bootstrap-t interval– STDEV=0.14 for normal/student-t interval
Historgram of 2000 BLEU Scores
0
20
40
60
80
100
120
140
160
BLEU Score
Fre
q
![Page 25: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/25.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
25
The Number of Bootstrap Replications B
• Ideal bootstrap estimate of the confidence interval takes B• Computational time increases linearly with B • The greater B, the smaller the standard deviation of the estimated confidence intervals. E.g. for BLEU’s relative
confidence interval– STDEV = 0.60 when B=100; STDEV = 0.27 when B=500
• Two rules of thumb:– Even a small B, say B=100 is usually informative– B>1000 gives quite satisfactory results
![Page 26: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/26.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
26
Conclusions
• Using bootstrapping method to measure the confidence intervals for MT evaluation metrics
• Using confidence intervals to study the characteristics of an MT evaluation metric– Correlation with human judgments
– Sensitivity
– Consistency
• Modified BLEU is a better metric than BLEU
• Single reference from multiple translators is as good as parallel multiple references and costs only half the effort
![Page 27: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/27.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
27
References
• Efron, B. and R. Tibshirani : 1986, Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy, Statistical Science 1, p. 54-77.
• F. Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. Of ACL, Sapporo, Japan.
• M. Bisani and H. Ney : 2004, 'Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation', In Proc. of ICASP, Montreal, Canada, Vol. 1, pp. 409-412.
• G. Leusch, N. Ueffing, H. Ney : 2003, 'A Novel String-to-String Distance Measure with Applications to Machine Translation Evaluation', In Proc. 9th MT Summit, New Orleans, LO.
• I Dan Melamed, Ryan Green and Joseph P. Turian : 2003, 'Precision and Recall of Machine Translation', In Proc. of NAACL/HLT 2003, Edmonton, Canada.
• King M., Popescu-Belis A. & Hovy E. : 2003, 'FEMTI: creating and using a framework for MT evaluation', In Proc. of 9th Machine Translation Summit, New Orleans, LO, USA.
• S. Nießen, F.J. Och, G. Leusch, H. Ney : 2000, 'An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research', In Proc. LREC 2000, Athens, Greece.
• NIST Report : 2002, Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics, http://www.nist.gov/speech/tests/mt/doc/ngram-study.pdf
• Papineni, Kishore & Roukos, Salim et al. : 2002, 'BLEU: A Method for Automatic Evaluation of Machine Translation', In Proc. of the 20th ACL.
• Ying Zhang, Stephan Vogel, Alex Waibel : 2004, 'Interpreting BLEU/NIST scores: How much improvement do we need to have a better system?,' In: Proc. of LREC 2004, Lisbon, Portugal.
![Page 28: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/28.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
28
Questions and Comments?
![Page 29: Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.](https://reader038.fdocuments.us/reader038/viewer/2022110207/56649d3a5503460f94a14c10/html5/thumbnails/29.jpg)
Oct 2004TMI, Baltimore, MD
Ying Zhang, Stephan VogelLTI, Carnegie Mellon University
29
N-gram Contributions to NIST Score