AL˝BUSTAN AL˝MARSAM˝ AL˝MARSAM˝˘ AL˝MATBAKH · al˝manara ˜ˆ ˆ˜ ˝˛ ˘ˆ ˜˝ ...
IBM Arabic-to-English Translation for IWSLT 2006 · 2016. 5. 23. · mkAn AyrfAyn fy Al# sbAq gdA...
Transcript of IBM Arabic-to-English Translation for IWSLT 2006 · 2016. 5. 23. · mkAn AyrfAyn fy Al# sbAq gdA...
-
IBM T. J. Watson Research Center
IBM Arabic-to-English Translation for IWSLT 2006
© 2006 IBM Corporation
Young-Suk Lee
IWSLT 2006November 28, 2006, Kyoto, Japan
-
IBM T. J. Watson Research Center
© 2006 IBM Corporation
Outline
• Baseline Decoding
• Technical Challenges and Solutions
• IWSLT 2006 Evaluation Results
• Conclusions
-
IBM T. J. Watson Research Center
© 2006 IBM Corporation
Baseline Decoding
IBM T. J. Watson Research Center
© 2006 IBM Corporation
Phrase translation models
- Direct / source channel / unigram models
- [Tillmann 2003], [Lee et al. 2006]
Modified IBM Model 1 cost [Lee et al. 2006]
Word trigram language models
Distortion models [Al-Onaizan & Papineni 2006]
Word/block count penalty [Zens & Ney 2004]
DP-based phrase
decoder
Translation output We are open from seven p.m. to midnight .
Input sentence � ا� � '2(01 ا�/.- ,+*() '& ا�%$"!# "! ا��
Pre-processing nftH mn AlsAbEp bEd AlZhr Aly mntSf Allyl
-
IBM T. J. Watson Research Center
© 2006 IBM Corporation
IWSLT 2006 Challenge: High OOV Rate
6.3
10.1
10.5
Avg. seg
length
157
489
609
OOV Cnt
3,164
4,763
5,229
Token Cnt
4.96 %Eval05
10.27 %Dev06
11.65 %Eval06
OOV Rate
Correct Recognition Result
-
IBM T. J. Watson Research Center
© 2006 IBM Corporation
Word Segmentation & Morphological Analysis
Word Segmentation [Lee et al. 2003], Morphological Analysis [Lee 2004]
wsyHl sA}q AltjArb fy jAgwAr AlbrAzyly lwsyAnw bwrty mkAn
AyrfAyn fy AlsbAq gdA AlAHd Al*y sykwn Awly xTwAth fy EAlm
sbAqAt AlfwrmwlA
w# s# y# Hl sA}q Al# tjArb fy jAgwAr Al# brAzyly lwsyAnw bwrty
mkAn AyrfAyn fy Al# sbAq gdA Al# AHd Al*y s# y# kwn Awly
xTw +At +h fy EAlm sbAq +At AlfwrmwlA
w# s# yHl sA}q ø tjArb fy jAgwAr Al# brAzyly lwsyAnw bwrty
mkAn AyrfAyn fy Al# sbAq gdA ø AHd Al*y s# ykwn Awly
xTwAt +h fy EAlm sbAqAt AlfwrmwlA
-
IBM T. J. Watson Research Center
© 2006 IBM Corporation
OOV Rate Reduction
0
2
4
6
8
10
12
Eval06 Dev06 Eval05
OOV Rate (%)
BTEC
baseline
BTEC
segmentation
BTEC
analysis
-
IBM T. J. Watson Research Center
© 2006 IBM Corporation
Translation Quality ImprovementBLEUr7n4c
BTEC
baseline
BTEC
segmentation
BTEC
analysis
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
Eval06 Dev06 Eval05
0.22
0.24 0.25
0.29
0.51
0.56
-
IBM T. J. Watson Research Center
© 2006 IBM Corporation
Out-of-Domain Corpora: Newswires
20,000189,239159,213BTEC
4,530,669
117,420
80,354
819,354
2,207,934
310,079
681,613
150,865
129,181
33,869
# EN words
3,547,700
86,614
70,183
616,879
1,771,893
227,792
520,971
123,505
103,717
26,146
# AR words
121,119OOD Total
2,624FBIS
2,346LDC2001T55
24,874LDC2005E46
52,042LDC2004E08
8,576LDC2004E11
20,358LDC2004E07
5,003LDC2003E09
4,235LDC2003E05
1,043LDC2003T18
# sent pairsSource
-
IBM T. J. Watson Research Center
© 2006 IBM Corporation
OOV Rate Reduction
0
2
4
6
8
10
12
Eval06 Dev06 Eval05
OOV Rate (%)
BTEC+OOD
baseline
BTEC+OOD
segmentation
BTEC+OOD
analysis
BTEC
baseline
-
IBM T. J. Watson Research Center
© 2006 IBM Corporation
Eval05 Translation Quality Improvement
0.55
0.56
0.57
0.58
0.59
0.6
0.61
1 ; 1 2 ; 1 3 ; 1 4 ; 1 5 ; 1 6 ; 1 1 ; 0
BTEC ; OOD Corpora Ratio
BLEUr16n4c baseline
segmentation
analysis
-
IBM T. J. Watson Research Center
© 2006 IBM Corporation
Dev06 Translation Quality Improvement
0.24
0.25
0.26
0.27
0.28
0.29
0.3
0.31
0.32
1 ; 1 2 ; 1 3 ; 1 4 ; 1 5 ; 1 6 ; 1 1 ; 0
BLEUr7n4c
BTEC ; OOD Corpora Ratio
baseline
segmentation
analysis
-
IBM T. J. Watson Research Center
© 2006 IBM Corporation
Eval06 Translation Quality Improvement
0.2
0.22
0.24
0.26
segmentation analysis
BLEUr7n4c
0.2
0.22
0.24
0.26
segmentation analysis
Correct Recognition ASR Output
BTEC ; OOD Corpora Ratio = 4 ; 1
BTEC
BTEC +
OOD
0.235
0.253
0.244
0.235
0.2120.215
0.224
0.218
-
IBM T. J. Watson Research Center
© 2006 IBM Corporation
IWSLT 2006 Open Data Track Results
0.49580.60350.48426.48670.2428Additional
0.51980.60490.48455.84660.2274Official
PERWERMETEORNISTBLEU4
ASR OutputASR OutputASR OutputASR Output
0.44800.55930.53147.16810.2773Additional
0.48250.56680.53166.37690.2549Official
PERWERMETEORNISTBLEU4
Correct Recognition ResultCorrect Recognition ResultCorrect Recognition ResultCorrect Recognition Result
Scores in BLUE indicate the best scores under the given condition
-
IBM T. J. Watson Research Center
© 2006 IBM Corporation
Conclusions
• Techniques for improving translation quality & increasing vocabulary coverage
• Word segmentation & morphological analysis
• Proper combination of domain-specific and out-of-domain corpora for model training
• Effectiveness of the techniques
• Demonstrated in the IBM Arabic-to-English translation system performances in the Open Data Track