IBM Arabic-to-English Translation for IWSLT 2006 · 2016. 5. 23. · mkAn AyrfAyn fy Al# sbAq gdA...

14
IBM T. J. Watson Research Center IBM Arabic-to-English Translation for IWSLT 2006 © 2006 IBM Corporation Young-Suk Lee IWSLT 2006 November 28, 2006, Kyoto, Japan

Transcript of IBM Arabic-to-English Translation for IWSLT 2006 · 2016. 5. 23. · mkAn AyrfAyn fy Al# sbAq gdA...

  • IBM T. J. Watson Research Center

    IBM Arabic-to-English Translation for IWSLT 2006

    © 2006 IBM Corporation

    Young-Suk Lee

    IWSLT 2006November 28, 2006, Kyoto, Japan

  • IBM T. J. Watson Research Center

    © 2006 IBM Corporation

    Outline

    • Baseline Decoding

    • Technical Challenges and Solutions

    • IWSLT 2006 Evaluation Results

    • Conclusions

  • IBM T. J. Watson Research Center

    © 2006 IBM Corporation

    Baseline Decoding

    IBM T. J. Watson Research Center

    © 2006 IBM Corporation

    Phrase translation models

    - Direct / source channel / unigram models

    - [Tillmann 2003], [Lee et al. 2006]

    Modified IBM Model 1 cost [Lee et al. 2006]

    Word trigram language models

    Distortion models [Al-Onaizan & Papineni 2006]

    Word/block count penalty [Zens & Ney 2004]

    DP-based phrase

    decoder

    Translation output We are open from seven p.m. to midnight .

    Input sentence � ا� � '2(01 ا�/.- ,+*() '& ا�%$"!# "! ا��

    Pre-processing nftH mn AlsAbEp bEd AlZhr Aly mntSf Allyl

  • IBM T. J. Watson Research Center

    © 2006 IBM Corporation

    IWSLT 2006 Challenge: High OOV Rate

    6.3

    10.1

    10.5

    Avg. seg

    length

    157

    489

    609

    OOV Cnt

    3,164

    4,763

    5,229

    Token Cnt

    4.96 %Eval05

    10.27 %Dev06

    11.65 %Eval06

    OOV Rate

    Correct Recognition Result

  • IBM T. J. Watson Research Center

    © 2006 IBM Corporation

    Word Segmentation & Morphological Analysis

    Word Segmentation [Lee et al. 2003], Morphological Analysis [Lee 2004]

    wsyHl sA}q AltjArb fy jAgwAr AlbrAzyly lwsyAnw bwrty mkAn

    AyrfAyn fy AlsbAq gdA AlAHd Al*y sykwn Awly xTwAth fy EAlm

    sbAqAt AlfwrmwlA

    w# s# y# Hl sA}q Al# tjArb fy jAgwAr Al# brAzyly lwsyAnw bwrty

    mkAn AyrfAyn fy Al# sbAq gdA Al# AHd Al*y s# y# kwn Awly

    xTw +At +h fy EAlm sbAq +At AlfwrmwlA

    w# s# yHl sA}q ø tjArb fy jAgwAr Al# brAzyly lwsyAnw bwrty

    mkAn AyrfAyn fy Al# sbAq gdA ø AHd Al*y s# ykwn Awly

    xTwAt +h fy EAlm sbAqAt AlfwrmwlA

  • IBM T. J. Watson Research Center

    © 2006 IBM Corporation

    OOV Rate Reduction

    0

    2

    4

    6

    8

    10

    12

    Eval06 Dev06 Eval05

    OOV Rate (%)

    BTEC

    baseline

    BTEC

    segmentation

    BTEC

    analysis

  • IBM T. J. Watson Research Center

    © 2006 IBM Corporation

    Translation Quality ImprovementBLEUr7n4c

    BTEC

    baseline

    BTEC

    segmentation

    BTEC

    analysis

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    0.55

    0.6

    Eval06 Dev06 Eval05

    0.22

    0.24 0.25

    0.29

    0.51

    0.56

  • IBM T. J. Watson Research Center

    © 2006 IBM Corporation

    Out-of-Domain Corpora: Newswires

    20,000189,239159,213BTEC

    4,530,669

    117,420

    80,354

    819,354

    2,207,934

    310,079

    681,613

    150,865

    129,181

    33,869

    # EN words

    3,547,700

    86,614

    70,183

    616,879

    1,771,893

    227,792

    520,971

    123,505

    103,717

    26,146

    # AR words

    121,119OOD Total

    2,624FBIS

    2,346LDC2001T55

    24,874LDC2005E46

    52,042LDC2004E08

    8,576LDC2004E11

    20,358LDC2004E07

    5,003LDC2003E09

    4,235LDC2003E05

    1,043LDC2003T18

    # sent pairsSource

  • IBM T. J. Watson Research Center

    © 2006 IBM Corporation

    OOV Rate Reduction

    0

    2

    4

    6

    8

    10

    12

    Eval06 Dev06 Eval05

    OOV Rate (%)

    BTEC+OOD

    baseline

    BTEC+OOD

    segmentation

    BTEC+OOD

    analysis

    BTEC

    baseline

  • IBM T. J. Watson Research Center

    © 2006 IBM Corporation

    Eval05 Translation Quality Improvement

    0.55

    0.56

    0.57

    0.58

    0.59

    0.6

    0.61

    1 ; 1 2 ; 1 3 ; 1 4 ; 1 5 ; 1 6 ; 1 1 ; 0

    BTEC ; OOD Corpora Ratio

    BLEUr16n4c baseline

    segmentation

    analysis

  • IBM T. J. Watson Research Center

    © 2006 IBM Corporation

    Dev06 Translation Quality Improvement

    0.24

    0.25

    0.26

    0.27

    0.28

    0.29

    0.3

    0.31

    0.32

    1 ; 1 2 ; 1 3 ; 1 4 ; 1 5 ; 1 6 ; 1 1 ; 0

    BLEUr7n4c

    BTEC ; OOD Corpora Ratio

    baseline

    segmentation

    analysis

  • IBM T. J. Watson Research Center

    © 2006 IBM Corporation

    Eval06 Translation Quality Improvement

    0.2

    0.22

    0.24

    0.26

    segmentation analysis

    BLEUr7n4c

    0.2

    0.22

    0.24

    0.26

    segmentation analysis

    Correct Recognition ASR Output

    BTEC ; OOD Corpora Ratio = 4 ; 1

    BTEC

    BTEC +

    OOD

    0.235

    0.253

    0.244

    0.235

    0.2120.215

    0.224

    0.218

  • IBM T. J. Watson Research Center

    © 2006 IBM Corporation

    IWSLT 2006 Open Data Track Results

    0.49580.60350.48426.48670.2428Additional

    0.51980.60490.48455.84660.2274Official

    PERWERMETEORNISTBLEU4

    ASR OutputASR OutputASR OutputASR Output

    0.44800.55930.53147.16810.2773Additional

    0.48250.56680.53166.37690.2549Official

    PERWERMETEORNISTBLEU4

    Correct Recognition ResultCorrect Recognition ResultCorrect Recognition ResultCorrect Recognition Result

    Scores in BLUE indicate the best scores under the given condition

  • IBM T. J. Watson Research Center

    © 2006 IBM Corporation

    Conclusions

    • Techniques for improving translation quality & increasing vocabulary coverage

    • Word segmentation & morphological analysis

    • Proper combination of domain-specific and out-of-domain corpora for model training

    • Effectiveness of the techniques

    • Demonstrated in the IBM Arabic-to-English translation system performances in the Open Data Track