Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi

25
Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon Chenhui Chu , Toshiaki Nakazawa, Sadao Kurohashi Graduate School of Informatics, Kyoto University IJCNLP2013 (2013/10/1 1

description

Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon. Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi Graduate School of Informatics, Kyoto University. IJCNLP2013 (2013/10/17). Outline. Background Related Work - PowerPoint PPT Presentation

Transcript of Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi

Page 1: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora

using Alignment Model and Translation Lexicon

Chenhui Chu, Toshiaki Nakazawa, Sadao KurohashiGraduate School of Informatics, Kyoto University

IJCNLP2013 (2013/10/17)1

Page 2: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Outline

• Background• Related Work• Proposed Method• Experiments• Conclusion

2

Page 3: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Outline

• Background• Related Work• Proposed Method• Experiments• Conclusion

3

Page 4: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Bilingual Corpora [Fung+ 2004]Type Definition ExampleParallel Sentence-aligned bilingual corpora EuroparlNoisy Parallel Bilingual translations of documents Patent familyComparable Topic-aligned bilingual documents Wikipedia Quasi-Comparable Very-non-parallel bilingual documents this study

4

• Lack of parallel corpora• Parallel sentences can be extracted from noisy and

comparable corpora• Quasi-comparable corpora more available, however

few parallel sentences exist

Page 5: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Parallel Fragments

• In quasi-comparable corpora, there could be parallel fragments in comparable sentences

• Parallel fragments are also helpful for SMT• We aim to accurately extract parallel fragments

from comparable sentences

应用 /铅 /离子 /选择 /电极 /电位 /滴定 /法 /测定 /甘草 /及 /其 /制品 /中/的 /甘草 /酸(Applying lead ion selective electrode potentiometric titration method to determine licorice and its products ‘s glycyrrhizic acid)< / 原 / 報 / > /鉛 /イオン /選択 /性 /電極を / 用いる / 混合 / 試料 / 中 / の/…/ と /電位 /差 /滴定 /法 / の / 比較 (<Original Report> lead ion selective electrode used mixed sample ‘s … and potentiometric titration method ‘s comparison)

Zh:

Ja:

5

Page 6: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Outline

• Background• Related Work• Proposed Method• Experiments• Conclusion

6

Page 7: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Parallel Sub-sentential Fragment Extraction [Munteanu+ 2006]

1. Extract translation lexicon from a parallel corpus

2. Apply a lexicon filter to comparable sentences in two directions independently– Assign initial scores according to the lexicon– Score smoothing to gain new knowledge that

does not exist in the lexicon 3. Extract sub-sentential (not exactly parallel)

fragment7

Page 8: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

8

应用

铅 离子

选择

电极电位

滴定法 测

定甘草及 其 制

品中 的 甘

草酸

< 原 報 > 鉛 イオン選択 性 電極 を 用いる

混合 試料 中 の と 電位 差 滴定 法 の 比較

Lexicon Filter on Ja-to-Zh Direction

-1.5

-1

-0.5

0

0.5

1

1.5 Initial score Smoothed score

Page 9: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

9

应用

铅 离子

选择

电极电位

滴定法 测

定甘草及 其 制

品中 的 甘

草酸

< 原 報 > 鉛 イオン選択 性 電極 を 用いる

混合 試料 中 の と 電位 差 滴定 法 の 比較

Lexicon Filter on Zh-to-Ja Direction

-1.5

-1

-0.5

0

0.5

1

1.5 Initial score Smoothed score

Page 10: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Outline

• Background• Related Work• Proposed Method• Experiments• Conclusion

10

Page 11: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

System Overview

Translated sentences

Comparable sentences

ParallelfragmentsSource

corpora

Target corpora Classifier

(2) IR: top N results

(1)(3) (4)

Alignment

Parallel corpus

Parallelfragmentcandidates

Lexiconfilter

(5)

SMT

11

Use an alignment model to locate the source and target fragment candidates simultaneously

Use a more accurate lexicon filter

Page 12: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Parallel Fragment Candidate Detection by Alignment

Monotonic, non-NULL and longest aligned fragments more than 3 tokens 12

Page 13: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Lexicon Filter − Assign Initial Scores

13

Assign scores in two directions to aligned word pairs in the candidates according to translation lexicon

Page 14: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Lexicon Filter − Score Smoothing

14

Only smooth a word with negative score when boththe left and right words around it have positive scores

Page 15: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Fragment Extraction

15

Fragments more than 3 tokens with continuous positivescores in both directions

Page 16: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Outline

• Background• Related Work• Proposed Method• Experiments– Parallel Fragment Extraction– Translation

• Conclusion

16

Page 17: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Experimental settings (Parallel Fragment Extraction 1/2)

• Parallel corpus: Zh-Ja abstract corpus (680k sentences, scientific domain)

• Quasi-Comparable Corpora– Chinese corpora: CNKI (90k articles, 420k sentences,

chemistry domain)– Japanese corpora: CiNii (880k articles, 5M sentences,

scientific domain)

• Comparable sentences: 30k chemistry domain sentences were extracted

17

Page 18: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Experimental settings (Parallel Fragment Extraction 2/2)

• Alignment: GIZA++ with symmetrization heuristics – Only: only use the extracted comparable sentences– External: together with 11k chemistry domain data in the

parallel corpus• Translation lexicon

– IBM Model 1 [Brown+ 1993]– Log-Likelihood-Ratio (LLR) [Munteanu+ 2006] – Sub-corpora sampling lexicon (SampLEX) [Vulic+ 2012]

• Compare with [Munteanu+ 2006]

18

Page 19: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Results

Method # fragments Avg size (Zh/Ja) Accuracy

[Munteanu+ 2006] 28.4k 20.36/21.39 (1%)

Only (IBM Model 1) 18.9k 4.03/4.14 80%

Only (LLR) 18.3k 4.00/4.14 89%

Only (SampLEX) 18.4k 3.96/4.05 87%

External (IBM Model 1) 28.7k 4.18/4.33 81%

External (LLR) 26.9k 4.17/4.33 85%

External (SampLEX) 28.0k 4.11/4.23 82%

※ Accuracy: manually evaluated 100 fragments based on exact match

19

Page 20: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Experimental Settings (Translation)

• Baseline: Zh-Ja paper abstract corpus (680k with 11k chemistry domain sentences)

• Tuning: 368 sentences of chemistry domain• Testing: 367 sentences of chemistry domain• Decoder: Moses• Language model: 5–gram language model on the Ja side

of the parallel corpus using SRILM

• Compare MT performance by appending the extracted fragments to the baseline training data

20

Page 21: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

BLUE-4 for Different Systems

21

“※ *” denotes that the result is better than “Baseline” significantly at p < 0.05

* **

*

Baselin

e

+Sentence

+Munteanu+ 2006

+Only (IB

M Model 1

)

+Only (LL

R)

+Only (Sa

mpLEX)

+External

(IBM M

odel 1)

+External

(LLR)

+External

(SampLE

X)38

38.4

38.8

39.2

39.6

40

Page 22: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Outline

• Background• Related Work• Proposed Method• Experiments• Conclusion

22

Page 23: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Conclusion

• We proposed an accurate parallel fragment extraction system using alignment model and translation lexicon

• Future Work– A method to deal with ordering– Parallel corpus independent method– Try other language pairs and domains

23

Page 24: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Thank you for your attention!

Page 25: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Examples of Extracted Fragment Pairs

25

ID Zh Fragment Ja Fragment

1 直接甲醇燃料电池 直接メタノール燃料電池2 X射线光电子能谱(XPS) X線光電子分光法(XPS)3 (OH)24(H2O)12] (OH)24(H2O)12]4 的原生质体融合 のプロトプラスト融合5 分子动力学(MD)模拟了 分子動力学(MD)シミュレー

ションを6 扫描电子显微镜(SEM)、透射电子显微镜(TEM)

型電子顕微鏡(SEM),透過型電子顕微鏡(TEM)

7 证明了本算法的 から本アルゴリズムの8 X射线粉末衍射 X線回折分析

※ Noise is written in red font• Most noise is due to the noisy translation lexicon (Example 5-7)• Score smoothing also produces some noise (Example 8)