Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi

Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora

using Alignment Model and Translation Lexicon

Chenhui Chu, Toshiaki Nakazawa, Sadao KurohashiGraduate School of Informatics, Kyoto University

IJCNLP2013 (2013/10/17)1

Outline

• Background• Related Work• Proposed Method• Experiments• Conclusion

2

Outline


3

Bilingual Corpora [Fung+ 2004]Type Definition ExampleParallel Sentence-aligned bilingual corpora EuroparlNoisy Parallel Bilingual translations of documents Patent familyComparable Topic-aligned bilingual documents Wikipedia Quasi-Comparable Very-non-parallel bilingual documents this study

4

• Lack of parallel corpora• Parallel sentences can be extracted from noisy and

comparable corpora• Quasi-comparable corpora more available, however

few parallel sentences exist

Parallel Fragments

• In quasi-comparable corpora, there could be parallel fragments in comparable sentences

• Parallel fragments are also helpful for SMT• We aim to accurately extract parallel fragments

from comparable sentences

应用 /铅 /离子 /选择 /电极 /电位 /滴定 /法 /测定 /甘草 /及 /其 /制品 /中/的 /甘草 /酸(Applying lead ion selective electrode potentiometric titration method to determine licorice and its products ‘s glycyrrhizic acid)＜ / 原 / 報 / ＞ /鉛 /イオン /選択 /性 /電極を / 用いる / 混合 / 試料 / 中 / の/…/ と /電位 /差 /滴定 /法 / の / 比較 (<Original Report> lead ion selective electrode used mixed sample ‘s … and potentiometric titration method ‘s comparison)

Zh:

Ja:

5

Outline


6

Parallel Sub-sentential Fragment Extraction [Munteanu+ 2006]

1. Extract translation lexicon from a parallel corpus

2. Apply a lexicon filter to comparable sentences in two directions independently– Assign initial scores according to the lexicon– Score smoothing to gain new knowledge that

does not exist in the lexicon 3. Extract sub-sentential (not exactly parallel)

fragment7

8

应用

铅离子

选择

电极电位

滴定法测

定甘草及其制

品中的甘

草酸

＜原報＞鉛イオン選択性電極を用いる

混合試料中のと電位差滴定法の比較

Lexicon Filter on Ja-to-Zh Direction

-1.5

-1

-0.5

0

0.5

1

1.5 Initial score Smoothed score

9

应用

铅离子

选择

电极电位

滴定法测

定甘草及其制

品中的甘

草酸

＜原報＞鉛イオン選択性電極を用いる

混合試料中のと電位差滴定法の比較

Lexicon Filter on Zh-to-Ja Direction

-1.5

-1

-0.5

0

0.5

1

1.5 Initial score Smoothed score

Outline


10

System Overview

Translated sentences

Comparable sentences

ParallelfragmentsSource

corpora

Target corpora Classifier

(2) IR: top N results

(1)(3) (4)

Alignment

Parallel corpus

Parallelfragmentcandidates

Lexiconfilter

(5)

SMT

11

Use an alignment model to locate the source and target fragment candidates simultaneously

Use a more accurate lexicon filter

Parallel Fragment Candidate Detection by Alignment

Monotonic, non-NULL and longest aligned fragments more than 3 tokens 12

Lexicon Filter − Assign Initial Scores

13

Assign scores in two directions to aligned word pairs in the candidates according to translation lexicon

Lexicon Filter − Score Smoothing

14

Only smooth a word with negative score when boththe left and right words around it have positive scores

Fragment Extraction

15

Fragments more than 3 tokens with continuous positivescores in both directions

Outline

• Background• Related Work• Proposed Method• Experiments– Parallel Fragment Extraction– Translation

• Conclusion

16

Experimental settings (Parallel Fragment Extraction 1/2)

• Parallel corpus: Zh-Ja abstract corpus (680k sentences, scientific domain)

• Quasi-Comparable Corpora– Chinese corpora: CNKI (90k articles, 420k sentences,

chemistry domain)– Japanese corpora: CiNii (880k articles, 5M sentences,

scientific domain)

• Comparable sentences: 30k chemistry domain sentences were extracted

17

Experimental settings (Parallel Fragment Extraction 2/2)

• Alignment: GIZA++ with symmetrization heuristics – Only: only use the extracted comparable sentences– External: together with 11k chemistry domain data in the

parallel corpus• Translation lexicon

– IBM Model 1 [Brown+ 1993]– Log-Likelihood-Ratio (LLR) [Munteanu+ 2006] – Sub-corpora sampling lexicon (SampLEX) [Vulic+ 2012]

• Compare with [Munteanu+ 2006]

18

Results

Method # fragments Avg size (Zh/Ja) Accuracy

[Munteanu+ 2006] 28.4k 20.36/21.39 (1%)

Only (IBM Model 1) 18.9k 4.03/4.14 80%

Only (LLR) 18.3k 4.00/4.14 89%

Only (SampLEX) 18.4k 3.96/4.05 87%

External (IBM Model 1) 28.7k 4.18/4.33 81%

External (LLR) 26.9k 4.17/4.33 85%

External (SampLEX) 28.0k 4.11/4.23 82%

※ Accuracy: manually evaluated 100 fragments based on exact match

19

Experimental Settings (Translation)

• Baseline: Zh-Ja paper abstract corpus (680k with 11k chemistry domain sentences)

• Tuning: 368 sentences of chemistry domain• Testing: 367 sentences of chemistry domain• Decoder: Moses• Language model: 5–gram language model on the Ja side

of the parallel corpus using SRILM

• Compare MT performance by appending the extracted fragments to the baseline training data

20

BLUE-4 for Different Systems

21

“※ *” denotes that the result is better than “Baseline” significantly at p < 0.05

* **

*

Baselin

e

+Sentence

+Munteanu+ 2006

+Only (IB

M Model 1

)

+Only (LL

R)

+Only (Sa

mpLEX)

+External

(IBM M

odel 1)

+External

(LLR)

+External

(SampLE

X)38

38.4

38.8

39.2

39.6

40

Outline


22

Conclusion

• We proposed an accurate parallel fragment extraction system using alignment model and translation lexicon

• Future Work– A method to deal with ordering– Parallel corpus independent method– Try other language pairs and domains

23

Thank you for your attention!

Examples of Extracted Fragment Pairs

25

ID Zh Fragment Ja Fragment

1 直接甲醇燃料电池直接メタノール燃料電池2 Ｘ射线光电子能谱（ＸＰＳ）Ｘ線光電子分光法（ＸＰＳ）3 （ＯＨ）２４（Ｈ２Ｏ）１２］（ＯＨ）２４（Ｈ２Ｏ）１２］4 的原生质体融合のプロトプラスト融合5 分子动力学（ＭＤ）模拟了分子動力学（ＭＤ）シミュレー

ションを6 扫描电子显微镜（ＳＥＭ）、透射电子显微镜（ＴＥＭ）

型電子顕微鏡（ＳＥＭ），透過型電子顕微鏡（ＴＥＭ）

7 证明了本算法的から本アルゴリズムの8 Ｘ射线粉末衍射Ｘ線回折分析

※ Noise is written in red font• Most noise is due to the noisy translation lexicon (Example 5-7)• Score smoothing also produces some noise (Example 8)

Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi

Documents

Transcript of Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi