Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi

21
Iterative Bilingual Lexicon Extraction from Comparable Corpora with Topical and Contextual Knowledge Chenhui Chu , Toshiaki Nakazawa, Sadao Kurohashi Graduate School of Informatics, Kyoto University 1 CICLing2014 (2014/04/0

description

Iterative Bilingual Lexicon Extraction from Comparable Corpora with Topical and Contextual Knowledge . Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi Graduate School of Informatics, Kyoto University. CICLing2014 (2014/04/ 08). Background. - PowerPoint PPT Presentation

Transcript of Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi

Page 1: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

1

Iterative Bilingual Lexicon Extraction from Comparable Corpora

with Topical and Contextual Knowledge

Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi

Graduate School of Informatics, Kyoto University

CICLing2014 (2014/04/08)

Page 2: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

2

Background

• Bilingual lexicons are important for many bilingual NLP tasks, such as SMT and CLIR

• Manual construction is expensive and time-consuming

• Automatic construction from parallel corpora is a possible way, however parallel corpora remain a scarce resource

Automatic construction from comparable corpora

Page 3: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Bilingual Lexicons in Comparable Corpora

Zh: En:...市场经济,又称自由市场经济,是一种经济体系,在这种体系下产品和服务的生产及销售完全由自由市场的自由价格机制所引导,而不是像计划经济一般由国家所引导。市场经济也被用作资本主义的同义词,但是绝大多数的社会主义国家也实行了市场经济。...

...A market economy is an economy in which decisions regarding investment, production and distribution are based on supply and demand, and prices of goods and services are determined in a free price system. The major defining characteristic of a market economy is that decisions on ...

※ Example of comparable texts describing “market economy” from Wikipedia (Bilingual lexicons are linked with bleu lines).

3

Page 4: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

4

Related Work

• Topic Model Based Method [Vulic+ 2011]– Bilingual lexicons often present in the same cross-

lingual topics (document-level context)– Does not require any prior knowledge

• Context Based Method [Rapp+ 1999]– Bilingual lexicons appear in similar contexts across

languages (usually window-based context)– Require a seed dictionary

Page 5: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

5

System Overview

Comparable corpora

Topic model based method

Context based method

Combination

Contextual bilingual lexicons

Topical bilingual lexicons

・・・市场

(market)

companymarketconsumer

0.06260.06000.0474・・・

・・・市场

(market)

consumermarketcompany

0.08400.06800.0557・・・

Combined bilingual lexicons

・・・市场

(market)

marketcompanyconsumer

0.06160.06120.0547・・・

Unsupervised SeedDictionary

Iteration

SeedDictionary

Page 6: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Topic Model Based Method

市场 : <T1: 0.0138, T2: 0.0087, T3: 0.0004, T4: 0.0102 ・・・ >

consumer: <T1: 0.0028, T2: 0.0009, T3: 0.0058, T4: 0.0037 ・・・ >

market: <T1: 0.0029, T2: 0.0039, T3: 0.0251, T4: 0.0081 ・・・ >

company: <T1: 0.0054, T2: 0.0120, T3: 0.0014, T4: 0.0089 ・・・ >

Sim=0.0474

Sim=0.0600

Sim=0.0626

6

D

α θφ

K

wzMS

β

wzMT

ψ

Topic distribution

Word–topic distributions

Page 7: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

7

Similarity Measure

• TI:

• Cue:

• TI+Cue:

Page 8: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

Context Based Method

市场 : < 规律 : 24.9, 系统 : 38.4, 厂 : 8.9, 饮料 : 22.2 ・・・ >

市场 :<law: 24.9, system: 38.4, factory: 8.9, drink: 22.2 ・・・ >

(projection via a seed dictionary)

consumer: <law: 23.5, system: 38.3, factory: 10.2, drink: 21.8 ・・・ >

market: <law: 14.6, system: 31.2, factory: 3.7, drink: 22.8 ・・・ >

company: <law: 64.0, system: 18.7, factory: 1.9, drink: 11.4 ・・・ >

Sim=0.0840

Sim=0.0680

Sim=0.0557

8

Page 9: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

9

Context Modeling and Similarity

• Window-based context (±2)– e.g.

• Cosine similarity

mainstream drink factory market law system sellers exchange goods services information

Page 10: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

10

Combination

• Combined similarity score

0.8 × + 0.2 × =

Topical bilingual lexicons

・・・市场

(market)

companymarketconsumer

0.06260.06000.0474・・・

Combined bilingual lexicons

・・・市场

(market)

marketcompanyconsumer

0.06160.06120.0547・・・

Contextual bilingual lexicons

・・・市场

(market)

consumermarketcompany

0.08400.06800.0557・・・

Page 11: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

11

Dataset

• Wikipedia: 10k Chinese-English and Japanese-English article pairs via the interlanguage links

• Kept only lemmatized noun forms– Zh-En: 112k Chinese and 179k English nouns– Ja-En: 48k Japanese and 188k English nouns

Page 12: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

12

Experimental Settings

• BiLDA topic model training: PolyLDA++ [Richardson+ 2013]– α = 50/K, β = 0.01, Gibbs sampling with 1k iterations

• TI+Cue measure: BLETM [Vulic+ 2011]

• Proposed method – Linear interpolation parameter γ = 0.8, 20 iterations

Page 13: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

13

Evaluation Criterion

• Manually created Zh-En and Ja-En test sets for the most 1k frequent source words

• Metrics– Precision@1– Mean Reciprocal Rank (MRR) [Voorhees+, 1999]

Page 14: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

14

Results (Chinese-English Precision@1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.2

0.220.240.260.28

0.30.320.340.360.38

0.4Precision@1

Iteration

Combination(K=2000)

Context(K=2000)

Topic(K=2000)

Combination(K=200)

Context(K=200)

Topic(K=200)

Page 15: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

15

Results (Chinese-English MRR)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46MRR

Iteration

Combination(K=2000)

Context(K=2000)

Topic(K=2000)

Combination(K=200)

Context(K=200)

Topic(K=200)

Page 16: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

16

Results (Japanese-English Precision@1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.18

0.20.220.240.260.28

0.30.320.340.36Precision@

1

Iteration

Combination(K=2000)

Context(K=2000)

Topic(K=2000)

Combination(K=200)

Context(K=200)

Topic(K=200)

Page 17: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

17

Results (Japanese-English MRR)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

Iteration

MRR Combination(K=2000)

Context(K=2000)

Topic(K=2000)

Combination(K=200)

Context(K=200)

Topic(K=200)

Page 18: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

18

Improved Examples (1/2)

※ An improved example of word 研究 (research), where topical similarity scores are similar, while contextual similarity scores are distinguishable

Page 19: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

19

Improved Examples (2/2)

※ An improved example of word 施設 (facility), where both topical and contextual similarity scores are not distinguishable

Page 20: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

20

Not Improved Example

※ A not improved example of word 执行 (execution), where linear combination of the two scores is not discriminative enough

Page 21: Chenhui Chu , Toshiaki  Nakazawa ,  Sadao Kurohashi

21

Conclusion

• Proposed a bilingual lexicon extraction system exploiting both topical and contextual knowledge in an iterative process

• Experiments on Wikipedia data verified the effectiveness of our system

• Software and dataset is freely available at:http://orchid.kuee.kyoto-u.ac.jp/~chu/code/iBiLexExtractor.tgz

• Future work– Extraction for polysemy, compound nouns and rare words