Approach

1
ara Institute of Science and Technology Approach Background NAIST-NTT System Description for Patent Translation NAIST-NTT System Description for Patent Translation Task at NTCIR-7 Task at NTCIR-7 -- Semi-supervised Learning of Bilingual Lexicon for Technical -- Semi-supervised Learning of Bilingual Lexicon for Technical Terms from Wikipedia -- Terms from Wikipedia -- Mamoru Komachi (NAIST, Japan), Masaaki Nagata (NTT, Japan), Yuji Matsumoto (NAIST, Japan) Experiment Conclusion and Future Work Demonstrated that a large scale bilingual dictionary can be extracted from Wikipedia Improved the quality of extracted bilingual lexicon by applying a graph kernel Plan to integrate graph-based word sense disambiguation into statistical machine translation framework The cost of hand-tagging resources is crucial for MT Wikipedia has become one of the important source for knowledge acquisition Wikipedia’s link structure has attracted attention in NLP fields, and the link structure turns out to be useful for extracting bilingual lexicon from Wikipedia [Adafre and Rijke 2006; Erdmann 2008] This study aims at: Domain adaptation (word sense disambiguation per domain) of term extraction Automatic refinement of translation pairs extracted from Wikipedia Random samples of extracted lexicon Wikiped ia Seed translat ion pairs Ranked bilingua l lexicon Corpora (Target domain) (Plant, 植植 ) organism (Plant, 植植 ) (Library, 植植植 ) (Airport, 植植 ) 植植 building Epoxy 植植植植植植 Single crystal 植植植 Laser cooling 植植植植植植 Centrifugal compressor 植植植植植 Fmlrun-int JE EJ JE EJ Baseline (WMT08) 26.3 9 28.2 5 25.3 4 27.1 9 Wikipedia (10%) N/A 27.4 7 N/A N/A Wikipedia (50%) N/A 27.4 6 N/A N/A Wikipedia (75%) N/A 27.4 2 N/A N/A Wikipedia (100%) 26.4 8 27.2 8 25.4 8 28.1 5 Wikiped ia # of words (OOV coverage) samples 10% 11,970 (1.9%) (natural selection, 植植植植植 ), (scrabble, 植植植植植 ), (phase transition, 植植植 ), (diamond, 植植植植植植 ), (videocassette recorder, 植植植植植植植植植植 ) 50% 75.420 (7.7%) (movement for multiparty democracy, 植植植植植植植植植植植 ), (fentanyl, 植植植植植植 ) [an opioid analgestic], (sigma sagittarii, 植植植 ) [the second brightest star system in the constellation Sagittarius]), (shintaro abe, 植植植植植 ) [the former prime minister of Japan], (nippon television, 植植植植植植植植 ) 75% 113,277 (11.5%) (pride final conflict 2003, pride grandprix 2003 植植植 ) [a mixed martial arts event held by PRIDE Fighting Championships], (uglyness, 植 ), (palma il vecchio, 植植植 植植植 植植植植植 ・・ ) [an Italian painter], (jean gilles, 植植植 植植 ) [a French composer; a French soldier], (amiloride, 植植植植植植 ) [a potassium-sparing diuretic] 100% 197,770 (13.5%) (brilliant corners, 植植植植植植 植植植植植 ) [an album by a jazz musician], (charly mottet 植植植植植 植植 ) [a French former professional cyclist], (deep purple in rock, 植植植植植 植植植植植 植植植 植植植 ) [an album by an English rock band], (june 2003, 植植植植植植植 「」 2003 植 6 植 ) [navigational entry for events happened in June 2003], (moanin', 植植植植 ) [a jazz album] filtere d 24.969 (1, 1 植 ) [year], (UTC+9, UTC+9) [Japanese side contains only alphanumeric characters], (Aera, AERA) [case-insensitive match] ( 植植植植 , 植植植 植 ) [garbage in English side], (image:himeji castle frontview.jpg, himeji castle frontview.jpg) [Wikipedia format navigational links], (user:eririnrinrin, eririnrinrin) [Wikipedia specific entries] System overview Extract bilingual lexicon from Wikipedia: 222,739 translation pairs (197,770 after filtering) Split the bilingual lexicon randomly into 8 sub-lexicons, and choose 5 seeds for each sub lexicon (total 8 * 5 = 40 seeds) After applying the regularized Laplacian kernel, collect the top 10%, 50% and 75% of the ranked list for each sub lexicon Accumulate sub lexicons by taking the intersection of the 8 collected lists to obtain final bilingual lexicon Performance of Wikipedia dict. Sample seeds Dataset: NTCIR-7 Patent Translation Task (1.8 million parallel sentences) Add the extracted bilingual lexicon to the training corpus to learn the translation probability between translation pairs The extracted bilingual lexicon slightly improves BLUE score However, increasing the extracted lexicon constantly degrades BLEU score for EJ on single-reference setting → Need re-examination Use interlingual links in Wikipedia to extract translation pairs Create a bipartite graph using Wikipedia abstract and interlingual links Manually select seed translation pairs from training corpora Select top-ranked translation pairs given seed translation pairs ion: one sense per domain [Thelen and Riloff 2002] elatedness” measure in link analysis to select the most relevant sense to the domain Bipartite graph from Wikipedia link structure ite graph construction steps translation pairs (en, ja) as white nodes bag-of-content words appearing abstracts of both languages as black nodes edges from translation pairs to co-occurring patterns (black nodes) Graph Laplacian R n ( L ) n ( I L ) 1 n 0 Regularized Laplacian matrix A: adjacency matrix of the graph D:(diagonal) degree matrix β:parameter Each column of R β gives the rankings relative to a node Wikipe dia Englis h abstra cts Wikiped ia Japanes e abstrac ts

description

NAIST-NTT System Description for Patent Translation Task at NTCIR-7 -- Semi-supervised Learning of Bilingual Lexicon for Technical Terms from Wikipedia --. (Plant, 植物 ). Wikipedia. Ranked bilingual lexicon. organism. - PowerPoint PPT Presentation

Transcript of Approach

Page 1: Approach

Nara Institute of Science and Technology

Approach

Background

NAIST-NTT System Description for Patent Translation Task at NTCIR-NAIST-NTT System Description for Patent Translation Task at NTCIR-77

-- Semi-supervised Learning of Bilingual Lexicon for Technical Terms from Wikipedia ---- Semi-supervised Learning of Bilingual Lexicon for Technical Terms from Wikipedia --Mamoru Komachi (NAIST, Japan), Masaaki Nagata (NTT, Japan), Yuji Matsumoto (NAIST, Japan)

Experiment

Conclusion and Future WorkDemonstrated that a large scale bilingual dictionary can be extracted from WikipediaImproved the quality of extracted bilingual lexicon by applying a graph kernelPlan to integrate graph-based word sense disambiguation into statistical machine translation framework

The cost of hand-tagging resources is crucial for MTWikipedia has become one of the important source for knowledge acquisitionWikipedia’s link structure has attracted attention in NLP fields, and the link structure turns out to be useful for extracting bilingual lexicon from Wikipedia [Adafre and Rijke 2006; Erdmann 2008]This study aims at:Domain adaptation (word sense disambiguation per domain) of term extractionAutomatic refinement of translation pairs extracted from Wikipedia

Random samples of extracted lexicon

WikipediaWikipedia

Seed translationpairs

Seed translationpairs

Rankedbilinguallexicon

Rankedbilinguallexicon

Corpora(Target domain)

Corpora(Target domain)

(Plant, 植物 )

organism

(Plant, 工場 )

(Library, 図書館 )

(Airport, 空港 )

施設

building

English Japanese

Thermal spray 溶射

Epoxy エポキシ樹脂

Single crystal 単結晶

Laser cooling レーザー冷却

Centrifugal compressor 遠心圧縮機

Single-ref Fmlrun-int

JE EJ JE EJ

Baseline (WMT08) 26.39 28.25 25.34 27.19

Wikipedia (10%) N/A 27.47 N/A N/A

Wikipedia (50%) N/A 27.46 N/A N/A

Wikipedia (75%) N/A 27.42 N/A N/A

Wikipedia (100%) 26.48 27.28 25.48 28.15

Wikipedia # of words (OOV coverage) samples

10% 11,970 (1.9%) (natural selection, 自然選択説 ), (scrabble, スクラブル ), (phase transition, 相転移 ), (diamond, ダイアモンド ), (videocassette recorder, ビデオテープレコーダ )

50% 75.420 (7.7%) (movement for multiparty democracy, 複数政党制民主主義運動 ), (fentanyl, フェンタニル ) [an opioid analgestic], (sigma sagittarii, ヌンキ ) [the second brightest star system in the constellation Sagittarius]), (shintaro abe, 安倍晋太郎 ) [the former prime minister of Japan], (nippon television, 日本テレビ放送網 )

75% 113,277 (11.5%) (pride final conflict 2003, pride grandprix 2003 決勝戦 ) [a mixed martial arts event held by PRIDE Fighting Championships], (uglyness, 醜 ), (palma il vecchio, パルマ・イル・ヴェッキオ ) [an Italian painter], (jean gilles, ジャン・ジル ) [a French composer; a French soldier], (amiloride, アミロライド ) [a potassium-sparing diuretic]

100% 197,770 (13.5%) (brilliant corners, ブリリアント・コーナーズ ) [an album by a jazz musician], (charly mottet シャーリー・モテ ) [a French former professional cyclist], (deep purple in rock, ディープ・パープル・イン・ロック ) [an album by an English rock band], (june 2003, 「最近の出来事」 2003年 6月 ) [navigational entry for events happened in June 2003], (moanin', モーニン ) [a jazz album]

filtered 24.969 (1, 1年 ) [year], (UTC+9, UTC+9) [Japanese side contains only alphanumeric characters], (Aera, AERA) [case-insensitive match] ( 大岡越前 , 大岡越前 ) [garbage in English side], (image:himeji castle frontview.jpg, himeji castle frontview.jpg) [Wikipedia format navigational links], (user:eririnrinrin, eririnrinrin) [Wikipedia specific entries]

System overview

Extract bilingual lexicon from Wikipedia: 222,739 translation pairs (197,770 after filtering)Split the bilingual lexicon randomly into 8 sub-lexicons, and choose 5 seeds for each sub lexicon (total 8 * 5 = 40 seeds)After applying the regularized Laplacian kernel, collect the top 10%, 50% and 75% of the ranked list for each sub lexiconAccumulate sub lexicons by taking the intersection of the 8 collected lists to obtain final bilingual lexicon

Performance of Wikipedia dict. Sample seedsDataset: NTCIR-7 Patent Translation Task (1.8 million parallel sentences)Add the extracted bilingual lexicon to the training corpus to learn the translation probability between translation pairsThe extracted bilingual lexicon slightly improves BLUE scoreHowever, increasing the extracted lexicon constantly degrades BLEU score for EJ on single-reference setting → Need re-examination

Use interlingual links in Wikipedia to extract translation pairsCreate a bipartite graph using Wikipedia abstract and interlingual linksManually select seed translation pairs from training corporaSelect top-ranked translation pairs given seed translation pairs

Assumption: one sense per domain [Thelen and Riloff 2002]→Use “relatedness” measure in link analysis to select the most relevant sense to the domain

Bipartite graph from Wikipedia link structure

Bipartite graph construction steps1. Add translation pairs (en, ja) as white nodes2. Add bag-of-content words appearing abstracts of both languages as black nodes3. Add edges from translation pairs to co-occurring patterns (black nodes)

Graph Laplacian

R n ( L)n (I L) 1

n0

Regularized Laplacian matrix

A: adjacency matrix of the graphD: (diagonal) degree matrixβ: parameterEach column of Rβ gives the rankings relative to a node

Wikipedia

Englishabstract

s

Wikipedia

Japanese

abstracts