Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with...

34
Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher, NII University of California, Berkeley (Final Presentation before returning to USA September 5) August 30, 2007 (appreciation to David K Evans for all his help)

Transcript of Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with...

Page 1: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections,

with Implications for Language Engineering

Fredric GeyVisiting Researcher, NIIUniversity of California, Berkeley(Final Presentation before returning to USA September 5)

August 30, 2007

(appreciation to David K Evans for all his help)

Page 2: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,
Page 3: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon Development and Potential Use for Transliteration Research

• Goal of the development

• Description of NTCIR 1 and 2 collections

• Method(s) used to create the lexicon

• Details about the lexicon

• Validation of the lexicon

• Research uses of the lexicon

– Transliteration

– Romanization and approximate matching

• Discussion of next steps

Page 4: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Goals of the Lexicon Development

• The NTCIR 1 and 2 test collections are the only available text research collections derived from the science and technology domains*

– Technical vocabulary translation is an important part of the translation industry

– A freely-available technical lexicon may be of use by professional translators

– The lexicon may also stimulate further interest in the collections

• Technical lexicons may be useful in linguistic research and language engineering

– Technical terms in Japanese are often borrowed from English (katakana)

– A technical lexicon may be useful for transliteration research and matching new vocabulary between Japanese and English

* The NTCIR Patent collection may also be one, depending upon viewpoint

Page 5: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

The NTCIR-1/2 collections

• The NTCIR 1 and 2 test collections are actually three sub-collections

• NTCIR-1 J-E gakkai collection (339,483 documents)– Author abstracts of articles from 65 Japanese

scientific society-hosted conferences for the period 1988-1992

• NTCIR-2 J and E gakkai collection– Extension of NTCIR-1 collection for years 1997-1999– 77,433 English abstracts, 116,177 Japanese abstracts– Independent files, not pre-joined

• NTCIR-2 J and E kaken collection– Abstracts of funded research final reports 1988-1997– 57,545 English abstracts, 287,071 Japanese abstracts– Independent files, not pre-joined

Page 6: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

The NTCIR-1 J-E collection

• 339,483 documents (334,515 (98.5%) with Japanese abstracts)

• Only 188,907 (55.6%) have English abstracts, however

• 313,673 (92.3%) have author-assigned keywords (terms) in both Japanese and English

• While 65 societies are represented, the bulk of the documents come from a few societies. For example, there are only 15 documents from the Japan Society for Wind Engineering

• The top 10 societies account for 82.7% of all documents

• The bottom 30 account for only 2.7% of documents

88,207(26.05%)

The Institute of Electronics, Information and Communication

55,629(16.43%) Architechtural Institute of Japan

27,191( 8.03%) Information Processing Society of Japan

23,395(6.91%) The Society of Polymer Science, Japan

21,352(6.31%) Japan Society of Civil Engineers

20,033(5.92%) Japan Society for Bioscience,Biotechnology and Agrochemistry

18,226(5.38%) The Institute of Electrical Engineers of Japan

12,112(3.58%) The Society of Instrument and Control Engeneers

7,100(2.10%) The Ceramic Society of Japan

6,682(1.97%) The Pharmaceutical Society of Japan

Page 7: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

<ACCN>gakkai-0000185279</ACCN><TITL TYPE="kanji"> 動画像圧縮イメージセンサの検討 </TITL><TITE TYPE="alpha">On-sensor Video Compression</TITE><AUPK TYPE="kanji"> 大野 洋 / 浜本 隆之 / 相澤 清晴 / 羽鳥 光俊 / 山崎 順一 / 丸山 裕孝 </AUPK><AUPE TYPE="alpha">Ohno,Hiroshi / Hamamoto,Takayuki / Aizawa,Kiyoharu / Hatori,Mitsutoshi / Yamazaki,Jun-ichi / Maruyama,Hirotaka</AUPE><CONF TYPE="kanji"> 画像応用研究会 </CONF><CNFE TYPE="alpha">Technical Group on Applied Image Processing and System</CNFE><CNFD>1994. 08. 26</CNFD><ABST TYPE="kanji"><ABST.P> 画像を扱う既存のシステムにおいては , 画像獲得と画像処理はほぼ完全に分離している。ところが画像技術の応用分野が広がるにつれ , イメージセンサに対して , 高レート化 , 高機能化が要求されるようになってきた。これらの要求に従来の枠組で対応していくと ,画像情報を 1 次元の時系列信号として転送する場合 , 転送遅延がボトルネックとなってしまう。この問題に対して , センサ上で一部 ( あるいは全て ) の処理を実行し , 画像取得と画像処理をより密接に関連させて解決しようというアプローチが検討され始めている。 </ABST.P><ABST.P> 我々はセンサ上で適切な画像圧縮を施すことで , 取得画像の高レート化 ( 高速化 , 高精細化 ) に対応することを考えている。本稿では , センサ上での動画像圧縮のためのアルゴリズム , およびそのチップへの実装について論じる。 </ABST.P></ABST><ABSE TYPE="alpha"><ABSE.P>In this paper, we propose new computational image sensors which compress image signal in the process of image acquisition. Conditional replenishment is used to reduce the band-width necessary for image read-out. We also describe about the design of the experimental chip. This chip has an extensible, parallel, architecture.</ABSE.P></ABSE><KYWD TYPE="kanji"> 画像センサ // コンピュテーショナルセンサ // 画像圧縮 // 画像符号化 </KYWD><KYWE TYPE="alpha">Image Sensors // Computational Sensors // Image Compression // Image Coding</KYWE><SOCN TYPE="kanji"> テレビジョン学会 </SOCN><SOCE TYPE="alpha">The Institute of Television Engineers of Japan</SOCE></REC><REC>

The NTCIR-1 J-E example document

Abstract in Japanese

Keywords in English

Keywords in Japanese

Page 8: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon creation methodology

• Remember, for the NTCIR-1 collection:• 313,673 (92.3%) have author-assigned keywords (terms) in both

Japanese and English, while only 188,907 (55.6%) have English abstracts

• Thus, pairing keywords may be more useful than the more complicated task of pairing sentences in documents (the usual approach of statistical machine translation) to align term pairs

• Pairing keywords is also a lot easier to program• Some regularity and ordering of the keywords helps to facilitate the

process• Thus we extracted keyword pairs and counted their occurrence in

the collection

Page 9: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon creation methodology

• To further simplify, we observe that pairs seem to be ordered in the original documents

gakkai-0000000016|KYWD| ワイエルシュトラス楕円関数 | 母数変換 | 等角写像 | テーター関数 |KYWE| Weierstrass Elliptic Function | Modulus Translation | Conformal Mapping | Theta Function

gakkai-0000006972|KYWD| 等角写像解析 | 楕円関数 | 楕円テータ関数 | 特性インピーダンス |KYWE|conformal mapping analysis | elliptic function | Elliptic Theta function | charactaristic impeadance

gakkai-0000024631|KYWD| アンテナ共用器 | 導波管 | フィルタ | 楕円関数 | サーキュレータ | マイクロ波 |KYWE|Duplexer | Waveguide | Filter | Elliptic Function | Circulator | Microwave

Page 10: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon creation methodology

• Of course, not all records have keywords in both languages, nor do they have the same number of keywords in each language – calls for more statistical investigation

num EngKW & not JapKW 18,140num JapKW & not EngKW 4,583# neither JapKW nor EngKW 3,087# w/ both JapKW and EngKW 313,673total records 339,483records where KYWE!=KYWJ 16,289 (5.2% of docs with both)

• I choose only records with both Japanese and English keywords present. Where # English Keywords != # Japanese keywords, only process min(|KYWE|,|KYWJ|) keywords in sequence

Page 11: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon creation results: NTCIR-1

• For the NTCIR-1 collection we obtain 598,439 unique J-E pairs, with the following distribution, with a very long tail, which may include many erroneous pairs, including misspellings: 495 情報検索 | information retrieval 8 情報検索 | information retrival 7 検索 | information retrieval 4 情報検索 | information retieval 3 文書検索 | information retrieval 3 情報検索 | information retreival 3 情報収集 | information retrieval 2 情報検索 | information retrieving 1 情報検索 | information retrilval 1 情報検索 | information retrievol 1 情報検索 | information retrierol 1 情報検索 | information retreaval

Number of Occurrences Pair count

5 or more 34,044

4 11,698

3 23,063

2 64,726

1 464,908

Page 12: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon creation methodology: NTCIR-2 gakkai

• The NTCIR-2 Gakkai collection is just an extension of the NTCIR-1 collection to additional years, with the complication that you have to join independent English and Japanese sub-collections. NTCIR-2 joined subset has the following keyword characteristics

EngKW & not JapKW 68JapKW & not EngKW 277# neither JapKW nor EngKW 1996# w/ both JapKW and EngKW 71839total records 74180records where KYWE!=KYWD 2643 (3.7% of docs with both)

• I choose only records with both Japanese and English keywords present. Where # English Keywords != # Japanese keywords, only process min(|KYWE|,|KYWJ|) keywords in sequence

Page 13: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon creation results: NTCIR-2 gakkai

• For the NTCIR-2 gakkai sub- collection we obtain 172,400 unique J-E pairs (compared to 598,439 for NTCIR-1) with the following distribution, also with a long tail:

528 シミュレーション | simulation493 有限要素法 | finite element method470 液状化 | liquefaction466 インターネット | internet412 遺伝的アルゴリズム |genetic algorithm383 ニューラルネットワーク |neural network245 ラジカル重合 | radical polymerization245 データベース |database235 アルミナ | alumina228 鉄筋コンクリート |reinforced concrete202 地震 |earthquake198 数値解析 |numerical analysis

Number of Occurrences Pair count

5 or more 8,032

4 3,210

3 6,644

2 19,380

1 135,134

Page 14: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon creation: NTCIR-2 Kaken

• The NTCIR-2 Kaken collection is totally different from the Gakkai subcollections. It derives from final reports of funded research. As such it has more diversity. There are no ‘societies’ to anchor the text by ‘domain.’ Example 1 Example 2 Example 3

• The statistical characteristics are also different:

EngKW & not JapKW 56

JapKW & not EngKW 98# neither JapKW nor EngKW 4# w/ both JapKW and EngKW 57354total records 57512records where KYWE!=KYWD 15530 (27.1% of docs with both)

• Thus the idea of ordered keyword pairing which works so well with the Gakkai collections may be inappropriate for Kaken. We may need to take a broader statistical association net.

Page 15: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

NTCIR-2 Kaken keyword diversity – no correspondence of E to J required of authors

kaken-j-0975082400|KYWE| Healt impaired children | Chronically Ill children | Healt psychology | Healt education | Denelopment of concepts | LOC | self-care | Coping behavior |KYWD| 病弱児 | 慢性疾患児 | 健康心理学 | 健康教育 | 概念発達 | LOC | セルフケア | 対処行動 (good example of misspellings in English keywords)

kaken-j-0965522600|KYWE| environmental issues | mass media | public opinion | social research | content analysis | effects of mass communication | global warming | social psychology |KYWD| 環境問題 | マスメディア | 世論 | 社会調査 | 内容分析 | マスコミ効果論 | 地球温暖化 | 社会心理学

kaken-j-0861763900|KYWE| Methylglyoxal | D-Lactate | HPLC | 2-Methylquinoxaline | o-Phenylenediamine | 4,5-Dichloro-o-phenylenediamine|KYWD| メチルグリオキサール | D- 乳酸 | HPLC | 2- メチルキノキサリン | オルトフェニレンジアミン | 4.5- ジクロロオルトフェニレンジアミン | 6.7- ジクロロメチルキノキサリン (good J-E correspondence)

kaken-j-0861806300| KYWE | Joro spider toxin | Purification of JSTX | Glutamate receptor | Glutamate binding | 2,4-Dihydroxyphenylacetic acid |KYWD| グルタミン酸レセプター | クモ毒 | ラット脳シナプス膜 | クモ毒の精製 | クモ毒の作用機序 (seems to have little J-E correspondence)

Page 16: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon creation results: NTCIR-2 Kaken

• For NTCIR-2 Kaken, using the paired ordered keyword assumption, we obtain 238,820 unique J-E pairs (compared to 172,400 for NT2-gakkai) also with an even longer tail.

282 ラット | rat

266 モノクローナル抗体 |monoclonal antibody

251 アポトーシス | apoptosis

243 サイトカイン | cytokine

236 遺伝子発現 | gene expression

233 免疫組織化学 | immunohistochemistry

188 シミュレーション | simulation

181 データベース | database

163 カルシウム | calcium

161 マウス | mouse

159 画像処理 | image processing

151 セラミックス | ceramics

Number of Occurrences Pair count

5 or more 5,685

4 2,353

3 4,549

2 14,001

1 212,232

Page 17: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon creation: NTCIR-2 Kaken another approach (thanks David)

• Maximalist approach, pair all English and Japanese keywords independently

E1 E2 E3 E4 E5

J1 J2 J3 J4 J5 J6 J7

(E1,J1)(E1,J2) … (E1,J7)

(E2,J1)(E2,J2) …

Produces >2 million pairs (2,219,878)

Number of Occurrences Pair count

5 or more 6,941

4 4,135

3 12,065

2 66,298

1 2,130,259

Page 18: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon creation: NTCIR-2 Kaken final approach (thanks David) - unfinished)

• Pair all English and Japanese keywords independently

E1 E2 E3 E4 E5

J1 J2 J3 J4 J5 J6 J7

(E1,J1)(E1,J2) … (E1,J7)

(E2,J1)(E2,J2) …

Produces >2 million pairs (2,219,878)

For each pair, collect the following contingency table, then you can compute an association measure (Yates Chisq or Dunning’s log likelihood ratio)

Produces ranked list of most likely translation for either Ei or Jk

Jk ~Jk

Ei a b

~Ei c d

Page 19: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon Validation

• There are many ways to validate the results of the lexicon– Have human beings who understand both Japanese and English validate the pairs– Use an external translation engine (say Google language tools Japanese

English translator) to translate the Japanese and compare to English

Page 20: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon Validation (continued)

• Of course, not all external translations will be the same

Page 21: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon Validation (continued)• But the most frequent ones can be validated either way

Page 22: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon Validation (continued)

• The problem is, of course, exhaustivity, especially with the long tails of the distribution

• My approach will be stratified sampling (suggestions welcome)– For a total sample of 1000, take– 200 from top pairs with frequency >= 5– 200 pairs from frequency =4– 200 pairs from frequency = 3– 200 pairs from frequency = 2– 600 pairs from frequency = 1

Page 23: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

If Google translate does such a good job with

technical terms, why bother with this project?

• Google technology is proprietary – reverse engineering their dictionary would violate intellectual property

• Developers may wish to have more control over the use of the lexicon for MT, CLIR or other purposes

• Existing available dictionaries are very limited

• The lexicon may have translations not found in G-translate

• This lexicon can be used for further research in correspondences between the Japanese and English language

• In particular for transliteration development and matching techniques for Katakana to English

• The lexicon is a great source for Katakana (David Evans extracted 20,871 distinct Katakana terms from NTCIR-2 Gakkai alone)

Page 24: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

What are the Limitations to the Lexicon Use?

• The lexicon has limited domain coverage – major areas of electrical engineering, information technology, architecture for NTCIR-1 and 2 gakkai collections

• The lexicon is statistically derived, and thus noisy

– Probably excellent for research into transliteration and matching

– Spelling variants can be matched for English spelling errors

– The utility of the ‘long tail’ still needs to be investigated

• BUT, it is (to my knowledge) the first freely available Japanese-English technical lexicon

Page 25: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Lexicon implications for Transliteration/Matching research*

• Transliteration research, especially between Japanese and English, was jump-started in 1997 with Kevin Knight and Jon Graehl’s paper

“Machine Transliteration (ACL 1997)” • Transliteration & Back-transliteration• Transliteration:

– Translating proper names, technical terms, etc. based on phonetic equivalents

– Complicated for language pairs with different alphabets & sound inventories

– E.g. “computer” --> “konpyuutaa” コンピュータ• Back-transliteration

– E.g. “konpyuuta” --> “computer”– Inversion of a lossy process

• Knight & Graehl developed a probabilistic finite-state machine model for transliteration and back-transliteration

*(slide adapted from U Washington seminar on web)

Page 26: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Application areas of Transliteration/Matching research

• Cross-language search where search terms in the source language are not in the translation dictionary, e.g. person and place names

– English-French is easy because of so many cognates (see Buckley et al 1997, Using Clustering and SuperConcepts Within SMART), Buckley

“We regard French as just misspelled English.” – Savoy & Rasolofo 2002: Report on the TREC 11 Experiment: Arabic,

Named Page and Topic Distillation Searches. performed EnglishArabic search by first Romanizing the entire Arabic corpus (done in Malta where Arabic is represented with Latin alphabet).

• Multilingual, multidocument summarization (Newsblaster, NewsExplorer) – see Steinberger & Pouliquien 2007 Cross-lingual Named Entity Recognition.

• Machine Translation for finding out-of-vocabulary words to augment the translation dictionaries

Page 27: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Transliteration versus Romanization

• Transliteration is a process of phonetic mapping from one language to another– Transliteration research uses machine learning to find the best

phonetic representation to maximize matching over a training set

– Fairly easy for Latin alphabet languages

– Less easy for Cyrillic script languages (but often 1-1 reverseable)

– More difficult for foreign scripts with different phonetic bases

– Next to impossible for Chinese

• Romanization is rule-based mapping from a non-Latin script to the Latin alphabet– Romanization has been around for a long time

– Hepburn created Japanese Romanization in 1887. Hepburn for

コンピュータ is kompyuta There is a Hepburn module in the Perl archive CPAN (utf-8). Dr. Apel at NII has Hepburn Romanization software written in Perl (eucJP) .

Library of Congress has Romanization rules for dozens of languages, which are used in cataloging of non-English books

Page 28: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Can Romanization be rescued?

• My research question is whether this vast amount of existing Romanization can be used for search across languages using approximate string matching, for example using Edit distance:

– Edit distance between computer and kompyuta ( コンピュータ ) is 5

– Edit distance between fish (E) and fisch (DE) is 1, between fish and frisch (E) is 2, between fresh (E) and frisch (DE) is also 2,

– “Edit distance is a terrible cross-lingual matching method” Martin Braschler, University of Zurich

• Can we use older Rule-based phonetic matching methods?

– Soundex (patented 1918) for telephone lookup

– Phonix developed by Gadd (Fisching fore Werds: Phonetic Retrieval of Written Text in Information Systems, 1988)

– a little background on approximate matching is in order

Page 29: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Well-known Approximate String Matching Methods

• Edit distance: number of insertion, deletions and replacements needed to transform one string to its matching string

• Q-grams – number of substrings of length q which the string has in common with its matching string– Simply counting q-grams is not adequate because it is ignores string length

(Fred has as many common q-grams with itself as with Frederick)

– Ukonian proposed a length-based distance metric (Approximate string matching with q-grams and maximal matches, 1992)

– |Gs| + |Gt| - 2|Gs ∩ Gt| where Gs is the set of q-grams in string s

• Q-gram methods were used by Robertson & Willett for searching historical English (Searching for historical word-forms in a database of 17th-century English text using spelling-correction methods, 1992)

• Modified q-gram method, targeted s-grams (where 2-grams are allowed to include a character skip were investigated by Pirkola et al for cross-language search between Finnish, Swedish and German (Targeted s-gram matching: a novel n-gram matching technique for cross- and mono-lingual word form variants, 2002) - would resolve the fish fisch match.

Page 30: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Classical Phonetic Matching Methods

• The classical phonetic matching algorithms (originally developed for telephone operators to do lookup for similar-sounding names) operate by shrinking (removing vowels) and mapping consonants to a canonical subset, as well as truncation

For example if computer kmptr and kompyuta kmpt and a maximum of 4 leading characters are retained, you have a match

• Soundex – map {a e I o u y h w} 0, {b p f v} 1, {c g j k q s x z} 2, {d t} 3, {l} 4, {m,n} 5, {r} 6

– Replace all but the first letter by phonetic map

– Eliminate adjacent repetitions of codes

– Eliminate all occurrences of code 0

– Truncate to the first 4 characters of the result

• Phonix – similar to Soundex, but slightly different coding, preceded by 160 letter-group transformations (x ecs, tjV chV at start of w)

Page 31: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Zobel-Dart Combined Methods

• In 1995 and 1996 Justin Zobel and Phillip Dart (University of Melbourne) did a substantial experimental evaluations of multiple approximate string matching methods on large corpora

– (Finding Approximate Matches in Large Lexicons, 1995, and Phonetic String Matching: Lessons from Information Retrieval, 1996)

– Used IR methods (recall/precision) to evaluate match effectiveness

– Concluded that Phonix and Soundex have terrible performance, but Phonix finds matches which other methods don’t

– Developed Phonix+ (modified Phonix without truncation) and Zobel-Dart algorithm which

• combined Phonix+ with edit distance for a better performing matching

• rewarded matches at the beginning of the strings

• Zobel-Dart has not been applied to cross-lingual term matching

• String matching may also be useful to correct English spelling errors in the NTCIR lexicon

Page 32: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Discussion of Potential Next Steps

• Further processing of the Kaken sub-collection to remove common prefixes and hence reduce the number of singly occurring pairs (may be useful for chemical compounds)

– kaken-j-0860020200|KYWE| La-Ce geochronometer | La-Ba geochronometer | &lt;^(138)Ce&gt; isotope tracer | Re-Os geochronometer | Decay constant of &lt;^(138)La&gt; | REE pattern | Ce anomaly |KYWD| La-Ce 年代測定法 | La-Ba年代測定法 | 【 ^(138)Ce 】同位体トレーサ | Re-Os 年代測定法 | 【 ^(138)La 】の壊変定数 | REE パターン | Ce 異常 | 希土鉱物

– kaken-j-0891105400|KYWE| Re-Os geochronometer | sulfide ore minerals | ICP mass spectrometer | isotopic equilibrium | mass discrimination | gas-mist merging introduction | molybdenite |KYWD| Re-Os 年代測定法 | 硫化鉱物 | ICP-MS | 同位体平衡 | 質量差別効果 | 蒸気発生 - 混合導入法 | モリブデナイト

– Google translates 年代測定法 as “age determination method”

• Application of Statistical MT (Giza++) to the Kaken sub-collection to enhance the lexicon and cross-validate the keyword matching methods

Page 33: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Discussion of Potential Next Steps

• Assessing the overlap of the sub-collection lexicons and merging into a single lexicon

• Subsetting the lexicon by levels of quality• Creation of a Katakana subset of the lexicon

– With or without Romanization

• And, of course, public release of the lexicon (with an accompanying paper for LREC 2008)– Downloadable file(s)– Online search version– Has user interface and display issues for chemical formulae– Professor H Satoh has pointed me to

http://homepage3.nifty.com/xymtex/fujitas/rd/choshoe.html#FBOOK6

Page 34: Deriving a Japanese-English Technical Lexicon from the NTCIR Scientific Collections, with Implications for Language Engineering Fredric Gey Visiting Researcher,

NII - Bilingual Technical Lexicon from NTCIRSic-naics-mapping-xmdr.ppt

Fini ( 終えられる )

• I enjoyed my time at NII and hope to return• Thank you very much• 本当にありがとう (I hope Google translate is correct)