Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese

Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese

Teruko Mitamura Mengqiu WangHideki Shima Frank Lin

In CMUEACL 2006

Introduction

JAVELIN is a modular and extensible architecture for building QA systems.

Since JAVELIN is language independent, this paper extend the original English version for CLQA in Chinese and Japanese.

JAVELIN The JAVELIN architecture:

Question Analyzer Producing keywords

Retrieval Strategist Finding documents

Information Extractor Extracting answers

Answer Generator Ranking answers

JAVELIN Extension

Extension for Cross-Lingual QA:

Question Analyzer Input questions are processed using the RA

SP parser (Korhonen and Briscoe, 2004). The module output contains three main com

ponents:a) selected keywords

b) the answer type (e.g. numeric-expression, person-name, location)

c) the answer subtype (e.g. author, river, city) The QA module is extended with a keyword

translation sub-module.

Translation Module The TM selects the best combination of translated k

eywords from several sources: Machine Readable Dictionaries (MRDs) Machine Translation systems (MTs) Web-mining-Based Keyword Translators (WBMTs) (Nagata

et al., 2001, Li et al., 2003). For E-to-J translation, two MRDs, eight MTs and on

e WBMT are used. If none of them return a translation, the word is transliterate

d into kana for Japanese. For E-to-C translation, one MRD, three MTs and on

e WBMT are used.

Translation Module After gathering all possible translations for every keyword, th

e TM uses a noisy channel model to select the best combination of translated keywords:

The TM estimates model statistics using the World Wide Web:

TM: Smoothing When the target keyword co-occurrence count with n keywords i

s below a set threshold, a moving window of size n-1 would “move” through the keywords in sequence.

If the co-occurrence count of all of these sets is above the threshold, return the product of the score of these sets as the language model score.

If not, decrease the window and repeat until either all of the split sets are above the threshold or n = 1.

Drawbacks:1. "Moving-window smoothing" assumes that keywords that are ne

xt to each other are also more semantically related, which may not always be the case.

2. "Moving-window smoothing" tends to give the keywords near the middle of the question more weight, which may not be desirable.

TM: Pruning1. Early Pruning:

Possible translations of the individual keywords are pruned before being combined.

A very simple pruning heuristic via a word frequency list is used. Very rare translations produced by a resource are not considered.

2. Late Pruning: Possible translation candidates of the entire set of keywo

rds are pruned after calculating translation probabilities. Since the calculation of the translation probabilities requi

res little access to the web, we can calculate only the language model score for the top N candidates with the highest translation score and prune the rest.

Retrieval Strategist Lemur 3.0 toolkit (Ogilvie and Callan, 2001) is used. Lemur supports structured queries using operators such as Boole

an AND, Synonym, Ordered/Un-Ordered Window and NOT. For example:

The RS uses an incremental relaxation technique: It starts from an initial query that is highly constrained. The algorithm searches for all the keywords and data types in close pr

oximity to each other. The priority is based on a function of the likely answer type, keyword ty

pe (word, proper name, or phrase) and the inverse document frequency of each keyword.

The query is gradually relaxed until the desired number of relevant documents is retrieved.

Information Extractor The Light IX module:

The algorithm considers only those terms that are tagged as named entities which match the desired answer type.

E-C uses character-based tokenization and the linear Dist measure.

E-J uses word-based tokenization and logarithmic Dist measure.

Answer Generator

The task of the AG module is to produce a ranked list of answer candidates from the IX output.

The AG is designed to resolve representational differences and combine answer candidates that differ only in surface form.

Even though the AG module plays an important role in JAVELIN, its full potential is not used in the E-C and E-J systems, since some language-specific resources required for multilingual answer merging are lacked.

Experiment Settings Three different runs were carried out for both the E-

C and E-J systems. The same 200-question test set and the document c

orpora provided by the NTCIR CLQA task is used. The first run:

A fully automatic run using the original translation module in the CLQA system.

The second run: The keywords selected by the Question Analyzer module a

re manually translated. The third run:

The corresponding term for each English keyword in the translations for the English questions provided by NTCIR organizers is looked up.

Experiment Results

Discussions Chinese:

Word sense disambiguation for keywords e.g. “ 埋” and “ 葬”

Regional language difference e.g. “ 软件” and “ 軟體”

Japanese: Representational gaps

e.g. “ ヴェルナー・シュピース” and “ ヴェルナーシュピース” . This problem can be resolved by Lemur.

Transliteration in WBMT Japanese nouns written in romaji (e.g. Funabashi) are transliterated them into hira

gana for a better result in WBMT. Since there may be higher positive co-occurrence between kana and kanji (i.e. “ ふ

なばし” and “ 船橋” ) than between romaji and kanji (i.e. “funabashi” and “ 船橋” ).

Document Retrieval in Kana Transliteration from kana into kanji is more ambiguous than transliteration from ro

manji into kana. Therefore, indexing kana readings in the corpus and querying in kana is sometime

s a useful technique for CLQA.

Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese

Documents

Transcript of Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese