Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari...

Multilingual experiments of UTA @ CLEF 2003

Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola

University of Tampere, FinlandDepartment of Information Studies

Multilingual indexing two possibilities

to create a common index for all the languages

to create separate index for each language

UTA followed the approach of separate indexes

Our result merging strategies in CLEF 2003 the raw score approach as a baseline the dataset size based method

185 German, 81 French, 99 Italian, 106 English, 285 Spanish, 120 Dutch, 35 Finnish and 89 Swedish documents (sum = 1000 docs)

the score difference based method every score is compared with the best score of the topic only documents with the difference of scores under the

predefined value are taken to the final list e.g. if the best score of the topics is 0.480001, and the

difference value is 0.08, we will take with a document with score 0.400002, but not a document with score 0.400001

the final ordering (1000 docs / topic) is done by raw score merging strategy

Indexing methods inflected index

dataset words are stored as such

employed by www search engines

normalized index stemming morphological

analysis

• we applied normalized indexing in our CLEF 2003 runs

Word normalization methods stemming

suitable for languages with weak morphology

several stemming techniques

we applied in CLEF 2003 mostly stemmers based on the Porter stemmer

morphological analysis full description of

inflectional morphology

large lexicon of basic vocabulary

suitable for languages with strong morphology

UTA applied both stemmers and morphological analyzers in multilingual runs of CLEF 2003

we built both stemmed and morhologically analyzed indexes for English, Finnish and Swedish

for Dutch, French, German, Italian and Spanish we built stemmed indexes

UTA indexes

The UTACLIR process each source word is normalized utilizing a

morphological analyzer source stop words are removed each normalized source word is translated translated words are normalized (by a

morphological analyzer or a stemmer, depending on the target language code)

target stop words are removed if the source word is untranslatable, two

highest ranked words obtained in n-gram-matching are selected as query words from the target index

Our resultsindex type merging

strategyaverageprecis. %

difference %

morph./stem. raw score 18.6morph./stem. dataset size 18.3 -1.6morph./stem. score diff/top 18.2 -2.1morph./stem. round robin 18.4 -1.1stemmed dataset size 18.6 0.0stemmed score diff/top 18.5 -0.5stemmed raw score 18.3 -1.6stemmed round robin 18.4 -1.1

The results of our additional monolingual English, bilingual English-Finnish and

bilingual English-Swedish runs

language index type

averageprecis. %

difference%

English morphol.anal. 45.6English stemmed 46.3 +1.5Finnish morphol.anal. 34.0Finnish stemmed 19.0 -44.1Swedish morphol.anal. 27.1Swedish stemmed 19.0 -29.9

Conclusions all the result merging strategies we

applied produced almost equal results

the performance did not vary depending on the index type in the multilingual task

Conclusions II the impact of different word normalization

methods on IR performance has not been investigated properly

our monolingual and bilingual tests show that stemming is an adequate normalization method for English, but not for Finnish and Swedish

so far, morphological analysis seems to offer a hard baseline for competing methods (e.g., stemming) in Finnish and Swedish

the reasons why stemming is not adequate for Finnish and Swedish may be different and should be investigated

Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari...

Documents

Transcript of Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari...