Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari...
-
Upload
drusilla-johnston -
Category
Documents
-
view
212 -
download
0
Transcript of Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari...
![Page 1: Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.](https://reader036.fdocuments.us/reader036/viewer/2022072006/56649d145503460f949e93a3/html5/thumbnails/1.jpg)
Multilingual experiments of UTA @ CLEF 2003
Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola
University of Tampere, FinlandDepartment of Information Studies
![Page 2: Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.](https://reader036.fdocuments.us/reader036/viewer/2022072006/56649d145503460f949e93a3/html5/thumbnails/2.jpg)
Multilingual indexing two possibilities
to create a common index for all the languages
to create separate index for each language
UTA followed the approach of separate indexes
![Page 3: Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.](https://reader036.fdocuments.us/reader036/viewer/2022072006/56649d145503460f949e93a3/html5/thumbnails/3.jpg)
Our result merging strategies in CLEF 2003 the raw score approach as a baseline the dataset size based method
185 German, 81 French, 99 Italian, 106 English, 285 Spanish, 120 Dutch, 35 Finnish and 89 Swedish documents (sum = 1000 docs)
the score difference based method every score is compared with the best score of the topic only documents with the difference of scores under the
predefined value are taken to the final list e.g. if the best score of the topics is 0.480001, and the
difference value is 0.08, we will take with a document with score 0.400002, but not a document with score 0.400001
the final ordering (1000 docs / topic) is done by raw score merging strategy
![Page 4: Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.](https://reader036.fdocuments.us/reader036/viewer/2022072006/56649d145503460f949e93a3/html5/thumbnails/4.jpg)
Indexing methods inflected index
dataset words are stored as such
employed by www search engines
normalized index stemming morphological
analysis
• we applied normalized indexing in our CLEF 2003 runs
![Page 5: Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.](https://reader036.fdocuments.us/reader036/viewer/2022072006/56649d145503460f949e93a3/html5/thumbnails/5.jpg)
Word normalization methods stemming
suitable for languages with weak morphology
several stemming techniques
we applied in CLEF 2003 mostly stemmers based on the Porter stemmer
morphological analysis full description of
inflectional morphology
large lexicon of basic vocabulary
suitable for languages with strong morphology
![Page 6: Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.](https://reader036.fdocuments.us/reader036/viewer/2022072006/56649d145503460f949e93a3/html5/thumbnails/6.jpg)
UTA applied both stemmers and morphological analyzers in multilingual runs of CLEF 2003
we built both stemmed and morhologically analyzed indexes for English, Finnish and Swedish
for Dutch, French, German, Italian and Spanish we built stemmed indexes
UTA indexes
![Page 7: Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.](https://reader036.fdocuments.us/reader036/viewer/2022072006/56649d145503460f949e93a3/html5/thumbnails/7.jpg)
The UTACLIR process each source word is normalized utilizing a
morphological analyzer source stop words are removed each normalized source word is translated translated words are normalized (by a
morphological analyzer or a stemmer, depending on the target language code)
target stop words are removed if the source word is untranslatable, two
highest ranked words obtained in n-gram-matching are selected as query words from the target index
![Page 8: Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.](https://reader036.fdocuments.us/reader036/viewer/2022072006/56649d145503460f949e93a3/html5/thumbnails/8.jpg)
Our resultsindex type merging
strategyaverageprecis. %
difference %
morph./stem. raw score 18.6morph./stem. dataset size 18.3 -1.6morph./stem. score diff/top 18.2 -2.1morph./stem. round robin 18.4 -1.1stemmed dataset size 18.6 0.0stemmed score diff/top 18.5 -0.5stemmed raw score 18.3 -1.6stemmed round robin 18.4 -1.1
![Page 9: Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.](https://reader036.fdocuments.us/reader036/viewer/2022072006/56649d145503460f949e93a3/html5/thumbnails/9.jpg)
The results of our additional monolingual English, bilingual English-Finnish and
bilingual English-Swedish runs
language index type
averageprecis. %
difference%
English morphol.anal. 45.6English stemmed 46.3 +1.5Finnish morphol.anal. 34.0Finnish stemmed 19.0 -44.1Swedish morphol.anal. 27.1Swedish stemmed 19.0 -29.9
![Page 10: Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.](https://reader036.fdocuments.us/reader036/viewer/2022072006/56649d145503460f949e93a3/html5/thumbnails/10.jpg)
Conclusions all the result merging strategies we
applied produced almost equal results
the performance did not vary depending on the index type in the multilingual task
![Page 11: Multilingual experiments of UTA @ CLEF 2003 Eija Airio, Heikki Keskustalo, Turid Hedlund, Ari Pirkola University of Tampere, Finland Department of Information.](https://reader036.fdocuments.us/reader036/viewer/2022072006/56649d145503460f949e93a3/html5/thumbnails/11.jpg)
Conclusions II the impact of different word normalization
methods on IR performance has not been investigated properly
our monolingual and bilingual tests show that stemming is an adequate normalization method for English, but not for Finnish and Swedish
so far, morphological analysis seems to offer a hard baseline for competing methods (e.g., stemming) in Finnish and Swedish
the reasons why stemming is not adequate for Finnish and Swedish may be different and should be investigated