Ir 1 lec 7
Transcript of Ir 1 lec 7
Cross Language IR
Cross Language Information Retrieval
Multilingual CollectionsThere are 6,703 languages listed in the
EthnologueDigital libraries
OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records
with over 500 million records ownership attached in more than 370 languages
World Wide WebAround 40% of Internet users do not speak
English, however, 80% of Web sites are still in EnglishHsin-Hsi Chen 10-3
4
The General ProblemFind documents written in any language
Using queries expressed in a single languageTraditional IR identifies relevant documents in
the same language as the query (monolingual IR)
Cross-language information retrieval (CLIR) tries to identify relevant documents in a language different from that of the query
This problem is more and more acute for IR on the Web due to the fact that the Web is a truly multilingual environment
5
Why is CLIR important?
6
Global Internet User Population
Source: Global Reach
5%
9%
6%
5%
5%
4%
4%
3%
2%
2%
3%
52%
Spanish Japanese German French
Chinese Scandanavian Italian Dutch
Korean Portuguese Other English
8%
8%
5%
3%
21%
2%5%2%5%
3%
6%
32%
Spanish Japanese German
French Chinese Scandanavian
Italian Dutch Korean
Portuguese Other English
8%
12%
6%
4%
8%
2%5%
2%6%2%5%
40%
Spanish Japanese German French
Chinese Scandanavian Italian Dutch
Korean Portuguese Other English
EnglishEnglish
2000 2005
Chinese
Multilingual CollectionsThere are 6,703 languages listed in the
EthnologueDigital libraries
OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records
with over 500 million records ownership attached in more than 370 languages
World Wide WebAround 40% of Internet users do not speak
English, however, 80% of Web sites are still in EnglishHsin-Hsi Chen 10-7
8
Importance of CLIRCLIR research is becoming more and more
important for global information exchange and knowledge sharing.National SecurityForeign Patent Information AccessMedical Information Access for Patients
9
CLIR is MultidisciplinaryCLIR involves researchers from the following fields:
information retrieval, natural language processing, machine translation and summarization, speech processing, document image understanding, human-computer interaction
10
User Needs Search a monolingual collection in a
language that the user cannot read.Retrieve information from a multilingual
collection using a query in a single language.Select images from a collection indexed with
free text captions in an unfamiliar language.Locate documents in a multilingual collection
of scanned page images.
11
Why Do Cross-Language IR?When users can read several languages
Eliminates multiple queriesQuery in most fluent language
Monolingual users can also benefitIf translations can be providedIf it suffices to know that a document existsIf text captions are used to search for images
12
CLIR Experimental System2 systems:
SMART Information retrieval system modified to work with 11 European languages (Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Spanish, Swedish) Generation of restricted bigrams Pseudo-Relevance feedback
TAPIR is a language model IR system written by M. Srikanth. It has been adated to work with 12 different European languages (Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish)
13
Approaches to CLIR
14
Design DecisionsWhat to index?
Free text or controlled vocabularyWhat to translate?
Queries or documentsWhere to get translation knowledge?
Dictionary, ontology, training corpus
15
Term-aligned Sentence-aligned Document-aligned Unaligned
Parallel Comparable
Knowledge-based Corpus-based
Controlled Vocabulary Free Text
Cross-Language Text Retrieval
Query Translation Document Translation
Text Translation Vector Translation
Ontology-based Dictionary-based
Thesaurus-based
16
Problems with CLIRMorphological processing difficult for some
languages (e.g. Arabic)Many different encodings for Arabic
Windows Arabic (e.g. dictionaries) Unicode (UTF-8) (e.g. corpus) Macintosh Arabic (e.g. queries)
Normalization Remove diacritics ِب�َّي�ة Arabic (language) الَعَرِب�َّية to الَع�َر�
Standardize spellings for foreign names Kleentoon” vs “Klntoon” for“ آلنتون vs آلَّينتون
Clinton
17
Problems with CLIR (contd)Morphological processing (contd.)
Arabic stemmingRoot + patterns+suffixes+prefixes=wordktb+CiCaC=kitabAll verbs and nouns derived from fewer than 2000 rootsRoots too abstract for information retrievalktb → kitab a book kitabi my bookalkitab the book kitabuki your book (f)kataba to write kitabuka your book (m)maktab office kitabuhu his bookmaktaba library, bookstore ...Want stem=root+pattern+derivational affixes?No standard stemmers available,only morphological (root) analyzers
18
Problems with CLIR (contd)Availability of resources
Names and phrases are very important, most lexicons do not have good coverage
Difficult to get hold of bilingual dictionaries can sometimes be found on the Web e.g. for recent Arabic cross-lingual evaluation we
used 3 on-line Arabic- English dictionaries (including harvesting) and a small lexicon of country and city names
Parallel corpora are more difficult and require more formal arrangements
19
CLIR better than IR?How can cross-language beat within-language?
We know there are translation errorsSurely those errors should hurt performance
Hypothesis is that translation process may disambiguate some query termsWords that are ambiguous in Arabic may not be
ambiguous in EnglishExpansion during translation from English to Arabic
prevents the ambiguity from re-appearingHas been proposed that CLIR is a model for IR
Translate query into one language and then back to original
Given hypothesis, should have an improved queryShould be reasonable to do this across many different
languages