Ir 1 lec 7

Cross Language IR

Cross Language Information Retrieval

Multilingual CollectionsThere are 6,703 languages listed in the

EthnologueDigital libraries

OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records

with over 500 million records ownership attached in more than 370 languages

World Wide WebAround 40% of Internet users do not speak

English, however, 80% of Web sites are still in EnglishHsin-Hsi Chen 10-3

4

The General ProblemFind documents written in any language

Using queries expressed in a single languageTraditional IR identifies relevant documents in

the same language as the query (monolingual IR)

Cross-language information retrieval (CLIR) tries to identify relevant documents in a language different from that of the query

This problem is more and more acute for IR on the Web due to the fact that the Web is a truly multilingual environment

5

Why is CLIR important?

6

Global Internet User Population

Source: Global Reach

5%

9%

6%

5%

5%

4%

4%

3%

2%

2%

3%

52%

Spanish Japanese German French

Chinese Scandanavian Italian Dutch

Korean Portuguese Other English

8%

8%

5%

3%

21%

2%5%2%5%

3%

6%

32%

Spanish Japanese German

French Chinese Scandanavian

Italian Dutch Korean

Portuguese Other English

8%

12%

6%

4%

8%

2%5%

2%6%2%5%

40%

Spanish Japanese German French

Chinese Scandanavian Italian Dutch

Korean Portuguese Other English

EnglishEnglish

2000 2005

Chinese

Multilingual CollectionsThere are 6,703 languages listed in the

EthnologueDigital libraries

OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records

with over 500 million records ownership attached in more than 370 languages

World Wide WebAround 40% of Internet users do not speak

English, however, 80% of Web sites are still in EnglishHsin-Hsi Chen 10-7

8

Importance of CLIRCLIR research is becoming more and more

important for global information exchange and knowledge sharing.National SecurityForeign Patent Information AccessMedical Information Access for Patients

9

CLIR is MultidisciplinaryCLIR involves researchers from the following fields:

information retrieval, natural language processing, machine translation and summarization, speech processing, document image understanding, human-computer interaction

10

User Needs Search a monolingual collection in a

language that the user cannot read.Retrieve information from a multilingual

collection using a query in a single language.Select images from a collection indexed with

free text captions in an unfamiliar language.Locate documents in a multilingual collection

of scanned page images.

11

Why Do Cross-Language IR?When users can read several languages

Eliminates multiple queriesQuery in most fluent language

Monolingual users can also benefitIf translations can be providedIf it suffices to know that a document existsIf text captions are used to search for images

12

CLIR Experimental System2 systems:

SMART Information retrieval system modified to work with 11 European languages (Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Spanish, Swedish) Generation of restricted bigrams Pseudo-Relevance feedback

TAPIR is a language model IR system written by M. Srikanth. It has been adated to work with 12 different European languages (Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish)

13

Approaches to CLIR

14

Design DecisionsWhat to index?

Free text or controlled vocabularyWhat to translate?

Queries or documentsWhere to get translation knowledge?

Dictionary, ontology, training corpus

15

Term-aligned Sentence-aligned Document-aligned Unaligned

Parallel Comparable

Knowledge-based Corpus-based

Controlled Vocabulary Free Text

Cross-Language Text Retrieval

Query Translation Document Translation

Text Translation Vector Translation

Ontology-based Dictionary-based

Thesaurus-based

16

Problems with CLIRMorphological processing difficult for some

languages (e.g. Arabic)Many different encodings for Arabic

Windows Arabic (e.g. dictionaries) Unicode (UTF-8) (e.g. corpus) Macintosh Arabic (e.g. queries)

Normalization Remove diacritics ِب�َّي�ة Arabic (language) الَعَرِب�َّية to الَع�َر�

Standardize spellings for foreign names Kleentoon” vs “Klntoon” for“ آلنتون vs آلَّينتون

Clinton

17

Problems with CLIR (contd)Morphological processing (contd.)

Arabic stemmingRoot + patterns+suffixes+prefixes=wordktb+CiCaC=kitabAll verbs and nouns derived from fewer than 2000 rootsRoots too abstract for information retrievalktb → kitab a book kitabi my bookalkitab the book kitabuki your book (f)kataba to write kitabuka your book (m)maktab office kitabuhu his bookmaktaba library, bookstore ...Want stem=root+pattern+derivational affixes?No standard stemmers available,only morphological (root) analyzers

18

Problems with CLIR (contd)Availability of resources

Names and phrases are very important, most lexicons do not have good coverage

Difficult to get hold of bilingual dictionaries can sometimes be found on the Web e.g. for recent Arabic cross-lingual evaluation we

used 3 on-line Arabic- English dictionaries (including harvesting) and a small lexicon of country and city names

Parallel corpora are more difficult and require more formal arrangements

19

CLIR better than IR?How can cross-language beat within-language?

We know there are translation errorsSurely those errors should hurt performance

Hypothesis is that translation process may disambiguate some query termsWords that are ambiguous in Arabic may not be

ambiguous in EnglishExpansion during translation from English to Arabic

prevents the ambiguity from re-appearingHas been proposed that CLIR is a model for IR

Translate query into one language and then back to original

Given hypothesis, should have an improved queryShould be reasonable to do this across many different

languages

Ir 1 lec 7

Documents

Transcript of Ir 1 lec 7