Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem...

CROSS LANGUAGE INFORMATION RETRIEVAL FOR LANGUAGES WITH SCARCE RESOURCES

Christian E. Loza

Thesis Prepared for the Degree of

MASTER OF SCIENCE

UNIVERSITY OF NORTH TEXAS

May 2009

APPROVED: Rada Mihalcea, Major Professor Paul Tarau, Committee Member Miguel Ruiz, Committee Member Armin Mikler, Graduate Advisor Krishna Kavi, Chair of the Department of

Computer Science and Engineering Michael Monticino, Interim Dean of the Robert

B. Toulouse School of Graduate Studies

Loza, Christian. Cross Language Information Retrieval for Languages with Scarce

Resources. Master of Science (Computer Science), May 2009, 53 pp., 2 tables, 20 illustrations,

bibliography, 21 titles.

Our generation has experienced one of the most dramatic changes in how society

communicates. Today, we have online information on almost any imaginable topic. However,

most of this information is available in only a few dozen languages. In this thesis, I explore the

use of parallel texts to enable cross‐language information retrieval (CLIR) for languages with

scarce resources. To build the parallel text I use the Bible. I evaluate different variables and

their impact on the resulting CLIR system, specifically: (1) the CLIR results when using different

amounts of parallel text; (2) the role of paraphrasing on the quality of the CLIR output; (3) the

impact on accuracy when translating the query versus translating the collection of documents;

and finally (4) how the results are affected by the use of different dialects. The results show

that all these variables have a direct impact on the quality of the CLIR system.

Copyright 2009

By

Christian E. Loza

ii

CONTENTS

CHAPTER 1. INTRODUCTION 1

1.1. Problem Statement 1

1.2. Contribution 2

CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL, MACHINE

TRANSLATION AND NLP FOR SCARCE RESOURCE LANGUAGES 4

2.1. Information Retrieval 4

2.1.1. Definition of Information Retrieval 4

2.1.2. Formal Definition and Model 5

2.1.3. Boolean Model 5

2.1.4. Vector Space Model 6

2.1.5. Probabilistic Model 8

2.1.6. Other Models 9

2.2. Machine Translation 9

2.2.1. Development of Machine Translation 9

2.2.2. Approaches to MT 11

2.2.3. Software Resources for Machine Translation 11

2.3. Cross-language Information Retrieval 15

2.3.1. The Importance of Cross Language Information Retrieval 15

2.3.2. Main Approaches to CLIR 16

2.3.3. Challenges Associated with CLIR for Languages with Scarce Resources 16

2.4. Related Work and Current Research 18

2.4.1. Computational Linguistics for Resource-Scarce Languages 18

2.4.2. Focusing on a Particular Language 18

2.4.3. NLP for Languages with Scarce Resources 19

2.4.4. CLIR for Languages with Scarce Resources 19

iii

2.4.5. Cross Language Information Retrieval with Dictionary Based Methods 19

2.4.6. Cross Language Information Retrieval Experiments 20

2.4.7. Word Alignment for Languages with Scarce Resources 20

CHAPTER 3. EXPERIMENTAL FRAMEWORK 21

3.1. Metrics 21

3.2. Data Sources 21

3.2.1. Time Collection 21

3.2.2. Parallel Text 22

CHAPTER 4. EXPERIMENTAL RESULTS 27

4.1. Objectives 27

4.2. Experiments 30

4.2.1. Experiment 1 30





CHAPTER 5. DISCUSSION AND CONCLUSIONS 45

5.1. Contributions 46

5.2. Conclusions 46

APPENDIX TIME COLLECTION FILES 49

BIBLIOGRAPHY 51

iv

CHAPTER 1

INTRODUCTION

1.1. Problem Statement

One of the most important changes in our society and the way we live in the last decades

is the availability of information, as a result of the development of the World Wide Web and

the Internet.

English being the lingua franca of the modern era, and United States the place where

the Internet was born, most of the current content in the World Wide Web is written in this

language.

Along with English, major European languages, like Spanish, French and German, and

others, like Chinese, Russian and Japanese are examples of the few dozen languages with

significant amount of content available on the Web, in direct relationship with the number

of speakers of those languages with regular access to the Web.

Figure 1.1, based on information reported in [8], shows that most of the languages in the

world have relatively few speakers.

Figure 1.1. Distribution of number of speakers.

1

As a result of the availability of content, out of all the languages spoken in the world

today, only a few dozen languages have a significant number of computational resources

developed for them. We can say that most of the current spoken languages have little or no

computational resources.

I will call the first set of languages “languages with computational resources,” as opposed

to the rest of the languages spoken in the world, which I call “scarce resource languages.”

We want to study methods to develop standard natural language processing (NLP) com-

putational resources for these languages, and see how different approaches producing them

affect other tasks. In particular, in this thesis I concentrate on the task of information

retrieval.

I chose Quechua for our study, because it has fourteen million of speakers, but most of

them have little or no access to the Web, making it a scarce resource language.

As shown in earlier works as [1] and [5], even gathering the initial resources is a difficult

task when creating computational resources for a language that does not have them.

The quantity of documents for any language is generally proportional to the number of its

speakers. Most of the scarce resource languages have relatively few speakers and few written

documents available, which increases the difficulty of obtaining them.

Other problems to develop initial resources are the availability of one unique written form

of the language, the availability of an alphabet, the presence of dialects, and how to use

different character sets with current software tools.

All these problems are common to most scarce resource languages. Many researchers

have proposed different approaches to them, according to the language that was the focus

of the study.

1.2. Contribution

In this thesis, we analyzed specific aspects of the construction and use of parallel corpora

for information retrieval.

2

More specifically, we want to discuss the following:

(i) How much does the amount of parallel data affect the precision of information

retrieval?

(ii) Is it possible to increase the precision of the information retrieval results using para-

phrasing?

(iii) Is there a significant difference if we translate the query or the collection of docu-

ments? How is this affected by the other factors?

(iv) Is it possible to use information retrieval across different dialects of the same lan-

guage? How much is the precision affected by this?

3

CHAPTER 2

A BRIEF REVIEW OF INFORMATION RETRIEVAL, MACHINE TRANSLATION AND

NLP FOR SCARCE RESOURCE LANGUAGES

2.1. Information Retrieval

Information retrieval (IR) is defined as the task to retrieve certain documents from a

collection that satisfy a query or an information need [2].

In the context of computer science, IR is a branch of natural language processing (NLP)

that studies the automatic search of information and all the sub tasks associated with it.

Previous remarkable events in the development of IR include the development of the

mechanical tabulator based on punch cards in 1890 by Herman Hollerith, to rapidly tabulate

statistics, Vannevar Bush’s ’As We May Think’, that appeared in Atlantic Monthly in 1945,

and Hans Peter Luhn’s (researcher engineer at IBM) work in 1947, on mechanized punch

cards for searching chemical compounds. More recently, in 1992 the first Text Retrieval

Conference (TREC) took place.

One the most important development of IR to date has been the development of massive

search engines the last decade. A decade ago libraries were still the main place to search for

information, but today we can find information on almost any topic using the Internet.

2.1.1. Definition of Information Retrieval

IR studies the representation, storage, organization of, and access to the information

items [2].

A data retrieval system receives a query from a user, and returns an information item or

items, sorted, with a degree of relevance to the query.

An information retrieval system tries to retrieve the information that will answer the

particular query from the collection.

4

The availability of resources and information on the Web has changed how we search

for information. The Internet not only has created a vast amount of people looking for

information, it also has created a greater number of people writing contents in all subjects.

2.1.2. Formal Definition and Model

IR can be defined using the following model [2]:

(1) [D,Q, F,R(qi , dj)]

where:

(i) D is a set of logical representations for the documents

(ii) Q is a set composed of logical representations of the user information needs, called

queries.

(iii) F is a framework for modeling the documents, queries and their relationships.

(iv) R(qi , dj) is a ranking function

There are three main models in IR: The Boolean model, the vector model, and the

probabilistic.

2.1.3. Boolean Model

The Boolean model is based on Boolean algebra, the queries are specified in boolean

terms, using Boolean operators as AND and OR. This gives this model simplicity and clarity,

and is easy to formalize and implement.

Unfortunately, these same reasons also present disadvantages. Because the model is

binary, it prevents any notion of scale, which has the effect to decrease the performance of

the information system in terms of quality.

5

The representation of a document consists of a set of terms weights assumed to be binary.

A query term is composed of index terms linked by the boolean operators, like AND, OR,

NOT.

The Boolean model can be defined as following[2]:

(i) D The terms can either be present (True) or absent (False).

(ii) Q The queries are boolean expressions, linking terms with operations.

(iii) F The framework is the boolean algebra, and it’s operators AND, OR and NOT.

(iv) The similarity function is defined in the context of the Boolean Algebra as:

sim(dj , q) =

T rue if ∃←−q cc | (←−q cc ∈

←−q dnf ∧ (∀ki , gi(

←−dj )) = gi(

←−qcc)

False otherwise

(2)

As shown above, the relevance criteria is a Boolean value, true (relevant) or false (not

relevant). Other disadvantage of this model is that the ranking function doesn’t provide

information regarding how relevant is a document, so is not possible to compare how relevant

it is relative to others. This makes this model less flexible than the other two models.

As a result of this disadvantages, but taking into account all the advantages, the Boolean

model is generally used in conjunction with the other models, or has been modified to show

a degree of relevance, as in the fuzzy Boolean model.

2.1.4. Vector Space Model

The vector space model presents a framework where a document is represented as a

vector in a N − dimensional space [18]. Each dimension of the vector is defined as wi ,

and is a non-negative non binary number. The space has N dimentions corresponding to the

unique terms present in the collection. The query vector is also defined as a vector in the

N − dimensional space.

(i) D The N terms are dimensions, the documents are vectors in this space.

(ii) Q The queries qj are also vectors in this N − dimensional space.

6

(iii) F The framework is defined in the context of the vector analysis and its operations.

One of the most commonly used metrics of distance between vectors is the cosine

similarity, as defined below.

sim(dj , q) =

←−dj ·←−q

|←−dj | × |

←−q |

(3)

The main advantage of using this framework is that we can define the similarity between

queries and documents as a function of similarity between two vectors. There are many

functions that can give us a measure of distance between two vectors, been the cosine

similarity one of the most used for IR purposes.

2.1.4.1. Weight Metrics

There are many methods to calculate the weights wi for the N dimensions of each doc-

ument, as shown in [18], being an active research topic. For our experiments, we used the

term frequency - inverse document frequency, tf − idf .

We define the term frequency (tf) tfi ,j as:

tfi ,j =fi ,j

maxk fk,j(4)

where fi ,j is the frequency of the term i in the document j , normalized dividing by the maximum

frequency observed of a term k in the document dj .

The inverse document frequency (idf) is defined as:

idfi = lgN

fi(5)

7

The tf − idf weighting scheme is defined as:

wi ,j = tfi ,j × idfi =fi ,j

maxk fk,j× lgN

fi(6)

2.1.5. Probabilistic Model

Defined in the context (framework) of probabilities, the Probabilistic Model tries to esti-

mate the probability that specific document is relevant to a given query.

The model assumes there is a subset R of documents which are the ideal answer for the

query q. Given the query q, the model assigns measure of similarity to the query.

Formally, we have the following.

(i) D The terms have probability weights, w ∈ {0.1}.

(ii) Q The queries are subsets of index terms.

(iii) F The framework is defined in the context of the probability theory. as:

Let R be the set of documents know to be relevant.

Let R be the complement of R.

P (R|←−dj ) is the probability of a document to be relevant.

sim(dj , q) =P (←−dj |R)× P (R)

P (←−dj |R)× P (R)

(7)

One of the advantages of this model is that the documents are ranked according to their

probability to be relevant. One of the many disadvantages of this model is that it requires an

initial classification of documents in relevant and not relevant (which is the original purpose),

and the assumption that the terms are independent of each other.

8

2.1.6. Other Models

Most of current approaches in the field of IR build on top of one or more of the previous

models. We can mention, among other models, the fuzzy set model and the extended Boolean

model.

2.2. Machine Translation

Machine translation (MT) is a field in NLP that investigates ways to perform automatic

translation between two languages.

In the context of this task, the original language is called the source (S) language, and

the desired result language is called the target (T) language.

The naive approach is to replace words from the source language with words from the

target language, based on a dictionary.

During the past two decads, research has been active, and many new methods have been

proposed.

2.2.1. Development of Machine Translation

The history of machine translation can be traced to Rene Descartes and Gottfried Leibniz.

In 1629, Descartes proposed a universal language, which could have a symbol for equivalent

ideas in different languages. In 1667, in Preface to the General Science Gottfried Leibniz

proposed that symbols are important for human understanding, and made important contri-

butions in symbolic reasoning and symbolic logic.

In the past century, in 1933 the Russian Petr Petrovich Troyanskij, patented a device for

translation, storing multilingual dictionaries and continuing his work for 15 years. He used

Esperanto to deal with grammatical roles between languages.

During the Second World War, scientists worked with the task to break codes and perform

cryptography. Mathematicians and early computer scientists tough that the task of MT was

9

very similar in essence to the task of breaking encrypted codes. The use of terms as ’decoding’

a text came during this period of time.

In 1952, Yehoshua Bar-Hillel, MIT’s first full-time MT researcher, organized the first MT

research conference. In 1954, at Georgetown University the first public demonstration of

a MT system is held , translating 49 sentences Russian to English. In 1964, the Academy

of Science creates the Automatic Language Processing Advisory Committee, ALPAC, to

evaluate the feasibility of MT.

One of the best known events in the history of MT is the publication of ALPAC’s report

in 1966, having an undeniable negative effect on the MT research community. The report

concluded that after many years of research, the task of MT hasn’t produced useful results,

and qualified it as hopeless. The result was a substantial funding cut for MT research in the

following years, therefore, in the immediate following decades, research on MT was somehow

slow.

In 1967, L. E. Baum, T. Petrie, G. Soules and N. Weiss introduced the expectation max-

imization technique occurring in the statistical analysis of probabilistic functions of Markov

chains, which, after the Shannon Lecture by Welch, became the Baum-Welch algorithm to

find the unknown parameters of a hidden Markov model (HMM). The Baum-Welch algorithm

is a generalized expectation-maximization (GEM) algorithm.

Although many consider that stochastic approaches for MT began in the 1980s, short

contributions during the 1950s already pointed in this direction. Most systems were based

on mainframe.

At the end of the 1980s, a group in IBM lead by P. Brown developed statistical models

in [4], which lead to the widely used IBM models for Statistical Machine Translation (SMT).

During the 1990s is the development of machine translation systems for personal com-

puters. At the end of the 1990s, many websites started offering MT services.

10

In the 2000s machines became more powerful and more data was available, making many

approaches for SMT feasible.

2.2.2. Approaches to MT

The main paradigms for MT can be classified in three groups: Rule based MT, Corpus

Based MT, and hybrid approaches.

2.2.2.1. Rule Based Machine Translation

RBMT is a paradigm that generally involves many steps or processes which analyze the

morphology, syntactic and semantics of an input text, and use some mechanism based on

rules to transform, using an internal structure or some interlingua the source language to the

target language.

2.2.2.2. Corpus Based Machine Translation

The corpus based machine method involves the use of bilingual corpus to extract statistical

parameters and applying them to models.

• Statistical Machine Translation (SMT)

Using a parallel corpus, parameters are determined using the parallel corpus, and

they are applied to their models. These methods were reintroduced by [4].

• Example based Machine Translation (EBMT)

This approach is comparable to the case-based learning, with examples extracted

from a parallel corpus.

2.2.3. Software Resources for Machine Translation

2.2.3.1. GIZA++

As part of the Egypt project, GIZA is used for analizing bilingual corpora, which includes

the implementation of the algorithms described in [4].

11

This tool was developed as part of the SMT system EGYPT. In addition to GIZA,

GIZA++ was developed as an extension that includes Models 4 and 5, by Franz Josef Och

[15].

It implements the Baum-Welch Forward-Backwards algorithm, with empty words and

dependency on word classes.

2.2.3.2. MKCLS

MKCLS implements language modelling, and is used to train word classes, using a maximum-

likelihood criterion [14]. The task of this model is to estimate the probability of a sequence of

words, wN1 = w1, w2, ..., wN. To approximate this, we can use P r(wN1 ) =

∏N

i=1 p(wi |wi − 1).

When we want to use this to estimate bigram probabilities (N = 2), we will have problems

with data sparseness, since most bigrams will not be seen. To solve this problem, we can

create a function C that maps words w to classes C, in this way:

(8) p(wN1 |C) :=

N∏

=i1

p(C(w1)|C(wi−1)) · p(wi |C(w1))

This is obtained with a maximum likelihood approach:

(9) C = argmaxc

p(wN1 |C)

As described in [14], we can arrive to the following optimization, following [16].

(10) LP1(C, n) = −∑

C,C′

h(n(C|C′)) + 2∑

C

h(n(C))

(11) C = argmaxc

LP1(C, n)

12

MKCLS uses alignment templates extracted from the parallel corpus, which contain an

alignment between the sentences.

The implementation of the template model was an early development of the phrase trans-

lation, which is used in current machine translation systems.

2.2.3.3. Moses

Moses, as GIZA++, is an implementation of a statistical machine translation system.

It’s a factored phrase-based beam-search decoder. The phrase-based statistical translation

model [10] is different since it tries to create sequences of words instead of aligning words.

Phrase-based models map text phrases without any linguistic information.

We model this problem as a noisy channel problem.

(12) p(e|f ) =p(f |e)

p(f )

Using the Bayes Rule we can rewrite this expression to the following.

(13) argmaxe

p(e|f ) = argmaxe

p(f |e)

p(e)

Each sentence in this case is transformed into a sequence of phrases, with the following

distribution.

(14) ϕ(fi |ei)

Using a parallel text line aligned, we produce an alignment between the words. To do this

step, we use GIZA++.

13

For the example, we used the book of Revelation, chapter 22, versicle 21. In figure 2.2

and figure 2.2, we can see the translation alignment for English to Spanish and Spanish to

English of this verse, which is the last verse of the Bible, using the New King James Version

and the Dios Habla Hoy translations respectively.

• e: the grace of our lord jesus christ be with you all

• f: que el senor jesus derrame su gracia sobre todos

Figure 2.1. Translations English to Spanish, and Spanish to English.

Figure 2.2. The intersection.

Using the word alignment, several approaches try to approximate the best way to learn

phrase translations.

14

2.3. Cross-language Information Retrieval

Cross-languge information retrieval (CLIR) is the task of performing IR when the query

and the collection are in two different languages.

One of the main motivations for the development of CLIR is the availability of information

in English, and the need to make this resources available to other languages.

2.3.1. The Importance of Cross Language Information Retrieval

CLIR is important because most of the contents available online are written mainly in a

couple dozen of languages, and they are not available for thousands of other languages, for

which little or no content is available.

For example, a query in English in a search engine for ’cold sickness’ would return

2.130.000 results, while the same search in Spanish, ’resfrio’, will return only 106.000 doc-

uments. A search for this same string in quechua for ’Qhonasuro’ will show no documents

available in the index.

A query in English for the word ’answer’ will return 511 million relevant document. The

query for the Spanish word ’respuesta’ returns 76 million results. The same query for the

word ’kutichiy’ returns 638 documents.

A query in English for the words ’blind wiki’ will return 2.4 million results, with 30 different

senses in Wikipedia. The same query in Spanish, for ’ciego’ will return 71 thousand relevant

documents, with 10 different senses. The query for the word ’awsa’ in Quechua will return

two thousand results, without any entry in a Wiki page.

These examples show that although there is a lot of useful information on the Internet,

it may not be accessible in most languages.

Unesco has estimated that today, around six thousand languages are disappearing, or at

risk to disappear, out of close to seven thousand live languages spoken around the world [21].

A language not only constitutes a way of human communication, it also constitutes a

repository of cultural knowledge, a source of wisdom and a particular way to interpret and to

15

see the world. In this context, this works is an effort to provide tools for native speakers to

use their languages to access the vast amount of information available for other languages.

This also constitutes a part of the effort to encourage the use of modern tools in different

languages, other than the few major languages.

Creating modern tools like an IR system in different languages constitutes another way

to preserve the languages. This can promote the use of one language, given that there are

documents accessible to the people that speak it.

2.3.2. Main Approaches to CLIR

Depending of what we choose to translate, there are two main approaches for CLIR.

(i) Query translation: The main advantage of this method is that it is very efficient

for short queries. The main disadvantage of this approach is that it is difficult to

disambiguate query terms, and it is difficult to include relevance feedback.

(ii) Document translation: This approach involves translating the collection into the

query language. This step can be done once, and only applied to new documents

of the collection. This approach is difficult if the collection is very large, and the

number of target languages increase, since each document will need to be translated

into all target languages.

2.3.3. Challenges Associated with CLIR for Languages with Scarce Resources

2.3.3.1. The Alphabet

To perform MT and CLIR and MT with any pair of languages, the first factor to consider

is the written form of that language. A native written system may or may not be present,

and in many cases, transliteration is needed to map words from one written system to the

other. While some languages have already transliteration rules, the complexity of this task is

important, but is beyond the present experiment.

16

In the absence of a native alphabet, many languages have adopted a borrowed alphabed

from a different language. In the case of Quechua, it adopted the written alphabet of Spanish

(which is based on the Latin alphabet). As a result, some phonetics have been represented

differently according to different linguists:

• Five-vowel system This is the old approach, based on a Spanish orthography. Sup-

ported almost exclusively by the “Academia Mayor de la Lengua Quechua”.

• Three-vowel system This approach uses only the three vowels a, i and u. Most

modern linguists support this system.

2.3.3.2. The Encoding

Another particular task on the development of resources for a scarce resource language

at the implementation level will involve definitely the task to use different encodings in the

application.

This sub task can be very challenging, since this issue doesn’t involve only problems with

software and programs developed by other researchers, but also some times includes deciding

many aspects that are specific to the language being studied.

For example, while it is easier to agree on the alphabetic order of characters in English

and languages which depend on the Latin alphabet, this task can have different difficulties

when the language is not English. Quechua uses the Spanish alphabet, but is very common to

include the symbol “ ’ ”. While there are rules on how to classify this symbol in Spanish, they

may or may not make sense for Quechua, taking into account the phonetics of the symbol.

In the specific case of Quechua, many systems for phonetic transcription where present,

and there was still some ongoing debate about this topic among linguists and native speakers.

17

2.4. Related Work and Current Research

2.4.1. Computational Linguistics for Resource-Scarce Languages

As we mentioned earlier, one of the first problems researchers face when building new

NLP resources for a scarce resource language is the availability of written resources, native

speakers and general knowledge for a language.

There are two possible approaches to perform this task.

(i) To work on a particular language, creating resources particular for it, working with

experts to develop methods most of the times only applicable to the particular

language.

(ii) To create methods that wold be appliable with little effort to any language.

2.4.2. Focusing on a Particular Language

Specific methods have been researched and developed that were focused on a particular

scarce resource language. [20] describes the process of creating resources for two languages,

Cebuano and Hindi. Cebuano is a native language of the Philippines.

Most of the experiments done for languages with scarce resources work on the same initial

grounds, which are to create as much resources as possible to start using traditional methods

and models.

For example, [12] describe a shared task on word alignment, were seven teams were

provided training and evaluation data, and submited word alignments between English and

Romanian, and English and French.

Other projects have focused as well the creation of resources for Quechua. For example,

[5] described the efforts to create NLP resources for Mapudungun, an indigenous language

spoken in Chile and Argentina, and Quechua. In this experiment, EBMT and RBMT systems

were developed for both languages. As the other experiments mentioned, one of the main

problems is focused on building the initial resources.

18

2.4.3. NLP for Languages with Scarce Resources

The construction of NLP resources for languages with scarce resources has been studied

lately. For example, [6] proposes an unsupervised method for POS adquisition.

In the task of word alignment [12] and [11] describe approaches to word alignment. [11]

proposes improvements to word alignment by incorporating syntactic knowledge.

2.4.4. CLIR for Languages with Scarce Resources

One of the common tasks of CLIR is the experimentation with languages with few or no

resources.

Nevertheless, it is necessary to note that this experiments have targeted most of the times

languages of interest with many speakers. While TREC-2001 and TREC-2002 have focused

on Arabic, other experiments have been also provided for languages with few or no resources

at all. For example, DARPA organized in 2003 a competition where teams had to build from

scratch a MT system for a suprise language, Indi in the context of the Translingual Information

Detection, Extraction and Summarization (TIDES). In this experiments, researchers needed

to port and created resources for a new language in a relative short amount of time, in this

case, one month.

2.4.5. Cross Language Information Retrieval with Dictionary Based Methods

Earlier experiments worked with dictionary-based methods for CLIR. One of this works

was presented by Lisa Ballesteros [3].

In this work, they found that Machine Readable Dictionaries would drop about 50% the

effectiveness of retrieval due to ambiguity.

They reported that local feedback prior to translation improves precision. They also

showed that this affects more longer queries, attributing the effect due to the reduction of

irrelevant translations.

19

The combination of both methods leaded to better results both in short queries and long

queries.

2.4.6. Cross Language Information Retrieval Experiments

Ten groups participated on the TREC-2001, which had to retrieve Arabic language doc-

uments based on 25 queries in English.

Since that was the first year that many resources in English were available, new approaches

were tried[7].

Some results on this experiment were that query length did not improve the retrieval

effectiveness.

In the TREC-2002, nine teams participated in the CLIR track [13].

The monolingual runs were mentioned as a comparative baseline to which cross-language

results can be compared.

As it happened a year before, the teams observed substantial variability in the effectiveness

on a topic to topic basis.

2.4.7. Word Alignment for Languages with Scarce Resources

More recently, the ACL organized a task[9] to align words for languages with scarce

resources. The pairs were English-Inuktitut, English-Hindi and Romanian-English.

One of them most interesting outcomes of this work, relative to this study, is in the results

for English-Hindi. The use of additional resources resulted in absolute improvements of 20%,

which was not the case for languages with larger training corpora.

20

CHAPTER 3

EXPERIMENTAL FRAMEWORK

3.1. Metrics

For our experiments, we used precision and recall to evaluate the effectiveness of the

results of our experiments.

In CLIR, the quality of the translation affects directly the results obtained using IR, but

it’s different from the quality meassured only by MT methods.

One of the main reasons for this is that most IR methods use bag of words approach, as

opposed to MT, where the order of the words is a factor.

We used the normal metrics, precision and recall to evaluate the results returned by the

systems.

To be able to analyze better the results, we also calculated the precision at other points

(different from the 11 normal points), taking into account a progressive inclusion of results.

P = (Relevantdocuments)Documentsreturned

R = (Relevantdocuments)Totalrelevantdocuments

In the formula, in order to calculate more points, we made the total number of relevant

documents vary accordingly.

3.2. Data Sources

To evaluate how different quantities of parallel text affects the translation quality, we used

the following data sources:

3.2.1. Time Collection

The Times magazine collection is a set of 423 documents, associated with 83 queries,

for which the relevance was manually assigned.

21

This collection contained originally 425 documents, with two of them repeated, document

504 and document 505 according to the original numeration. Both documents were not used

for this reason.

There is a total of 324 relevant documents associated with the 83 queries.

3.2.1.1. Preprocessing

In order to use this collection, we needed to preprocess the files. The collection’s docu-

ments are distributed in a single file, which has to be parsed and transformed to be usable.

The files were numbered in a unspecified order. For this experiment, they were rearranged

and indexed. The mapping is detailed in Appendix ‘A’.

3.2.2. Parallel Text

Many sets of parallel text were used on this experiment. The main source of parallel text

used was the Bible.

3.2.2.1. The Bible as a Source of Parallel Text

Many authors have described advantages and disadvantages of using the Bible as a source

of parallel text, for example [17].

The Bible is a source of high quality translation that is available in several languages.

According to the United Bible Society [19], there are 438 languages with translations of the

Bible, and 2,454 languages with at least a portion of the Bible translated, making it the

parallel corpus that is available in more languages.

Table 3.1 contains all the versions that were used in this experiment.

All the Bibles required a preprocessing to be usable for this experiment.

This process involved many manual steps. The final format of the Bibles, for post pro-

cessing was defined as the following:

Every new line involving a new versicle, was denoted with a line starting with the following:

[a, b, c, d, e] Text of the versicle.

22

Table 3.1. List of English translations of the Bible used for this experiment.

ID Language Name008 English American Standard Version009 English King James Version015 English Youngs Literal Translation016 English Darby Translation031 English New International Version045 English Amplified Bible046 English Contemporary English Version047 English English Standard Version048 English 21st Century King James Version049 English New American Standard Bible050 English New King James Version051 English New Living Translation063 English Douay Rheims 1899 American Edition064 English New International Version U K072 English Todays New International Version074 English New Life Version076 English New International Readers Version077 English Holman Christian Standard Bible078 English New Century Version998 Quechua Bolivia Quechua Catholic Bolivia999 Quechua Cuzco Quechua Catholic Cuzco

Where:

a) Bible translation ID b) Book number c) Chapter number d) Versicle number e) Type

of annotation

The different types of annotation are:

4) Title 5) Sub title 6-10) Warning of duplicate versicles

We introduced lines to be skipped in the document, which contained annotation of the

Bible, as many chapters have a name, and also, there are many titles for some stories. This

titles were commented, and although they are present in the Bible file, they have a different

notation, to be skipped by the post processing.

The versions of the Bibles contained many particularities, which are not relevant in this

context. Nevertheless, some of them had to be accounted in order to process the parallel

files.

3.2.2.2. Catholic and Protestant Versions

We used both Catholic and Protestant versions of the Bible. The difference between

them, in this context, is the number of books included, and the length of some of the books.

23

We include a listing of the books of the Bible used in both versions.

3.2.2.3. Joined Versicles

Some of the Bibles contained versicles that were, for translation purposes, joined together

into one. Some versions specifically had more joined versicles than others. This seemed to

be arbitrary, since on inspection, the joined versicles didn’t match from version to version.

In order to create parallel files, we joined together the versicles when creating the parallel

files for the parallel Bible.

In some cases, new joined versicles corresponded to versicles that were not joined in the

other version. We grouped all the versicles that needed to be grouped so there was a parallel

number of corresponding versicles in both sides.

This process is dynamic, since different versions of the Bible are grouped differently, in a

case by case basis.

24

Table 3.2. Comparison of content between Catholic and Protestant Bibles.

Book number Book name Versicles Catholic Bible Versicles Protestant Bible1 Genesis 50 502 Exodus 40 403 Leviticus 27 274 Numbers 36 365 Deuteronomy 34 346 Joshua 24 247 Judges 21 218 Ruth 4 49 1 Samuel 31 3110 2 Samuel 24 2411 1 Kings 22 2212 2 Kings 25 2513 1 Chronicles 29 2914 2 Chronicles 36 3615 Ezra 10 1016 Nehemiah 13 1317 Tobit 14 018 Judith 16 019 Esther 10 1020 1 Maccabees 16 021 2 Maccabees 15 022 Job 42 4223 Psalm 150 15024 Proverbs 31 3125 Ecclesiastes 12 1226 Song 8 827 Wisdom 19 028 Sirach 51 029 Isaiah 66 6630 Jeremiah 52 5231 Lamentations 5 532 Baruch 6 033 Ezekiel 48 4834 Daniel 13 1235 Hosea 14 1436 Joel 3 337 Amos 9 938 Obadiah 1 139 Jonah 4 440 Micah 7 741 Nahum 3 342 Habakkuk 3 343 Zephaniah 3 344 Haggai 2 245 Zechariah 14 1446 Malachi 4 447 Matthew 28 2848 Mark 16 1649 Luke 24 2450 John 21 2151 Acts 28 2852 Romans 16 1653 1 Corinthians 16 1654 2 Corinthians 13 1355 Galatians 6 656 Ephesians 6 657 Philippians 4 458 Colossians 4 459 1 Thessalonians 5 560 2 Thessalonians 3 3

25

Book number Book name Versicles Catholic Bible Versicles Protestant Bible61 1 Timothy 6 662 2 Timothy 4 463 Titus 3 364 Philemon 1 165 Hebrews 13 1366 James 5 567 1 Peter 5 568 2 Peter 3 369 1 John 5 570 2 John 1 171 3 John 1 172 Jude 1 173 Revelation 22 22

26

CHAPTER 4

EXPERIMENTAL RESULTS

4.1. Objectives

The objective of our experiments is to measure the impact of different variables on the

quality of the results obtained doing CLIR, in the context and limitations of scarce resources

languages.

Namely, in the first experiment we want to analyze the effect of the amount of parallel

data on precision and recall. We contrast the quality of IR results on different corpus sizes. In

the second experiment, we analyze if paraphrasing improves the results. In this experiment,

we contrast the use of more than one version of the English Bible, and we align this to

the Quechua Bibles. In the third experiment, we compare the results of translating the

collection against translating the query. In the forth experiment, we analyze the possibility

to use MT on similar dialects, and how this would affect the IR results. In the fifth and last

experiment, we want to analyze if removing punctuation will affect in any way the results of

the experiment. We remove punctuation in the parallel corpus and we contraste the results

in different situations for experiment.

To see the upper bound of the experiments, we performed the IR experiment with the

collection and query in English. Figure 4.1 shows the results using the vector model space

doing monolingual IR on the collection, which is the upper bound of our CLIR experiment.

We contrast this results with the baseline, which is to pick any document from the list.

As the size of the corpus grows, the precision and recall improved, and as a direct result,

the results of the IR task improved. In this experiment, we want to analyze the impact of

the increment of data available for the parallel corpus, and how much quantity will make an

impact on the results.

27

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Upper boundBaseline

Figure 4.1. Upper bound and baseline.

For all the experiments, when we refer to the set of all Bibles, we refer to the aggregation

of the following English Bibles:

(i) American Standard Version

(ii) King James Version

(iii) Youngs Literal Translation

(iv) Darby Translation

(v) New International Version

(vi) Amplified Bible

(vii) Contemporary English Version

(viii) English Standard Version

(ix) 21st Century King James Version

(x) New American Standard Bible

(xi) New King James Version

(xii) New Living Translation

(xiii) Douay Rheims 1899 American Edition

28

(xiv) New International Version U K

(xv) Todays New International Version

(xvi) New Life Version

(xvii) New International Readers Version

(xviii) Holman Christian Standard Bible

(xix) New Century Version

The entire corpus has a total of 574,296 lines.

When we refer to the set of four Bibles, we refer to the aggregation of the following

English Bibles:


(ii) Contemporary English Version

(iii) English Standard Version

(iv) New International Readers Version

The corpus of four Bibles has a total of 118,762 lines.

The set of two Bibles is the aggregation of the following English Bibles:


(ii) English Standard Version

The corpus of two Bibles has a total of 58,711 lines

When we refer to the set of one Bible, we refer to the following Bible:

(i) English Standard Version

This corpus has a total of 27,620 lines.

For the experiments involving only sections of 5,000 lines, 10,000 lines, 15,000 lines,

20,000 lines and 25,000 lines, we used in all cases the English Standard Version.

29

4.2. Experiments

4.2.1. Experiment 1

Analyze the effect of the size of the corpus

Translation of the collection

Punctuation was not removed

We used the English Standard Version and the Quechua Cuzco Version of the Bible in

order to construct the parallel files.

We aligned parallel versions of the two Bibles, and prepared the parallel files. In this case,

we translated the queries from Quechua to English, using the different translators created

based on different quantities of lines from the corpus, in sections of 5,000 lines. We evaluated

precision and recall of the results obtained using the vector model.

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Upper bound5k lines

10k lines15k lines20k lines25k lines

27.6k(complete)

Figure 4.2. Increments of 5,000 lines for the parallel corpus.

One thing that we observed is that there is a very clear difference between the results

using the complete Bible, 27,610 lines, and the results obtained using only 25,000 parallel

lines. This difference is less evident between the other sets, even when the difference of

30

Table 4.1. Normalized F-measure at different recall points.

Recall point 5k 10k 15k 20k 25k 27.6k0.1 0.9731 0.9590 0.9590 0.9567 0.9567 0.99750.2 0.9203 0.9203 0.9153 0.9170 0.9119 0.95220.3 0.8568 0.8142 0.8604 0.8554 0.8279 0.88890.4 0.7995 0.7889 0.7820 0.7925 0.7943 0.84970.5 0.7644 0.7133 0.6791 0.6910 0.7203 0.82240.6 0.6030 0.5315 0.5232 0.5264 0.5469 0.63070.7 0.4013 0.3217 0.3346 0.3217 0.3526 0.36850.8 0.3534 0.3003 0.3112 0.2876 0.3085 0.31570.9 0.3878 0.3650 0.3650 0.3561 0.3739 0.4105µ 0.6733 0.6349 0.6366 0.6338 0.6437 0.6929

parallel data is several times greater. One possible reason for this difference is that the New

Testament uses words that are more likely to appear in a current text.

In general terms, more data positively affects the quality of IR results.

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Increasing amount of data

Complete bible5k lines

Figure 4.3. Contrasting 5000 lines with 27600 lines (complete Bible).

In figure 4.3 we contrast clearly the difference between using a corpus of 5,000 lines with

the complete set of 27,600 lines. We can see clearly that precision and recall are increased

as we increase the size of the corpus.

In table 4.1 we calcuated the F-measure at different recall points. Then, we normalized

this values with the upper bound, to see the impact of the increase of the size of the corpus.

31

4.2.2. Experiment 2

Analyze the effect of paraphrasing.

Quechua Version used: Cuzco Version.

Translation of the collection.

In the second experiment we want to evaluate if paraphrasing improved precision and

recall. Paraphrasing increases the size of the parallel corpus, but is aligned in all cases with

the corresponding Quechua Bible, of which we only had one version.

For this reason, in figure 4.4 we compared the results for one Bible, the combination of

two Bibles, and the combination of four.

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

With punctuation

One bibleTwo biblesFour bibles

Figure 4.4. Query translation.

We have to note that the increase of parallel data came from paraphrasing. We used

different versions of the Bible in English, and we aligned all of them to the Quechua Bibles.

We used only one of the Quechua Bibles at a time.

The results show an increase in the precision when using different versions of the Bible in

the same language.

In figure 4.4 we can observe that the best performing corpus is the set of four Bibles.

32

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Removing punctuation



0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall


One bible (remv punct)Two bibles (remv punct)Four bibles (remv punct)


In figure 4.5 we repeated the same experiment removing the punctuation, and again the

best results belonged to the set of four Bibles.

We also compared both sets of results, and the best results came from the set of four

Bibles, removing punctuation, as shown in figure 4.6.

33

Table 4.2. Normalized F-measure at different recall points.

Recall point 1 bible 2 bibles 4 bibles0.1 0.9567 0.9567 0.95670.2 0.9102 0.9170 0.91860.3 0.8493 0.8198 0.90110.4 0.7583 0.7601 0.82190.5 0.7099 0.6791 0.78520.6 0.5770 0.5342 0.61210.7 0.3969 0.3994 0.47400.8 0.3507 0.3765 0.45460.9 0.4268 0.4080 0.4544µ 0.6595 0.6501 0.7087

In table 4.2 we can clearly see that although the corpus of 2 bibles performed nearly 1 %

worst than the corpus of 1 bible, the best performance was for the corpus of 4 bibles, almost

5 % better than the others.

34

4.2.3. Experiment 3

Analyze the effect of translating collection vs query.

Quechua Version used: Cuzco Version.

One of the research questions we wanted to answer was if it was better to translate the

query or the collection, or if both approaches would give the same result. In this experiment,

we compare results of the experiments translating the query, and contrasting that result with

the translation of the collection.

This experiment was constructed using the corpus of 5,000 lines, one Bible, and finally

four Bibles, using the Cuzco dialect of the Quechua Bible without removing punctuation.

Figure 4.7 shows that translating the collection with 5,000 lines gave better results.

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Query translation and Collection translation (5k lines)

CollectionQuery

Figure 4.7. Query translation and collection translation (5k lines).

Figure 4.8 shows the same experiment using the corpus of one Bible, and we can see

clearly that translating the collection yields better results too.

Lastly, we repeated the experiment with only 5,000 lines of parallel text. Figure 4.9 shows

also that the translation the collection performed better than the query translation also in

this case.

35

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Query translation and Collection translation (one bible)

CollectionQuery

Figure 4.8. Query translation and collection translation (one Bible).

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Query translation and Collection translation (four bibles)

CollectionQuery

Figure 4.9. Query translation and collection translation (four Bibles).

4.2.4. Experiment 4

Analyze the effect of using a different dialect

Quechua Version used: Cuzco Version and Bolivia

36

Table 4.3. Normalized F-measure at different recall points for a corpus of 5k lines.

Recall point Collection translation Query Translation δ0.1 0.1587 0.1560 0.26750.2 0.2383 0.2366 0.17510.3 0.2796 0.2776 0.20090.4 0.2891 0.2815 0.75330.5 0.2621 0.2445 1.76170.6 0.1852 0.1710 1.42130.7 0.0940 0.0999 -0.58910.8 0.0626 0.0766 -1.39060.9 0.0541 0.0609 -0.6747

Table 4.4. Normalized F-measure at different recall points for a corpus of one bible.

Recall point Collection translation Query Translation δ0.1 0.1587 0.1560 0.26750.2 0.2401 0.2374 0.26330.3 0.2845 0.2788 0.57420.4 0.3019 0.2781 2.37210.5 0.2991 0.2558 4.32890.6 0.2405 0.1856 5.48820.7 0.1607 0.1078 5.29430.8 0.0840 0.0716 1.23090.9 0.0629 0.0633 -0.0372

Table 4.5. Normalized F-measure at different recall points for a corpus of four bibles.

Recall point Collection translation Query Translation δ0.1 0.1587 0.1552 0.34230.2 0.2392 0.2366 0.26280.3 0.2841 0.2784 0.57010.4 0.3015 0.2740 2.74500.5 0.2982 0.2443 5.39890.6 0.2326 0.1577 7.49180.7 0.1486 0.0870 6.15320.8 0.0823 0.0667 1.56350.9 0.0629 0.0562 0.6731

In this experiment, we want to see how much impact would make to use a different dialect

for IR purposes. We want see how much impact would have to perform the experiment using

a different dialect in the query and the collection.

The Time collection was translated by human translators that spoke Quechua Cuzco.

In this experiment, we contrast the impact of creating translators using the same dialect,

Quechua Cuzco, with translators using a different dialect, with is Quechua Bolivia.

In figure 4.12 we translated the collection for the first set of experiments.

In the next experiment, we used only one Bible, as shown in figure 4.11.

In the following experiment, we used only five thousand lines of parallel text.

37

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Different dialect, translating collection (5k lines)

Quechua CuzcoQuechua Bolivia

Figure 4.10. Using different dialects, translating collection for 5k lines corpus.

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Different dialect, translating collection (one bible)


Figure 4.11. Using different dialects, translating collection for one bible corpus.

Only in the last experiment we can see a different behavior, where both dialects perform

almost equally.

In the next set of experiments, we used query translation.

38

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Different dialect, translating collection (four bibles)


Figure 4.12. Using different dialects, translating collection for four bibles corpus.

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Different dialect, translating query (5k lines)


Figure 4.13. Using different dialects, translating query for 5k lines corpus.

In figures 4.13, 4.14 and 4.15 the results don’t show a significant difference between both.

We believe that this result is caused because keywords will be similar on both languages, and

the difference between both will be less relevant if we translate a smaller amount of text,

which is the case in query translation.

39

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Different dialect, translating query (one bible)


Figure 4.14. Using different dialects, translating query for one bible corpus.

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Different dialect, translating query (four bibles)


Figure 4.15. Using different dialects, translating query for four bibles corpus

Table 4.6. F-measure average using different dialects on different sizes of cor-

pus, with collection translation.

Corpus Cuzco dialect Bolivian dialect δ5k 0.1804 0.1810 0.2990one bible 0.2036 0.1750 16.3466four bibles 0.2009 0.1809 11.0547

40

Table 4.7. F-measure average using different dialects on different sizes of cor-

pus, with query translation.

Corpus Cuzco dialect Bolivian dialect δ5k 0.1783 0.1741 2.4258one bible 0.1816 0.1750 3.7862four bibles 0.1729 0.1809 4.4234

Table 4.6 shows clearly that there is an improvement when using the correct dialect when

we do collection translation. Table 4.7 shows that there is no noticeable difference if we use

different dialects when the query is translated.

41

4.2.5. Experiment 5

Analyze the effect of removing punctuation

Quechua Version used: Cuzco Version

During the previous experiments, we noticed that many words were aligned with punctu-

ation. In this experiment, we wanted to investigate if removing punctuation would affect in

any way the quality of the translation. For the experiment, we removed all punctuation signs.

In figures 4.16 and 4.17 we translated the collection. We can see that the results are

almost similar, and there is no measurable difference between both.

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Corpus with symbols and without symbols (four bibles)

With symbolsWithout symbols

Figure 4.16. Removing punctuation, translating the collection.

For the next set, we translated the query. In figures 4.18 and 4.19 we can see a clear

difference between the results obtained removing punctuation and not removing it.

Figures 4.16, 4.17, 4.18 and 4.19 show that while removing the punctuation didn’t improve

the results for collection translation, it clearly improved the results for query translation.

42

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Corpus with symbols and without symbols (one bible)


Figure 4.17. Removing punctuation, translating the collection.

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Corpus with symbols and without symbols (four bibles)


Figure 4.18. Removing punctuation, translating the query.

43

0

0.1

0.2

0.3

0.4

0.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pre

cis

ion

Recall

Corpus with symbols and without symbols (one bible)


Figure 4.19. Removing punctuation, translating the query.

44

CHAPTER 5

DISCUSSION AND CONCLUSIONS

In experiment 1 we can see that the results showed clearly that an increase in the size of

the parallel corpus results in better precision.

Given the relative size of the corpus, we can say that there was a significant increase in

precision from 25k lines to the complete Bible, 27k lines. This can be an effect of words

relevant for the experiment being included in the last books of the Bible.

In experiment 2 we can see that using paraphrasing to construct bigger corpus increases

the precision for information retrieval purposes. The increase of precision in the experiment

was consistent with the amount of parallel data that was used.

In experiment 3 we contrasted the translation of the query and the translation of the

collection of documents. The translation of the collection proved to have a higher precision

in all cases. To see if this was due to the quality of the parallel data, we contrasted the

translation of the query and the collection when a different Quechua dialect was used. In

this case, both the translation of the query and the collection returned comparatively similar

results, indicating that choosing the query or the collection makes impact when the quality

of the aligned data is higher.

In experiment 4 we analyzed the translation of the text using a different dialect from the

one used originally. We translated the collection, and later the query.

For the collection translation, using the same dialect gave better precision than using a

different dialect in all cases. We found the same difference at different amounts of informa-

tion.

One interesting result was the translation of the query using different dialects. Both

performed relatively similarly, under different amounts of parallel data.

45

In experiment 5 we shown that removing the symbols made a noticeable impact when

doing query translation. For collection translation, there was almost no difference in the

precision.

5.1. Contributions

For this work, we had to develop all the NLP resources for MT for Quechua Cuzco and

Quechua Bolivia from the begining.

We aligned this text with the English Bibles, and finally designed the experiments to

meassure the effect of different variables when performing IR.

We designed the experiments in order to compare different variables and see how they

affect the results of SMT for IR purposes.

5.2. Conclusions

Based on this results, we have the following conclusions.

(i) How much does the amount of parallel data affect the precision of information

retrieval?

All increases of additional parallel text increased the precision of the results. The

best results corresponded to the parallel with more lines, with a significant difference

over the other sets.

(ii) Is it possible to increase the precision of the information retrieval results using para-

phrasing? We showed that paraphrasing had a direct impact on the results, increas-

ing the precision. The best performing system was the corpus with 4 Bibles.

(iii) Is there a significant difference if we translate the query or the collection of doc-

uments? How is this affected by the other factors? Translating the collection

performed better. While translating the query, we noticed that removing punctu-

ation caused increased precision, while removing the punctuation didn’t make any

impact for collection translation.

46

(iv) Is it possible to use Information Retrieval across different dialects of the same lan-

guage? How much is the precision affected by this? While the results were inferior

for the different dialect, they also made us believe that is possible to use the same

system for a different dialect. Collection and query translation had similar results in

this case.

47

APPENDIX

TIME COLLECTION FILES

48

APPENDIX TIME COLLECTION FILES

Documents in the Time collection are identified by a number, which is associated with to

the query relevance document. This number range from 17 to 563.

In our experiment, we identified the documents using a correlative number, which has a

mapping to the Time collection document.

Columns labeled ‘E’ contain the document number used in this experiment, while columns

labeled ‘C’ contain the document number assigned in the times collection.

49

E C E C E C E C E C E C E C E C E C1 17 51 83 101 143 151 199 201 259 251 312 301 387 351 471 401 5372 18 52 84 102 144 152 200 202 260 252 314 302 388 352 472 402 5383 19 53 85 103 145 153 201 203 261 253 316 303 389 353 474 403 5394 20 54 86 104 146 154 202 204 262 254 317 304 390 354 475 404 5405 22 55 87 105 147 155 203 205 263 255 318 305 391 355 476 405 5416 23 56 89 106 148 156 212 206 264 256 319 306 393 356 477 406 5427 24 57 90 107 149 157 213 207 265 257 320 307 395 357 478 407 5438 25 58 91 108 150 158 214 208 266 258 321 308 397 358 479 408 5449 26 59 92 109 151 159 216 209 267 259 322 309 398 359 484 409 54510 27 60 93 110 152 160 217 210 268 260 323 310 399 360 486 410 54711 28 61 94 111 153 161 218 211 269 261 325 311 400 361 489 411 54812 29 62 95 112 154 162 219 212 271 262 328 312 401 362 490 412 54913 30 63 96 113 155 163 220 213 272 263 329 313 402 363 491 413 55014 31 64 97 114 156 164 221 214 273 264 330 314 403 364 492 414 55115 32 65 98 115 157 165 222 215 274 265 331 315 404 365 493 415 55216 33 66 99 116 158 166 223 216 275 266 332 316 405 366 494 416 55417 34 67 100 117 159 167 224 217 276 267 333 317 406 367 495 417 55518 35 68 101 118 160 168 225 218 277 268 334 318 407 368 496 418 55619 39 69 103 119 161 169 226 219 278 269 335 319 410 369 497 419 55720 41 70 104 120 162 170 227 220 279 270 336 320 411 370 500 420 55821 42 71 105 121 169 171 228 221 280 271 340 321 412 371 501 421 55922 44 72 106 122 170 172 229 222 281 272 341 322 413 372 502 422 56023 46 73 107 123 171 173 230 223 282 273 344 323 414 373 503 423 56124 47 74 108 124 172 174 231 224 283 274 345 324 416 374 504 424 56225 48 75 109 125 173 175 233 225 284 275 346 325 417 375 505 425 56326 49 76 110 126 174 176 234 226 285 276 347 326 421 376 50627 50 77 111 127 175 177 235 227 286 277 349 327 423 377 50728 51 78 112 128 176 178 236 228 287 278 350 328 424 378 50829 52 79 114 129 177 179 237 229 288 279 352 329 425 379 50930 53 80 115 130 178 180 238 230 291 280 353 330 426 380 51131 54 81 116 131 179 181 239 231 292 281 354 331 429 381 51232 56 82 117 132 180 182 240 232 293 282 355 332 430 382 51333 57 83 118 133 181 183 241 233 294 283 356 333 433 383 51434 58 84 119 134 182 184 242 234 295 284 357 334 435 384 51635 59 85 120 135 183 185 243 235 296 285 358 335 436 385 51836 60 86 121 136 184 186 244 236 297 286 360 336 437 386 51937 61 87 122 137 185 187 245 237 298 287 362 337 441 387 52138 62 88 125 138 186 188 246 238 299 288 363 338 442 388 52239 63 89 127 139 187 189 247 239 300 289 364 339 443 389 52340 64 90 128 140 188 190 248 240 301 290 366 340 444 390 52441 65 91 129 141 189 191 249 241 302 291 367 341 445 391 52542 66 92 130 142 190 192 250 242 303 292 368 342 448 392 52643 67 93 132 143 191 193 251 243 304 293 369 343 458 393 52744 68 94 133 144 192 194 252 244 305 294 379 344 459 394 52845 69 95 134 145 193 195 253 245 306 295 380 345 460 395 52946 70 96 135 146 194 196 254 246 307 296 381 346 461 396 53247 71 97 136 147 195 197 255 247 308 297 382 347 462 397 53348 80 98 137 148 196 198 256 248 309 298 383 348 463 398 53449 81 99 139 149 197 199 257 249 310 299 384 349 469 399 53550 82 100 142 150 198 200 258 250 311 300 385 350 470 400 536

50

BIBLIOGRAPHY

[1] Yaser Al-Onaizan, Ulrich Germann, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Daniel

Marcu, and Kenji Yamada, Translating with scarce resources, Proceedings of the Sev-

enteenth National Conference on Artificial Intelligence and Twelfth Conference on In-

novative Applications of Artificial Intelligence, AAAI Press / The MIT Press, 2000,

pp. 672–678.

[2] Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern information retrieval, Addison-

Wesley, 1999.

[3] Lisa Ballesteros and W. Bruce Croft, Dictionary methods for cross-lingual information

retrieval, DEXA (Roland Wagner and Helmut Thoma, eds.), Lecture Notes in Computer

Science, vol. 1134, Springer, 1996, pp. 791–801.

[4] P. F. Brown, J. Cocke, S. A. Della Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and

P. S. Roossin, A statistical approach to machine translation, Computational Linguistics

16 (1990), no. 2, 79–85.

[5] Roberto Aranovich Lori Levin Ralf Brown Eric Peterson Jaime Carbonell Alon Lavie

Christian Monson, Ariadna Font Llitjos, Building nlp systems for two resource-scare in-

digenous languages: Mapudungun and quechua, The International Conference on Lan-

guage Resources and Evalua (2006).

[6] S. Dasgupta and V. Ng, Unsupervised part-of-speech acquisition for resource-scarce

languages, Proceedings of the EMNLP-CoNLL, 2007, pp. 218–227.

[7] F. C. Gey, University of California, D. W. Oard, and University of Maryland, The TREC-

2001 cross-language information retrieval track: Searching arabic using english, french

or arabic queries, Text REtrieval Conference (TREC) TREC 2001 Proceedings, Depart-

ment of Commerce, National Institute of Standards and Technology, 2001, NIST Special

Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001), pp. 16–??

[8] Jr. Gordon, Raymond G., Ethnologue: Languages of the world, SIL International, 2005.

51

[9] Rada Mihalcea Joel Martin and Ted Pedersen, Word alignment for languages with scarce

resources, 2005.

[10] Philipp Koehn, Franz Josef Och, and Daniel Marcu, Statistical phrase-based translation,

HLT-NAACL, 2003.

[11] A. Lopez and P. Resnik, Improved HMM alignment models for languages with scarce

resources, Proceedings of the ACL Workshop on Building and Using Parallel Texts, 2005,

pp. 83–86.

[12] Rada Mihalcea and Ted Pedersen, An evaluation exercise for word alignment, June 21

2003.

[13] Douglas W. Oard and Fredric C. Gey, The TREC 2002 arabic/english CLIR track,

TREC, 2002.

[14] Franz Josef Och, An efficient method for determining bilingual word classes, EACL,

1999, pp. 71–76.

[15] Franz Josef Och and Hermann Ney, A systematic comparison of various statistical align-

ment models, Computational Linguistics 29 (2003), no. 1, 19–51.

[16] H. Ney R. Kneser, Forming word classes by statistical clustering for statistical language

modelling, Quantitative Linguistics Conference 1 (1991).

[17] Philip Resnik, Mari Broman Olsen, and Mona Diab, Creating a parallel corpus from the

”book of 2000 tongues”, November 17 1997.

[18] G. Salton and C. Buckley, Term-weighting approaches in automatic text retrieval, Infor-

mation Processing and Management 24 (1988), no. 5, 513–523.

[19] United Bible Society, Statistical summary of languages with the scriptures, 2008.

[20] Stephanie Strassel, Mike Maxwell, and Christopher Cieri, Linguistic resource creation

for research and technology development: A recent experiment, ACM Transactions on

Asian Language Information Processing 2 (2003), no. 2, 101–117.

52

[21] Stephen Wurm, Atlas of the world’s languages in danger of disappearing, 2 ed., The

United Nations Educational, Scientific and Cultural Organization (UNESCO), 2001.

53

Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem...

Documents

Transcript of Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem...