Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem...
Transcript of Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem...
![Page 1: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/1.jpg)
CROSS LANGUAGE INFORMATION RETRIEVAL FOR LANGUAGES WITH SCARCE RESOURCES
Christian E. Loza
Thesis Prepared for the Degree of
MASTER OF SCIENCE
UNIVERSITY OF NORTH TEXAS
May 2009
APPROVED: Rada Mihalcea, Major Professor Paul Tarau, Committee Member Miguel Ruiz, Committee Member Armin Mikler, Graduate Advisor Krishna Kavi, Chair of the Department of
Computer Science and Engineering Michael Monticino, Interim Dean of the Robert
B. Toulouse School of Graduate Studies
![Page 2: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/2.jpg)
Loza, Christian. Cross Language Information Retrieval for Languages with Scarce
Resources. Master of Science (Computer Science), May 2009, 53 pp., 2 tables, 20 illustrations,
bibliography, 21 titles.
Our generation has experienced one of the most dramatic changes in how society
communicates. Today, we have online information on almost any imaginable topic. However,
most of this information is available in only a few dozen languages. In this thesis, I explore the
use of parallel texts to enable cross‐language information retrieval (CLIR) for languages with
scarce resources. To build the parallel text I use the Bible. I evaluate different variables and
their impact on the resulting CLIR system, specifically: (1) the CLIR results when using different
amounts of parallel text; (2) the role of paraphrasing on the quality of the CLIR output; (3) the
impact on accuracy when translating the query versus translating the collection of documents;
and finally (4) how the results are affected by the use of different dialects. The results show
that all these variables have a direct impact on the quality of the CLIR system.
![Page 3: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/3.jpg)
Copyright 2009
By
Christian E. Loza
ii
![Page 4: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/4.jpg)
CONTENTS
CHAPTER 1. INTRODUCTION 1
1.1. Problem Statement 1
1.2. Contribution 2
CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL, MACHINE
TRANSLATION AND NLP FOR SCARCE RESOURCE LANGUAGES 4
2.1. Information Retrieval 4
2.1.1. Definition of Information Retrieval 4
2.1.2. Formal Definition and Model 5
2.1.3. Boolean Model 5
2.1.4. Vector Space Model 6
2.1.5. Probabilistic Model 8
2.1.6. Other Models 9
2.2. Machine Translation 9
2.2.1. Development of Machine Translation 9
2.2.2. Approaches to MT 11
2.2.3. Software Resources for Machine Translation 11
2.3. Cross-language Information Retrieval 15
2.3.1. The Importance of Cross Language Information Retrieval 15
2.3.2. Main Approaches to CLIR 16
2.3.3. Challenges Associated with CLIR for Languages with Scarce Resources 16
2.4. Related Work and Current Research 18
2.4.1. Computational Linguistics for Resource-Scarce Languages 18
2.4.2. Focusing on a Particular Language 18
2.4.3. NLP for Languages with Scarce Resources 19
2.4.4. CLIR for Languages with Scarce Resources 19
iii
![Page 5: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/5.jpg)
2.4.5. Cross Language Information Retrieval with Dictionary Based Methods 19
2.4.6. Cross Language Information Retrieval Experiments 20
2.4.7. Word Alignment for Languages with Scarce Resources 20
CHAPTER 3. EXPERIMENTAL FRAMEWORK 21
3.1. Metrics 21
3.2. Data Sources 21
3.2.1. Time Collection 21
3.2.2. Parallel Text 22
CHAPTER 4. EXPERIMENTAL RESULTS 27
4.1. Objectives 27
4.2. Experiments 30
4.2.1. Experiment 1 30
4.2.2. Experiment 2 32
4.2.3. Experiment 3 35
4.2.4. Experiment 4 36
4.2.5. Experiment 5 42
CHAPTER 5. DISCUSSION AND CONCLUSIONS 45
5.1. Contributions 46
5.2. Conclusions 46
APPENDIX TIME COLLECTION FILES 49
BIBLIOGRAPHY 51
iv
![Page 6: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/6.jpg)
CHAPTER 1
INTRODUCTION
1.1. Problem Statement
One of the most important changes in our society and the way we live in the last decades
is the availability of information, as a result of the development of the World Wide Web and
the Internet.
English being the lingua franca of the modern era, and United States the place where
the Internet was born, most of the current content in the World Wide Web is written in this
language.
Along with English, major European languages, like Spanish, French and German, and
others, like Chinese, Russian and Japanese are examples of the few dozen languages with
significant amount of content available on the Web, in direct relationship with the number
of speakers of those languages with regular access to the Web.
Figure 1.1, based on information reported in [8], shows that most of the languages in the
world have relatively few speakers.
Figure 1.1. Distribution of number of speakers.
1
![Page 7: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/7.jpg)
As a result of the availability of content, out of all the languages spoken in the world
today, only a few dozen languages have a significant number of computational resources
developed for them. We can say that most of the current spoken languages have little or no
computational resources.
I will call the first set of languages “languages with computational resources,” as opposed
to the rest of the languages spoken in the world, which I call “scarce resource languages.”
We want to study methods to develop standard natural language processing (NLP) com-
putational resources for these languages, and see how different approaches producing them
affect other tasks. In particular, in this thesis I concentrate on the task of information
retrieval.
I chose Quechua for our study, because it has fourteen million of speakers, but most of
them have little or no access to the Web, making it a scarce resource language.
As shown in earlier works as [1] and [5], even gathering the initial resources is a difficult
task when creating computational resources for a language that does not have them.
The quantity of documents for any language is generally proportional to the number of its
speakers. Most of the scarce resource languages have relatively few speakers and few written
documents available, which increases the difficulty of obtaining them.
Other problems to develop initial resources are the availability of one unique written form
of the language, the availability of an alphabet, the presence of dialects, and how to use
different character sets with current software tools.
All these problems are common to most scarce resource languages. Many researchers
have proposed different approaches to them, according to the language that was the focus
of the study.
1.2. Contribution
In this thesis, we analyzed specific aspects of the construction and use of parallel corpora
for information retrieval.
2
![Page 8: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/8.jpg)
More specifically, we want to discuss the following:
(i) How much does the amount of parallel data affect the precision of information
retrieval?
(ii) Is it possible to increase the precision of the information retrieval results using para-
phrasing?
(iii) Is there a significant difference if we translate the query or the collection of docu-
ments? How is this affected by the other factors?
(iv) Is it possible to use information retrieval across different dialects of the same lan-
guage? How much is the precision affected by this?
3
![Page 9: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/9.jpg)
CHAPTER 2
A BRIEF REVIEW OF INFORMATION RETRIEVAL, MACHINE TRANSLATION AND
NLP FOR SCARCE RESOURCE LANGUAGES
2.1. Information Retrieval
Information retrieval (IR) is defined as the task to retrieve certain documents from a
collection that satisfy a query or an information need [2].
In the context of computer science, IR is a branch of natural language processing (NLP)
that studies the automatic search of information and all the sub tasks associated with it.
Previous remarkable events in the development of IR include the development of the
mechanical tabulator based on punch cards in 1890 by Herman Hollerith, to rapidly tabulate
statistics, Vannevar Bush’s ’As We May Think’, that appeared in Atlantic Monthly in 1945,
and Hans Peter Luhn’s (researcher engineer at IBM) work in 1947, on mechanized punch
cards for searching chemical compounds. More recently, in 1992 the first Text Retrieval
Conference (TREC) took place.
One the most important development of IR to date has been the development of massive
search engines the last decade. A decade ago libraries were still the main place to search for
information, but today we can find information on almost any topic using the Internet.
2.1.1. Definition of Information Retrieval
IR studies the representation, storage, organization of, and access to the information
items [2].
A data retrieval system receives a query from a user, and returns an information item or
items, sorted, with a degree of relevance to the query.
An information retrieval system tries to retrieve the information that will answer the
particular query from the collection.
4
![Page 10: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/10.jpg)
The availability of resources and information on the Web has changed how we search
for information. The Internet not only has created a vast amount of people looking for
information, it also has created a greater number of people writing contents in all subjects.
2.1.2. Formal Definition and Model
IR can be defined using the following model [2]:
(1) [D,Q, F,R(qi , dj)]
where:
(i) D is a set of logical representations for the documents
(ii) Q is a set composed of logical representations of the user information needs, called
queries.
(iii) F is a framework for modeling the documents, queries and their relationships.
(iv) R(qi , dj) is a ranking function
There are three main models in IR: The Boolean model, the vector model, and the
probabilistic.
2.1.3. Boolean Model
The Boolean model is based on Boolean algebra, the queries are specified in boolean
terms, using Boolean operators as AND and OR. This gives this model simplicity and clarity,
and is easy to formalize and implement.
Unfortunately, these same reasons also present disadvantages. Because the model is
binary, it prevents any notion of scale, which has the effect to decrease the performance of
the information system in terms of quality.
5
![Page 11: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/11.jpg)
The representation of a document consists of a set of terms weights assumed to be binary.
A query term is composed of index terms linked by the boolean operators, like AND, OR,
NOT.
The Boolean model can be defined as following[2]:
(i) D The terms can either be present (True) or absent (False).
(ii) Q The queries are boolean expressions, linking terms with operations.
(iii) F The framework is the boolean algebra, and it’s operators AND, OR and NOT.
(iv) The similarity function is defined in the context of the Boolean Algebra as:
sim(dj , q) =
T rue if ∃←−q cc | (←−q cc ∈
←−q dnf ∧ (∀ki , gi(
←−dj )) = gi(
←−qcc)
False otherwise
(2)
As shown above, the relevance criteria is a Boolean value, true (relevant) or false (not
relevant). Other disadvantage of this model is that the ranking function doesn’t provide
information regarding how relevant is a document, so is not possible to compare how relevant
it is relative to others. This makes this model less flexible than the other two models.
As a result of this disadvantages, but taking into account all the advantages, the Boolean
model is generally used in conjunction with the other models, or has been modified to show
a degree of relevance, as in the fuzzy Boolean model.
2.1.4. Vector Space Model
The vector space model presents a framework where a document is represented as a
vector in a N − dimensional space [18]. Each dimension of the vector is defined as wi ,
and is a non-negative non binary number. The space has N dimentions corresponding to the
unique terms present in the collection. The query vector is also defined as a vector in the
N − dimensional space.
(i) D The N terms are dimensions, the documents are vectors in this space.
(ii) Q The queries qj are also vectors in this N − dimensional space.
6
![Page 12: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/12.jpg)
(iii) F The framework is defined in the context of the vector analysis and its operations.
One of the most commonly used metrics of distance between vectors is the cosine
similarity, as defined below.
sim(dj , q) =
←−dj ·←−q
|←−dj | × |
←−q |
(3)
The main advantage of using this framework is that we can define the similarity between
queries and documents as a function of similarity between two vectors. There are many
functions that can give us a measure of distance between two vectors, been the cosine
similarity one of the most used for IR purposes.
2.1.4.1. Weight Metrics
There are many methods to calculate the weights wi for the N dimensions of each doc-
ument, as shown in [18], being an active research topic. For our experiments, we used the
term frequency - inverse document frequency, tf − idf .
We define the term frequency (tf) tfi ,j as:
tfi ,j =fi ,j
maxk fk,j(4)
where fi ,j is the frequency of the term i in the document j , normalized dividing by the maximum
frequency observed of a term k in the document dj .
The inverse document frequency (idf) is defined as:
idfi = lgN
fi(5)
7
![Page 13: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/13.jpg)
The tf − idf weighting scheme is defined as:
wi ,j = tfi ,j × idfi =fi ,j
maxk fk,j× lgN
fi(6)
2.1.5. Probabilistic Model
Defined in the context (framework) of probabilities, the Probabilistic Model tries to esti-
mate the probability that specific document is relevant to a given query.
The model assumes there is a subset R of documents which are the ideal answer for the
query q. Given the query q, the model assigns measure of similarity to the query.
Formally, we have the following.
(i) D The terms have probability weights, w ∈ {0.1}.
(ii) Q The queries are subsets of index terms.
(iii) F The framework is defined in the context of the probability theory. as:
Let R be the set of documents know to be relevant.
Let R be the complement of R.
P (R|←−dj ) is the probability of a document to be relevant.
sim(dj , q) =P (←−dj |R)× P (R)
P (←−dj |R)× P (R)
(7)
One of the advantages of this model is that the documents are ranked according to their
probability to be relevant. One of the many disadvantages of this model is that it requires an
initial classification of documents in relevant and not relevant (which is the original purpose),
and the assumption that the terms are independent of each other.
8
![Page 14: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/14.jpg)
2.1.6. Other Models
Most of current approaches in the field of IR build on top of one or more of the previous
models. We can mention, among other models, the fuzzy set model and the extended Boolean
model.
2.2. Machine Translation
Machine translation (MT) is a field in NLP that investigates ways to perform automatic
translation between two languages.
In the context of this task, the original language is called the source (S) language, and
the desired result language is called the target (T) language.
The naive approach is to replace words from the source language with words from the
target language, based on a dictionary.
During the past two decads, research has been active, and many new methods have been
proposed.
2.2.1. Development of Machine Translation
The history of machine translation can be traced to Rene Descartes and Gottfried Leibniz.
In 1629, Descartes proposed a universal language, which could have a symbol for equivalent
ideas in different languages. In 1667, in Preface to the General Science Gottfried Leibniz
proposed that symbols are important for human understanding, and made important contri-
butions in symbolic reasoning and symbolic logic.
In the past century, in 1933 the Russian Petr Petrovich Troyanskij, patented a device for
translation, storing multilingual dictionaries and continuing his work for 15 years. He used
Esperanto to deal with grammatical roles between languages.
During the Second World War, scientists worked with the task to break codes and perform
cryptography. Mathematicians and early computer scientists tough that the task of MT was
9
![Page 15: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/15.jpg)
very similar in essence to the task of breaking encrypted codes. The use of terms as ’decoding’
a text came during this period of time.
In 1952, Yehoshua Bar-Hillel, MIT’s first full-time MT researcher, organized the first MT
research conference. In 1954, at Georgetown University the first public demonstration of
a MT system is held , translating 49 sentences Russian to English. In 1964, the Academy
of Science creates the Automatic Language Processing Advisory Committee, ALPAC, to
evaluate the feasibility of MT.
One of the best known events in the history of MT is the publication of ALPAC’s report
in 1966, having an undeniable negative effect on the MT research community. The report
concluded that after many years of research, the task of MT hasn’t produced useful results,
and qualified it as hopeless. The result was a substantial funding cut for MT research in the
following years, therefore, in the immediate following decades, research on MT was somehow
slow.
In 1967, L. E. Baum, T. Petrie, G. Soules and N. Weiss introduced the expectation max-
imization technique occurring in the statistical analysis of probabilistic functions of Markov
chains, which, after the Shannon Lecture by Welch, became the Baum-Welch algorithm to
find the unknown parameters of a hidden Markov model (HMM). The Baum-Welch algorithm
is a generalized expectation-maximization (GEM) algorithm.
Although many consider that stochastic approaches for MT began in the 1980s, short
contributions during the 1950s already pointed in this direction. Most systems were based
on mainframe.
At the end of the 1980s, a group in IBM lead by P. Brown developed statistical models
in [4], which lead to the widely used IBM models for Statistical Machine Translation (SMT).
During the 1990s is the development of machine translation systems for personal com-
puters. At the end of the 1990s, many websites started offering MT services.
10
![Page 16: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/16.jpg)
In the 2000s machines became more powerful and more data was available, making many
approaches for SMT feasible.
2.2.2. Approaches to MT
The main paradigms for MT can be classified in three groups: Rule based MT, Corpus
Based MT, and hybrid approaches.
2.2.2.1. Rule Based Machine Translation
RBMT is a paradigm that generally involves many steps or processes which analyze the
morphology, syntactic and semantics of an input text, and use some mechanism based on
rules to transform, using an internal structure or some interlingua the source language to the
target language.
2.2.2.2. Corpus Based Machine Translation
The corpus based machine method involves the use of bilingual corpus to extract statistical
parameters and applying them to models.
• Statistical Machine Translation (SMT)
Using a parallel corpus, parameters are determined using the parallel corpus, and
they are applied to their models. These methods were reintroduced by [4].
• Example based Machine Translation (EBMT)
This approach is comparable to the case-based learning, with examples extracted
from a parallel corpus.
2.2.3. Software Resources for Machine Translation
2.2.3.1. GIZA++
As part of the Egypt project, GIZA is used for analizing bilingual corpora, which includes
the implementation of the algorithms described in [4].
11
![Page 17: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/17.jpg)
This tool was developed as part of the SMT system EGYPT. In addition to GIZA,
GIZA++ was developed as an extension that includes Models 4 and 5, by Franz Josef Och
[15].
It implements the Baum-Welch Forward-Backwards algorithm, with empty words and
dependency on word classes.
2.2.3.2. MKCLS
MKCLS implements language modelling, and is used to train word classes, using a maximum-
likelihood criterion [14]. The task of this model is to estimate the probability of a sequence of
words, wN1 = w1, w2, ..., wN. To approximate this, we can use P r(wN1 ) =
∏N
i=1 p(wi |wi − 1).
When we want to use this to estimate bigram probabilities (N = 2), we will have problems
with data sparseness, since most bigrams will not be seen. To solve this problem, we can
create a function C that maps words w to classes C, in this way:
(8) p(wN1 |C) :=
N∏
=i1
p(C(w1)|C(wi−1)) · p(wi |C(w1))
This is obtained with a maximum likelihood approach:
(9) C = argmaxc
p(wN1 |C)
As described in [14], we can arrive to the following optimization, following [16].
(10) LP1(C, n) = −∑
C,C′
h(n(C|C′)) + 2∑
C
h(n(C))
(11) C = argmaxc
LP1(C, n)
12
![Page 18: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/18.jpg)
MKCLS uses alignment templates extracted from the parallel corpus, which contain an
alignment between the sentences.
The implementation of the template model was an early development of the phrase trans-
lation, which is used in current machine translation systems.
2.2.3.3. Moses
Moses, as GIZA++, is an implementation of a statistical machine translation system.
It’s a factored phrase-based beam-search decoder. The phrase-based statistical translation
model [10] is different since it tries to create sequences of words instead of aligning words.
Phrase-based models map text phrases without any linguistic information.
We model this problem as a noisy channel problem.
(12) p(e|f ) =p(f |e)
p(f )
Using the Bayes Rule we can rewrite this expression to the following.
(13) argmaxe
p(e|f ) = argmaxe
p(f |e)
p(e)
Each sentence in this case is transformed into a sequence of phrases, with the following
distribution.
(14) ϕ(fi |ei)
Using a parallel text line aligned, we produce an alignment between the words. To do this
step, we use GIZA++.
13
![Page 19: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/19.jpg)
For the example, we used the book of Revelation, chapter 22, versicle 21. In figure 2.2
and figure 2.2, we can see the translation alignment for English to Spanish and Spanish to
English of this verse, which is the last verse of the Bible, using the New King James Version
and the Dios Habla Hoy translations respectively.
• e: the grace of our lord jesus christ be with you all
• f: que el senor jesus derrame su gracia sobre todos
Figure 2.1. Translations English to Spanish, and Spanish to English.
Figure 2.2. The intersection.
Using the word alignment, several approaches try to approximate the best way to learn
phrase translations.
14
![Page 20: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/20.jpg)
2.3. Cross-language Information Retrieval
Cross-languge information retrieval (CLIR) is the task of performing IR when the query
and the collection are in two different languages.
One of the main motivations for the development of CLIR is the availability of information
in English, and the need to make this resources available to other languages.
2.3.1. The Importance of Cross Language Information Retrieval
CLIR is important because most of the contents available online are written mainly in a
couple dozen of languages, and they are not available for thousands of other languages, for
which little or no content is available.
For example, a query in English in a search engine for ’cold sickness’ would return
2.130.000 results, while the same search in Spanish, ’resfrio’, will return only 106.000 doc-
uments. A search for this same string in quechua for ’Qhonasuro’ will show no documents
available in the index.
A query in English for the word ’answer’ will return 511 million relevant document. The
query for the Spanish word ’respuesta’ returns 76 million results. The same query for the
word ’kutichiy’ returns 638 documents.
A query in English for the words ’blind wiki’ will return 2.4 million results, with 30 different
senses in Wikipedia. The same query in Spanish, for ’ciego’ will return 71 thousand relevant
documents, with 10 different senses. The query for the word ’awsa’ in Quechua will return
two thousand results, without any entry in a Wiki page.
These examples show that although there is a lot of useful information on the Internet,
it may not be accessible in most languages.
Unesco has estimated that today, around six thousand languages are disappearing, or at
risk to disappear, out of close to seven thousand live languages spoken around the world [21].
A language not only constitutes a way of human communication, it also constitutes a
repository of cultural knowledge, a source of wisdom and a particular way to interpret and to
15
![Page 21: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/21.jpg)
see the world. In this context, this works is an effort to provide tools for native speakers to
use their languages to access the vast amount of information available for other languages.
This also constitutes a part of the effort to encourage the use of modern tools in different
languages, other than the few major languages.
Creating modern tools like an IR system in different languages constitutes another way
to preserve the languages. This can promote the use of one language, given that there are
documents accessible to the people that speak it.
2.3.2. Main Approaches to CLIR
Depending of what we choose to translate, there are two main approaches for CLIR.
(i) Query translation: The main advantage of this method is that it is very efficient
for short queries. The main disadvantage of this approach is that it is difficult to
disambiguate query terms, and it is difficult to include relevance feedback.
(ii) Document translation: This approach involves translating the collection into the
query language. This step can be done once, and only applied to new documents
of the collection. This approach is difficult if the collection is very large, and the
number of target languages increase, since each document will need to be translated
into all target languages.
2.3.3. Challenges Associated with CLIR for Languages with Scarce Resources
2.3.3.1. The Alphabet
To perform MT and CLIR and MT with any pair of languages, the first factor to consider
is the written form of that language. A native written system may or may not be present,
and in many cases, transliteration is needed to map words from one written system to the
other. While some languages have already transliteration rules, the complexity of this task is
important, but is beyond the present experiment.
16
![Page 22: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/22.jpg)
In the absence of a native alphabet, many languages have adopted a borrowed alphabed
from a different language. In the case of Quechua, it adopted the written alphabet of Spanish
(which is based on the Latin alphabet). As a result, some phonetics have been represented
differently according to different linguists:
• Five-vowel system This is the old approach, based on a Spanish orthography. Sup-
ported almost exclusively by the “Academia Mayor de la Lengua Quechua”.
• Three-vowel system This approach uses only the three vowels a, i and u. Most
modern linguists support this system.
2.3.3.2. The Encoding
Another particular task on the development of resources for a scarce resource language
at the implementation level will involve definitely the task to use different encodings in the
application.
This sub task can be very challenging, since this issue doesn’t involve only problems with
software and programs developed by other researchers, but also some times includes deciding
many aspects that are specific to the language being studied.
For example, while it is easier to agree on the alphabetic order of characters in English
and languages which depend on the Latin alphabet, this task can have different difficulties
when the language is not English. Quechua uses the Spanish alphabet, but is very common to
include the symbol “ ’ ”. While there are rules on how to classify this symbol in Spanish, they
may or may not make sense for Quechua, taking into account the phonetics of the symbol.
In the specific case of Quechua, many systems for phonetic transcription where present,
and there was still some ongoing debate about this topic among linguists and native speakers.
17
![Page 23: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/23.jpg)
2.4. Related Work and Current Research
2.4.1. Computational Linguistics for Resource-Scarce Languages
As we mentioned earlier, one of the first problems researchers face when building new
NLP resources for a scarce resource language is the availability of written resources, native
speakers and general knowledge for a language.
There are two possible approaches to perform this task.
(i) To work on a particular language, creating resources particular for it, working with
experts to develop methods most of the times only applicable to the particular
language.
(ii) To create methods that wold be appliable with little effort to any language.
2.4.2. Focusing on a Particular Language
Specific methods have been researched and developed that were focused on a particular
scarce resource language. [20] describes the process of creating resources for two languages,
Cebuano and Hindi. Cebuano is a native language of the Philippines.
Most of the experiments done for languages with scarce resources work on the same initial
grounds, which are to create as much resources as possible to start using traditional methods
and models.
For example, [12] describe a shared task on word alignment, were seven teams were
provided training and evaluation data, and submited word alignments between English and
Romanian, and English and French.
Other projects have focused as well the creation of resources for Quechua. For example,
[5] described the efforts to create NLP resources for Mapudungun, an indigenous language
spoken in Chile and Argentina, and Quechua. In this experiment, EBMT and RBMT systems
were developed for both languages. As the other experiments mentioned, one of the main
problems is focused on building the initial resources.
18
![Page 24: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/24.jpg)
2.4.3. NLP for Languages with Scarce Resources
The construction of NLP resources for languages with scarce resources has been studied
lately. For example, [6] proposes an unsupervised method for POS adquisition.
In the task of word alignment [12] and [11] describe approaches to word alignment. [11]
proposes improvements to word alignment by incorporating syntactic knowledge.
2.4.4. CLIR for Languages with Scarce Resources
One of the common tasks of CLIR is the experimentation with languages with few or no
resources.
Nevertheless, it is necessary to note that this experiments have targeted most of the times
languages of interest with many speakers. While TREC-2001 and TREC-2002 have focused
on Arabic, other experiments have been also provided for languages with few or no resources
at all. For example, DARPA organized in 2003 a competition where teams had to build from
scratch a MT system for a suprise language, Indi in the context of the Translingual Information
Detection, Extraction and Summarization (TIDES). In this experiments, researchers needed
to port and created resources for a new language in a relative short amount of time, in this
case, one month.
2.4.5. Cross Language Information Retrieval with Dictionary Based Methods
Earlier experiments worked with dictionary-based methods for CLIR. One of this works
was presented by Lisa Ballesteros [3].
In this work, they found that Machine Readable Dictionaries would drop about 50% the
effectiveness of retrieval due to ambiguity.
They reported that local feedback prior to translation improves precision. They also
showed that this affects more longer queries, attributing the effect due to the reduction of
irrelevant translations.
19
![Page 25: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/25.jpg)
The combination of both methods leaded to better results both in short queries and long
queries.
2.4.6. Cross Language Information Retrieval Experiments
Ten groups participated on the TREC-2001, which had to retrieve Arabic language doc-
uments based on 25 queries in English.
Since that was the first year that many resources in English were available, new approaches
were tried[7].
Some results on this experiment were that query length did not improve the retrieval
effectiveness.
In the TREC-2002, nine teams participated in the CLIR track [13].
The monolingual runs were mentioned as a comparative baseline to which cross-language
results can be compared.
As it happened a year before, the teams observed substantial variability in the effectiveness
on a topic to topic basis.
2.4.7. Word Alignment for Languages with Scarce Resources
More recently, the ACL organized a task[9] to align words for languages with scarce
resources. The pairs were English-Inuktitut, English-Hindi and Romanian-English.
One of them most interesting outcomes of this work, relative to this study, is in the results
for English-Hindi. The use of additional resources resulted in absolute improvements of 20%,
which was not the case for languages with larger training corpora.
20
![Page 26: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/26.jpg)
CHAPTER 3
EXPERIMENTAL FRAMEWORK
3.1. Metrics
For our experiments, we used precision and recall to evaluate the effectiveness of the
results of our experiments.
In CLIR, the quality of the translation affects directly the results obtained using IR, but
it’s different from the quality meassured only by MT methods.
One of the main reasons for this is that most IR methods use bag of words approach, as
opposed to MT, where the order of the words is a factor.
We used the normal metrics, precision and recall to evaluate the results returned by the
systems.
To be able to analyze better the results, we also calculated the precision at other points
(different from the 11 normal points), taking into account a progressive inclusion of results.
P = (Relevantdocuments)Documentsreturned
R = (Relevantdocuments)Totalrelevantdocuments
In the formula, in order to calculate more points, we made the total number of relevant
documents vary accordingly.
3.2. Data Sources
To evaluate how different quantities of parallel text affects the translation quality, we used
the following data sources:
3.2.1. Time Collection
The Times magazine collection is a set of 423 documents, associated with 83 queries,
for which the relevance was manually assigned.
21
![Page 27: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/27.jpg)
This collection contained originally 425 documents, with two of them repeated, document
504 and document 505 according to the original numeration. Both documents were not used
for this reason.
There is a total of 324 relevant documents associated with the 83 queries.
3.2.1.1. Preprocessing
In order to use this collection, we needed to preprocess the files. The collection’s docu-
ments are distributed in a single file, which has to be parsed and transformed to be usable.
The files were numbered in a unspecified order. For this experiment, they were rearranged
and indexed. The mapping is detailed in Appendix ‘A’.
3.2.2. Parallel Text
Many sets of parallel text were used on this experiment. The main source of parallel text
used was the Bible.
3.2.2.1. The Bible as a Source of Parallel Text
Many authors have described advantages and disadvantages of using the Bible as a source
of parallel text, for example [17].
The Bible is a source of high quality translation that is available in several languages.
According to the United Bible Society [19], there are 438 languages with translations of the
Bible, and 2,454 languages with at least a portion of the Bible translated, making it the
parallel corpus that is available in more languages.
Table 3.1 contains all the versions that were used in this experiment.
All the Bibles required a preprocessing to be usable for this experiment.
This process involved many manual steps. The final format of the Bibles, for post pro-
cessing was defined as the following:
Every new line involving a new versicle, was denoted with a line starting with the following:
[a, b, c, d, e] Text of the versicle.
22
![Page 28: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/28.jpg)
Table 3.1. List of English translations of the Bible used for this experiment.
ID Language Name008 English American Standard Version009 English King James Version015 English Youngs Literal Translation016 English Darby Translation031 English New International Version045 English Amplified Bible046 English Contemporary English Version047 English English Standard Version048 English 21st Century King James Version049 English New American Standard Bible050 English New King James Version051 English New Living Translation063 English Douay Rheims 1899 American Edition064 English New International Version U K072 English Todays New International Version074 English New Life Version076 English New International Readers Version077 English Holman Christian Standard Bible078 English New Century Version998 Quechua Bolivia Quechua Catholic Bolivia999 Quechua Cuzco Quechua Catholic Cuzco
Where:
a) Bible translation ID b) Book number c) Chapter number d) Versicle number e) Type
of annotation
The different types of annotation are:
4) Title 5) Sub title 6-10) Warning of duplicate versicles
We introduced lines to be skipped in the document, which contained annotation of the
Bible, as many chapters have a name, and also, there are many titles for some stories. This
titles were commented, and although they are present in the Bible file, they have a different
notation, to be skipped by the post processing.
The versions of the Bibles contained many particularities, which are not relevant in this
context. Nevertheless, some of them had to be accounted in order to process the parallel
files.
3.2.2.2. Catholic and Protestant Versions
We used both Catholic and Protestant versions of the Bible. The difference between
them, in this context, is the number of books included, and the length of some of the books.
23
![Page 29: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/29.jpg)
We include a listing of the books of the Bible used in both versions.
3.2.2.3. Joined Versicles
Some of the Bibles contained versicles that were, for translation purposes, joined together
into one. Some versions specifically had more joined versicles than others. This seemed to
be arbitrary, since on inspection, the joined versicles didn’t match from version to version.
In order to create parallel files, we joined together the versicles when creating the parallel
files for the parallel Bible.
In some cases, new joined versicles corresponded to versicles that were not joined in the
other version. We grouped all the versicles that needed to be grouped so there was a parallel
number of corresponding versicles in both sides.
This process is dynamic, since different versions of the Bible are grouped differently, in a
case by case basis.
24
![Page 30: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/30.jpg)
Table 3.2. Comparison of content between Catholic and Protestant Bibles.
Book number Book name Versicles Catholic Bible Versicles Protestant Bible1 Genesis 50 502 Exodus 40 403 Leviticus 27 274 Numbers 36 365 Deuteronomy 34 346 Joshua 24 247 Judges 21 218 Ruth 4 49 1 Samuel 31 3110 2 Samuel 24 2411 1 Kings 22 2212 2 Kings 25 2513 1 Chronicles 29 2914 2 Chronicles 36 3615 Ezra 10 1016 Nehemiah 13 1317 Tobit 14 018 Judith 16 019 Esther 10 1020 1 Maccabees 16 021 2 Maccabees 15 022 Job 42 4223 Psalm 150 15024 Proverbs 31 3125 Ecclesiastes 12 1226 Song 8 827 Wisdom 19 028 Sirach 51 029 Isaiah 66 6630 Jeremiah 52 5231 Lamentations 5 532 Baruch 6 033 Ezekiel 48 4834 Daniel 13 1235 Hosea 14 1436 Joel 3 337 Amos 9 938 Obadiah 1 139 Jonah 4 440 Micah 7 741 Nahum 3 342 Habakkuk 3 343 Zephaniah 3 344 Haggai 2 245 Zechariah 14 1446 Malachi 4 447 Matthew 28 2848 Mark 16 1649 Luke 24 2450 John 21 2151 Acts 28 2852 Romans 16 1653 1 Corinthians 16 1654 2 Corinthians 13 1355 Galatians 6 656 Ephesians 6 657 Philippians 4 458 Colossians 4 459 1 Thessalonians 5 560 2 Thessalonians 3 3
25
![Page 31: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/31.jpg)
Book number Book name Versicles Catholic Bible Versicles Protestant Bible61 1 Timothy 6 662 2 Timothy 4 463 Titus 3 364 Philemon 1 165 Hebrews 13 1366 James 5 567 1 Peter 5 568 2 Peter 3 369 1 John 5 570 2 John 1 171 3 John 1 172 Jude 1 173 Revelation 22 22
26
![Page 32: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/32.jpg)
CHAPTER 4
EXPERIMENTAL RESULTS
4.1. Objectives
The objective of our experiments is to measure the impact of different variables on the
quality of the results obtained doing CLIR, in the context and limitations of scarce resources
languages.
Namely, in the first experiment we want to analyze the effect of the amount of parallel
data on precision and recall. We contrast the quality of IR results on different corpus sizes. In
the second experiment, we analyze if paraphrasing improves the results. In this experiment,
we contrast the use of more than one version of the English Bible, and we align this to
the Quechua Bibles. In the third experiment, we compare the results of translating the
collection against translating the query. In the forth experiment, we analyze the possibility
to use MT on similar dialects, and how this would affect the IR results. In the fifth and last
experiment, we want to analyze if removing punctuation will affect in any way the results of
the experiment. We remove punctuation in the parallel corpus and we contraste the results
in different situations for experiment.
To see the upper bound of the experiments, we performed the IR experiment with the
collection and query in English. Figure 4.1 shows the results using the vector model space
doing monolingual IR on the collection, which is the upper bound of our CLIR experiment.
We contrast this results with the baseline, which is to pick any document from the list.
As the size of the corpus grows, the precision and recall improved, and as a direct result,
the results of the IR task improved. In this experiment, we want to analyze the impact of
the increment of data available for the parallel corpus, and how much quantity will make an
impact on the results.
27
![Page 33: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/33.jpg)
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Upper boundBaseline
Figure 4.1. Upper bound and baseline.
For all the experiments, when we refer to the set of all Bibles, we refer to the aggregation
of the following English Bibles:
(i) American Standard Version
(ii) King James Version
(iii) Youngs Literal Translation
(iv) Darby Translation
(v) New International Version
(vi) Amplified Bible
(vii) Contemporary English Version
(viii) English Standard Version
(ix) 21st Century King James Version
(x) New American Standard Bible
(xi) New King James Version
(xii) New Living Translation
(xiii) Douay Rheims 1899 American Edition
28
![Page 34: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/34.jpg)
(xiv) New International Version U K
(xv) Todays New International Version
(xvi) New Life Version
(xvii) New International Readers Version
(xviii) Holman Christian Standard Bible
(xix) New Century Version
The entire corpus has a total of 574,296 lines.
When we refer to the set of four Bibles, we refer to the aggregation of the following
English Bibles:
(i) American Standard Version
(ii) Contemporary English Version
(iii) English Standard Version
(iv) New International Readers Version
The corpus of four Bibles has a total of 118,762 lines.
The set of two Bibles is the aggregation of the following English Bibles:
(i) American Standard Version
(ii) English Standard Version
The corpus of two Bibles has a total of 58,711 lines
When we refer to the set of one Bible, we refer to the following Bible:
(i) English Standard Version
This corpus has a total of 27,620 lines.
For the experiments involving only sections of 5,000 lines, 10,000 lines, 15,000 lines,
20,000 lines and 25,000 lines, we used in all cases the English Standard Version.
29
![Page 35: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/35.jpg)
4.2. Experiments
4.2.1. Experiment 1
Analyze the effect of the size of the corpus
Translation of the collection
Punctuation was not removed
We used the English Standard Version and the Quechua Cuzco Version of the Bible in
order to construct the parallel files.
We aligned parallel versions of the two Bibles, and prepared the parallel files. In this case,
we translated the queries from Quechua to English, using the different translators created
based on different quantities of lines from the corpus, in sections of 5,000 lines. We evaluated
precision and recall of the results obtained using the vector model.
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Upper bound5k lines
10k lines15k lines20k lines25k lines
27.6k(complete)
Figure 4.2. Increments of 5,000 lines for the parallel corpus.
One thing that we observed is that there is a very clear difference between the results
using the complete Bible, 27,610 lines, and the results obtained using only 25,000 parallel
lines. This difference is less evident between the other sets, even when the difference of
30
![Page 36: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/36.jpg)
Table 4.1. Normalized F-measure at different recall points.
Recall point 5k 10k 15k 20k 25k 27.6k0.1 0.9731 0.9590 0.9590 0.9567 0.9567 0.99750.2 0.9203 0.9203 0.9153 0.9170 0.9119 0.95220.3 0.8568 0.8142 0.8604 0.8554 0.8279 0.88890.4 0.7995 0.7889 0.7820 0.7925 0.7943 0.84970.5 0.7644 0.7133 0.6791 0.6910 0.7203 0.82240.6 0.6030 0.5315 0.5232 0.5264 0.5469 0.63070.7 0.4013 0.3217 0.3346 0.3217 0.3526 0.36850.8 0.3534 0.3003 0.3112 0.2876 0.3085 0.31570.9 0.3878 0.3650 0.3650 0.3561 0.3739 0.4105µ 0.6733 0.6349 0.6366 0.6338 0.6437 0.6929
parallel data is several times greater. One possible reason for this difference is that the New
Testament uses words that are more likely to appear in a current text.
In general terms, more data positively affects the quality of IR results.
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Increasing amount of data
Complete bible5k lines
Figure 4.3. Contrasting 5000 lines with 27600 lines (complete Bible).
In figure 4.3 we contrast clearly the difference between using a corpus of 5,000 lines with
the complete set of 27,600 lines. We can see clearly that precision and recall are increased
as we increase the size of the corpus.
In table 4.1 we calcuated the F-measure at different recall points. Then, we normalized
this values with the upper bound, to see the impact of the increase of the size of the corpus.
31
![Page 37: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/37.jpg)
4.2.2. Experiment 2
Analyze the effect of paraphrasing.
Quechua Version used: Cuzco Version.
Translation of the collection.
In the second experiment we want to evaluate if paraphrasing improved precision and
recall. Paraphrasing increases the size of the parallel corpus, but is aligned in all cases with
the corresponding Quechua Bible, of which we only had one version.
For this reason, in figure 4.4 we compared the results for one Bible, the combination of
two Bibles, and the combination of four.
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
With punctuation
One bibleTwo biblesFour bibles
Figure 4.4. Query translation.
We have to note that the increase of parallel data came from paraphrasing. We used
different versions of the Bible in English, and we aligned all of them to the Quechua Bibles.
We used only one of the Quechua Bibles at a time.
The results show an increase in the precision when using different versions of the Bible in
the same language.
In figure 4.4 we can observe that the best performing corpus is the set of four Bibles.
32
![Page 38: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/38.jpg)
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Removing punctuation
One bibleTwo biblesFour bibles
Figure 4.5. Query translation.
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
One bibleTwo biblesFour bibles
One bible (remv punct)Two bibles (remv punct)Four bibles (remv punct)
Figure 4.6. Query translation.
In figure 4.5 we repeated the same experiment removing the punctuation, and again the
best results belonged to the set of four Bibles.
We also compared both sets of results, and the best results came from the set of four
Bibles, removing punctuation, as shown in figure 4.6.
33
![Page 39: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/39.jpg)
Table 4.2. Normalized F-measure at different recall points.
Recall point 1 bible 2 bibles 4 bibles0.1 0.9567 0.9567 0.95670.2 0.9102 0.9170 0.91860.3 0.8493 0.8198 0.90110.4 0.7583 0.7601 0.82190.5 0.7099 0.6791 0.78520.6 0.5770 0.5342 0.61210.7 0.3969 0.3994 0.47400.8 0.3507 0.3765 0.45460.9 0.4268 0.4080 0.4544µ 0.6595 0.6501 0.7087
In table 4.2 we can clearly see that although the corpus of 2 bibles performed nearly 1 %
worst than the corpus of 1 bible, the best performance was for the corpus of 4 bibles, almost
5 % better than the others.
34
![Page 40: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/40.jpg)
4.2.3. Experiment 3
Analyze the effect of translating collection vs query.
Quechua Version used: Cuzco Version.
One of the research questions we wanted to answer was if it was better to translate the
query or the collection, or if both approaches would give the same result. In this experiment,
we compare results of the experiments translating the query, and contrasting that result with
the translation of the collection.
This experiment was constructed using the corpus of 5,000 lines, one Bible, and finally
four Bibles, using the Cuzco dialect of the Quechua Bible without removing punctuation.
Figure 4.7 shows that translating the collection with 5,000 lines gave better results.
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Query translation and Collection translation (5k lines)
CollectionQuery
Figure 4.7. Query translation and collection translation (5k lines).
Figure 4.8 shows the same experiment using the corpus of one Bible, and we can see
clearly that translating the collection yields better results too.
Lastly, we repeated the experiment with only 5,000 lines of parallel text. Figure 4.9 shows
also that the translation the collection performed better than the query translation also in
this case.
35
![Page 41: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/41.jpg)
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Query translation and Collection translation (one bible)
CollectionQuery
Figure 4.8. Query translation and collection translation (one Bible).
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Query translation and Collection translation (four bibles)
CollectionQuery
Figure 4.9. Query translation and collection translation (four Bibles).
4.2.4. Experiment 4
Analyze the effect of using a different dialect
Quechua Version used: Cuzco Version and Bolivia
36
![Page 42: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/42.jpg)
Table 4.3. Normalized F-measure at different recall points for a corpus of 5k lines.
Recall point Collection translation Query Translation δ0.1 0.1587 0.1560 0.26750.2 0.2383 0.2366 0.17510.3 0.2796 0.2776 0.20090.4 0.2891 0.2815 0.75330.5 0.2621 0.2445 1.76170.6 0.1852 0.1710 1.42130.7 0.0940 0.0999 -0.58910.8 0.0626 0.0766 -1.39060.9 0.0541 0.0609 -0.6747
Table 4.4. Normalized F-measure at different recall points for a corpus of one bible.
Recall point Collection translation Query Translation δ0.1 0.1587 0.1560 0.26750.2 0.2401 0.2374 0.26330.3 0.2845 0.2788 0.57420.4 0.3019 0.2781 2.37210.5 0.2991 0.2558 4.32890.6 0.2405 0.1856 5.48820.7 0.1607 0.1078 5.29430.8 0.0840 0.0716 1.23090.9 0.0629 0.0633 -0.0372
Table 4.5. Normalized F-measure at different recall points for a corpus of four bibles.
Recall point Collection translation Query Translation δ0.1 0.1587 0.1552 0.34230.2 0.2392 0.2366 0.26280.3 0.2841 0.2784 0.57010.4 0.3015 0.2740 2.74500.5 0.2982 0.2443 5.39890.6 0.2326 0.1577 7.49180.7 0.1486 0.0870 6.15320.8 0.0823 0.0667 1.56350.9 0.0629 0.0562 0.6731
In this experiment, we want to see how much impact would make to use a different dialect
for IR purposes. We want see how much impact would have to perform the experiment using
a different dialect in the query and the collection.
The Time collection was translated by human translators that spoke Quechua Cuzco.
In this experiment, we contrast the impact of creating translators using the same dialect,
Quechua Cuzco, with translators using a different dialect, with is Quechua Bolivia.
In figure 4.12 we translated the collection for the first set of experiments.
In the next experiment, we used only one Bible, as shown in figure 4.11.
In the following experiment, we used only five thousand lines of parallel text.
37
![Page 43: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/43.jpg)
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Different dialect, translating collection (5k lines)
Quechua CuzcoQuechua Bolivia
Figure 4.10. Using different dialects, translating collection for 5k lines corpus.
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Different dialect, translating collection (one bible)
Quechua CuzcoQuechua Bolivia
Figure 4.11. Using different dialects, translating collection for one bible corpus.
Only in the last experiment we can see a different behavior, where both dialects perform
almost equally.
In the next set of experiments, we used query translation.
38
![Page 44: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/44.jpg)
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Different dialect, translating collection (four bibles)
Quechua CuzcoQuechua Bolivia
Figure 4.12. Using different dialects, translating collection for four bibles corpus.
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Different dialect, translating query (5k lines)
Quechua CuzcoQuechua Bolivia
Figure 4.13. Using different dialects, translating query for 5k lines corpus.
In figures 4.13, 4.14 and 4.15 the results don’t show a significant difference between both.
We believe that this result is caused because keywords will be similar on both languages, and
the difference between both will be less relevant if we translate a smaller amount of text,
which is the case in query translation.
39
![Page 45: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/45.jpg)
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Different dialect, translating query (one bible)
Quechua CuzcoQuechua Bolivia
Figure 4.14. Using different dialects, translating query for one bible corpus.
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Different dialect, translating query (four bibles)
Quechua CuzcoQuechua Bolivia
Figure 4.15. Using different dialects, translating query for four bibles corpus
Table 4.6. F-measure average using different dialects on different sizes of cor-
pus, with collection translation.
Corpus Cuzco dialect Bolivian dialect δ5k 0.1804 0.1810 0.2990one bible 0.2036 0.1750 16.3466four bibles 0.2009 0.1809 11.0547
40
![Page 46: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/46.jpg)
Table 4.7. F-measure average using different dialects on different sizes of cor-
pus, with query translation.
Corpus Cuzco dialect Bolivian dialect δ5k 0.1783 0.1741 2.4258one bible 0.1816 0.1750 3.7862four bibles 0.1729 0.1809 4.4234
Table 4.6 shows clearly that there is an improvement when using the correct dialect when
we do collection translation. Table 4.7 shows that there is no noticeable difference if we use
different dialects when the query is translated.
41
![Page 47: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/47.jpg)
4.2.5. Experiment 5
Analyze the effect of removing punctuation
Quechua Version used: Cuzco Version
During the previous experiments, we noticed that many words were aligned with punctu-
ation. In this experiment, we wanted to investigate if removing punctuation would affect in
any way the quality of the translation. For the experiment, we removed all punctuation signs.
In figures 4.16 and 4.17 we translated the collection. We can see that the results are
almost similar, and there is no measurable difference between both.
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Corpus with symbols and without symbols (four bibles)
With symbolsWithout symbols
Figure 4.16. Removing punctuation, translating the collection.
For the next set, we translated the query. In figures 4.18 and 4.19 we can see a clear
difference between the results obtained removing punctuation and not removing it.
Figures 4.16, 4.17, 4.18 and 4.19 show that while removing the punctuation didn’t improve
the results for collection translation, it clearly improved the results for query translation.
42
![Page 48: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/48.jpg)
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Corpus with symbols and without symbols (one bible)
With symbolsWithout symbols
Figure 4.17. Removing punctuation, translating the collection.
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Corpus with symbols and without symbols (four bibles)
With symbolsWithout symbols
Figure 4.18. Removing punctuation, translating the query.
43
![Page 49: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/49.jpg)
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pre
cis
ion
Recall
Corpus with symbols and without symbols (one bible)
With symbolsWithout symbols
Figure 4.19. Removing punctuation, translating the query.
44
![Page 50: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/50.jpg)
CHAPTER 5
DISCUSSION AND CONCLUSIONS
In experiment 1 we can see that the results showed clearly that an increase in the size of
the parallel corpus results in better precision.
Given the relative size of the corpus, we can say that there was a significant increase in
precision from 25k lines to the complete Bible, 27k lines. This can be an effect of words
relevant for the experiment being included in the last books of the Bible.
In experiment 2 we can see that using paraphrasing to construct bigger corpus increases
the precision for information retrieval purposes. The increase of precision in the experiment
was consistent with the amount of parallel data that was used.
In experiment 3 we contrasted the translation of the query and the translation of the
collection of documents. The translation of the collection proved to have a higher precision
in all cases. To see if this was due to the quality of the parallel data, we contrasted the
translation of the query and the collection when a different Quechua dialect was used. In
this case, both the translation of the query and the collection returned comparatively similar
results, indicating that choosing the query or the collection makes impact when the quality
of the aligned data is higher.
In experiment 4 we analyzed the translation of the text using a different dialect from the
one used originally. We translated the collection, and later the query.
For the collection translation, using the same dialect gave better precision than using a
different dialect in all cases. We found the same difference at different amounts of informa-
tion.
One interesting result was the translation of the query using different dialects. Both
performed relatively similarly, under different amounts of parallel data.
45
![Page 51: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/51.jpg)
In experiment 5 we shown that removing the symbols made a noticeable impact when
doing query translation. For collection translation, there was almost no difference in the
precision.
5.1. Contributions
For this work, we had to develop all the NLP resources for MT for Quechua Cuzco and
Quechua Bolivia from the begining.
We aligned this text with the English Bibles, and finally designed the experiments to
meassure the effect of different variables when performing IR.
We designed the experiments in order to compare different variables and see how they
affect the results of SMT for IR purposes.
5.2. Conclusions
Based on this results, we have the following conclusions.
(i) How much does the amount of parallel data affect the precision of information
retrieval?
All increases of additional parallel text increased the precision of the results. The
best results corresponded to the parallel with more lines, with a significant difference
over the other sets.
(ii) Is it possible to increase the precision of the information retrieval results using para-
phrasing? We showed that paraphrasing had a direct impact on the results, increas-
ing the precision. The best performing system was the corpus with 4 Bibles.
(iii) Is there a significant difference if we translate the query or the collection of doc-
uments? How is this affected by the other factors? Translating the collection
performed better. While translating the query, we noticed that removing punctu-
ation caused increased precision, while removing the punctuation didn’t make any
impact for collection translation.
46
![Page 52: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/52.jpg)
(iv) Is it possible to use Information Retrieval across different dialects of the same lan-
guage? How much is the precision affected by this? While the results were inferior
for the different dialect, they also made us believe that is possible to use the same
system for a different dialect. Collection and query translation had similar results in
this case.
47
![Page 53: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/53.jpg)
APPENDIX
TIME COLLECTION FILES
48
![Page 54: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/54.jpg)
APPENDIX TIME COLLECTION FILES
Documents in the Time collection are identified by a number, which is associated with to
the query relevance document. This number range from 17 to 563.
In our experiment, we identified the documents using a correlative number, which has a
mapping to the Time collection document.
Columns labeled ‘E’ contain the document number used in this experiment, while columns
labeled ‘C’ contain the document number assigned in the times collection.
49
![Page 55: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/55.jpg)
E C E C E C E C E C E C E C E C E C1 17 51 83 101 143 151 199 201 259 251 312 301 387 351 471 401 5372 18 52 84 102 144 152 200 202 260 252 314 302 388 352 472 402 5383 19 53 85 103 145 153 201 203 261 253 316 303 389 353 474 403 5394 20 54 86 104 146 154 202 204 262 254 317 304 390 354 475 404 5405 22 55 87 105 147 155 203 205 263 255 318 305 391 355 476 405 5416 23 56 89 106 148 156 212 206 264 256 319 306 393 356 477 406 5427 24 57 90 107 149 157 213 207 265 257 320 307 395 357 478 407 5438 25 58 91 108 150 158 214 208 266 258 321 308 397 358 479 408 5449 26 59 92 109 151 159 216 209 267 259 322 309 398 359 484 409 54510 27 60 93 110 152 160 217 210 268 260 323 310 399 360 486 410 54711 28 61 94 111 153 161 218 211 269 261 325 311 400 361 489 411 54812 29 62 95 112 154 162 219 212 271 262 328 312 401 362 490 412 54913 30 63 96 113 155 163 220 213 272 263 329 313 402 363 491 413 55014 31 64 97 114 156 164 221 214 273 264 330 314 403 364 492 414 55115 32 65 98 115 157 165 222 215 274 265 331 315 404 365 493 415 55216 33 66 99 116 158 166 223 216 275 266 332 316 405 366 494 416 55417 34 67 100 117 159 167 224 217 276 267 333 317 406 367 495 417 55518 35 68 101 118 160 168 225 218 277 268 334 318 407 368 496 418 55619 39 69 103 119 161 169 226 219 278 269 335 319 410 369 497 419 55720 41 70 104 120 162 170 227 220 279 270 336 320 411 370 500 420 55821 42 71 105 121 169 171 228 221 280 271 340 321 412 371 501 421 55922 44 72 106 122 170 172 229 222 281 272 341 322 413 372 502 422 56023 46 73 107 123 171 173 230 223 282 273 344 323 414 373 503 423 56124 47 74 108 124 172 174 231 224 283 274 345 324 416 374 504 424 56225 48 75 109 125 173 175 233 225 284 275 346 325 417 375 505 425 56326 49 76 110 126 174 176 234 226 285 276 347 326 421 376 50627 50 77 111 127 175 177 235 227 286 277 349 327 423 377 50728 51 78 112 128 176 178 236 228 287 278 350 328 424 378 50829 52 79 114 129 177 179 237 229 288 279 352 329 425 379 50930 53 80 115 130 178 180 238 230 291 280 353 330 426 380 51131 54 81 116 131 179 181 239 231 292 281 354 331 429 381 51232 56 82 117 132 180 182 240 232 293 282 355 332 430 382 51333 57 83 118 133 181 183 241 233 294 283 356 333 433 383 51434 58 84 119 134 182 184 242 234 295 284 357 334 435 384 51635 59 85 120 135 183 185 243 235 296 285 358 335 436 385 51836 60 86 121 136 184 186 244 236 297 286 360 336 437 386 51937 61 87 122 137 185 187 245 237 298 287 362 337 441 387 52138 62 88 125 138 186 188 246 238 299 288 363 338 442 388 52239 63 89 127 139 187 189 247 239 300 289 364 339 443 389 52340 64 90 128 140 188 190 248 240 301 290 366 340 444 390 52441 65 91 129 141 189 191 249 241 302 291 367 341 445 391 52542 66 92 130 142 190 192 250 242 303 292 368 342 448 392 52643 67 93 132 143 191 193 251 243 304 293 369 343 458 393 52744 68 94 133 144 192 194 252 244 305 294 379 344 459 394 52845 69 95 134 145 193 195 253 245 306 295 380 345 460 395 52946 70 96 135 146 194 196 254 246 307 296 381 346 461 396 53247 71 97 136 147 195 197 255 247 308 297 382 347 462 397 53348 80 98 137 148 196 198 256 248 309 298 383 348 463 398 53449 81 99 139 149 197 199 257 249 310 299 384 349 469 399 53550 82 100 142 150 198 200 258 250 311 300 385 350 470 400 536
50
![Page 56: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/56.jpg)
BIBLIOGRAPHY
[1] Yaser Al-Onaizan, Ulrich Germann, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Daniel
Marcu, and Kenji Yamada, Translating with scarce resources, Proceedings of the Sev-
enteenth National Conference on Artificial Intelligence and Twelfth Conference on In-
novative Applications of Artificial Intelligence, AAAI Press / The MIT Press, 2000,
pp. 672–678.
[2] Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern information retrieval, Addison-
Wesley, 1999.
[3] Lisa Ballesteros and W. Bruce Croft, Dictionary methods for cross-lingual information
retrieval, DEXA (Roland Wagner and Helmut Thoma, eds.), Lecture Notes in Computer
Science, vol. 1134, Springer, 1996, pp. 791–801.
[4] P. F. Brown, J. Cocke, S. A. Della Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and
P. S. Roossin, A statistical approach to machine translation, Computational Linguistics
16 (1990), no. 2, 79–85.
[5] Roberto Aranovich Lori Levin Ralf Brown Eric Peterson Jaime Carbonell Alon Lavie
Christian Monson, Ariadna Font Llitjos, Building nlp systems for two resource-scare in-
digenous languages: Mapudungun and quechua, The International Conference on Lan-
guage Resources and Evalua (2006).
[6] S. Dasgupta and V. Ng, Unsupervised part-of-speech acquisition for resource-scarce
languages, Proceedings of the EMNLP-CoNLL, 2007, pp. 218–227.
[7] F. C. Gey, University of California, D. W. Oard, and University of Maryland, The TREC-
2001 cross-language information retrieval track: Searching arabic using english, french
or arabic queries, Text REtrieval Conference (TREC) TREC 2001 Proceedings, Depart-
ment of Commerce, National Institute of Standards and Technology, 2001, NIST Special
Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001), pp. 16–??
[8] Jr. Gordon, Raymond G., Ethnologue: Languages of the world, SIL International, 2005.
51
![Page 57: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/57.jpg)
[9] Rada Mihalcea Joel Martin and Ted Pedersen, Word alignment for languages with scarce
resources, 2005.
[10] Philipp Koehn, Franz Josef Och, and Daniel Marcu, Statistical phrase-based translation,
HLT-NAACL, 2003.
[11] A. Lopez and P. Resnik, Improved HMM alignment models for languages with scarce
resources, Proceedings of the ACL Workshop on Building and Using Parallel Texts, 2005,
pp. 83–86.
[12] Rada Mihalcea and Ted Pedersen, An evaluation exercise for word alignment, June 21
2003.
[13] Douglas W. Oard and Fredric C. Gey, The TREC 2002 arabic/english CLIR track,
TREC, 2002.
[14] Franz Josef Och, An efficient method for determining bilingual word classes, EACL,
1999, pp. 71–76.
[15] Franz Josef Och and Hermann Ney, A systematic comparison of various statistical align-
ment models, Computational Linguistics 29 (2003), no. 1, 19–51.
[16] H. Ney R. Kneser, Forming word classes by statistical clustering for statistical language
modelling, Quantitative Linguistics Conference 1 (1991).
[17] Philip Resnik, Mari Broman Olsen, and Mona Diab, Creating a parallel corpus from the
”book of 2000 tongues”, November 17 1997.
[18] G. Salton and C. Buckley, Term-weighting approaches in automatic text retrieval, Infor-
mation Processing and Management 24 (1988), no. 5, 513–523.
[19] United Bible Society, Statistical summary of languages with the scriptures, 2008.
[20] Stephanie Strassel, Mike Maxwell, and Christopher Cieri, Linguistic resource creation
for research and technology development: A recent experiment, ACM Transactions on
Asian Language Information Processing 2 (2003), no. 2, 101–117.
52
![Page 58: Cross Language Information Retrieval for Languages with ...CHAPTER 1. INTRODUCTION 1 1.1. Problem Statement 1 1.2. Contribution 2 CHAPTER 2. A BRIEF REVIEW OF INFORMATION RETRIEVAL,](https://reader036.fdocuments.us/reader036/viewer/2022071607/61447d8db5d1170afb43e86e/html5/thumbnails/58.jpg)
[21] Stephen Wurm, Atlas of the world’s languages in danger of disappearing, 2 ed., The
United Nations Educational, Scientific and Cultural Organization (UNESCO), 2001.
53