FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy...
-
Upload
sebastian-ruder -
Category
Science
-
view
42 -
download
1
Transcript of FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Ganguly, Haithem Afli, Andy...
FaDA: Fast document aligner with
word embedding
Pintu Lohar, Debasis Ganguly, Haithem Afli,
Andy Way and Gareth F. Jones
ADAPT Centre, School of Computing, Dublin
City University
Contents
• Objective
• Introduction to FaDA
• Methodology used
• Word vector-based similarity
• Architecture of the whole system
• Experiments
• Results
• Conclusions and future work
Objective
• To align the documents in two different
languages within a large collection of
comparable documents.
• Alignment procedure should be faster with less
than quadratic time complexity.
Example of comparable documents
• The same news published in two languages
Introduction to FaDA
• FaDA (Fast Document Aligner) is a free/open-
source tool for aligning bilingual documents .
• It is a fast alignment tool with linear time
complexity.
Methodology used
• Crosslingual information retrieval (CLIR)-
based document-alignment system with word
vector-based similarity measurements.
Why word vector-based similarity ?
• CLIR-based approach takes into account only
text-based similarity without addressing the
underlying semantic match between the words.
• The word vector-based approach considers the
semantic similarity between the words.
Word vectors
• Example:
Word vector-based similarity
• Query likelihood
Where, q1, q2, q3 → query terms dots → words of a document in 2D space. The centroid of document in Figure (a) is closer to the query terms than document in Figure (b)
Combination of word vector-based
and text-based similarity
• α is the linear interpolation parameter denoting the relative contributions from the text overlap and word vector-based similarities
Bilingual
documents
target
documents
source
documents
Indexing
target index source index
Pseudo query of
each document
Translate by
dictionary
Translated
query terms Compare
top n
documents
Combine word -vector
and text similarity Select with best score
Retrieved target
document
System architecture of FaDA
Experiments
• Dataset
Baseline
• Based on “Jaccard similarity coefficient” which measures the term overlaps between
document pairs.
• “Cosine similarity-based” and “Named
Entity matching-based” approaches did not
work well hence not used as baseline.
Tuning (Euronews data) :
Optimal parameter settings:
i. λ = 0.9
ii. Number of translation terms M = 7 and
iii. Query to document ratio τ = 0.6
Result on WMT test data:
Conclusions and future work
Uses CLIR-based approach which is much faster
than the baseline (with quadratic time complexity).
The performance is further enhanced by word
vector embedding-based approach.
In future , we would like to apply our approach to
other language pairs.
Thank you
Questions ?
and/or
Suggestions !