Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

17
EXTRACTION OF TRANSLATION CORRESPONDENCES FROM A PARALLEL CORPUS USING METHODS OF DISTRIBUTIONAL SEMANTICS Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Transcript of Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Page 1: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

EXTRACTION OF TRANSLATION CORRESPONDENCES FROM A PARALLEL CORPUS

USING METHODS OF DISTRIBUTIONAL SEMANTICS

Yuliya MorozovaInstitute for Informatics Problems of the Russian Academy of Sciences, Moscow

Page 2: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Distributional semanticsnew area of linguistic researchinferring semantic properties of linguistic

units from corporaTheoretical foundations: distributional

methodology by Z. Harris, F. de Saussure, L. Wittgenstein.

Distributional hypothesis: semantically similar words occur in similar contexts.

J. R. Firth “You shall know a word by the company it keeps”.

Page 3: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Vector spacedrink coffee – occurred 1 timedrink tea – occurred 2 times

Page 4: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Cosine measure of vector similarity

n

i i

n

i i

n

i ii

yx

yx

1

2

1

2

1

Page 5: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Main application areaslexical ambiguity resolutioninformation retrievaldictionaries of semantic relationsmultilingual dictionariessemantic maps of different domainsmodelling of synonymydocument topic detectionsentiment analysis

Page 6: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

The present researchGoal: to apply distributional semantics

models to extraction of translation correspondences from a parallel corpus.

Vector space model + test corpus

Page 7: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Test corpusPatent texts in French translated into Russian

Texts splitted into sentencesAlignment at the sentence level – manually

verified (in the visual editor MakeBilingua) Uploaded to the Sketch Engine corpus

manager

Page 8: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

PreprocessingLemmatizationFrequent words removed (prepositions ,

conjunctions etc.)Punctuation marks removed

Page 9: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Vector space model

type of linguistic units: single words; type of context: aligned regions; frequency measure: Boolean frequency

(equal either to 1 or 0); method used to compute the distance

between vectors: cosine measure.

Page 10: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Example (aligned region as a context)Aligned region #1

présent invention concerner liant minéral notamment hydraulique

настоящий изобретение касаться неорганический связующий частность гидравлический связующий

Page 11: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Example (vector space)

Aligned region

#1 #2 #3

présent 1 … …

invention 1 … …

concerner 1 … …

настоящий 1 … …

изобретение

1 … …

касаться 1 … …

Page 12: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

ResultsA list of translation correspondences.

Linguistic filter: the same part of speech.

Precision: 78%.

Page 13: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Correspondences with different POS

Syntactic transformations

verbal infinitive (French) → noun (Russian) traiter (“to process”) → обработка (“processing”)

noun (French) → adjective (Russian)

crochet (“hook”) → крюкообразный (“hook-shaped”)

verbal infinitive (French) → adjective (Russian)

connaître (“to know”) → известный (“well-known”)

Page 14: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Correspondences with different POS

Parts of multi-word expressionsau moins (“at least”) → по меньшей мере (“at least”)

The output of the program:moins → мера

Page 15: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

EvaluationEduardo Cendejas, Grettel Barceló,

Alexander Gelbukh, Grigori Sidorov . Incorporating Linguistic Information to Statistical Word-Level Alignment // Proceedings of the 14th Iberoamerican Conference on Pattern Recognition, CIARP 2009, Guadalajara, Jalisco, Mexico, November 15-18, 2009.

Vector space model + similarity measures PMI, T-score, Log-likelihood ratio and Dice coefficient.

Precision – 53%.

Page 16: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

ConclusionDistributional semantics methodology can be

used to extract translation correspondences from a parallel corpus with a high level of precision.

It can be used to study productive syntactic transformations occurring in translation.

The present vector space model needs to be enhanced to take into account multi-word expressions.

Page 17: Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Thank you!