LREC'2008 translation universals
-
Upload
naveed-afzal -
Category
Documents
-
view
44 -
download
0
Transcript of LREC'2008 translation universals
![Page 1: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/1.jpg)
Translation universals: do they exist?A corpus-based and NLP approach to convergence
Gloria Corpas*, Ruslan Mitkov**, Naveed Afzal***, Lisette Garcia Moya***
* University of Malaga** University of Wolverhampton
*** Centre for Pattern Recognition and Data Mining, Santiago de Cuba
![Page 2: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/2.jpg)
Translation universals (Baker 1993, 1996; Toury 1995)
Translated texts tend to be simpler than non-translated, original texts (simplification)Translated texts tend to be more explicit than non-translated texts (explicitation)Translated texts tend to be more similar than non-translated texts (convergence)
![Page 3: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/3.jpg)
Previous research on translation universals
Formulation and initial explanation been based of intuition and introspection Follow-up corpus research limited to comparatively small-size corpora, literary or newswire texts and semi-manual analysisNo sufficient guidance as to which are the features which account for these universals to be regarded as valid
![Page 4: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/4.jpg)
Objective of this study
To test the validity of convergence (translated texts tend to be more similar than non-translated texts)
Test (target) language: Spanish
![Page 5: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/5.jpg)
General methodology
Employment of NLP techniques on corpora of translated Spanish and on comparable corpora of non-translated (original) Spanish Similarity between every pair of corpora of translated texts and between every pair of corpora of original texts computedSimilarity is measured in terms of both style and syntax
![Page 6: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/6.jpg)
Corpora usedCorpus of Medical Spanish Translations by Professionals (MSTP: 1,058,122) Corpus of Medical Spanish Translations by Students (MSTS: 1,058,122)Corpus of Technical Spanish Translations (TST: 1,736,027)Corpus of Original Medical Spanish Comparable to Translations by Professionals (MSTPC: 1,402,172) Corpus of Original Medical Spanish Comparable to Translations by Students (MSTSC: 1,164,435)Corpus of Original Technical Spanish Comparable to Technical Translations (TSTC: 1,986,651)
![Page 7: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/7.jpg)
Comparability of corpora
Comparability in terms of
(i) Text types and forms (ii) Domains and sub domains (iii) Level of specialisation and formality (iv) Diatopic restrictions (Peninsular Spanish) (v) Time span (2005-2008) (vi) Similar size
![Page 8: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/8.jpg)
Comparability of corpora (2)Comparability of corpora is mathematically an equivalence relationReflexivity: A corpus is comparable to itself (C ~ C)Symmetry: If a corpus C is comparable to corpus D, then D is comparable to C (C ~ D, then D ~ C)Transitivity: if a corpus C is comparable to corpus D and if corpus D is comparable to corpus E, then C is comparable to E (C ~ D and D ~ E, then C ~ E)Therefore, comparability splits the universe of all corpora into disjoint classes of corpora.
![Page 9: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/9.jpg)
CORPUS DESIGNCORPUS DESIGN
TRANSLATED CORPUS
MSTTST
MSTP MSTS
NONTRANSLATED
CORPUS
MSC
MSTSC MSTPC
TSTC
ES (TT) ES (NT)
![Page 10: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/10.jpg)
Specific methodology (1)
Compared: all 3 pairs of translated texts (MSTP-MSTS; MSTS-TST; MSTP-TST) all 3 pairs of comparable non- translated texts (MSTPC-MSTSC; MSTSC-TSTC; MSTPC-TSTC)
Premise: If convergence universals holds, higher similarity for pairs of translated texts expected.
![Page 11: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/11.jpg)
Specific methodology (2)
Texts compared on the basis of (i) style (stylistic features)(ii) syntax (syntactic features).
Our proposal for stylistic and syntactic features
![Page 12: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/12.jpg)
Style comparison: stylistic features
Lexical density: (number of types)/
(total number of tokens present in corpus)
Lexical richness: (number of lemmas)/
(number of tokens present in corpus)
Sentence length:(number of tokens in corpus)/
(number of sentences)
![Page 13: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/13.jpg)
Style comparison: stylistic features (2)
Simple/complex sentencesDiscourse markers (Spanish)Two statistical tests (Chi-Square test and T-test) employed
![Page 14: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/14.jpg)
Syntax comparison
Sequences of POS tags for every pair of corpora comparedCorpora represented as frequency vectors of 3-grams (Nerbonne and Wiersma, 2006)Measures:
Cosine Recurrence metrics R and Rsq (Kessler, 2001)
![Page 15: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/15.jpg)
Experimental results
Computation of stylistic featuresChi-square values for global comparisonT-test values for statistical significanceMeasuring vector differences for syntax comparison
![Page 16: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/16.jpg)
Style comparison: Stylistic Features
Features MSTP MSTS TST MSTPC MSTSC TSC
Lexical Density
0.027954 0.052715 0.020679 0.042505 0.041159 0.025529
Lexical Richness
0.016929 0.037709 0.013281 0.029992 0.028905 0.015591
Average Sentence Length
25.256248 28.499456 27.292782 20.702349 26.442412 18.124363
Simple Sentences (%)
0.441768121 0.507205751 0.476949103 0.638889238 0.52120611 0.592110096
Discourse Markers (Ratio)
0.001268941 0.001852604 0.000763805 0.002022331 0.002099085 0.001649655
![Page 17: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/17.jpg)
Style comparison: Chi-Square Values
Corpora Chi-Square Values
1MSTP 2MSTS 0.010622566
1MSTP 3TST 0.00266151
2MSTS 3TST 0.023731912
Total 0.037015988
Average 0.012338663
Corpora Chi-Square Values
1MSTPC 2MSTSC 0.059779549
1MSTPC 3TSC 0.006140764
2MSTSC 3TSC 0.07122404
Total 0.137144352
Average 0.045714784
Translated Corpora Non-Translated Corpora
![Page 18: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/18.jpg)
Style comparison: T-Test ValuesFeatures Translated Corpora (T-test Values)
MSTP MSTS MSTS TST MSTP TST
Non-translated Corpora (T-test Values)
MSTPC MSTSC MSTSC TSC MSTPCTSC
Lexical Density 0.002545387 0.000123172 0.079875166 0.140348431 0.201151185 0.000748439
Lexical Richness 0.0006604 0.000006.9792 0.140236542 0.140711253 0.015893183 0.00009.71905
Sentence Length 0.011826639 0.522122939 0.202480843 0.145216739 0.002807505 0.368840258
Simple Sentences 0.057465277 0.673936375 0.202830407 0.096465071 0.462960518 0.21217697
Discourse Markers 0.001048007 0.005746253 0.351552034 0.063428055 0.00084074 0.072337471
![Page 19: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/19.jpg)
Syntax comparison: Results Measuring Vector Differences
Corpora 1-C R Rsq
Translated texts
MSTP - MSTS 0.206015066283 252526.914323 638848591.082
MSTP - TST 0.337626383799 388466.504863 3146471863.13
MSTS - TST 0.176310545152 432725.578482 2643068563.82
Non-Translated texts
MSTPC - MSTSC 0.0176469276126 98448.0858054 82218137.9687
MSTPC - TSC 0.150912596476 364322.217714 851312764.364
MSTSC - TSC 0.167167511143 372940.61477 1008322991.78
![Page 20: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/20.jpg)
Discussion (1)
Stylistic features: translated texts included in experiment are more similar than non-translated texts (Chi-square test)
![Page 21: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/21.jpg)
Discussion (2)
T-test observationsThere are non-translated texts which are not statistically different in terms of stylistic features whereas corresponding translated texts different statisticallyThere are non-translated texts which are statistically different in terms of only one stylistic feature whereas corresponding translated texts different statistically with regard to two stylistic features Translated texts could often differ significantly with regard to certain style features (lexical density).
![Page 22: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/22.jpg)
Discussion (3)
Translated texts differ more in terms of syntax for all compared pairs and from the point of view of all measures (1-C, R and Rsq)
![Page 23: LREC'2008 translation universals](https://reader033.fdocuments.us/reader033/viewer/2022061607/55d2979abb61eb5a398b4575/html5/thumbnails/23.jpg)
ConclusionsStyle: convergence appears to be broadly holding, but no definite conclusion can be made that convergence is a clear-cut universal Syntax: there is no evidence that convergence holds in terms of syntaxGeneral: results do not provide sufficient support to the convergence ‘universal’