Corpus Linguistics Notes
-
Upload
jesse-turland -
Category
Documents
-
view
46 -
download
9
description
Transcript of Corpus Linguistics Notes
CASSCorpus Approaches to Social Science
Using comparable and parallel corpora in contrastive and translation studies
Richard Xiao Lancaster University
Outline of the session • Types of corpora used in translation and contrastive
studies
• Paradigmatic shift in contrastive and translation studies
• A model of Contrastive Corpus Linguistics
• Alignment and parallel concordancing
• Corpus resources and tools
Types of corpora: Some distinctions
• Monolingual versus multilingual corpora
• Parallel versus comparable corpora
• Comparable versus comparative corpora
Monolingual vs. multilingual corpora • Monolingual corpora
• A corpus that only involves one language • Multilingual corpora
• A corpus that contains texts of more than one language • A corpus covering two languages is conventionally
known as ‘bilingual’ • Multilingual corpora, in a narrow sense, must involve
more than two languages • ‘Multilingual’ and ‘bilingual’ are often used
interchangeably • Parallel and comparable corpora
Parallel vs. comparable corpora • Terminological confusion centres around the terms • For some scholars (e.g. Aijmer & Altenberg 1996; Granger 1996: 38)
• Corpora composed of source texts in one language and their translations in another language (or other languages) are ‘translation corpora’ while those comprising different components sampled from different native languages using comparable sampling techniques are called ‘parallel corpora’
• For many others (e.g. Baker 1993: 248, 1995, 1999; Barlow 1995, 2000: 110; Hunston 2002: 15; McEnery and Wilson 1996: 57; McEnery, Xiao & Tono 2006) • Corpora of the first type are labelled ‘parallel corpora’ while
those of the latter type are ‘comparable corpora’
Parallel vs. comparable corpora • Consistent and logical ways of doing things…
• We can say a corpus is a translation or a non-translation corpus if the criterion of corpus content is used
• But if we choose to define corpus types by the criterion of corpus form, we must use the criterion consistently • We can say a corpus is parallel if the corpus contains source
texts and translations in parallel, or it is a comparable corpus if its components or subcorpora are comparable by applying the same sampling techniques and representing similar balance
• It is simply inconsistent and illogical to refer to corpora of the first type as ‘translation corpora’ by the criterion of content while referring to corpora of the latter type as ‘comparable corpora’ by the criterion of form!
Multilingual vs. monolingual comparable corpora • A common practice in TS is to compare a corpus of translated texts
(‘translational corpus’) with a corpus comprising comparably sampled non-translated native texts in the target language • The ZJU Corpus Translation Chinese (ZCT C) vs. the Lancaster
Corpus of Mandarin Chinese (LCMC ) • The two sub-corpora form monolingual comparable corpora, as
opposed to multilingual comparable corpora composed of comparable texts for different languages (LCMC s. FLOB)
Comparative corpora • Corpora containing different varieties of the same
language are not comparable corpora • e.g. the International Corpus of English (ICE); the
Brown family of corpora • All corpora, as a resource for linguistic research, are well suited for comparative studies, in either intralingual or interlingual research
• Corpora of this kind are comparative corpora
Use of parallel & comparable corpora • Parallel and comparable corpora “offer specific uses and
possibilities” for contrastive and translation studies (Aijmer & Altenberg 1996:12) • Giving new insights into the languages compared – insights
that are not likely to be gained from the study of monolingual corpora
• Used for a range of comparative purposes and increasing our knowledge of language-specific, typological and cultural differences, as well as of universal features
• Illuminating differences between source texts and translations, and between native and non-native texts
• Used for a number of practical applications, e.g. in lexicography, language teaching and translation
Use of parallel & comparable corpora • Used primarily for translation and contrastive studies • The two types of corpora have their own characteristics, and serve
different purposes • Parallel corpora: useful in translation studies, but they alone
serve as a poor basis for cross-linguistic contrast, because translations cannot avoid the effect of translationese
• Comparable corpora: well suited for contrastive research, but are less useful in translation studies, e.g. in studying translation equivalents
Using corpora in translation studies • Translational corpora
• Used in combination with a comparable TL corpus to provide primary evidence in product-oriented Translation Studies, and in studies of “translation universals”
• If corpora of this kind are encoded with sociolinguistic and cultural parameters, they can also be used to study the sociocultural environment of translations
• Monolingual SL and TL corpora • Can raise the translator’s linguistic and cultural awareness in
general • A useful and effective reference tool for translators • Used in combination with a parallel corpus to form a so-called
‘translation evaluation corpus’: helping translator trainers or critics to evaluate translations more effectively and objectively
Using corpora in translation studies • Parallel corpora
• Useful in exploring how an idea in one language is conveyed in another language, thus providing indirect evidence to the study of the translation process
• Indispensable for building statistical or example-based machine translation (EBMT) systems, and for the development of bilingual lexicons and translation memories
• Parallel concordancing is a useful tool for translators • Comparable corpora of SL and TL
• Useful in improving the translator’s understanding of the subject field and improving the quality of translation in terms of fluency, correct term choice and idiomatic expressions in the chosen field
• Can also be used to build terminology banks
Corpora in contrastive linguistics • Contrastive analysis
• An important part of FLT methodology following WWII and remained dominant throughout the 1960s
• Lost ground to more learner-oriented approaches e.g. error analysis, performance analysis, and interlanguage analysis
• Revived in the 1990s • The rapid development of corpus linguistics has been
recognized as a principal reason for its revival (cf. Salkie 2002; Xiao & McEnery 2010; Xiao 2011)
Corpora in contrastive linguistics • The marriage of corpus linguistics and contrastive analysis is an
entirely natural one • Corpus linguistics is inherently comparative in nature • The combination of corpus analysis and contrastive analysis can
produce a synergy that can and has benefited both corpus linguistics and contrastive analysis
• Corpora have “always been pre-eminently suited for comparative studies” (Aarts 1998:ix) • Corpora of the Brown family (Lancaster 1931, LOB, FLOB, BE2006;
B-Brown, Brown, Frown, AE2006) • Even the BNC, which is designed balanced corpus representing
modern British English in general, provides a useful basis for various intra-lingual comparisons
Corpora in contrastive linguistics • Corpus analysis techniques are also intrinsically
comparative • keyword analysis • collocation analysis • interlanguage analysis
• Corpus-based contrastive linguistics has emerged with a wealth of methodologies, addressing a wide spectrum of cross-linguistic issues (cf. Altenberg & Granger 2002; Granger 2003)
Corpus-based Translation Studies • Laviosa (1998a): “the corpus-based approach is evolving,
through theoretical elaboration and empirical realisation, into a coherent, composite and rich paradigm that addresses a variety of issues pertaining to theory, description, and the practice of translation.” • Hypothesis that translation universals can be identified and tested by using corpus data (Baker 1993, 1995)
• Rapid development of corpus linguistics, especially multilingual corpus research in the early 1990s
• Increasing interest in Descriptive Translation Studies (Toury 1995)
Corpus-based Translation Studies • Tymoczko (1998): “Corpus Translation Studies is central to the way that
Translation Studies as a discipline will remain vital and move forward.”
• Meta 43/4 (1998); Kenny (2001); Bowker (2002); Laviosa (2002); Granger et al (2003); Teich (2003); Zanettin et al (2003); Mauranen et al (2004); Olohan (2004); Santos (2004); Rogers & Anderman (2007); Beeby et al (2009); Saldanha (2009); Hruzov (2010); Izwaini (2010); Tengku et al (2010); Véronis (2010); Xiao (2010, 2011, 2012); Hu 2011; Kruger et al (2011); Wang 2012
• Corpus-based Translation Studies book series (Shanghai Jiao Tong University Press / Springer)
The Holmes-Toury map • Applied Translation Studies
• Descriptive Translation Studies
• Theoretical Translation Studies
Applied Translation Studies • Three major contributions of corpora
• Corpus-assisted translating • Bowker (1998: 631): “corpus-assisted translations are of a higher
quality with respect to subject field understanding, correct term choice and idiomatic expressions.”
• Corpus-aided translation teaching and training • Bernardini (1997): ‘large corpora concordancing’ (LCC) can help
students to develop ‘awareness’, ‘reflectiveness’ and ‘resourcefulness’, which are the skills that distinguish a professional translator from those unskilled amateurs
• Development of translation tools • Corpora, and especially aligned parallel corpora, are essential for
the development of translation technology such as machine translation (MT) systems, and computer-aided translation (CAT) tools and translation memories (TM)
Descriptive Translation Studies • Characterized by its emphasis on the study of
translation per se, aiming to answer the question of “why a translator translates in this way” instead of “how to translate”
• Baker (1993) predicted that the availability of large corpora of both source and translated texts, together with the development of the corpus-based approach, would enable translation scholars to uncover the nature of translation as a mediated communicative event
Descriptive Translation Studies • Three focuses (Holmes 1972/1988) • The function of translation
• Concerned with the study of contexts rather than texts: e.g. function or impact of a translation work
• Relatively few function-oriented studies that are corpus-based • Translation as a process
• Aiming to reveal the thought processes that take place in the mind of the translator while they are translating
• One possible way for corpus-based DTS is to analyze the written transcripts of these recordings off-line (Think-Aloud Protocols, or TAPs)
• Research of translation as a product can also provide indirect evidence to translation as a process (product vs. process)
• Translation as a product • Concerned with describing translation as a product by comparing
comparable corpora of translated and non-translated texts in TL • Attempting to uncover evidence to validate / invalidate the so-called
translation universal hypotheses
Descriptive Translation Studies • Core patterns of lexical use (Laviosa 1998b)
• A relatively low proportion of lexical words over function words
• A relatively high proportion of high-frequency words over low-frequency words
• A relatively great repetition of the most frequent words
• Less variety in most frequently used words
Descriptive Translation Studies • Beyond the lexical level
• Simplification: “tendency to simplify the language used in translation” (Baker 1996: 181-182)
• Normalisation: “tendency to exaggerate features of the target language and to conform to its typical patterns” (Baker 1996: 183)
• Explicitation: translations tend to “spell things out rather than leave them implicit” (Baker 1996: 180)
• Sanitisation: translated texts are “somewhat ‘sanitised’ versions of the original” (Kenny 1998: 515)
• Leveling out (convergence): “tendency of translated text to gravitate towards the centre of a continuum” (Baker 1996: 184)
Theoretical Translation Studies • Aims “to establish general principles by means of which
these phenomena can be explained and predicted” (Holmes 1988: 71) • Closely related to, and often reliant on the empirical
findings produced by Descriptive Translation Studies
• One good battleground of using DTS findings to pursue general theory of translation is the hypothesis of so-called “translation universals” (TUs) – the inherent common features of translational language • An important area of corpus-based TS over the past
decade
Contrastive Corpus Linguistics • Bringing together the strengths of contrastive analysis and
corpus analysis • This synergy has not only revived contrastive analysis but
has also expanded the fields of corpus linguistics, translation studies, and SLA research
• A new model of Contrastive Corpus Linguistics (Xiao & McEnery 2010) to demonstrate the promise and potential value of the corpus-based approach to contrastive and translation studies • Common platform for research areas including corpus
linguistics, contrastive linguistics, translation studies, and SLA
Contrastive Corpus Linguistics
Corpus alignment • We have so far assumed that parallel corpora means aligned parallel corpora
• An essential step in the construction and exploitation of parallel corpora
• Without alignment, we cannot easily determine which sentences in TL are translations of which in SL
• Corpus alignment makes explicit the information regarding the translation in a parallel corpus, with the aim of finding translation equivalents at different levels (sentence, phrase, word) between the SL and TL texts in a parallel corpus
• Most multilingual corpus tools only take pre-aligned parallel texts as input in parallel concordancing
Corpus alignment • Levels of alignment
• Document level • Paragraph • Sentence • Phrase (multi-word unit) • Word
• Sentence alignment is generally the first step to phrase and word alignment
Corpus alignment • Combined vs. stand-alone format
• Combined/embedded : the source and translated texts stored in a single text
• Stand-alone: stored in separate files, with SL and TL segment in each translation equivalent linked together with a unique identifier or pointer
• Conversion between the two formats is possible • Different parallel concordancers may have different
requirements
Corpus alignment • Statistical (probabilistic) approach to sentence alignment
• Usually based on sentence length in terms of words or characters
• Linguistic (knowledge/rule-based) approach • Using morpho-syntactic information to explore similarities
between languages • Punctuations and “anchor points” • Achieving more accurate alignment, but necessarily slow
• Hybrid approach • Most widely used approach to sentence alignment • Integrating linguistic knowledge into a probabilistic algorithm
to achieve improved accuracy • Making use of anchor points
Corpus alignment • Research of alignment has focused on European
language pairs
• Sentence alignment among closely related European language pairs has achieved a very high accuracy rate (98%+)
• But less accurate for typologically different languages such as English and Chinese (ca. 80%+), typically requiring human intervention or post-editing
Corpus alignment • InterText Editor (with automatic Hunalign)
• Supporting different operating systems • Local and networked server • http://wanthalf.saga.cz/intertext
• WinAlign in SDL-Trados • Commercial CAT software tool
• Uplug corpus tools • http://sourceforge.net/projects/uplug/?source=dlp
Corpus alignment
Corpus alignment
Parallel concordancing • ParaConc
• Commercial software (US$89): http://www.paraconc.com/
• Unicode compliant • Semi-automatic alignment • Computing and highlighting collocation • Supporting 2-4 aligned parallel texts stored in
separate files
Parallel concordancing
Parallel concordancing • CUC_Paraconc
• Freeware tool • Supporting up to 16 parallel texts store either in
one file or in different files • Unicode compliant • Supporting Regular Expression search • Displaying results in KWIC format, and saving
results either in a single text file or in different files
• www.fass.lancs.ac.uk/projects/corpus/data/CUC_Paraconc.zip
Parallel concordancing
Parallel concordancing
Parallel concordancing
Parallel concordancing • Terminology in multilingual corpus linguistics • Types of corpora used in contrastive and translation
studies • Relationship between corpus linguistics and
contrastive analysis • Corpus-based translation studies • Corpus alignment and parallel concordancing • Well known and influential corpora
• www.fass.lancs.ac.uk/projects/corpus/cbls/corpus_survey.pdf
UCCTS conferences • International conferences on Using Corpora in
Contrastive and Translation Studies • UCCTS1: China
• www.lancs.ac.uk/fass/projects/corpus/UCCTS2008Proceedings
• UCCTS2: UK • www.lancs.ac.uk/fass/projects/corpus/UCCTS2010Proceedings
• UCCTS3 (jointly with ICLC7): Belgium • http://www.iclc7-uccts3.ugent.be/
• UCCTS4: July 2014, Lancaster • http://ucrel.lancs.ac.uk/uccts4/