Contrasting corpora to identify phraseological suggestions ...

6
Contrasting corpora to identify phraseological suggestions to enhance L2 English research writing Gustavo Zomer Ana Frankenberg - Garcia Centre for Translation Studies School of Literature and Languages University of Surrey

Transcript of Contrasting corpora to identify phraseological suggestions ...

Page 1: Contrasting corpora to identify phraseological suggestions ...

Contrasting corpora to identify phraseological suggestions to enhance L2 English research

writing

Gustavo ZomerAna Frankenberg-Garcia

Centre for Translation StudiesSchool of Literature and Languages

University of Surrey

Page 2: Contrasting corpora to identify phraseological suggestions ...

Research problem/questions

• Research writing can be especially challenging for scholars and scientists whose habitual working language is not English (L2 English researchers) (Schuster, Levkowitz, & Oliveira Junior, 2014; Politzer-Ahles, Holliday, Girolamo, Spychalska, & Berkson, 2016)

• A major hurdle in the way of L2 English researchers when writing for publication in high-ranking international journals is academic English phraseology

• This study aims to help Brazilian researchers improve their phraseological repertoire when writing for publication in English.

What are the main phraseological differences in journal articles published locally by Brazilian researchers when compared with a reference corpus of expert academic English?

Can L1 Portuguese academic phraseology explain some of the discrepancies above?

How to automatically provide target academic English phraseological solutions to help Brazilian researchers writing for publication in English?

RQ1

RQ2

RQ3

Page 3: Contrasting corpora to identify phraseological suggestions ...

Corpora

• The focus corpus, the Brazilian Academic Corpus of English (BrACE) v2, is a much larger, 35M word version of the 1M word BrACE corpus used in Tavares Pinto, Rees & Frankenberg-Garcia (2021).

• It was compiled using a balanced sample of journal articles in seven broad subject areas, written in English, and published in Brazil, which were downloaded from Scientific Electronic Library Online (SciELO)

• The reference corpus is the Expert Academic Corpus of English (ExpACE), which was built specifically for this study. It consists of 35M words from more than 5000 highly cited papers published in international high-impact journals in eight subject areas.

• The third corpus is CoPEP, a 40M word corpus of academic Portuguese sourced from published journal articles from SciELO (Kuhn, 2017).

Page 4: Contrasting corpora to identify phraseological suggestions ...

Methodology

• To identify phraseological contrasts between publications by L2 English Brazilian researchers and expert academic English (RQ1), we follow Granger’s (Granger, 1996) Contrastive Interlanguage Analysis

• We began by extracting the top 2-4 grams in BrACE and ExpACE with a normalized frequency of over ten per million, selecting only n-grams occurring in more than three disciplines in each corpus, and removing proper nouns and non-phrases (e.g., preposition+preposition).

• Next, we contrasted BrACE with ExpACE using a smoothed-frequency ratio (Kilgarriff, 2009), and selected the n-grams that occurred twice as often in the former to look for evidence of overuse

• To identify the possible sources of n-gram overuse that could be related to L1 (RQ2), we machine-translated them into Brazilian Portuguese using DeepL, and selected the translation that occurred most frequently in the CoPEP corpus.

• Finally, we machine-translated each Brazilian Portuguese phrase back to English, which resulted in alternative lexical suggestions for English phrases overused in Brazilian papers (RQ3).

Page 5: Contrasting corpora to identify phraseological suggestions ...

Results

N-gram1 ExpACEfrequency2

BrACE frequency2 Factor3 Translation COPEP

n-gram rank4MT

Suggestion 1MT

Suggestion 2

in this study 88.7 272.9 3.1 neste estudo 930 in this paperaccording to

the 90.6 396.5 4.4 de acordo com o 4 in accordance

with in line with

the present study 42.8 242.7 5.7 o presente

estudo 782 this study the current study

of this study 32.6 140.0 4.3 deste estudo 677 of this paperof the

sample 21.3 59.7 2.8 da amostra 503 from the sample

with the aim 8.0 19.1 2.4 com o objetivo 98 in order to aiming

portion of the 10.4 47.3 4.6 parte do 266 part of the

in all the 7.6 21.8 2.9 em todo o 198 throughout the across the

among the 105.5 285.0 2.7 entre os 17 between the

is necessary 40.6 92.3 2.3 é necessário 606 is requiredconsidering

the 38.0 186.7 4.9 tendo em conta o 276 in view of given the

1 – The complete table has over 1000 entries2 – Tokens per million 3 – BraCE frequency/ExpACE frequency4 – Translated n-gram rank in COPEP

Page 6: Contrasting corpora to identify phraseological suggestions ...

Final remarks

• Results suggest that many phraseological contrasts observed can be traced back to academic Portuguese.

• Overused and underused phraseology is not always obvious to detect and address.

• This study fills this gap by developing a contrastive approach for automatically identifying typical issues and offering alternative suggestions without using error-annotated corpora.

• Our methodology can be extended for researchers with different L1 backgrounds by changing the L2 English corpus and L1 academic corpus.

The plans also assist in understanding the reasons for adverse outcomes with the aim to improve future mine-to-plan compliance.

The plans also assist in understanding the reasons for adverse outcomes aiming to improve future mine-to-plan compliance.

Example suggestion in context: