John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014....

27
is is a contribution from Corpus-Informed Research and Learning in ESP. Issues and applications. Edited by Alex Boulton, Shirley Carter-omas and Elizabeth Rowley-Jolivet. © 2012. John Benjamins Publishing Company is electronic file may not be altered in any way. e author(s) of this article is/are permitted to use this PDF file to generate printed copies to be used by way of offprints, for their personal use only. Permission is granted by the publishers to post this file on a closed server which is accessible to members (students and staff) only of the author’s/s’ institute, it is not permitted to post this PDF on the open internet. For any other use of this material prior written permission should be obtained from the publishers or through the Copyright Clearance Center (for USA: www.copyright.com). Please contact [email protected] or consult our website: www.benjamins.com Tables of Contents, abstracts and guidelines are available at www.benjamins.com John Benjamins Publishing Company

Transcript of John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014....

Page 1: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

This is a contribution from Corpus-Informed Research and Learning in ESP. Issues and applications. Edited by Alex Boulton, Shirley Carter-Thomas and Elizabeth Rowley-Jolivet.© 2012. John Benjamins Publishing Company

This electronic file may not be altered in any way.The author(s) of this article is/are permitted to use this PDF file to generate printed copies to be used by way of offprints, for their personal use only.Permission is granted by the publishers to post this file on a closed server which is accessible to members (students and staff) only of the author’s/s’ institute, it is not permitted to post this PDF on the open internet.For any other use of this material prior written permission should be obtained from the publishers or through the Copyright Clearance Center (for USA: www.copyright.com). Please contact [email protected] or consult our website: www.benjamins.com

Tables of Contents, abstracts and guidelines are available at www.benjamins.com

John Benjamins Publishing Company

Page 2: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

Corpora and academic writing

A contrastive analysis of research articles in biology and linguistics

Céline Poudat & Peter FolletteLexiques, Dictionnaires, Informatique, Université Paris 13 / Bases, Corpus et Langages, Université Nice Sophia Antipolis

Research articles represent a major form of academic discourse and are known to vary based on genre and discipline. Although numerous studies have been conducted to describe and explain these variations, few of them have used quantitative methods. However, text statistics is particularly well developed in France, and the methods and tools developed would be very useful for ESP and EAP teachers and corpus linguists. The present chapter offers an overview of the main methods that have been developed, and combines qualitative and text-statistics approaches to examine variation between the academic fields of biology and linguistics, two disciplines that differ widely with respect to their experimental approaches, methodology, and intellectual communities and history.

Keywords: academic genres; corpus linguistics; text statistics

1.  Introduction

Among the variations that significantly impact academic discourse, those involv-ing discipline and genre are certainly among the most frequently analyzed. Numerous studies have followed the example of Swales (1990), who successfully drew attention to the notion of genre in ESP, and there has been growing inter-est in disciplinary discourse practices over the last two decades. Research arti-cles, which are undoubtedly the most examined form of academic English, have been extensively studied with respect to numerous aspects, from their rhetori-cal structure and moves (e.g. Swales 1990; Thompson 1993) to specific linguistic regularities – notably, hedging (Hyland 1998) and voice (Fløttum, Dahl & Kinn 2006; Suomela-Salmi & Dervin 2009). At the same time, disciplinary variations have also been widely examined (e.g. Evangelisti Allori 1994; Duszak 1997), and

Page 3: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

1 Céline Poudat & Peter Follette

notable differences have been highlighted between disciplines with respect to style and recurrent patterns. More importantly, the literature has distinguished between disciplines showing more similarities in their writings and those showing more variation, with the hard sciences appearing to be quite different from the humani-ties and social sciences (HSS) in this regard.

In this context, we have chosen to concentrate on variations in research arti-cles in two significantly different disciplines, one from the hard sciences (biology) and the other from the humanities and social sciences (linguistics). Because they are quite opposing disciplines that reflect two different ways of ‘doing’ science, we found it relevant to contrast them and assess what we could learn from the comparison using a corpus-based approach. Biology is certainly one of the most frequently studied disciplines in EAP: the research field is strongly internation-alized and large article databases (e.g. PubMed1) have been developed, enabling biologists to be continuously up-to-date in their fields – and allowing corpus lin-guists and EAP researchers to benefit from available digitized corpora. Research papers in biology almost always follow the classic IMRAD structure (Introduc-tion, Method, Results and Discussion) describing hypothesis-driven experimen-tal research (Swales 1990). This structure both facilitates the writing process and enables researchers to find the information they need instantly. Biology articles have generally been examined with respect to their individual sections and the regular patterns that can be found therein from one move to another (see for instance Thompson 1993; Kanoksilapatham 2005; Saber this volume). Finally, key papers in biology are nearly always published in English, and for this reason numerous EAP researchers teach academic English to biologists. When writing in English, biologists use a large number of recurrent fixed expressions and stereo-typical features; as Myers pointed out:

For established researchers, the English of research articles in their field may be almost a sublanguage, with a very restricted set of stock phrases strung together with a very limited range of vocabulary and syntax and structure. A new report can be written almost as a fill in the blanks. (Myers 1994: 45)

In contrast, linguistics research articles are far from being a case of “fill in the blanks”. Among the HSS, it is recognized that linguistics research varies consid-erably (Fløttum 2007; Evangelisti Allori 1994; Duszak 1997). Further, as global research collaborations are less common in linguistics than they are in biology, the discipline is notably less internationalized, and numerous national communities exist in the field. For instance, the French linguistics community is very active,

1.  ⟨http://www.ncbi.nlm.nih.gov/pubmed/⟩ (12 April, 2011).

Page 4: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

Corpora and academic writing 1

and a French linguist can have a successful academic career without ever publish-ing in English. Accordingly, the most important linguistics articles are not neces-sarily published in English, nor do they necessarily follow an IMRAD structure, although they generally do comprise an introduction and a conclusion. Finally, courses for linguistics students in academic English – or in academic French, for that matter – are far less widespread, further increasing the heterogeneity of aca-demic writing produced by linguists.

Advances in ESP and EAP have been based to a large extent on the develop-ment of corpora and corpus linguistics, which has significantly influenced both linguistic descriptions and language teaching practices. Corpus linguistics concen-trates on naturally occurring data and designs corpora to investigate language use and variation, enabling researchers to assess distances between intuition and data as well as to explore large corpora. We will not discuss here whether or not cor-pus linguistics is a theory or a methodology (see for example McEnery & Wilson 1996; or Tognini-Bonelli 2001), but in any case numerous corpus tools have been developed to carry out analyses on the patterns found in natural texts. In this pro-cess, concordance programs are crucial, and in fact they are essentially the main tools used by corpus linguists and language teachers. For example, when Baker (2010) lists the most popular corpus tools, he actually lists the major concordance programs (WordSmith Tools,2 AntConc,3 MonoConc Pro,4 Xaira,5 SketchEngine,6 COBUILD Concordance Sampler7). The search for patterns is indeed at the heart of corpus linguistics, as collocations are considered to be one of the two main organizing features of texts (see Sinclair’s idiom and open choice principles [1991]).

Variation in academic discourse has therefore been generally examined through collocation differences, and numerous comparative studies have been conducted within this framework. This approach is more qualitative than quan-titative, and most analyses have individually examined articles within relatively small corpora. Although we do not question the relevance and importance of these studies, in the present chapter we have mainly concentrated on quantitative methods, which are significantly less widespread and less established in the field.

2.  ⟨http://www.lexically.net/wordsmith/⟩ (12 April, 2011).

3.  ⟨http://www.antlab.sci.waseda.ac.jp/software.html⟩ (12 April, 2011).

4.  ⟨http://www.athel.com/mp.html⟩ (12 April, 2011).

5.  ⟨http://www.oucs.ox.ac.uk/rts/xaira/⟩ (12 April, 2011).

.  ⟨http://www.sketchengine.co.uk/⟩ (12 April, 2011).

7.  ⟨http://www.harpercollins.co.uk/about-harpercollins/Imprints/collins/Pages/Collins.aspx⟩

Page 5: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

17 Céline Poudat & Peter Follette

We took a more inductive approach to reveal the specific linguistic elements that best characterize our (sub-)corpora. This will enable us to present an overview of the quantitative techniques and tools that may be used to support qualitative state-ments and confirm, or disconfirm, linguistic hypotheses.

2.  Tools and methodology

Text statistics is particularly well developed in France where it is known by the name Analyse de Données Textuelles, or ADT. This approach originated in Saint Cloud (Bonnafous & Tournier 1995) in the 1980s and has been corpus-based and language-use oriented since its inception. Researchers built and contrasted corpora using statistical measures such as relative frequencies or hypergeometric distributions8 to compare corpora. Thorough research has been conducted on cor-pora representativeness and comparisons, as frequencies and statistical measures can only be interpreted within or between corpora. Linguistic units do not have frequencies in language (Lafon 1980), as the norm is basically endogenous to the corpus. Consequently, many parameters have to be considered to guarantee the representativeness and relevance of the sample. Indeed, statistics provides results regardless of the data set, and humans have a natural propensity to interpret and generalize from figures and visual representations.

The methods developed were quite innovative for their time, and they are still frequently used for exploring and contrasting large corpora. Most of them are available in the main ADT software package, Lexico39 (Salem, Paris 3 Univer-sity), Hyperbase10 (Brunet, Nice University), Alceste11 (Reinert, Image company) and DtmVic12 (Lebart, TelecomParisTech), to name but a few. In spite of the wide range of functions they offer (from concordance to multivariate statistics), this software is not very well known outside of the French-speaking scientific com-munity; this may be partly due to the fact that most of the programs are in French, with no translation into English, as are some of the key articles that helped shape this research field. However, the methods and software developed would be very useful for corpus linguists, who mainly work with concordance programs that include comparatively basic statistical measures. The text statistics field benefits

  .  Hypergeometric distribution, or Fisher’s exact test, is a discrete probability distribution; see Lafon (1980).

  .  ⟨http://www.tal.univ-paris3.fr/lexico/lexico3.htm⟩ (12 April, 2011).

1.  ⟨http://www.unice.fr/bcl/spip.php?rubrique38⟩ (12 April, 2011).

 11.  ⟨http://www.image-zafar.com/index_alceste.htm⟩ (12 April, 2011).

12.  ⟨http://www.dtmvic.com/⟩ (12 April, 2011).

Page 6: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

Corpora and academic writing 171

from advances made in lexicometry, which has developed a wide range of quanti-tative methods, enabling for instance the user to describe the vocabulary and style of a given author or political party. It also benefits from advances in Natural Lan-guage Processing (NLP), and notably from automatic tagging and parsing, which make it possible to handle annotated or even multi-annotated corpora. Moving beyond an outdated conception of texts as ‘bags of words’, the approach is basically textual and considers texts sequentially and in terms of their structure, notably through searches for structuring patterns or motifs (Longrée & Mellet 2009) that echo Sinclair’s idiom principle (1991). Text statistics and corpus linguistics obvi-ously share common interests, and would benefit substantially by cooperating to accomplish intersecting goals.

Furthermore, a recent project13 involving the major software designers (Hyper-base, Lexico3, Weblex14) has led to the development of a standalone open-source platform, TXM, which should be powerful enough to challenge the existing cor-pus tools. As we will see in the following sections, the TXM workbench includes an impressive concordance program that notably enables researchers to perform multi-level searches on morphosyntactically labeled corpora (i.e. words, lemmas and morphosyntactic categories, thanks to a Corpus Query Processor15 that also allows users to submit regular expressions). In addition, TXM is a relevant corpus manager in which users can partition corpora or create sub-corpora according to metadata they have encoded (i.e. author, genre, domain, date, etc.). In its current version, it already offers a wide range of statistical functions, some of which will be discussed in further detail in the present chapter.

Most of the analyses carried out in this study were computed by TXM. We also resorted to the previously mentioned Lexico3 software, which is still one of the leading tools used in text statistics. Lexico3 notably computes repeated seg-ments, which are useful in two primary ways: first, they give us insight into the way language is used in each corpus and the kinds of themes or patterns that are typical of each type of article, and second they reflect how language is often learned – not just word-by-word but also phrase-by-phrase. Bolinger (1976) argued that users memorize prefabrications or ‘prefabs’ to ensure language fluency in the real time process of communication. In Sinclair’s terms, “the language user has available to him a large number of preconstructed or semi-preconstructed phrases that constitute single choices, even though they appear to be analyzable into segments” (1987: 320). Applied to relevant corpora, repeated segments, combined with

13.  See the Project website at ⟨http://textometrie.ens-lyon.fr/⟩ (12 April, 2011).

14.  Developed by Heiden: ⟨http://weblex.ens-lsh.fr/doc/weblex/⟩ (12 April, 2011).

15.  IMS Corpus Workbench (Christ et al. 1999): ⟨http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPUserManual/HTML/⟩ (12 April, 2011).

Page 7: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

172 Céline Poudat & Peter Follette

advanced concordances, are quite helpful for capturing these prefabs, which seem crucial for language acquisition.

3.  Corpus description

The corpus (hereafter referred to as the BIOLING corpus) consists of 144 published articles in English, and comprises three different subcorpora:

– 90 articles in biology, divided into two sets, to observe genre variations: – 45 research articles (285,211 words) – 45 review articles (240,270 words)

– 54 articles in linguistics (389,614 words), to examine domain variation.

The biology subcorpus is divided into two distinct subcorpora containing 45 articles each. The two sets will enable us to observe genre distinctions. Indeed, there are two main categories of published articles in biology, research articles and review articles. The central feature of research articles is that they present original data that has not been published elsewhere. Review articles, on the other hand, provide an overview of a research area but without providing original data. The articles were all written by presumed native speakers (based on their presence within an English-speaking university). To achieve parity between the two biology genres, the review articles were selected in almost every case from the same journals and within one year of the research articles. The articles came from a total of 42 different journals16 related primarily to various aspects of molecular biology and genetics, and were published between the years 2004 and 2009.

The linguistics corpus is smaller, and comprises 54 articles, as the distinc-tion between research and review articles is not that clear in the field, and in HSS in general – although it may be more established in coming years, as linguistics journals refer increasingly to the distinction. For instance, Lingua

1.  AIDS, Applied Environmental Microbiology, Biochemistry, Biotechnology Progress, Blood, BMC Bioinformatics, Cancer Cell International, Cancer Letters, Cell, Cell Death and Differen-tiation, Cell Metabolism, Cell Signaling, Current Biology, Current Genetics, Current Opinions in Genetics and Development, Development, Developmental Cell, EMBO Journal, EMBO Reports, European Journal of Biochemistry, FEMS Yeast Research, Gene Expression Patterns, Genes and Development, Genetics, Human Genetics, Journal of Endocrinology, Journal of Immunology, Journal of Virology, Mechanisms of Development, Molecular Pharmacology, Nature, Nature Bio-technology, Nature Reviews in Genetics, Nucleic Acids Research, PLoS One, PNAS, Science, Stem Cells, Trends in Cell Biology, Trends in Genetics, Vaccine, and Yeast.

Page 8: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

Corpora and academic writing 173

already makes a distinction between the two genres when describing its aims and scope:

Lingua publishes papers of any length, if justified, as well as review articles surveying developments in the various fields of linguistics, and occasional discussions. (emphasis added)17

Nevertheless, the great majority of the texts published in Lingua are labeled “origi-nal research articles” and the distinction is still not made in most journals, which simply distinguish between articles and book reviews.

The 54 linguistics articles were extracted from three international journals published between 2001 and 2002: Journal of Pragmatics (20 articles), Linguistics (16) and English for Specific Purposes (18).

In spite of the different designs of the two corpora, we tried to describe them with the same metadata. We distinguished between single- and multi-authored articles, and in this regard the difference between biology and linguistics was very significant, as most of the articles of the linguistics subset were single-authored (42 single-authored vs 12 multi-authored): in linguistics, science is not yet a team activity, but a rather solitary process (see Carter-Thomas & Chambers this volume, for the case of economics research articles). On the contrary, most articles in biol-ogy are multi-authored: all of the research articles included in the corpus were written by multiple authors, as were all but eight of the review articles. Figure 1 proposes an overall view of the BIOLING corpus.

60

50

40

30

20

10

0LING - research

articlesBIOL - research

articlesBIOL - review

articles

Multi-authored

Single-authored

Figure 1. Overall structure of the BIOLING corpus

17.  ⟨http://www.elsevier.com/wps/find/journaldescription.cws_home/505590/description #description⟩ (12 April, 2011).

Page 9: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

174 Céline Poudat & Peter Follette

Author affiliations, keywords, figures, tables and references were removed from the texts, as were most of the examples18 we found in the linguistics articles, as they are subjected to different textual regimes and very often belong to different languages. Let us finally mention that the corpus was POS-tagged with TreeTagger (Schmid 1994), using the default parameters for English.19

4.  Results: Domain and genre variations

We first explored domain and genre variations to assess which categories were most interesting to compare, and to see what we could learn about their similari-ties and differences. To this end, we made the most of the three annotation levels we had (words, lemmas and parts-of-speech) and used hypergeometric distri-butions to determine which linguistic criteria were specific to biology and lin-guistics (Section 4.1.), and what best distinguished research from review articles (Section 4.2.).

4.1  Domain variation

We first used TXM and computed ‘specificities’ to determine which criteria best differentiate biology from linguistics. Specificities share the same goals as the key-words facility of WordSmith,20 with two exceptions: (a) two or more corpora can be compared (no reference corpus is needed21 and the corpora can have different sizes); and (b) the statistical tests are different – WordSmith options allow the user to choose either the classic Yates’ chi-square test22 or the log likelihood test, origi-nally applied to textual data by Dunning in 1993.23 Specificities, on the other hand,

1.  It was not possible to remove all the examples, as they are often repeated in the body of the articles.

1.  Tagset available at: ⟨http://www.sketchengine.co.uk/tagsets/penn.html⟩ (12 April, 2011).

2.  Developed by Scott: ⟨http://www.lexically.net/wordsmith/⟩ (12 April, 2011).

21.  The measure is nevertheless available in Hyperbase.

22.  This is not relevant in its current use, as the Chi-square test is computed on the sample size, and consequently depends excessively on corpus size.

23.  Thanks to Dunning, the log likelihood test is often used by corpus linguists to assess the distance between two word cooccurrences. However, it needs more data than hypergeometric distributions to provide relevant results and it does not consider the theoretical frequencies of the data occurrences, and only gives approximations (Kilgarriff 2001). Dunning’s test takes into account a parameter that hypergeometric distributions disregard, which is the presence

Page 10: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

Corpora and academic writing 175

are calculated according to a probabilistic model (Lafon 1980) based on hypergeo-metric distributions. This measure is still commonly used in text statistics, and is, incidentally, available in most components of the ADT software (e.g. Lexico3 or Hyperbase). Different levels of specificities can be computed, depending on the corpus annotation. In our case, three levels were available: words, lemmas, and morphosyntactic categories. The results are given in a table listing all the linguistic units found within a given partition (the corpus is divided into different ‘parts’ according to the categories considered, e.g. authors, dates, genres, or disciplines) in the rows. In addition to providing the total frequency of each linguistic unit within the corpus, each row has as many columns as there are parts, and each column gives the logarithm of the specificity score for the relevant part. The scores can be positive or negative, according to the over- or under-use of the given lin-guistic unit, and scores around zero are considered inconclusive, or neutral.

The two sections of the BIOLING corpus, biology (BIOL) and linguistics (LING), were submitted to computations of specificities for the three levels of annotation (words, lemmas and parts-of-speech). Figure 2 proposes a view of the

of word A when word B is absent, although a precise evaluation of the two tests would be needed to go further.

Figure 2. TXM lemma specificities – domain partition

Page 11: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

17 Céline Poudat & Peter Follette

results we obtained at the lemma level, on which we will mainly concentrate in the present sub-section. Note that the lemmas are sorted downward according to their significance to the linguistics part. This shows that the top lemmas are not associ-ated with a specific score but are instead labeled +infinity. Indeed, they exceed the hypergeometric threshold, which is not even given in this case by TXM, and are then highly specific (or highly non-specific depending on whether the item is over- or under-used). We decided to concentrate on these highly specific items to explore domain variation.

If we concentrate on lemmas,24 528 items appear to be highly specific to biol-ogy, whereas 534 are highly specific to linguistics: this large number of specific items already shows the great distance that separates these two fields. Indeed, the number and nature of the specificities obtained fluctuate substantially from one comparison to another and, in general, the more two corpora differ, the more specificities there are (compare with the slight differences we obtained below when contrasting the two biology genres). Unsurprisingly, a large percentage of these items do not appear at all in the other field: 311 of the 528 lemmas that are highly specific to biology occur only in biology, while 202 of the 534 lemmas that are highly characteristic of linguistics are found only in linguistics. We concentrated on the 100 most frequent items out of these two specificity lists, and distinguished between lexical (i.e. content words) and stylistic markers (i.e. function words, punctuation and symbols). Table 1 shows the lexical markers that were absent or almost absent from the other field (<20 occurrences), as we considered them to be more characteristic. Items are sorted downward according to their under-use in the other field, and we highlighted in bold those markers that are positively spe-cific (+infinity) in one field and negatively specific (-infinity) in the other.

What is striking in Table 1 is that biology tends to be characterized by content markers that reflect a specialized terminology, whereas linguistics uses a wider variety of function words that signal a particular rhetorical style.

The terms that are characteristic for all of the biology articles include a com-bination of general scientific words – with significant relevance for biological research – and biological terms that are specifically related to the fields of molecu-lar biology and genetics (i.e. protein, gene, DNA, etc.). Interestingly, the lemmas that are absent in linguistics are mostly nouns, whereas the lemmas that are rare in linguistics also include verbs. The verbs we noted were all action predicates, which may be divided into two groups: (a) verb predicates related to biological properties or reactions, whose arguments are the scientific objects under examination (for instance, proteins mediate, or regulate); and (b) verbs that refer to the biological

24.  Note that lemmas also include punctuation marks and symbols.

Page 12: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

Corpora and academic writing 177

method and its performance, like measure, compare, test, result, which were found in the 100 lemma list but were not reported in Table 1 as they exceeded 20 occur-rences in linguistics. The latter verbs are certainly common to most experimen-tal sciences, which write similar scientific narratives organized around series of temporal and measurable steps (see the specific function words we identified, e.g. during, after, and lemmas like loss).

Table 1. Domain specificities: Biology vs linguistics – lemma level

Lexical markers Stylistic markers

Absent from the other field

Rare in the other field (<20 occ.)

Specific function words (F), punctuation (P) and symbols (S)

Biol

ogy

protein, gene, mouse, mutant, mutation, DNA, GFP, RNA, wild, chromosome, receptor, strain, embryo, kinase, membrane

cell, pathway, transcription, bind, region, regulate, mediate, anti, growth, loss, tissue

(F) and, with, by, at, after, during (P) dots, dashes, brackets, square brackets, slashes, underscores, question marks (S) %, +, =

Ling

uist

ics

speaker, language, literal, sentence, discourse, linguistic, utterance, English, semantic, verb, speech

meaning, word, student, say

(F) the, of, be, to, a, that, as, this, not, it, or, on, an, which, do, they, their, can, other, but, one, may, between, what, more, such, I, only, some, than, there, if, would, will, about, his, he, so, where, like, should (P) commas, quotation marks, colons, single quotation marks (S) NONE

Although linguistics unquestionably works with a specific terminology (e.g. speaker, language, sentence or discourse), the domain is primarily charac-terized by the style used, and notably by the significant use of: (a) “Noun of Noun” patterns (twice as many occurrences and distinct patterns identified as in biology research articles: 7479 vs 4084 occurrences/3454 vs 1853 distinct pat-terns), explaining the high specificity score of the determiner the; and (b) func-tion words showing more rhetorical caution. Note, for instance, the presence of modals or hedges (can, may, would, will, should), which weaken the truth value of the findings. These linguistic cues relate in large part to the linguistics method, and more generally to the interpretation process: it is once again tempt-ing to generalize the results we obtained for linguistics to most interpretative sciences.

Page 13: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

17 Céline Poudat & Peter Follette

Contrasting biology and linguistics highlights a continuum between general scientific terms and specialized domain terminology, and this is mainly due to the fact that biology and linguistics are very different disciplines within academic discourse. The two fields advance separately and enjoy few scientific25 and aca-demic relations, despite some overlap in methodology and terminology in certain subfields. For this reason, trends affecting the writing of biology articles would likely have very little impact on linguistics – and vice versa – although they might influence closer disciplines such as chemistry. Therefore, contrasting biology and linguistics allows us to capture much more than a simple difference between two research fields: it also enables us to oppose two ways of doing science, echoing the traditional distinction between natural and cultural, or interpretative, sciences. Figure 3 illustrates the two levels of differences we obtained within academic discourse.

Is there a generalacademic terminology?

Academicdiscourse

Natural Sciences(NS)

Natural vs Humanities sciences– NS: Experimental discourse(e.g. compare, test, measure)– HS: Speculative discourse (e.g.modals)

Specialized terminologies– Biology: protein, gene,mouse…– Linguistics: speaker,language, literal…

HumanitiesSciences (HS) Linguistics

Biology

Figure 3. Lexical variations in academic discourse: a continuum?

While academic discourse is basically structured into research fields and sub-fields, it is built on genres as well. The next section further refines this distinc-tion while also enabling us to further explore hypergeometric distributions and to present the traditional ADT method of correspondence factor analysis.

4.2  Genre variation

After setting aside the linguistics corpus, we contrasted biology research and review articles (henceforth RAs and RVAs), resorting once again to hypergeo-metric distributions and high specificities. The two sets of articles first turned out

25.  Studies highlighting similarities based on concepts of evolution and phylogeny can nev-ertheless be found (e.g. Atkinson & Gray 2005).

Page 14: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

Corpora and academic writing 17

to be notably closer: only 42 lemmas are highly specific to RAs and 26 to RVAs. Moreover, no lemmas were totally absent from the other set. As a result, we used the specificity score to select the lemmas that were significantly under-used in the other genre. We also used the part-of-speech (POS) level, as scientific genres are known to involve morphosyntactic variations and changes (Biber 1988; Poudat 2006). Table 2 shows the main specificities we obtained using the lemma and the POS levels; the items are sorted downward according to their under-use in the other field, whose score is given. As in Table 1, items that are positively specific (+infinity) in one genre and negatively specific (−infinity) in the other are high-lighted in bold.

Table 2. Genre specificities: Research vs review articles – lemma and POS levels26

Lexical markers Stylistic markers Parts-of-speech

Rese

arch

art

icle

s

GFP (−62), mutant (−47), Fig (−45), figure (−39), wild (−34), use (−34), data (−33), strain (−29), experiment (−26), compare (−26), expression (−23), antibody (−22), time (−20), mRNA (−17), indicate (−17)

(F) we (−109), with (−24), each (−22), at (−20) (P) underscores, slashes (−57), brackets (−55), colons (−48), dashes (−41) (S) = (−94), % (−23)

proper nouns26, verb to be conjugated in the simple past, other verbs conjugated in the simple past (−124), numerals (−48), symbols (−48)

Revi

ew a

rtic

les

cell (−69), stem (−59), tumor (−38), factor (−36), complex (−28), structure (−27), process (−25), mechanism (−22), development (−21), study (−17)

(F) can (−61), have (−48), that (−35), such (−34), it (−33), they (−32), a (−25), to (−19), their (−18), may (−18) (P) square brackets (−38), commas (−28) (S) NONE – abbrev. et al. (−79)

modals (−102), relative pronouns (−80), verb be conjugated in the simple present (−74), be (−64), other verbs conjugated in the simple present (−51), been (−49), is (−44), verb have, 3rd person sing. present (−33), singular nouns (−32), plural nouns (−25), adjectives (−25)

2.  The presence of proper nouns in first position was a real surprise, and after investigating this issue in the annotated corpus, it was found to be faulty. The numerous symbols as well as the gene and cell names caused extensive tagging mistakes. It is important to recall that TreeTagger has not been trained on scientific texts, and we have shown in previous work how beneficial it is to use adapted annotations (e.g. Poudat 2006).

Page 15: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

1 Céline Poudat & Peter Follette

Table 2 further refines the previous results: RAs resort significantly to an experiment-based vocabulary (e.g. use, data, experiment, compare, time, indicate), whereas RVAs take a broader view and use more general scientific terms (e.g. fac-tor, structure, process, mechanism, development) and more nouns in general (both singular and plural nouns are highly specific to RVAs). Plural nouns may also be more widely used, as revealed by the extensive use of third-person plurals (they, their) denoting longer reference chains referring to plural discourse entities – and more specifically, the plural nouns cells, tumors, factors, structures, processes.

(1) These nuclear bodies self-assemble by virtue of nucleation around certain molecular components and are continuous with the nucleoplasm in which they reside, and in many cases their appearance and their numbers within the nuclear landscape are connected to cellular activity. – EMBOJ253469 – Shal-Tov27

It is worth highlighting here the presence of the lemma study, which is seldom employed in a singular sense: reference to previous studies is indeed central in reviews, and these citations are typically accompanied by the use of square brackets28 as well as the recurring et al. abbreviation and the reader-friendly imperative see referring to the cited piece of work.

RAs and RVAs are also clearly associated with different verb tenses: the simple past is strongly characteristic of RAs, while RVAs tend to use verbs conjugated in the simple present. This echoes what Swales has already shown (Swales 1990; Swales & Feak 2004): researchers typically use verbs in the simple past tense to present new results – which is precisely what RAs do; in contrast, while RVAs do survey past results, they largely focus on the conclusions of the studies, discussing them in the present tense. However we wanted to go further, and determine which of the verbs conjugated in the simple past and in the simple present were the most specific to RAs and RVAs. To achieve this, we used TXM and built a lexical table of all the verbs conjugated in the two tenses distributed among the two genres (with a threshold of a minimum of 10 occurrences), on which we launched a specificity calculation. The new table included 291 conjugated verbs. Table 3 lists the most specific verbs we obtained per genre.

Verbs characteristic of RAs are once again related to the experimental design: some of these refer to the experimental process itself (e.g. did, used), others to

27.  To refer to the articles of the corpus, we use the following format: Journal name (here, EMBO Journal), Volume, Issue and Author name(s).

2.  In biology journals, articles are commonly referred to with a number enclosed in square brackets, e.g. [1] or [26].

Page 16: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

Corpora and academic writing 11

the gathering of results (e.g. examined), and others to their interpretation (e.g. indicated). A few verbs conjugated in the simple present or found in their infini-tive form were also present (test, determine, compare). In contrast, verbs that are specific to RVAs are more general and explanatory, and, notably, are all in the present tense.

Table 3. Verb specificities: RAs and RVAs

RAs RVAs

Verb Specif. score Verb Specif. score

did 11,8 acquire 9,6examined 11 integrate 9,2indicated 11 maintain 8,3observed 8,9 understand 8test 8,5 renew 7,9determine 7,1 contribute 7,7used 6,8 lead 6,6tested 6,2 differentiate 6compare 6,1 help 5,8appeared 5,4 regulate 5,7described 5,2 prevent 5,7

Research articles are also significantly characterized by the use of the we pro-noun. We is in fact strongly characteristic of RAs, even if we add the linguistics articles (henceforth LAs): RAs: + infinity, RVAs: −79 and LAs: −5. The role of we is quite clear in RAs, and obviously refers to the authors of the paper, as the perform-ers of the experiments, while in the two other subcorpora, we is endowed with different values – and the frequencies are significantly lower (Table 4).

RVAs, and to a large extent LAs, are non-experimental and comprise more modal verbs than do RAs, but the collocates of the we pronoun are very dif-ferent: we has a more inclusive value, showing more reader-friendliness in lin-guistics articles (e.g. we have seen, we shall see, we see that), although both sets resort to the future tense to direct the reader through the article; the pronoun is also associated with subjective verbs related to the interpretation process (e.g. we believe that), or restricting the value of the findings (e.g. we seem to have three levels of communicated contents – JP3404 Vicente), at least at a rhetorical level. On the other hand, we in RVAs shows the authors playing a more active role in the review process, making proposals or stating new findings on the

Page 17: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

12 Céline Poudat & Peter Follette

basis of the articles they are reviewing (e.g. we searched for, we propose that, we determined that).

Therefore, although the three sets share the generic name of article, they clearly do not fulfill the same objectives and this considerably affects the linguistic properties and styles of the articles; on that basis, they should be considered as three distinct but partially overlapping scientific genres.

The final question we addressed in this part of the analysis concerns the structure of the corpus according to these domain and genre categories. How are the three genres and the two domains opposed? To answer this question, we com-puted a correspondence factor analysis (CFA), a classic method in text statistics. Developed by Benzécri (1973), CFA is a multivariate technique that detects associations and oppositions between individuals (texts, domains, authors, genres, etc.) and observations (words, lemmas, morphosyntactic categories,

Table 4. Collocation statistics of the pattern we followed by 2 words291

RAs RVAs Linguistics

word F word F word F

we found that 26 we need to 8 we have seen 15we show that 22 we searched for 6 we do not 14We found that 20 we do not 5 we have to 11we examined the 19 we review the 5 we would expect 9We conclude that 19 we propose that 5 we need to 9we have shown 14 We propose that 5 we shall see 6we did not 13 we focus on 4 we believe that 6We suggest that 11 we determined that 3 we will consider 6we conclude that 10 we discuss the 3 we see that 5we demonstrate that 10 we will focus 3 we wanted to 5we determined the 9 we will discuss 3 we get the 4We find that 9 We found that 3 we seem to 4We next examined 9 we may be 2 we consider all 4we used the 8 we found that 2 we consider the 4we propose that 7 we observe that 2 we should not 4We show that 7 we describe the 2

2.  TXM index function – CQP query: [lemma=“we”][][] – RSA vs RVAs vs LAs.

Page 18: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

Corpora and academic writing 13

etc.) within a contingency table. Individuals and observations can be visualized separately or simultaneously onto two-dimensional factor maps, enabling the user to assess the distances between individuals and observations. To measure the distance between the three genres (RAs, RVAs and LAs), we considered the verbs conjugated in the simple present and past tenses as well as the personal pronouns, which were significant criteria in our previous analyses. We chose a threshold of a minimum of 50 occurrences within the BIOLING corpus as a whole.

The results are eloquent enough: linguistics articles are clearly opposed to biol-ogy RAs and RVAs on the first (horizontal) axis, whereas most RAs are opposed to RVAs on the second (vertical) axis (Figure 4).

1.0

0.5

0.0

axis

(2):

iner

tia =

3.7

3%–0

.5–1

.0

–1.5 –1.0 –0.5 0.5 1.0 1.0.0axis(1): inertia = 9.76%

Figure 4. First factor map of the CFA – individuals

The above analysis contrasting biology and linguistics articles using quantita-tive methods revealed key distinctions between experimental sciences and HSS, involving both the terminologies used and the styles employed. Nevertheless, we found it relevant to go further and to adopt a more qualitative focus. The following section will concentrate on a set of linguistic phenomena and examine their use in the corpora and their role in structuring the texts.

Page 19: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

14 Céline Poudat & Peter Follette

5.  Results: Textual organizers in the two disciplines

In addition to the above analysis, we conducted a more targeted study of specific terms that carry out specific textual functions and play a role in text structuring within articles. Two linguistic markers were selected: (a) a set of features taken from previous studies that are considered to signal text moves in articles, particu-larly those in biochemistry (Swales & Feak 2004; Kanoksilapatham 2005); and (b) some of the markers that turned out to play a differential role in the preceding section. This investigation will enable us to assess the meaning and stability of the markers, and to consider the relevance of features observed in other studies – especially for linguistics.

Rhetorical moves are a key concept in ESP and EAP. Following the exam-ple of Swales (1981, 1990), numerous studies have concentrated on these moves or steps, generally examined within the different article sections (e.g. Abstracts, Introductions, Discussions). Importantly, many of the moves and steps described for research articles appear to have distinctive and represen-tative linguistic features and patterns that are often taught in academic Eng-lish classes. However, most of these investigations have been performed on hard sciences articles – e.g. biochemistry in Kanoksilapatham (2005) – as HSS articles do not follow such a regular structure and vary significantly more. We nonetheless considered characteristic terms identified by Swales and Feak (2004) and Kanoksilapatham (2005) – S&F and K respectively in Table 5 – as relevant features to examine. We chose to concentrate on two different sets of criteria that we hypothesised would play an important role in text structuring: (a) markers related to the relevant field (mostly represented in moves 1 and 2 of Introductions), and (b) features related to the results obtained (statements and interpretation).

Table 5 reports the set of markers we selected. Each cue is associated with its source (references or Section 4 – in this case, the set the cue is specific for is given), its supposed function, and the number of times it occurs, as well as its specificity in the three sets. Let us finally point out that the words considered to be significant in the literature were nonspecific to RAs.

5.1  Introduction features

5.1.1  ImportanceWhile we found only one significant pattern built with key in RAs (plays a key role, 5 occurrences), ultimately this word did not permit us to formulate any conclusions about our corpus, as no regular patterns were found in LAs or in

Page 20: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

Corpora and academic writing 15

RVAs. On the other hand, important turned out to be much more distinctive. We observed the same two first patterns in the three sets: an important role in (LAs: 6 occurrences, RAs: 5, RVAs: 13) and is important to note (LAs: 5, RAs: 3, RVAs: 7). We then looked more closely at the expressions important to + Verb and important + Noun: important to note and important role remained the most frequent in all sets (respectively LAs 5, RAs 3, RVAs 7 and LAs 9, RAs 11, RVAs 17). Here again, RAs showed the greatest number of fixed expressions and the least variation: important to establish, account, point, consider (1 occurrence each) and important roles (3), function (2), distinction (2), translation (2). For important to + Verb, RVAs also included recognize (3), determine (3), point (2), consider (2), and 11 other verbs (e.g. understand, remember, keep, ask), whereas LAs comprised point (2) and 15 other verbs (e.g. stress, clarify, ensure, provide, emphasize). Note that the nouns qualified as important in RAs were mostly technical (e.g. drug, antioxidant, modulators), while RVAs and LAs comprised a variety of more general nouns (e.g. RVAs: mechanism, factors, function, area, findings or insights; LAs: aspect, differ-ence, implications, impact, characteristics).

This is again related to the very function of research articles in biology: RAs primarily report experiments and leave little space for more general interpreta-tions. As the use of qualifying adjectives such as important implies a comparison process, the objects being compared in RAs mostly deal with the specific (techni-cal) elements of the experiments. In contrast, RVAs and LAs are similar to each

Table 5. Selected features

Linguistic feature

Source Supposed function

Nb. occ. & specificity scores

LAs RAs RVAs

Fiel

d

important S&F, K, intro. move 1

Introduction of topic

188 (0) 90 (−9) 200 (+12)

key K, intro. mv 1

Introduction of topic

56 (0) 34 (−1) 52 (+2)

remains S&F, intro. mv 2

Gap in knowledge

50 (−2) 35 (−2) 80 (+8)

Resu

lts

unclear S&F, intro. mv 2

Gap in knowledge

18 (−1) 26 (0) 22 (0)

indicated S4, RAs Stating results

51 (−26) 259 (+i) 32 (−14)

observed K, S4, RAs

Stating results

92 (−39) 342 (+i) 139 (0)

Page 21: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

1 Céline Poudat & Peter Follette

other in that they both aim at deriving generalizations; this overlap between LAs and RVAs may explain why formal review articles are less common in linguistics than in biology.

5.1.2  GapFirst of all, remains appeared to be more structuring (i.e. more stable, and more distinctive) for the two biology sets than for linguistics. Whereas the subjects of remains vary considerably (although 20% of the occurrences were preceded by the pronoun it), the context to the right is remarkably stable: RVAs comprised 11 occurrences of remains to be determined and 3 of remains to be identified, while RAs included remains to be determined (4) and elucidated (3). In both cases, remains basically indicates a gap (which will not necessarily be filled in the arti-cle, as the pattern turned out to be rare in Introductions) as well as the words following it (e.g. RAs: elusive, poorly understood, controversial, unclear, a mystery why, unknown; vs RVAs: a challenge, poorly characterized, obscure, unclear, a high priority, a risk today, a mystery). Although the very meaning of remains contains an idea of limitation that is also perceptible in linguistics articles, much more variation is noted: the first pattern we observed was remains to be seen (3 occur-rences), which is finally more reader-friendly, as the ‘obscure’ issue is developed further:

(2) It remains to be seen whether this definition of literal meaning is really coherent and functional in any way (cf. Section 4). JP 3404 – Ariel

(3) It remains to be seen precisely how this link functions; I will return to this problem in Section 7. JP3309 – Waltereit

On the other hand, unclear appeared to be used in two distinct but related ways in biology articles. The first, corresponding to a move 2-like function, highlights gaps in the available knowledge of a given topic. In this context, unclear is associated with markers indicating that the gap is current (with cooccurrents conveying the same idea, e.g. currently, with 7 occurrences). The concordance lines in Figure 5 include examples of this type.

sequence of phosphorylation is currently unclearor acentrosomal spindle formation remain unclearalthough the molecular basis for this is unclearthis alternative mechanism is currently unclearbrings about these changes is currently unclear

( Chen et al . 2006 ) . Protein kinase A. Here , we performed live - cell microscopy. Here we show that nephrin selectively. In summary , although it is apparent that. The fact that overexpression of membrane

Figure 5. Concordance lines of unclear within biology RAs: gaps in knowledge

The second usage of this feature refers to the results of the same paper, point-ing to limitations to their interpretation or to knowledge gaps that remain in view of the results, as illustrated in Figure 6.

Page 22: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

Corpora and academic writing 17

activation induced by E4 - ORF1 . It is unclearOkuda et al , 2004 ) . At present it is unclear

be accomplished in a single step . It is uncleareyed flies in Table 1 , cross 11 . It is uncleartrains ( Fig . 3b , right half ) . It is unclear

whether , like E4 - ORF1 , high - risk HPVwhether proteolysis plays a role in mammalianwhy no more than nine introns could be integratedwhy the D. melanogaster product is predominantwhy the parental clbl , 3 , 4 strain showed

Figure 6. Concordance lines of unclear within biology RAs: presented experiments.

Interestingly, RAs is the only set having a balanced use of unclear: RVAs, aiming to define current gaps in knowledge, significantly prefer the first usage (75% occur-rences). On the other hand, the distinction was not clear in LAs, as gaps in current knowledge and references can be found throughout the article. However, unclear was never found with the term currently in LAs but rather with other, sometimes similar, adverbs like still (3 occurrences), often and sometimes (1 occurrence each).

5.2  Statement of results and interpretation

By observing the collocates of indicating and observed, we found that these verbs were systematically followed by the conjunction that. In general, this construction was used either to refer to conclusions that can be drawn based on experimental results (these data show that…) or to report the findings of previous research (studies in vertebrates show that…) (Charles 2006). To systematize our findings, we found it relevant to observe verbs preceding a that conjunction (Table 6).

In spite of noteworthy similarities (use of the verbs suggest, show, indicate, be or find), the three sets show important variations and distinct choices when discussing or interpreting findings. Linguists acknowledge their subjectivity with verbs like argue, claim, say or believe, whereas biologists conceal themselves behind the data, especially in RVAs. If we consider the first argument of the verb suggest (lemma search) in the three sets, results (16), studies (13), This (11), data (9), evidence (6), observations (4) and findings (3) are the most highly represented in RVAs. The collocates are very similar in RAs, except for the We pronoun: data (31 occurrences), This (22), results (18), We (11), studies (5), findings (3) and evi-dence (3). Interestingly, linguists seem more hesitant or cautious when present-ing results based on data, as they almost systematically resort to hedging with verbs like suggest or indicate. Suggest is often weakened by might (5 occurrences) or accompanied by other hedging devices:

(4) The results of this study seem to suggest that discussion of previous research is an element that does not just establish the territory (Move 1) but can also be employed to realize steps that belong to other moves. ESP 2101 – Samraj.

(5) It would indeed be untenable to suggest that linguistic communication can be explained in terms of a simple process of encoding and decoding. JP3310 – Chapman.

(6) This suggests that these nonverbs may only be marked, or at least may originally have only been marked, with conjunct (…) LI4003 – Curnow.

Page 23: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

1 Céline Poudat & Peter Follette

Table 6. Collocation statistics of the pattern [VERB] that30 – RAs vs RVAs vs LAs29

RAs RVAs Linguistics

word F word F word F

suggest that 113 suggest that 85 is that 254show that 84 is that 80 argue that 76suggesting that 84 shown that 73 suggests that 75found that 75 suggests that 73 shows that 66suggests that 67 suggesting that 70 show that 64indicate that 66 indicating that 49 suggest that 57indicating that 61 showed that 49 argues that 48is that 60 demonstrated that 46 means that 46shown that 57 indicate that 38 argued that 43demonstrated that 53 revealed that 34 indicates that 43suggested that 48 found that 30 shown that 39demonstrate that 48 suggested that 26 claims that 33showed that 39 indicates that 26 indicate that 33indicated that 37 proposed that 21 found that 32conclude that 32 demonstrate that 19 say that 32shows that 26 reported that 18 assume that 30determined that 24 note that 17 believe that 30revealed that 24 propose that 16 claim that 27indicates that 24 show that 14 mean that 27propose that 16 appears that 14 conclude that 26confirm that 15 implies that 13 note that 25confirmed that 15 reveal that 11 showed that 24demonstrating that 15 concluded that 11 demonstrate that 24find that 14 showing that 11 noted that 22showing that 14 noted that 10 seems that 21observed that 13 shows that 9 imply that 20implies that 13 postulated that 9 be that 19Given that 12 means that 9 find that 18appears that 11 speculate that 9 ensure that 16

3.  TXM index function – CQP query: [ttpos=“V.*”]“that”.

Page 24: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

Corpora and academic writing 1

This difference in the level of hedging observed between biology and linguis-tics articles may reflect the nature of the scientific process involved in the two disciplines, with linguistics involving more delicate interpretation of data, and biology aiming more for clear-cut, conclusive results based on testable hypotheses.

This section has enabled us to highlight important differences between the three sets of articles, and three quite opposed writing styles: as underlined in pre-vious studies, linguistics articles show a high level of variation (Fløttum 2007; Evangelisti Allori 1994; Duszak 1997), whereas biology articles are much more structured (moves and sections, and fixed expressions) (Swales 1990; Thompson 1993; Kanoksilapatham 2005). Linguists accept their subjectivity and continu-ously weaken the truth value of their findings. Given the more interpretive nature of linguistics research and analysis, this stance is not only rhetorical: it also reflects a complex relationship to the data. On the contrary, data guarantee the objectiv-ity and the impact of biologists’ results. Accordingly, authors tend to hide behind data, results and previously published studies.

.  Conclusion

The overall aims of this paper were twofold: (a) to provide a contrastive descrip-tion of biology and linguistics articles; and (b) to present an overview of the quantitative techniques and tools that may be useful for describing and compar-ing corpora. The quantitative approach we adopted enabled us to highlight key distinctions between the two disciplines which ultimately echo larger lexical and stylistic differences between experimental laboratory sciences such as biol-ogy and HSS. Further, the results obtained led us to question the very identity of a discipline as a single coherent whole, which is the product of distinct writing cultures and traditions. Comparing two such divergent disciplines draws atten-tion to fundamental differences and criteria that academics have consciously or unconsciously acquired, and which are in any case very important for learning how to write academic papers. Moreover, the quantitative or inductive techniques we used interestingly echo the learning process itself (collocations or prefabs, or even specificities, as the acquisition of a specialized sublanguage is necessarily a differential process).

In that respect, we assume that the comparative approach we have adopted here would be very relevant in EAP teaching, which still remains rather general. Indeed, the supposed homogeneity of academic English should be further assessed, as we have shown that the different functions of an article, and the lexical elements used to achieve them, differ significantly between different academic fields and between similar but distinct genres. Further, as ESP and EAP teachers typically

Page 25: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

1 Céline Poudat & Peter Follette

teach students in a variety of academic disciplines, their intuitions regarding what characterizes these various genres and fields may not be reliable.

These observations raise various questions regarding research and teaching in academic English. Given the differences between these fields and genres, what could constitute a core curriculum that would be useful for advanced students in all disciplines? Also, once we go beyond such a core curriculum, how do we define the specific needs of students in various disciplines, and, for students within a single discipline, how do we approach teaching them to read or produce different genres?

In view of the results set out above, we believe that a corpus approach would certainly help to answer such questions and we hope that our work may contribute to the greater use of corpora by ESP and EAP teachers and researchers.

References

Atkinson, Q. & Gray, R. 2005. Curious parallels and curious connections: Phylogenetic thinking in biology and historical linguistics. Systemic Biology 54(4): 513–526.

Baker, P. 2010. Sociolinguistics and Corpus Linguistics. Edinburgh: EUP.Benzécri, J.-P. 1973. L’Analyse des Données, Vol. 2: L’Analyse des Correspondances. Paris: Dunod.Biber, D. 1988. Variation across Speech and Writing. Cambridge: CUP.Bolinger, D. 1976. Meaning and memory. Forum Linguisticum 1: 1–14.Bonnafous, S. & Tournier, M. 1995. Analyse du Discours, Lexicométrie, Communication et Poli-

tique. Langages 117: 67–82.Charles, M. 2006. Phraseological patterns in reporting clauses used in citation: A corpus-based

study of theses in two disciplines. English for Specific Purposes 25(3): 310–331.Dunning, T. 1993. Accurate methods for the statistics of surprise and coincidence. Computa-

tional Linguistics 19(1): 61–74.Duszak, A. (ed.). 1997. Culture and Styles of Academic Discourse. Berlin: Mouton de Gruyter.Evangelisti Allori, P. 1994. Academic Discourse in Europe: Thought Processes and Linguistic Reali-

sations. Rome: Bulzoni.Fløttum, K. (ed.). 2007. Language and Discipline Perspectives on Academic Discourse. Newcastle:

Cambridge Scholars.Fløttum, K., Dahl, T. & Kinn, T. 2006. Academic Voices: Across Languages and Disciplines [Prag-

matics & Beyond New Series 148]. Amsterdam: John Benjamins.Hyland, K. 1998. Hedging in Scientific Research Articles [Pragmatics & Beyond New Series 54].

Amsterdam: John Benjamins.Kanoksilapatham, B. 2005. Rhetorical structure of biochemistry research articles. English for

Specific Purposes 24: 269–292.Kilgarriff, A. 2001. Comparing corpora. International Journal of Corpus Linguistics 6(1): 1–37.Lafon, P. 1980. Sur la variabilité de la fréquence des formes dans un corpus. Mots: Les langages

du politique 1(1): 127–165.Longrée, D. & Mellet, S. 2009. Syntactical motifs and textual structures. Belgian Journal of Lin-

guistics 23: 161–173.

Page 26: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved

Corpora and academic writing 11

McEnery, T. & Wilson, A. 1996. Corpus Linguistics. Edinburgh: EUP.Myers, G. 1994. The genres of biologists’ writing. In Academic Discourse in Europe: Thought pro-

cesses and linguistic realisations, P. Evangelisti Allori (ed.), 43–54. Rome: Bulzoni.Poudat, C. 2006. Etude contrastive de l’article scientifique de revue linguistique dans une per-

spective d’analyse des genres. Texto! 11(3–4). ⟨http://www.revue-texto.net/1996–2007/Corpus/Corpus.html⟩ (12 April, 2011).

Schmid, H. 1994. TreeTagger: A Language Independent Part-of-speech Tagger. ⟨http://www.ims.uni-stuttgart.de/Tools/DecisionTreeTagger.html⟩ (12 April, 2011).

Sinclair, J. 1987. Collocation: A progress report. In Language Topics: Essays in Honour of Michael Halliday, R. Steele & T. Threadgold (eds), 319–331. Amsterdam: John Benjamins.

Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: OUP.Suomela-Salmi, E. & Dervin, F. 2009. Cross-Linguistic and Cross-Cultural Perspectives on Aca-

demic Discourse [Pragmatics & Beyond New Series 193]. Amsterdam: John Benjamins.Swales, J. 1990. Genre Analysis: English in Academic and Research Settings. Cambridge: CUP.Swales, J. & Feak, C. 2004. Academic Writing for Graduate Students: Essential Tasks and Skills,

2nd edn. Michigan MI: University of Michigan Press.Thompson, D. 1993. Arguing for experimental facts in science. Written Communication 10(1):

106–128.Tognini-Bonelli, E. 2001. Corpus Linguistics at Work [Studies in Corpus Linguistics 6]. Amster-

dam: John Benjamins.

Page 27: John Benjamins Publishing Companyecole-ecriture-2013.conference.univ-poitiers.fr/Poudat... · 2014. 11. 18. · through collocation differences, and numerous comparative studies have

© 2012. John Benjamins Publishing CompanyAll rights reserved