The Use of Corpus Linguistics in Lexicography

26
The Use of Corpus Linguistics in Lexicography An Integrative Review Lexicography ENGL 6203 Submitted by: Ihsan Ibadurrahman (G1025429) Syareen Izzaty Bt Majelan (G1029580) Rudiana Razali (G1115202)

description

 

Transcript of The Use of Corpus Linguistics in Lexicography

Page 1: The Use of Corpus Linguistics in Lexicography

The Use of Corpus Linguistics in Lexicography

An Integrative Review

LexicographyENGL 6203

Ihsan Ibadurrahman (G1025429)Syareen Izzaty Bt Majelan (G1029580)

Rudiana Razali (G1115202)

Page 2: The Use of Corpus Linguistics in Lexicography

The Use of Corpus Linguistics in Lexicography

An integrative literature review

I. Introduction

The practice of dictionary-making began as early as 1600 when Robert Cawdrey included words

that were deemed difficult as they were borrowed from another language into his version of the

dictionary (Siemens, 1994). The words from the dictionary were taken from Latin-English

dictionaries and also available texts of the time and were given concise definitions, synonym and

a fixed form (Siemens, 1994). It was Samuel Johnson who explicitly introduced the methods or

steps that were taken to create his dictionary in the 1700s and some of the methods were then

followed by the committee entrusted to create “A New Dictionary” or currently known as the

Oxford English Dictionary in the 1800s.

A corpus is a collection of samples of authentic spoken and written text which are used

for analysis of words, meanings, grammar and usage (David, 1992). In Saussurian terminology,

the text is akin to that of parole, while the corpus provides the evidence of langue (Tognini &

Bonelli, 2001). The term corpus linguistics is used when a corpus is specifically used to study a

language. Lindquist (2009: 1) distinguishes the term with other branches of linguistics such as

sociolinguistics (the study of language and society), or psycholinguistics (the study of language

and the mind) in that corpus linguistics is a specific method used in language study, the “how to”

rather than the “what”. In other words, corpus linguistics is an approach rather than a specific

field of language study (Gries, 2009).

This paper aims to highlight major findings in the literature on corpus linguistics with an

added emphasis on its use in dictionary-making. In developing this integrative literature review,

18 sources were obtained: 13 books, 2 journal articles, and 3 online articles. After all the

literature is reviewed, recurring ideas found in the literature are compared, listed, and discussed.

For ease of reading, the literature has been categorized into separate subheadings, namely, pre-

corpus era, the initial corpus, and the present corpus.

1

Page 3: The Use of Corpus Linguistics in Lexicography

II. Literature Review

a. Pre-corpus linguistics

Robert Cawdrey's Table Alphabeticall (1604) is considered to be the first monolingual English

dictionary ever made even though glosses of words have been made prior to Cawdrey's

dictionary (Jackson, 2002). Cawdrey's dictionary consisted of 2543 'hard' words which

comprised of loanwords that were considered difficult to be learned by the 'uneducated' reader

where the words were gathered from Latin-English dictionaries, glosses of religious, legal and

scientific texts (Siemens, 1994). Cawdrey provided a concise definition of each word, a synonym

or explanatory phrase and fixed form of many of the difficult words (Siemens, 1994; Jackson

2002). After the conception of Cawdrey's dictionary, a lot of effort have been made to better the

quality of the dictionary and the subsequent dictionaries were made according to the methods

employed by Cawdrey which was extracting 'hard' words from different texts and including them

into the dictionary.

It was in 1755 that Samuel Johnson published a two volume dictionary that he worked on

for 9 years (Jackson, 2002). It became the standard for English dictionary for 150 years before

the conception of the Oxford English Dictionary in England and was the first dictionary that used

quotes to indicate how each word was used (Baugh & Cable, 2002). Johnson in his letter to his

patron wrote that he had faced difficulties in adding a word into the dictionary in the following

order:

1) Selecting words. Johnson had to decide on which words that he wanted to include in the

dictionary and classify each word whether they are foreign or belong to English since a

lot of borrowing has been made from other languages. He also had to decide if words

from specific professions should be included in the dictionary.

2) Orthography. Johnson proposed that no change should be made to the spelling of words

without a sufficient reason because change would only cause inconvenience to others and

is a mark of weakness or inconsistency.

3) Pronunciation. Johnson says that along with orthography, pronunciation should also be

constant because stability in a language is important to the lifespan of a language and any

changes would create almost new speech which would corrupt spoken English of that

time.

2

Page 4: The Use of Corpus Linguistics in Lexicography

4) Etymology and derivation. It is important to know the etymology of the word because it

is hard to discern which words are native to English with the amount of borrowings from

different languages.

5) Analogy. The rules that governed how the words are used are included.

6) Syntax. The construction of each word is shown because the construction of English is

too inconsistent that it would be difficult to be reduced to only rules.

7) Phraseology. The phrases in which the word is used are included to illustrate the

different ways the words can be used.

8) Interpretation. Compared to the previous steps, Johnson considers interpretation of a

word to be the most difficult part of creating the dictionary because he had to look at the

different usages of each word and come up with the best explanation of the word.

9) Distribution. After all the above mentioned steps have been taken, Johnson then slotted

each word into their proper classes.

After more than 150 years being the main source of reference with several revisions,

Johnson’s dictionary was found to be inadequate for the standards of modern scholarship

(Jackson, 2002). So in 1857 a committee was appointed to collect words that are not in the

dictionary to be added as a supplement but the committee found that it was not enough and in

1858 it was decided a new dictionary should be created (Baugh & Cable, 2002; Jackson, 2002).

The main aims of the new project were to record every word that can be found in English from

about the year 1000 and to exhibit the history of each from a selection of quotations from the

whole range of English writings (Baugh & Cable, 2002). They gathered a total of six million

slips containing quotations from volunteers not only from England but from all over the world as

well. After 24 years of hard work, they managed to publish the first instalment of the dictionary

that covers part of the letter A in 1884. Another 16 years passed when four and a half volume of

dictionary was published until the letter H. Finally in 1928, the final section of the dictionary was

issued making the effort to create "A New Dictionary" successful after 70 years and now known

as the Oxford English Dictionary (OED) (Baugh & Cable, 2002). The committee came up with

rules that have to be observed by the editors of OED before a word can be included in the

dictionary in the following order (Considine, 1996):

3

Page 5: The Use of Corpus Linguistics in Lexicography

1) The Word to be explained.

2) The Pronunciation and Accent.

3) The Various Forms assumed by the word, and its principal grammatical inflexions.

4) The Etymon of the word.

5) The Cognate Forms in kindred languages.

6) The Meanings which are logically deduced from the Etymology, and arranged to show

the common thread or threads which unite them together.

Even though over a century has passed since Johnson created his dictionary, some of the

steps taken by Johnson were still used while creating the OED. This shows that the methods

employed by Johnson were still relevant to lexicographers and were the main steps to be taken in

making a dictionary before corpus linguistics was introduced in dictionary making.

b. The initial stage of corpus linguistics

In 1950s, there was a growing dissatisfaction of how language theory (e.g. Noam Chomsky’s

syntactic structure) could not reason out the many ‘ungrammatical’ patterns found in English

(i.e. distinction between transitive and intransitive verbs). There was a strong call for empirical,

real language data (Teubert, 2004). It was then that corpus was invented. The first corpus was

made out of a survey of English usage conducted by two universities, University of London and

the Brown University Corpus in Providence. In the 1960s, both compiled its million word corpus

of written text from 500 reading passages, which was named Brown Corpus. This American

corpus was a landmark in corpus linguistics since it was the first corpus to employ a computer in

its making. In 1982, the British version of the corpus, named the LOB corpus was compiled by

Hofland and Johansson. LOB is an abbreviation from The Lancaster-Oslo-and Bergen, and as its

name suggests it is a collaborative attempt between the three universities: the University of

Lancster, the University of Oslo, and the University of Norwegian Computing Centre of the

Humanities.

However, both the Brown corpus and LOB corpus were deemed to be inadequate to

sample English vocabulary. This gave birth to John Sinclair’s English Lexical Studies which

specifically aimed to investigate vocabulary using an electronic text of spoken and written

4

Page 6: The Use of Corpus Linguistics in Lexicography

language. The study gave prominence to collocation - words that naturally co-occur together.

Aimed to represent varieties of English where it is used as a first or second language, Sidney

Greenbaum compiled one-million-word corpora called The International Corpus of English in

1988. The unique feature of this corpus is that it samples more spoken language (60%) than its

written counterpart (40%).

In the early 1990s, major universities and companies together compiled British National

Corpus (BNC) containing 100 million words from 1980 up to 1993. The compilers were Oxford

University Press, Longman, Chambers, the British Library, Oxford University and Lancaster

University. The aim of the corpus is to provide a balanced corpus that represents British English.

The corpus includes 10% spoken language and 90% written language, which comprises of 25%

fiction and 75% non-fiction. One big distinction between BNC and Brown is that the former took

samples from a longer piece of text between 40,000 and 50,000 words. This gives BNC an added

advantage of being representative since text contains a different use of words at the beginning, in

the middle, and at the end (Lindquist, 2009). Due to its sheer size, representativeness, and care,

most British publishers prefer to make use of this corpus as their source of lexicographic

information.

Typically, any corpora will need to go through a three-step process in its making. Before

going through these three steps, however the writer needs to determine the basic outlines of a

corpus such as the size of the corpus, the genre of the corpus, whether it will specifically look

into written, spoken language, or both. Sinclair (1996) points out that the principles underlying

corpus creation should be as large as possible including samples from a broad range of material

in order to accomplish one way of representativeness to be anticipated with the technology of the

time. The corpus should also be classified into different genres and even size. Once this basic

outlines is determined, the three-step process may begin. It starts with collecting the data, spoken

and/or written. It entails gathering a large mass of speech, written texts, obtaining permission,

and doing a careful and organized record-keeping. The next step is computerization which entails

converting raw spoken or written text into a digital format in a computer. Recording of speech

may be painstaking since it needs to be transcribed manually. Another concern with spoken text

is the issue of naturalness of the speech; it needs to be recorded in a natural, casual way that

resembles how people speak every day in real life, not in a stilted way. Though written records

5

Page 7: The Use of Corpus Linguistics in Lexicography

seem to be less painstaking, it also has its problem, mainly the copyright issue. Still some texts

that come from books, magazines, and other written sources need to be retyped since scanning

device such as OCR (Optical character recognition) software that detect and scan words

automatically usually contain errors, so many that it’s best to avoid using them altogether. The

last step is annotating, which involves assigning information such as parts of speech, etymology,

for each data. It should be noted that the three aforementioned steps need not to be seen as a

separate process; they are all closely connected. For example, after gathering recording of

speech, it may be best to transcribe it there and then.

Corpus may have given a lot of contributions in language study, but its impact to

lexicography did not start until 1989. Together with the advance of computer software, both have

since contributed significantly to the development of lexicography. Since everything is

automated and recorded in a digital format, lexicographers can now save their time and the

tremendous amount of work needed in compiling a dictionary. Typically, a dictionary usually

has information on the part of speech, usage, meaning, pronunciation, etymology of a word.

Before the advent of corpora, all this information had to be gathered manually; lexicographers

needed to do the hard labor of collecting slips of paper containing text that they intend to include

in the dictionary. For this reason, it took roughly 50 years to complete Oxford English

Dictionary, which was later known as New English Dictionary (Meyer, 2002). With corpora,

dictionary makers can now use a large sample of authentic spoken and written text as a source to

illustrate how each word in their list is used in real life. The citation used in dictionary comes

from real-life discourse. Real contexts also provide accurate, well-defined lexical meanings in

the definition of a word in dictionary, which is a huge improvement over the previous dictionary

practice where words were defined using an unscientific manner. One huge improvement in

dictionary making is the rich information available for words that have many invariant meanings

such as take, go, and time, which tend to be overlooked in the previous dictionary practice

(Lindquist, 2009).

Another huge advantage of using corpora in lexicography is that information on word

frequency can also be obtained. This way, lexicographers can assign whether a word is among

the first 500 most common words, the next 500 and so on. Meyer (2002) notes that the most

frequent words are functional words such as the, an, a, and, and of which carry little lexical

meaning and the least frequent words are content words such as proper nouns. Gries (2009)

6

Page 8: The Use of Corpus Linguistics in Lexicography

mentions two kinds of frequency information that lexicographers can obtain from a corpus:

frequencies of occurrence of linguistic elements in the so-called frequency list, and frequencies

of co-occurrence of these linguistic elements in concordances. Lindquist (2009: 5) defines

concordance as “a list of all the contexts in which a word occurs in a particular text”. Using a

Key Word in Context (KWIC) concordance, words can be retrieved within their surrounding

text, and be presented vertically on the screen. Since the information is presented in contexts,

lexicographers can easily assign the collocations of each word in their dictionary. Below is an

excerpt from concordance software in which the word “corpus” is highlighted.

Figure 1: Concordance from a software called AntConc 3.2.2w (Gries, 2009).

The above figure illustrates concordance software called AntConct in use. It should be

noted that the software does not come with a ready-made corpus. Hence, users need to readily

have a file to generate a KWIC output. The latest version of the software is 3.2.4w and can be

downloaded online at http://www.antlab.sci.waseda.ac.jp/software.html. Similar software that

lexicographers may use to find how words are used in context is wordsmith tools, devised by

Mike Scott in 1993. Since then the software has gone through a lot of changes which now

include a concordance, word-listing, web text downloader and many other features (Wikipedia,

2011). Previous versions of the software were sold and owned by Oxford University Press. The

software’s current version is now owned by Lexical Analysis Software Ltd. The current

7

Page 9: The Use of Corpus Linguistics in Lexicography

Wordsmith version is 5.0, and can be downloaded online at:

http://www.lexically.net/wordsmith/version5/index.html. However, unlike AntConc, Wordsmith

is a shareware. In order to unlock the demo version from the website, user will need to pay a

single-user license of £50 or around $70-80 from two online retailers (Lexical Software

Analysis, and Oxford University Press).

Since corpus is discourse-based, it means that the word appears in haphazard, arbitrary

collection of occurrences, as illustrated in the figure above. Dictionary makers need to check for

some contradictions with ‘real’ meaning. It is thus dangerous to solely depend on corpus

(Teubert, 2004). One way to check the word in context is to expand the text by retrieving its

original source. Such feature is lacking in both software mentioned previously: the AntConc and

Wordsmith tools. Fortunately, the feature is thankfully available for free from Birmingham

Young University Website, which provides a concordance containing BNC, COCA (Corpus of

Contemporary American English), and some other corpora and can be accessed at:

http://corpus.byu.edu/

The huge amount of data in the corpus also allows lexicographers to look for new words

that occur for the first time in spoken or written text. However, the corpus has to be large

enough to glean information on vocabulary items (Meyer, 2002). A small corpus such as LOB

corpus which stores roughly one million word items could not give lexicographers enough

information on the range of vocabulary items. A monitor corpus is also needed, in which large

data of language is pooled from time to time, rather than fixed only in one particular time period.

This way, the corpus is frequently updated with new words and meanings in today’s growing

language.

The first dictionary to be founded wholly on corpus is Collins COBUILD series of

English Language Dictionary compiled in 1987, guided by John Sinclair. The dictionary has its

citation taken from real life discourse, and each word is defined from these authentic texts,

instead of relying on previous dictionary. This entails using a very large corpus so that it may be

able to include all lemmas including their word senses. However, this presents problem in that

there tends to be an exclusion of rare words such as apothegm (Teubert, 2004). Besides being the

first corpus-based dictionary, COBUILD is innovative in that the definitions are akin to a

8

Page 10: The Use of Corpus Linguistics in Lexicography

classroom teacher explaining the words. For example in describing the word junk, it says: “You

can use junk to refer to old and second-hand goods that people buy and collect” (Jackson, 2002).

In the practice of dictionary-making, one crucial distinction has to be made between

corpus-based dictionary and corpus-driven dictionary. Dictionaries such as Collins COBUILD

series of English Language dictionaries are said to be corpus-driven if the corpus itself is used to

validate information presented in the dictionary. However, if the corpus is used to extract the

information used in the dictionary, it is called corpus-driven. Teubert (2004: 112) suggests that

dictionary should follow corpus-driven approach so that it may complement standard linguistics

and not just extend it.

c. Modern corpus linguistics

During the 1970s, computational research on English had not developed much in

Birmingham because heavy preparation was spent towards devising software packages,

instituting undergraduate courses and influencing opinions on the campus (Sinclair, 1991). On

that time, when computing was almost restricted to a number of crises, there was a highlight for

the importance of data- processing. It has taken approximately fifty years to make a real

improvement in the area of corpus- based linguistics which has been driven by systems that work

and methodologies that can produce reasonable coverage of linguistic condition (Lawler & Dry,

1998). Years after years, there has been a realization of emergence on accessibility of

computational resources such as fast machines and sufficient storage in order to process large

volumes of data. Besides that, in the modern corpus, there is a growing availability of corpora

with linguistics annotations, for example, part of speech, prosodic intonation, proper names, and

bilingual parallel corpora. Furthermore, the maturity of computational linguistics technology has

improved the commercial market for natural language product and the corpus linguistics

nowadays has been equipped by efficient parsing and statistical techniques.

From 1980 to 1986, computational language was put to good effect which transformed

into a completely new set of techniques for language observation, analysis, and recording. This is

as well bringing to the development of editing substantial dictionaries by using technique and

huge database of annotated examples.

9

Page 11: The Use of Corpus Linguistics in Lexicography

One of the most prominent uses of a corpus in recent years is as a resource for

lexicography. There was a corpus-based work for a small number of languages that was used in

lexicography. Only recently the need for very large corpora has come to the front. The

Lexicography and Natural Language Processing (NLP) collaboration has incited the use of

corpora in dictionary projects that have had access to very large corpora (Hua, 2001).

The role of the computer has a clerical role in lexicography which reducing the labor of

sorting and filing and examining very large amounts of English in a short time (Sinclair, 1991).

In the late 1970s, the prospects of computerized typesetting were growing more realistic. Ten

years later, in the early 1980s, a multi-million word corpus became available for study but still

limited. From simple tools, it has evolved to a substantial progress together with crucial,

profound and basic linguistic generalizations (Lawler & Dry, 1998). By these kinds of developed

tools, they have revealed many topics for inquiry which have not been well explored by

traditional linguistic methods.

In the modern era, the word has been reserved for collections of texts that are stored and

accessed electronically. Electronic corpora are usually larger than the paper-based collections

which are basically small, previously used to study the aspect of language (Hunston, 2002).This

is due to the capacity of computers that can store and process large amount of information

compared to the previous time.

One of the work in the area of corpus linguistics is from the work done by Johansson and

collegues in producing a parallel corpus of British English have made it possible for research

workers to scrutinize and visualize physically texts of greater length compared to the time

before. The main structural features of these corpora are:

- A classification into genres (15) of printed texts

- A large number (500) of fairly short extracts (2000 words), giving a total of around

one million words.

- A close to random selection of extracts within genres.

Due to this, a great amount of useful information can be extracted easily from the

corpora. Besides that, many locations have samples of text which provide hundreds of billions of

words. Many collections available such as Association for Computational Linguistics’ Data

Collection Initiative (ACL/DCI), the European Corpus Initiative (ECI), ICAME, The British

10

Page 12: The Use of Corpus Linguistics in Lexicography

National Corpus (BNC), the Linguistic Data Consortium (LDC), the Consortium for Lexical

Research (CLR), Electronic Dictionary Research (EDR), and standardization efforts such as the

Text Encoding Initiative (TEI) (Armstrong, 1994).

The application of corpora in applied linguistics is also extended to the language teaching

apart from the area of lexicography. It has benefited into a wide variety of field. Other relevant

applications of corpora are to the production of dictionaries and grammars, in critical linguistics,

translation, literary studies and stylistic, forensic linguistics and designing writer support

packages (Hunston, 2002).

In relation towards the dictionary making, corpora have a contribution towards the area

which is most far-reaching and influential. The use of corpora has changed dictionaries in a way

that it has stressed on frequency, collocation and phraseology, variation, lexis in grammar and

authenticity (Hunston, 2002). Recent innovations of dictionaries include the on-line Longman

Web Dictionary and the Collins COBUILD English Collocations on CD ROM.

Sinclair (1996) points out that the principles underlying corpus creation should be as

large as possible including samples from a broad range of material in order to accomplish one

way of representativeness to be anticipated with the technology of the time. The corpus should

also be classified into different genres and even size.

d. The use of corpora in language teaching

The method of using corpora in the disciplines of many studies is not uncommon (McEnery &

Wilson, 1996:4). Apart from Lexicography, other possible areas include Language Teaching,

Discourse and Pragmatics, Semantics, Sociolinguistics, Historical linguistics and Stylistic.

Within the area of Language teaching, we also have another branch known as CALL (Computer-

Assisted Language Learning), where it provides a further application of corpora. There is a study

conducted at Lancaster University towards the role of corpus-based computer software for

teaching undergraduates the basis concept of grammatical analysis (Hua, 2001). The software is

called Cytor which reads an annotated corpus, including part-of-speech tagged or parsed, in one

11

Page 13: The Use of Corpus Linguistics in Lexicography

sentence at a time. Besides the reading, it also hides the annotation and asks the students to

annotate the sentences on their own. In addition, students could call up help in the form of the list

of tag mnemonics, examples of frequency lexicon or concordances.

How effective is the Cytor at teaching part-of-speech learning? A research carried out

related to this was done by McEnery, Baker and Wilson (1995, cited in Hua, 2001) which after

comparing two groups of students which have different treatments; one who were taught with

Cytor and another via traditional lecturer-based methods, the result suggests that the computer-

taught students performed better than the human-taught students throughout the term.

Another use of corpus in the language teaching and learning is the adaptation of

classroom concordance (data driven learning) by classroom practitioner where corpus has

become a source for empirical teaching data (Hua:2001,5). One of the examples of link to Data-

Driven Learning is Tim John’s Home Page at http://web.bham.ac.uk/johnstf/. It provides an

outstanding resource of online web-based bibliographic database of books and articles related to

Corpora and Language Teaching. Moreover, it has included online worksheets which involving

corpora for classroom teaching. Another resource which is also quite interesting is the “Grammar

Safari” site developed at Champaign-Urbana and can be found online at

http://deil.lang.uiuc.edu/web.pages/grammarsafari.html which provides careful and thoughtful

selection of corpus-based activities. Furthermore, the Longman Grammar of Spoken and Written

English by Douglas Biber et al to answer student questions related to grammar contribute to the

useful corpus categorized into fiction, conversation, news, etc.

12

Page 14: The Use of Corpus Linguistics in Lexicography

III. Discussions and Conclusions:

From the reviewed literature, it could be dictionary has been around centuries ago. The first

dictionary was made in the 1600s and was based on what was considered difficult words at that

time. During this initial stage, lexicographers faced some challenges in adding words into their

dictionaries: selecting words, orthography, pronunciation, etymology and derivation, analogy,

syntax, phraseology, interpretation, distribution. All this information had to be gathered

manually; lexicographers needed to do the hard labor of collecting slips of paper containing text

that they intend to include in the dictionary. For this reason, it took roughly 50 years to complete

Oxford English Dictionary, which was later known as New English Dictionary. However with

the advent of corpus linguistics, things began to change dramatically.

In 1989, together with the technological advance in computer, corpus provided a

significant contribution to the development of dictionary making. Corpus linguistics made such a

huge impact in dictionary-making:

a. It significantly reduces the time and the heavy work it needs to compile a

dictionary since everything is automated and computerized.

b. Each dictionary now resembles how language is used in real world. Meaning is

assigned from these samples, rather than from the writer’s point of view.

c. Frequency of each word in the list can be assigned / identified.

d. Much more information can be given to words with a lot of variant meanings such

as go, and take.

e. It makes it easy to include collocation because words appear in its surrounding

text.

f. It can quickly take ‘new’ everyday words into the system.

However, because corpus is discourse-based, it means that the word appears in

haphazard, arbitrary collection of occurrences. Dictionary makers need to check for some

contradictions with ‘real’ meaning. It is thus dangerous to solely depend on corpus. Another

disadvantage of dictionaries that are corpora-based is that it tends to exclude rare words (not

appearing in real world language) such as apothegm. The first dictionary to ever make it corpus-

based is Collins COBUILD series of English dictionaries.

13

Page 15: The Use of Corpus Linguistics in Lexicography

Corpus linguistics serve some linguistic purpose and to preserve the texts due to the

intrinsic value in the texts (Hunston, 2002). It also can be used as groundwork for research. The

storage of a corpus allows the users to study it non-linearly and both quantitatively and

qualitatively. The nature of a corpus does not include new information about language but to

offer us a new viewpoint on the given information. It shows us a way that language can be

examined. Most of available software packages process data from a corpus in three ways;

showing frequency, phraseology, and collocation (Hunston , 2002).

Corpora have made life simpler as well as more complex. In situations that corpora have

made the life of users simpler are, for example, when a translator could see quickly the

comparison of words that are more or less equivalent or a teacher could refer to the corpus when

he or she wishes to show the reasons of why a particular usage is incorrect or inexact in

explanations. On the other hand corpora could also made life more complex in a sense that

language is patterned in a much more fined way than what we might have been expected that a

simple and general rule turns out to be applied only in certain context (Hunston, 2002).

The modern corpus is reserved for collections of texts that are stored and accessed

electronically. Electronic corpora are usually larger than the paper-based collection which is

basically small, previously used to study the aspect of language. Electronic corpora gave birth to

the recent innovations of dictionaries, which include the on-line Longman Web Dictionary and

the Collins COBUILD English Collocations on CD ROM.

14

Page 16: The Use of Corpus Linguistics in Lexicography

References:

Armstrong, S. (1994). Using Large Corpora. Cambridge: MIT Press.

Baugh, A. C. & Cable, T. (2002). A History of the English Language. Oxon: Routledge.

Considine, J. (1996). The Meanings, deduced logically from etymology in Gellerstam, M.; Jeker Jäborg; Sven-Göran Malmgren; Kerstin Norén; Lena Rogström y Catarina Röjder Pammehl (eds.), Euralex ‘96 Proceedings. Papers submitted to the Seventh EURALEX International Congress on Lexicography in Göteborg, Sweden, Göteborg University - Department of Swedish, Göteborg, 1996, 365-371.

David, C. (1992). An Encyclopedic Dictionary of Language and Languages. Oxford: Oxford University Press. Retrieved from: http://www.tuchemintz.de/phil/english/chairs/linguist/independent/kursmaterialien/language_computers/whatis.htm

Gries, S.T. (2009). ‘What is Corpus Linguistics?’, Language and Linguistics Compass, Vol. 3. pp.1-14

Hua,T.K. (2001). Corpora: Characteristics and Related Studies. Kuala Lumpur: Maziza Sdn Bhd.

Hunston , S. (2002). Corpora in Applied Linguistics. UK : Cambridge University Press.

Jackson, H. (2002). Lexicography, an Introduction. Oxon: Routledge.

Johnson, S. (1747). The Plan of a Dictionary of the English Language.

Lawler, J.M. & Dry,H.A. (1998). Using Computers in Linguistics: A Practical Guide. London: Routledge.

Lindquist, H. (2009). Corpus Linguistics and the Description of English. Edinburgh: Edinburgh University Press.

Mason, O. (2000). Programming for Corpus Linguistics: How to Do Text Analysis with Java. Edinburgh: Edinburgh University Press.

Meyer, C.F. (2002). English Corpus Linguistics. Cambridge: Cambridge University Press.

McEnery T. & Wilson, A. (1996). Corpus Linguistics. Edinburgh: Edinburgh University Press.

Siemens, R. G. (1994). Robert Cawdrey: A Table Alphabetical of Hard Usual English Words (1604). Retrieved from http://www.library.utoronto.ca/utel/ret/cawdrey/cawdrey0.html

Sinclair, J. (1991). Corpus,Concordance,Collocation. Oxford: Oxford University Press.

15

Page 17: The Use of Corpus Linguistics in Lexicography

Teubert, W. (2004). ‘Language and corpus linguistics’. Lexicology and Corpus Linguistics. London: Continuum.

Tognini, E., Bonelli. (2001). Corpus Linguistics at Work. Amsterdam: John Benjamins Publishing Co.

WordSmith. (2011, October 15). In Wikipedia, The Free Encyclopedia. Retrieved April 22, 2012, from http://en.wikipedia.org/w/index.php?title=WordSmith&oldid=455732307

16