BBI3210 DR AFIDA MOHAMAD ALI. In linguistics, corpus (plural corpora) is a large and structured set...

42
BBI3210 DR AFIDA MOHAMAD ALI

Transcript of BBI3210 DR AFIDA MOHAMAD ALI. In linguistics, corpus (plural corpora) is a large and structured set...

BBI3210 DR AFIDA MOHAMAD ALI

In linguistics, corpus (plural corpora) is a large and structured set of texts (now usually electronically stored,processed and analysed). A corpus may contain single texts in single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. (Webster’s Online Dictionary)

A corpus is a collection of naturally-occurring language text, chosen to characterize a state or variety of a language. (Sinclair, Corpus, Concordance, Collocation, 1991:171)

spoken vs. written monolingual vs. bi/multilingual parallel vs. comparable corpora

(translation corpora) general language purpose vs.

specialisedlanguage purpose

diachronic vs. synchronic plain text vs. annotated (tagged) text

aim at representing spoken language London-Lund Corpus (LLC) Lancaster/IBM Spoken English

Corpus (SEC) Cambridge and Nottingham

Corpus of Discourse in English (CANCODE)

Santa Barbara Corpus of Spoken American English (SBCSAE)

Wellington Corpus of Spoken New Zealand English (WSC)

aim at representing written language BROWN Corpus (written texts, AE

in 1961) LOB Corpus (Comparable to

BROWN Corpus, BE, early 1960s) FROWN Corpus (AE, Early 1990s) FLOB Corpus (BE, Early 1990’s)

aim at representing several, at least two, different languages, often with the same text types (for contrastive analyses)

Parallel corpora (source texts plus translations): Canadian Hansard

Comparable corpora (monolingual subcorpora designed using the same sampling techniques): Aahrus corpus of contract law Multilingual Bilingual

Important resources for translation and contrastive studies.

Multilingual corpora… …give new insight into the language

compared …can be used to study language

specific and universal features …illuminate differences between

source texts and translations …can be used for a number of practical

applications, in lexicography, language teaching, translation, etc.

Bilingual vs.Multilingual Unidirectional (from La to Lb or

from Lb to Lc alone) vs. Bidirectional (from La to Lb

and from Lb to La) vs. Multidirectional (from La to Lb,

Lc etc.)

A corpus containing components that are collected using the same sampling techniques and similar balance and representativeness, e.g. the same proportions of the texts of the same genres in the same domains in a range of different languages in the same sampling period.

For the latest comprehensive website on corpora and corpus tools, go to http://www.uow.edu.au/~dlee/CBLLinks.htm

The sampling frame is essential for comparable corpora but not for parallel corpora because the texts are exact translations of each other.

Broadest type of corpus – very large, more than 10 million words, and contain a variety of language so that findings from it may be somewhat generalized.

Although no corpus will ever represent all possible language, generalized corpora seek to give users as much of a whole picture of a language as possible.

Analysis of patterns of language use as a whole.

Examples; British National Corpus (BNC 100,106,008 words)The American National CorpusICE – regional corpusCOCA (The Corpus of Contemporary American English)

These large, generalized corpora contain written texts newspaper and magazine articles, works of fiction and nonfiction, writing from scholarly journals, spoken transcripts (informal converstaions, government proceedings and business meetings)

If generalizations about language as a whole are to be drawn, a large general corpus should be consulted.

Compiled to desribe language use in a specific variety, register or genre.

Contains texts of a certain type and aims to be representative of the language of this type.

It can be large or small and are often created to answer very specific questions. MICASE (1,700,000 words of English

spoken in the academic domain) Contains only spoken language from a

university setting

CHILDES Corpus - contains language used by children

MICUSP (Michigan Corpus of Upper-level Student Papers) – a collection of papers from a range of university disciplines

Medical corpus – contains language used by nurses and hospital staff

Guangzhou Petroleum English Corpus (411,612 words of written English from the petrochemical domain)

HKUST Computer Science Corpus (1,000,000 words of written English sampled from undergraduate textbooks in computer science.

CPSA (Corpus of Professional Spoken American English)

Specialized corpora – often used in ESP settings

The AWL – was generated from a specialized corpora of academic texts

Also known as historical corpora.Texts date to different periods in time.

Ideal to study language change and history.

Brown/Frown Lob/Flob Helsinki Diachronic Corpus of

English Texts (8th-18th century) Archer Corpus – A representative

Corpus of Historical English Registers (BE and AE, 1650-1990).

Useful to compare varieties of English. Texts date all to the same period.

Brown and Lob Frown and Flob International Corpus of English

(ICE) (Texts produced after 1989) BNC

Specialized corpus that contains written texts and/or spoken transcripts of language used by students who are currently acquiring the language.

aim at representing the language as produced by learners of this language .

Learner corpora are often tagged and can be examined, e.g., to see common errors students made.

Lstr or L2 acquisition/L1 acquired by children

International Corpus of Learner English – ICLE (LC) Generalized corpora Contains essays written by English language

learners with 14 different native languages.

Standard Speaking Test Corpus (SST) More specialized E.g., comprised of oral interview tests of

Japanese learners.

Other examples; CHILDES (DC)Cambridge Learner Corpus (LC)

Targeted instruction can be developed for general language teaching or for specific language groups depending on the type of learner corpus.

It is a corpus that contains language used in classroom settings.

It can include academic textbooks, transcripts of classroom interactions, or any other written text or spoken transcript that learners encounter in an educational setting.

Lexicography / terminology Linguistics / computational linguistics

Dictionaries & grammars (Collins Cobuild English Dictionary for Advanced Learners; Longman Grammar of Spoken and Written English

Critical Discourse Analysis - Study texts in social context- Analyze texts to show underlying ideological

meanings and assumptions- Analyze texts to show how other meanings and ways

of talking could have been used….and therefore the ideological implications of the ways that things were stated

Literary studies Translation practice and theory Language teaching / learning

ESL TeachingLSP Teaching (exemplar texts)

Issues such as

1.How common are different words?2.How common are the different senses for a

given word across registers?3.Do words have systematic associations with

other words?4.Do words have systematic associations with

particular registers or dialects?

Research on empirical linguistics Study language use in various aspects

– Verify linguistic theory, e.g. the explanation of definite description,– Lexical studies e.g. study near synonymous ‘little’ ‘small’– Sociolinguistics : compare the different of languages produced from different social

groups (m/f)– Cultural study e.g. differences found in 2

comparable corpora (British/American) ….

Corpus based : use corpus as a resource Knowledge :

– Know better about Englishanswer specific questions of certain

words, phrases, structures.– Know where the problems are

error analysis on a learner corpus– Know what should be taught

word frequency, comparing native/learner corpora

References :– create better references

dictionary, grammar book, textbooks– verify certain hypotheses about languages

find support examples / counter examples– use a native corpus as a reference

see whether it is possiblewhich one is more natural

Corpus based : use corpus as a resourceSyllabus design :– Native corpora => what are actually used– Learner corpora => what are the problems– Find out which aspects should be given priority– Lexical syllabus = focus on frequency of occurrence– How many words the students should know?

What are they?– Knowing 90% or 95% of the words?

“In a corpus-driven approach the commitment of the linguist is to the integrity of the data as a whole, and descriptions aim to be comprehensive with respect to corpus evidence. The corpus, therefore, is seen as more than a repository of examples to back pre-existing theories or a probabilistic extension to an already well defined system. […] Examples are normally taken verbatim, in other words they are not adjusted in any way to fit the predefined categories of the analyst; recurrent patterns and frequency distributions are expected to form the basic evidence for linguistic categories; the absence of a pattern is considered potentially meaningful.” (Tognini-Bonelli, Corpus linguistics at work, 2001:84)

Corpus driven– provides new paradigm of teaching/learning– students as a researcher– data driven learning– learn how to use concordance + corpora– extract generalization from data– Is it possible?

Intuition alone is not enough– Is “starting” always replaceable by “beginning”?– Is it only “time” that is “immemorial”?– “think of” vs. “think about”

Native speaker intuition is unreliable– provides no information on frequency of occurrence– “head” => body part - Is this the most used sense?

Help answering questions of usage easily– More than one character is/are– Worth to do / worth doing- toward / towards

Is it sheer a synonym of pure, complete, utter and absolute?

TEXT CORPUS

Read whole Read fragmented

Read horizontally Read vertically

Read for content Read for formal patterning

Read as a unique event Read for repeated events

Read as an individual act of will Read as a sample of social practice

Coherent communicative event Not a coherent communicative event

From time to time there is also the need for high quality information to support particular initiatives, such as the (successful) application for accreditation. Some progress has been made in recording data on the Polytechnic 's rooms and buildings, and on the teaching space requirements of individual courses. These data are analysed, along with the database on course details and students ' course and module registrations, using the methodology in DES Design Note 44. Ad hoc reports are an essential part of any system that aspires not merely to process data routinely but to permit management information to be creamed off the top.

N Concordance13 enter to whether choose can ement system. They data themselves or to use the data preparation se14 related student-as well As individuals is recorded. data, details on courses, on modules and their 15 subsequent for entered,is student on the module data processing by the registry. In addition16 to access effortless and keeping s detailed record- data. All of this the system provides.17 the with together who,nd passed to the Registry data preparation service, enter and verify approxi18 system management student of use considerable data processing. The marksheet for each module19 student Individual work.their ons that can assist data are available to assist counselling. Registry 20 recording in made been has tion. Some progress data on the Polytechnic 's rooms and buildings, a21 such view to Committee whole onitors to allow the data. A detailed analysis of the performance of22 These courses.individual of space requirements data are analysed, along with the database on c23 complicated unless (computer a easy to record in data structures are used) and are even harder to

Word frequency Concordance Collocation Key word Dispersion plots

Frequency counts – can be in raw data or percentages.

Frequency analyses allows

comparison between different words in a corpus. Ascertain grammatical forms in a corpus Word list to be created - a list of all of the words in a corpus along with their frequencies and the

percentage contribution that each word makes towards the corpus.

A concordance is simply a list of all of the occurrences of a particular search term in a corpus, presented within the context that they occur in; usually a few words to the left and right of the search term.

A concordance is also sometimes referred to as key word in context or a KWIC.

Here key word simply means the word that is currently under examination - and that can be any word that takes the interest of the researcher.

All words co-occur with each other to some degree.

However, when a word regularly appears near another word, and the relationship is statistically significant in some way, then such co-occurrences are referred to as collocates and the phenomena of certain words frequently occurring next to or near each other is collocation.

The notion of keyness derives from keywords.

Keywords are words which are significantly more frequent in one corpus than another (Hunston 2002).

They are words that are either unique or specific which are found more frequently in a specialised corpus compared with a general reference corpus.

These words can be one of the defining characteristics of the specialized corpus.

The rate of occurrence of a word or phrase across a particular file or corpus.

A dispersion plot enables us to visually determine whether a term is equally spread throughout a text or occur as a central theme in one or more parts of the text.