1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course...
-
date post
20-Dec-2015 -
Category
Documents
-
view
223 -
download
2
Transcript of 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course...
![Page 1: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/1.jpg)
1/26
Corpus Linguistics
![Page 2: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/2.jpg)
2/26
Varieties of English
• Relevance of corpus linguistics to this course– Previously studies of stylistics were largely
informal and subjective– Using computers to look at larger amounts of
data allows us to be more formal and objective– “Corpus linguistics” basically provides a
“mindset” (and some procedures) for doing this
![Page 3: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/3.jpg)
3/26
What is a corpus?
• Corpus (pl. corpora) = ‘body’• Collection of written text or transcribed speech • Usually but not necessarily purposefully collected• Usually but not necessarily structured• Usually but not necessarily annotated• (Usually stored on and accessible via computer)• Corpus ~ text archive
![Page 4: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/4.jpg)
4/26
“Purposefully collected”
• Text samples collected to meet a specific need
• Corpus may be quite focused, eg corpus of newswire texts, or may be more general
• Issue of balance often important– Demographic features (age, sex, location, social
class of writer/reader)– Different styles and genres
![Page 5: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/5.jpg)
5/26
“Structured”
• Overall corpus is divided into sections defined by parameters
• Again balance will ensure that different genres or demographic features are equally represented
![Page 6: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/6.jpg)
6/26
Parameters in the BNC (written portion)
![Page 7: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/7.jpg)
7/26
Genre distinctions in the BNC (written portion)
![Page 8: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/8.jpg)
8/26
Parameters in BNC (spoken part)
![Page 9: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/9.jpg)
9/26
Parameters in BNC (spoken part) cont
![Page 10: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/10.jpg)
10/26
“Annotated”• Not just plain text• Most corpora are at least “POS tagged”
– Each word has its part of speech (POS) identified– POS tags contain quite rich information, eg not just
“verb” but including some morphological information
– tags also disambiguate, eg between book (N/V) if possible
• Some may also have other information indicated– structural information resulting from parse– word sense distinctions for same-POS homonyms
![Page 11: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/11.jpg)
11/26
![Page 12: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/12.jpg)
12/26
![Page 13: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/13.jpg)
13/26
What is corpus linguistics?
• Not a branch of linguistics, like socio~, psycho~, …
• Not a theory of linguistics
• A set of tools and methods (and a philosophy) to support linguistic investigation across all branches of the subject
![Page 14: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/14.jpg)
14/26
Evidence in linguistics
• Real attested usage as linguistic evidence• Contrasts with introspective approach previously
typical• Relates to the competence~performance
(langue~parole) distinction• Corpus linguists often more interested in trends
than rules (probabilities rather than certainties)• Famous stories of corpus evidence contradicting
widely-held assumptions about language use.
![Page 15: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/15.jpg)
15/26
Activities in corpus linguistics
• Design and compilation of corpora• Development of tools for corpus analysis• Descriptive linguists using corpora to
analyze lexical and grammatical behaviour of language, eg for lexicography, and of course stylistics
• Exploiting corpora in applied linguistics – language teaching, translation.
![Page 16: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/16.jpg)
16/26
History of Corpus Linguisticswww.essex.ac.uk/linguistics/clmt/w3c/corpus_ling/content/history.html
• Textual study has always included an element of counting and cataloguing, despite impracticalities – notably concordances of Shakespeare, the Bible, etc.
• Arrival of computers in 1950s of course changed everything
![Page 17: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/17.jpg)
17/26
Brown corpus
• First modern computer-readable corpus• W.N. Francis and H. Kučera, Brown
University, Providence, RI • one million words of American English
texts printed in 1961 • sampled from 15 different text categories • used as model for other corpora, including
…
![Page 18: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/18.jpg)
18/26
LOB corpus
• compiled by researchers in Lancaster, Oslo and Bergen
• one million words of British English texts printed in 1961
• sampled from same 15 text categories as Brown corpus
• All texts ≤ 2,000 words long• Kolhapur corpus of Indian English compiled in
1978 to same sepcification
![Page 19: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/19.jpg)
19/26
The London-Lund Corpus of Spoken English (LLC)
• First corpus of transcribed spoken language• Part of Survey of Spoken English at Lund
University under the direction of J. Svartvik• 500,000 words of spoken British English
recorded from 1953 to 1987 • different categories, such as spontaneous
conversation, spontaneous commentary, spontaneous and prepared oration
![Page 20: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/20.jpg)
20/26
COBUILD
• 1m-word corpus too small for many applications• 1980: Collins instigated collection of 20m-word
corpus to support lexicographers writing new Collins Birmingham University International Learners’ Dictionary (John Sinclair)
• Now expanded to Bank of English corpus, 320m words and growing
• www.collins.co.uk/Corpus/CorpusSearch.aspx• www.collins.co.uk/books.aspx?group=153
![Page 21: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/21.jpg)
21/26
BNC (1995)
• http://www.natcorp.ox.ac.uk/• 100m word collection of written and spoken text
from 1975-93 (already dated in some respects!)• Carefully designed and balanced• Corpus is closed (finite, synchronic)• All text tagged to high quality• Lots of tools available for exploration• Nice online interface (available on campus)
http://bnc.humanities.manchester.ac.uk/cgi-bnc/BNCquery.pl?theQuery=search&urlTest=yes
![Page 22: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/22.jpg)
22/26
What can you do with a corpus?
• Many things, but just some examples:
• Investigate behaviour of words and how they relate to genre, mode, sex of speaker/hearer
• Prove (or disprove) supposed trends with quantitative data
![Page 23: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/23.jpg)
23/26
Example 1: swearing
• Women and men swear (and use taboo words) differently
• Data (from BNC spoken part) shows– Women and men use different swear words– They use them for different effect (men use them to
disparage, women use them to intensify)– Their use changes depending on the sex of the
listener(s): women swear more in single-sex groups; men don’t swear more in mixed-sex than amongst themselves
![Page 24: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/24.jpg)
24/26
Example 2.1: Near synonyms
• Subtle differences in the meaning of near synonyms can be distinguished by looking at the words they collocate with– “You shall know a word by the company it
keeps” (Firth)
![Page 25: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/25.jpg)
25/26
frail vs fragile
![Page 26: 1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.](https://reader036.fdocuments.us/reader036/viewer/2022062516/56649d455503460f94a2281d/html5/thumbnails/26.jpg)
26/26
Example 2.2: Near synonyms• In addition, near synonyms can be shown to be
favoured depending on genre, eg big vs large
Category big large
Spoken conversation 768.55 488.34
Other spoken material 395.89 447.58
Newspapers 365.27 431.62
Fiction and verse 333 293.06
Other published written material 290.84 223.43
Unpublished written material 247.39 186.35
Non-academic prose and biography 139.63 181.19
Academic prose 38.85 45.11
Frequency per million words