BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A...
-
Upload
nancy-wilkins -
Category
Documents
-
view
212 -
download
0
Transcript of BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A...
![Page 1: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/1.jpg)
BTANT 129 w5
Introduction to corpus linguistics
![Page 2: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/2.jpg)
BTANT 129 w5
Corpus
• The old school concept– A collection of texts especially if complete and
self-contained: the corpus of Anglo-Saxon verse
The Oxford Companion to the English Language
• The modern view– A collection of naturally occurring language
text chosen to characterize a state or variety of a language
• John Sinclair Corpus Concordance Collocation OUP
![Page 3: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/3.jpg)
BTANT 129 w5
Corpus vs. archive
• Text archive• Collection of texts in their original format(Oxford Text Archive:
http://ota.ox.ac.uk/)• Corpus• texts collected and processed in a unified,
systematic mannerBritish National Corpus:
http://www.natcorp.ox.ac.uk/
![Page 4: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/4.jpg)
BTANT 129 w5
![Page 5: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/5.jpg)
BTANT 129 w5
![Page 6: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/6.jpg)
BTANT 129 w5
Short history
Brief mention of just a select few! • Brown Corpus (Brown university)
– 1 m words– 15 genres– 500 samples 2000 words each– Area: US– Time: 1961
• LOB Corpus (Lancaster-Bergen-Oslo)– GB replica of Brown
![Page 7: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/7.jpg)
BTANT 129 w5
Cobuild
• Major corpus initiative by Collins and Birmingham Univ. John Sinclair
• 1991 20 m • -> Bank of English currently 450 m
words• http://www.cobuild.collins.co.uk
![Page 8: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/8.jpg)
BTANT 129 w5
British National Corpus
• 100 m words careful selection• 10 % spoken material• time span 1960 (fiction) – 1975 non-
ficion)• 40-50 000 word texts• TEI compliant SGML coding• http://www.comp.lancs.ac.uk/ucrel/
bncindex/
![Page 9: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/9.jpg)
BTANT 129 w5
![Page 10: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/10.jpg)
BTANT 129 w5
International Corpus of English
• 20 corpora of 1 m words devoted to varieties of English around the world
• 500 texts (300 written 200 spoken) of 2000 words each
• time span: 1990-0996• ICE-GB available in demo version• syntactic annotation, graphical tool
ICECUP
![Page 11: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/11.jpg)
BTANT 129 w5
![Page 12: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/12.jpg)
BTANT 129 w5
Corpus processing: tokenization
• Preprocessing– tokenization segmenting the text into
sentences• sometimes tricky: sentence delimiters in
mid-sentence positions
words• multi-word units – problem
– Normalization• restoring clitics, abbreviations ("can't",
"I've")
![Page 13: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/13.jpg)
BTANT 129 w5
Corpus processing: tagging
• Tagging– labelling every word with its Part of
Speech category– Problem: ambiguity
• out of context, words can belong to different part of speech or have different analysis within the same POS
– set N vs. set V– bánt 'bánik' VBD vagy 'bánt' VBZ
![Page 14: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/14.jpg)
BTANT 129 w5
Corpus processing: disambiguation
• Disambiguation– defining the correct analysis in context
• Two approaches:• both needs manually corrected training
corpus– statistical
• Hidden Markov model• calculating probability within a span of usually one or
two words• rate of success can be around 98%
– rule-based
![Page 15: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/15.jpg)
BTANT 129 w5
Syntactic annotation
• Difficult to do on such a scale • shallow parsing• Treebank:
collection of syntactically analyzed sentences
• Penn treebank• http://www.cis.upenn.edu/~treebank/
![Page 16: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/16.jpg)
BTANT 129 w5
Recent trends
• Word sense ambiguation (SENSEVAL) • http://www.itri.brighton.ac.uk/events/
senseval/
• Message understanding• http://www.itl.nist.gov/iaui/894.02/related_
projects/muc/index.html
• SEMANTIC WEB• making information on the web
understandable for machines• a vision requiring a huge effort, not clear
whether feasible at all
![Page 17: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/17.jpg)
BTANT 129 w5
Representative sample?
• A corpus any size is inevitably a sample
• Of what?• Two approaches
– sampling speakers – demographic sampling
– sampling their output – text type sample
![Page 18: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/18.jpg)
BTANT 129 w5
The notion of representativeness
• Sample vs. population• sample should be proportional to the
population for a given feature– example for demographic samplingif we know from census figures that 48% of
people in living in Budapest are malewe should compile our sample so that 48% of the
informants are male-> our sample is representative of Budapest
residents for gender
![Page 19: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/19.jpg)
BTANT 129 w5
Trouble with representativeness
• What should be the units of sampling?• Registers, text types, genres etc.• But no independent evidence about
theirratio in the totality of language output
-> representativeness is an ideal but impossible to implement
![Page 20: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/20.jpg)
BTANT 129 w5
Approaches to Representativeness
• Douglas Biber:• Rejects notion of proportional
sampling• Sample should be as varied as
possible• Representativeness measured in
terms of wide variety of text types included in the sample
![Page 21: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/21.jpg)
BTANT 129 w5
The Web as a corpus?
• Pro:• immense database• dynamically
growing• ideal 'quick and
dirty' method
• Cons:• lots of rubbish,
irrelevant data• difficult to extract
hits• no language analysis• only string query,
which is crude
![Page 22: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/22.jpg)
BTANT 129 w5
One quick example
• Representativity or representativeness
• Throw the two words at Google and have a look at the figures
• Think about the conclusions• There are special front-end sites
![Page 23: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/23.jpg)
BTANT 129 w5
![Page 24: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/24.jpg)
BTANT 129 w5
![Page 25: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/25.jpg)
BTANT 129 w5
![Page 26: BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:](https://reader034.fdocuments.us/reader034/viewer/2022051820/56649e405503460f94b32201/html5/thumbnails/26.jpg)
BTANT 129 w5