BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A...

BTANT 129 w5

Introduction to corpus linguistics

BTANT 129 w5

Corpus

• The old school concept– A collection of texts especially if complete and

self-contained: the corpus of Anglo-Saxon verse

The Oxford Companion to the English Language

• The modern view– A collection of naturally occurring language

text chosen to characterize a state or variety of a language

• John Sinclair Corpus Concordance Collocation OUP

BTANT 129 w5

Corpus vs. archive

• Text archive• Collection of texts in their original format(Oxford Text Archive:

http://ota.ox.ac.uk/)• Corpus• texts collected and processed in a unified,

systematic mannerBritish National Corpus:

http://www.natcorp.ox.ac.uk/

BTANT 129 w5

Short history

Brief mention of just a select few! • Brown Corpus (Brown university)

– 1 m words– 15 genres– 500 samples 2000 words each– Area: US– Time: 1961

• LOB Corpus (Lancaster-Bergen-Oslo)– GB replica of Brown

BTANT 129 w5

Cobuild

• Major corpus initiative by Collins and Birmingham Univ. John Sinclair

• 1991 20 m • -> Bank of English currently 450 m

words• http://www.cobuild.collins.co.uk

BTANT 129 w5

British National Corpus

• 100 m words careful selection• 10 % spoken material• time span 1960 (fiction) – 1975 non-

ficion)• 40-50 000 word texts• TEI compliant SGML coding• http://www.comp.lancs.ac.uk/ucrel/

bncindex/

BTANT 129 w5

International Corpus of English

• 20 corpora of 1 m words devoted to varieties of English around the world

• 500 texts (300 written 200 spoken) of 2000 words each

• time span: 1990-0996• ICE-GB available in demo version• syntactic annotation, graphical tool

ICECUP

BTANT 129 w5

Corpus processing: tokenization

• Preprocessing– tokenization segmenting the text into

sentences• sometimes tricky: sentence delimiters in

mid-sentence positions

words• multi-word units – problem

– Normalization• restoring clitics, abbreviations ("can't",

"I've")

BTANT 129 w5

Corpus processing: tagging

• Tagging– labelling every word with its Part of

Speech category– Problem: ambiguity

• out of context, words can belong to different part of speech or have different analysis within the same POS

– set N vs. set V– bánt 'bánik' VBD vagy 'bánt' VBZ

BTANT 129 w5

Corpus processing: disambiguation

• Disambiguation– defining the correct analysis in context

• Two approaches:• both needs manually corrected training

corpus– statistical

• Hidden Markov model• calculating probability within a span of usually one or

two words• rate of success can be around 98%

– rule-based

BTANT 129 w5

Syntactic annotation

• Difficult to do on such a scale • shallow parsing• Treebank:

collection of syntactically analyzed sentences

• Penn treebank• http://www.cis.upenn.edu/~treebank/

BTANT 129 w5

Recent trends

• Word sense ambiguation (SENSEVAL) • http://www.itri.brighton.ac.uk/events/

senseval/

• Message understanding• http://www.itl.nist.gov/iaui/894.02/related_

projects/muc/index.html

• SEMANTIC WEB• making information on the web

understandable for machines• a vision requiring a huge effort, not clear

whether feasible at all

BTANT 129 w5

Representative sample?

• A corpus any size is inevitably a sample

• Of what?• Two approaches

– sampling speakers – demographic sampling

– sampling their output – text type sample

BTANT 129 w5

The notion of representativeness

• Sample vs. population• sample should be proportional to the

population for a given feature– example for demographic samplingif we know from census figures that 48% of

people in living in Budapest are malewe should compile our sample so that 48% of the

informants are male-> our sample is representative of Budapest

residents for gender

BTANT 129 w5

Trouble with representativeness

• What should be the units of sampling?• Registers, text types, genres etc.• But no independent evidence about

theirratio in the totality of language output

-> representativeness is an ideal but impossible to implement

BTANT 129 w5

Approaches to Representativeness

• Douglas Biber:• Rejects notion of proportional

sampling• Sample should be as varied as

possible• Representativeness measured in

terms of wide variety of text types included in the sample

BTANT 129 w5

The Web as a corpus?

• Pro:• immense database• dynamically

growing• ideal 'quick and

dirty' method

• Cons:• lots of rubbish,

irrelevant data• difficult to extract

hits• no language analysis• only string query,

which is crude

BTANT 129 w5

One quick example

• Representativity or representativeness

• Throw the two words at Google and have a look at the figures

• Think about the conclusions• There are special front-end sites

BTANT 129 w5

BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A...

Documents

Transcript of BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A...

€¦ · w5-fatty acids, thereby unequivocally localizing w5-fatty acids to the trichomes. Because w5-fatty acids are unique precursors for the biosynthesis of w5-anacardic acids,

W5 - 8004 - Handout

W5 - Model Development

W5 Warehouse

W5 Queueing Theory

Al sketch w5

Clavister W5 · Clavister W5 The Clavister W5 is the perfect security solution for large headquarters or as a main enterprise firewall. The Clavister W5 delivers stunning perfor-mance,

Layout Plan of the Proposed Figure 6.8 Sewage Treatment ... · LEGEND S5 A4 W4 W4 W4 W4 W4 W5 A1 W5 W5 W5 W5 W5 W5 A1 A1 S2 W1 A1 A1 W2 W2 W2 W2 W2 A2 A6 A10 A7 A8 A9 A13 A11 A12

Cyberpolitics 2009 W5

W5& - the Restoration Movement

W5 Questionnaire Design

W5 sections & hatch1

Surface w5 Sample

w5 Lexical Relations

T1 w5 current

11.1 EXISTING BUILDING - Bourgon Constructionrjbourgon.com/wp-content/uploads/sites/141/2016/03/Drawings.pdf · w1 w1 w2 w2 w3 w3 w7 w7 w7 w7 w9 w5 w5 w5 w5 w5 w5 w4 w4 remove and

Edson Forest Area - Alberta · T59 R10 W5 T55 R21 W5 T47 R23 W5 T51 R16 W5 T47 R25 W5 T55 R19 W5 T55 R17 W5 T47 R20 W5 T47 R21 W5 T47 R16 W5 T47 R15 W5 T47 R14 W5 T56 R20 W5 T47 R10

T2 w5 current

W5 abdomen[1]

FINS1612 W5 Lecture