Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of...
-
Upload
franklin-hodge -
Category
Documents
-
view
218 -
download
1
Transcript of Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of...
Why We Need Corpora and the Sketch Engine
Adam KilgarriffLexical Computing Ltd, UKUniversities of Leeds and Sussex
Madrid April 2010 Kilgarriff: Why corpora and how 3
Exercise
planet Think about the word What could you say about it if you
were writing a dictionary entry Write down three (or more) things
Madrid April 2010 Kilgarriff: Why corpora and how 4
The Sketch Engine: demo
http://www.sketchengine.co.uk
Madrid April 2010 Kilgarriff: Why corpora and how 5
Dictionaries
How to decide what to say about the word?
Madrid April 2010 Kilgarriff: Why corpora and how 6
Dictionaries
How to decide what to say about the word? What the native speaker knows
(introspection)
Madrid April 2010 Kilgarriff: Why corpora and how 7
Dictionaries
How to decide what to say about the word? What the native speaker knows
(introspection) What other dictionaries say
Madrid April 2010 Kilgarriff: Why corpora and how 8
Dictionaries
How to decide what to say about the word? What the native speaker knows
(introspection) What other dictionaries say corpus
Madrid April 2010 Kilgarriff: Why corpora and how 10
Age 1:
Pre-computer
Oxford English Dictionary:• 20 million index cards
Madrid April 2010 Kilgarriff: Why corpora and how 11
Age 2: KWIC Concordances
From 1980 Computerised Overhauled lexicography
Madrid April 2010 Kilgarriff: Why corpora and how 12
Age 2: limitations
as corpora get bigger:too much data
• 50 lines for a word: :read all • 500 lines: could read all, takes a long
time, slow • 5000 lines: no
Madrid April 2010 Kilgarriff: Why corpora and how 13
Age 3: Collocation statistics
Problem:too much data - how to summarise?
Solution:list of words occurring in neighbourhood of headword, with frequencies
Sorted by salience
Madrid April 2010 Kilgarriff: Why corpora and how 14
Collocation listing
For collocates of save (>5 hits), to right of nodeword
word word
forests life
$1.2 dollars
lives costs
enormous thousands
annually face
jobs estimated
money your
Madrid April 2010 Kilgarriff: Why corpora and how 15
Age-3 collocation statistics: limitations
Lists contain junk unsorted for type
mixes together adverbs, subjects, objects, prepositions
What we really want: noise-free lists one list for each grammatical relation
Madrid April 2010 Kilgarriff: Why corpora and how 16
Age 4: The word sketch
Large well-balanced corpus Parse to find
subjects, objects, heads, modifiers etc
One list for each grammatical relation Statistics to sort each list, as before
Madrid April 2010 Kilgarriff: Why corpora and how 17
Macmillan English DictionaryFor Advanced Learners
Ed: Rundell, 2002, 2007
Madrid April 2010 Kilgarriff: Why corpora and how 19
Fruit task
Choose fruit Concordance
Lemma, noun, lower case Frequency: node forms Write down
Plural freq (pl) Singular freq (sing)
Compute proportion: pl/(pl+sing)
Madrid April 2010 Kilgarriff: Why corpora and how 20
What is a corpus?
A collection of texts (as used for linguistic study)
Which texts? How many?
Madrid April 2010 Kilgarriff: Why corpora and how 22
Written Books
Fiction Non-fiction Textbooks
Newspapers Letters, unpublished Web pages Academic journals Student essays …
Madrid April 2010 Kilgarriff: Why corpora and how 23
Spoken
Must be transcribed, for text corpora Conversation
Who? Region, class, age-group, situation… Lectures TV and Radio Film transcripts Meetings, seminars …
Madrid April 2010 Kilgarriff: Why corpora and how 24
Which texts?
Different purposes, different text types
Making dictionaries: Cover the whole language Some of everything
Madrid April 2010 Kilgarriff: Why corpora and how 25
How much?
Most words are rare Zipf’s Law To get enough data for most words,
we need very big corpora
Madrid April 2010 Kilgarriff: Why corpora and how 26
Zipf’s Law
Word (pos) r f r x f
the (det) 1 6187267 6187267 to (prep) 10 917579 9175790as (adv) 100 91583 9158300playing (vb) 1000 9738 9738000paint (vb) 2000 4539 9078000amateur (adj) 10,000 741 7410000
Madrid April 2010 Kilgarriff: Why corpora and how 27
Zipf’s Law the: 6%
100 most frequent: 45% 7500 most frequent: 90% all others: rare
Madrid April 2010 Kilgarriff: Why corpora and how 28
Zipf’s Law
0102030405060708090
100
'the' 100 mostfrequent
3500most
frequent
7500most
frequent
% of all texts
Madrid April 2010 Kilgarriff: Why corpora and how 29
Leading English Corpora: Size
109
108
107
106
Size of
Corpora
(in words)
1960s 1970s 1980s 1990s 2000s
Brown/LOB COBUILD BNC OEC