School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus?...

21
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING What’s in a Corpus? Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)

Transcript of School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus?...

Page 1: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

School of somethingFACULTY OF OTHER

School of ComputingFACULTY OF ENGINEERING

What’s in a Corpus?

Eric Atwell, Language Research Group

(with thanks to Katja Markert, Marti Hearst, and other contributors)

Page 2: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Reminder

Why NLP is difficult: language is a complex system

How to solve it? Corpus-based machine-learning approaches

Motivation: applications of “The Language Machine”

BACKGROUND READING: (Atwell 99) The Language Machine

Intro to NLTK

Visit the website: http://www.nltk.org

Page 3: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Today

The main areas of linguistics

Rationalism: language models based on expert introspection

Empiricism: models via machine-learning from a corpus

Corpus: text selected by language, genre, domain, …

Brown, LOB, BNC, Penn Treebank, MapTask, CCA, …

Corpus Annotation: text headers, PoS, parses, …

Corpus size is no. of words – depends on tokenisation

We can count word tokens, word types, type-token distribution

Lexeme/lemma is “root form”, v inflections (be v am/is/was…)

Page 4: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

The main sub-areas of linguistics

◮ Phonetics and Phonology: The study of linguistic sounds or speech.

◮ Morphology: The study of the meaningful components of words.

◮ Syntax (grammar): The study of the order and links between words.

◮ Semantics: The study of meanings of words, phrases, sentences.

◮ Discourse: The study of linguistic units larger than a single utterance.

◮ Pragmatics: The study of how language is used to accomplish goals.

Page 5: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Why is NLP hard?

Main reason: Ambiguity in all areas and on all levels, e.g:

◮ Phonetic Ambiguity: 1 expression being pronounced in several ways

◮ POS Ambiguity: 1 word having several different Parts of Speech (adjective/noun...)

◮ Lexical Ambiguity: 1 word having several different meanings

◮ Syntactic/Structural Ambiguity: 1 phrase or sentence having several

different possible structures

◮ Pragmatic Ambiguity: 1 sentence communicating several different intentions

◮ Referential Ambiguity: 1 expression having several different possible references

Key Task in NLP: Disambiguation in context!

Page 6: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Rationalism v Empiricism

Rationalism: the doctrine that knowledge is acquired by reason without regard to experience (Collins English Dictionary)

Noam Chomsky, 1957 Syntactic Structures

-Argued that we should build models through introspection:

-A language model is a set of rules thought up by an expert

Like “Expert Systems”…

Chomsky thought data was full of errors, better to rely on linguists’ intuitions…

Page 7: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Empiricism v Rationalism

Empiricism: the doctrine that all knowledge derives from experience (Collins English Dictionary)

The field was stuck for quite some time: rationalist

linguistic models for a specific example did not generalise.

A new approach started around 1990: Corpus Linguistics

• Well, not really new, but in the 50’s to 80’s, they didn’t have the text, disk space, or GHz

Main idea: machine learning from CORPUS data

How to do corpus linguistics:

• Get large text collection (a corpus; plural: several corpora)

• Compute statistical models over the words/PoS/parses/… in the corpus

Surprisingly effective

Page 8: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

What is a corpus?

A corpus is a finite machine-readable body of naturally occurring text, selected according to specified criteria, eg:

◮ Language and type: English/German/Arabic/…, dialects v. “standard”, edited text v. spontaneous speech, …

◮ Genre and Domain: 18th century novels, newspaper text, software manuals, train enquiry dialogue...

◮ Web as Corpus: URL “domain” = country: .uk .ar

◮ Media: “Written” Text, Audio, Transcriptions, Video.

◮ Size: 1000 words, 50K words, 1M words, 100M words, ???

Page 9: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Brown and LOB

◮ Brown: Famous first corpus! (well, first widely-used corpus)

◮ by Nelson Francis and Henry Kucera, Brown University USA

◮ A balanced corpus: representative of a whole language

◮ Brown: balanced corpus of written, published American English from 1960s (newspapers, books, … NOT handwritten)

◮ 1 million words, Part-of-Speech tagged.

◮ LOB: Lancaster-Oslo/Bergen corpus: British English version

◮ published British English text from equivalent 1960s sources

◮FROWN, FLOB: US, UK text from equivalent 1990s sources

Page 10: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Some recent corpora

Corpus features: Size, Domain, Language

British National Corpus: 100M words, balanced British English

Newswire Corpus: 600M words, newswire, American English

UN or EU proceedings: 20M+ words, legal, 10 language pairs

Penn Treebank: 2M words, newswire American English

MapTask: 128 dialogues, British English

Corpus of Contemporary Arabic: 1M words, balanced Arabic

Web: 8 billion(?) words, many domains and languages

Web-as-Corpus: harvest your own corpus from WWW, via “seed terms” Google API web-pages Corpus!

Marco Baroni: BootCat, Adam Kilgarriff: SketchEngine, …

Page 11: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Corpus Annotation

Annotation is a process in which linguistics experts add (linguistic) information to the corpus that is not explicitly there (increases utility of a corpus), e.g.:

◮Text Headers: meta-data for each text: author, date, type,…

◮ Part-of-speech tag for each word (very common!).

◮ Syntactic structure: parse-tree for each sentence

◮ Word Sense label for each word

◮ Prosodic information: pauses, rise and fall in pitch, etc.

Page 12: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Annotation example: POS tagging

◮ Some texts are annotated with Part-of-speech (POS) tags.

◮ POS tags encode simple grammatical functions.

<s><w pos=RN> Here </w> <w pos=BEZ> is </w> <w pos=AT> a </w>

<w pos=NN> sentence </w>.</s>

◮ Several tag sets:

◮ Brown tag set (87 tags) in Brown corpus

◮ CLAWS / LOB tag set (132 tags) in LOB corpus

◮ Penn tag set (45 tags) in Penn Treebank

◮ CLAWS c5 tag set (62 tags) in BNC (British National Corpus)

◮ Tagging is usually done automatically (then proofread and corrected)

Page 13: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

http://www.comp.leeds.ac.uk/eric/atwell00icamej.pdf

Page 14: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

http://www.comp.leeds.ac.uk/eric/atwell08clih.pdf

Page 15: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

What’s a word?

How many words do you find in the following short text?

What is the biggest/smallest plausible answer to this question?

What problems do you encounter?

It’s a shame that our data-base is not up-to-date. It is a shame that um, data base A costs $2300.50 and that database B costs $5000. All databases cost far too much.

Time: 1 minute

Page 16: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Counting words: tokenization

Tokenisation is a processing step where the input text is automatically divided into units called tokens where each is either a word or a number or a punctuation mark…

So, word count can ignore numbers, punctuation marks (?)

Word: Continuous alphanumeric characters delineated by whitespace.

Whitespace: space, tab, newline.

BUT dividing at spaces is too simple: It’s, data base

Page 17: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Counting words: types v tokens

◮ Word token: individual occurrence of words.

◮ Q: How big is the corpus (N)?

= how many word tokens are there? (LOB: 1M; BNC: 100M)

◮Word type: the “word itself” regardless of context

◮ Q: How many “different words” (word types) are there?

= Size of corpus vocabulary (LOB: 50K, BNC: 650K)

◮ Q: What is the frequency of each word type?

= type-token distribution

A few word=types (the of a …) are very frequent, but most are rare, and half of all word-types occur only once! Zipf’s Law

Page 18: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Other sorts of “words”

◮ Lemma/Lexeme: dictionary form of a word. cost and costs

are derived from the same lexeme “cost”.

data-base, data base, database, databases – same lexeme

Can include spaces: data base, New York

Ambiguous tokenization: as well (= also), as well as (= and)

Inflection: grammatical variant, eg cost v costs

◮ Morpheme: basic “atomic” indivisible unit of meaning or grammar, e.g. data, base, s

◮ For languages other than English, morphological analysis can be hard: root/stem, affixes (prefix, postfix, infix)

morph ologi cal or morpho logic al ?

Page 19: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Arabic Morphology

Templatic Morphology

وكتمب

b

و? ??َم�

kt

تاكب

�ِا???

maktūbwritten

kātibwriterLexeme.Meaning =

(Root.Meaning+Pattern.Meaning)*Idiosyncrasy.Random

ب تك

maū āi

Root

Pattern

Lexeme

Page 20: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Arabic MorphologyRoot Meaning + Pattern meaning + ??

ب ت ك KTB = notion of “writing”

كتب/katab/write

كاتب/

kātib/writer

مكتوب/maktūb/

letter

كتاب/kitāb/book

مكتبة/maktaba/

libraryمكتب

/maktab/office

مكتوب/maktūb/written

Page 21: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Reminder

Rationalism: language models based on expert introspection

Empiricism: models via machine-learning from a corpus

Corpus: text selected by language, genre, domain, …

Brown, LOB, BNC, Penn Treebank, MapTask, CCA, …

Corpus Annotation: text headers, PoS, parses, …

Corpus size is no. of words – depends on tokenisation

We can count word tokens, word types, type-token distribution

Morpheme: basic lexical unit, “root form”, plus affixes

Lexeme: dictionary entry, can be multi-word: New York