1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

29
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

Page 1: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

1

SIMS 290-2: Applied Natural Language Processing

Marti HearstSept 8, 2004 

 

Page 2: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

2

Today

Tokenizing using Regular ExpressionsElementary MorphologyFrequency Distributions in NLTK

Page 3: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

3Modified from Dorr and Habash (after Jurafsky and Martin)

Tokenizing in NLTK

The Whitespace Tokenizer doesn’t work very well

What are some of the problems?

NLTK provides an easy way to incorporate regex’s into your tokenizer

Uses python’s regex package (re)http://docs.python.org/lib/re-syntax.html

Page 4: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

4Modified from Dorr and Habash (after Jurafsky and Martin)

Regex’s for TokenizingBuild up your recognizer piece by piece

Make a string of regex’s combined with OR’sPut each one in a group (surrounded by parens)

Things to recognize:urlswords with hyphens in themwords in which hyphens should be removed (end of line hyphens)Numerical termsWords with apostrophes

Page 5: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

5

Regex’s for TokenizingHere are some I put together:

url = r'((http:\/\/)?[A-Za-z]+(\.[A-Za-z]+){1,3}(\/)?(:\d+)?)‘» Allows port number but no argument variables.

hyphen = r'(\w+\-\s?\w+)‘ » Allows for a space after the hyphen

apostro = r'(\w+\'\w+)‘

numbers = r'((\$|#)?\d+(\.)?\d+%?)‘» Needs to handle large numbers with commas

punct = r'([^\w\s]+)‘

wordr = r'(\w+)‘

A nice python trick:regexp = string.join([url, hyphen, apostro, numbers, wordr, punct],"|")

– Makes one string in which a “|” goes in between each substring

Page 6: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

6

Regex’s for Tokenizing

More code:

import stringfrom nltk.token import *from nltk.tokenizer import *t = Token(TEXT='This is the girl\'s depart- ment.')regexp =

string.join([url, hyphen, apostrophe, numbers, wordr, punct],"|")

RegexpTokenizer(regexp,SUBTOKENS='WORDS').tokenize(t)print t['WORDS']

[<This>, <is>, <the>, <girl's>, <depart- ment>, <store>, <.>]

Page 7: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

7Modified from Dorr and Habash (after Jurafsky and Martin)

Tokenization Issues

Sentence BoundariesInclude parens around sentences? What about quotation marks around sentences?Periods – end of line or not?

– We’ll study this in detail in a couple of weeks.

Proper NamesWhat to do about

– “New York-New Jersey train”?– “California Governor Arnold Schwarzenegger”?

Clitics and Contractions

Page 8: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

8Modified from Dorr and Habash (after Jurafsky and Martin)

MorphologyMorphology:

The study of the way words are built up from smaller meaning units.Morphemes:

The smallest meaningful unit in the grammar of a language.Contrasts:

Derivational vs. InflectionalRegular vs. IrregularConcatinative vs. Templatic (root-and-pattern)

A useful resource:Glossary of linguistic terms by Eugene Looshttp://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm

Page 9: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

9Modified from Dorr and Habash (after Jurafsky and Martin)

Examples (English)

“unladylike”3 morphemes, 4 syllables

un- ‘not’lady ‘(well behaved) female adult human’-like ‘having the characteristics of’

Can’t break any of these down further without distorting the meaning of the units

“technique”1 morpheme, 2 syllables

“dogs”2 morphemes, 1 syllable

-s, a plural marker on nouns

Page 10: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

10Modified from Dorr and Habash (after Jurafsky and Martin)

Morpheme DefinitionsRoot

The portion of the word that:– is common to a set of derived or inflected forms, if any, when all affixes

are removed – is not further analyzable into meaningful elements– carries the principle portion of meaning of the words

StemThe root or roots of a word, together with any derivational affixes, to which inflectional affixes are added.

AffixA bound morpheme that is joined before, after, or within a root or stem.

Clitica morpheme that functions syntactically like a word, but does not appear as an independent phonological word

– Spanish: un beso, las aguas– English: Hal’s (genetive marker)

Page 11: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

11Modified from Dorr and Habash (after Jurafsky and Martin)

Inflectional vs. Derivational

Word ClassesParts of speech: noun, verb, adjectives, etc.Word class dictates how a word combines with morphemes to form new words

Inflection:Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast.

– Doesn’t change the word class– Usually produces a predictable, nonidiosyncratic change of

meaning.

Derivation:The formation of a new word or inflectable stem from another word or stem.

Page 12: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

12Modified from Dorr and Habash (after Jurafsky and Martin)

Inflectional Morphology

Adds: tense, number, person, mood, aspect

Word class doesn’t changeWord serves new grammatical roleExamples

come is inflected for person and number:The pizza guy comes at noon.

las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s

las manzanas rojas (‘the red apples’)

Page 13: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

13Modified from Dorr and Habash (after Jurafsky and Martin)

Derivational MorphologyNominalization (formation of nouns from other parts of speech, primarily verbs in English):

computerizationappointeekillerfuzziness

Formation of adjectives (primarily from nouns) computationalcluelessEmbraceable

Diffulcult cases:building from which sense of “build”?

A resource:CatVar: Categorial Variation Databasehttp://clipdemos.umiacs.umd.edu/catvar

Page 14: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

14Modified from Dorr and Habash (after Jurafsky and Martin)

Concatinative MorphologyMorpheme+Morpheme+Morpheme+…Stems: also called lemma, base form, root, lexeme

hope+ing hoping hop hopping

AffixesPrefixes: AntidisestablishmentarianismSuffixes: AntidisestablishmentarianismInfixes: hingi (borrow) – humingi (borrower) in TagalogCircumfixes: sagen (say) – gesagt (said) in German

Agglutinative Languagesuygarlaştıramadıklarımızdanmışsınızcasınauygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casınaBehaving as if you are among those whom we could not cause to become civilized

Page 15: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

15Modified from Dorr and Habash (after Jurafsky and Martin)

Templatic MorphologyRoots and Patterns

Example: Hebrew verbsRoot:

– Consists of 3 consonants CCC– Carries basic meaning

Template:– Gives the ordering of consonants and vowels– Specifies semantic information about the verb

Active, passive, middle voiceExample:

– lmd (to learn or study) CaCaC -> lamad (he studied) CiCeC -> limed (he taught) CuCaC -> lumad (he was taught)

Page 16: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

16Modified from Dorr and Habash (after Jurafsky and Martin)

Nouns and Verbs (in English)

Nouns have simple inflectional morphologycatcat+s, cat+’s

Verbs have more complex morphology

Page 17: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

17Modified from Dorr and Habash (after Jurafsky and Martin)

Nouns and Verbs (in English)

NounsHave simple inflectional morphologyCat/CatsMouse/Mice, Ox, Oxen, Goose, Geese

VerbsMore complex morphologyWalk/WalkedGo/Went, Fly/Flew

Page 18: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

18Modified from Dorr and Habash (after Jurafsky and Martin)

Regular (English) Verbs

Morphological Form Classes Regularly Inflected Verbs

Stem walk merge try map

-s form walks merges tries maps

-ing form walking merging trying mapping

Past form or –ed participle walked merged tried mapped

Page 19: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

19Modified from Dorr and Habash (after Jurafsky and Martin)

Irregular (English) Verbs

Morphological Form Classes Irregularly Inflected Verbs

Stem eat catch cut

-s form eats catches cuts

-ing form eating catching cutting

Past form ate caught cut

-ed participle eaten caught cut

Page 20: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

20Modified from Dorr and Habash (after Jurafsky and Martin)

“To love” in Spanish

Page 21: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

21Modified from Dorr and Habash (after Jurafsky and Martin)

Syntax and Morphology

Phrase-level agreementSubject-Verb

– John studies hard (STUDY+3SG)

Noun-Adjective– Las vacas hermosas

Sub-word phrasal structuresנויספרבש

נו+ים+ספר+ב+ש

That+in+book+PL+Poss:1PLWhich are in our books

Page 22: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

22Modified from Dorr and Habash (after Jurafsky and Martin)

Phonology and Morphology

Script Limitations

Spoken English has 14 vowels– heed hid hayed head had hoed hood who’d hide

how’d taught Tut toy enough

English Alphabet has 5– Use vowel combinatios: far fair fare– Consonantal doubling (hopping vs. hoping)

Page 23: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

23Modified from Dorr and Habash (after Jurafsky and Martin)

Computational MorphologyApproaches

Lexicon onlyRules onlyLexicon and Rules

– Finite-state Automata– Finite-state Transducers

SystemsWordNet’s morphyPCKimmo

– Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay

– Accurate but complex– http://www.sil.org/pckimmo/

Two-level morphology– Commercial version available from InXight Corp.

BackgroundChapter 3 of Jurafsky and MartinA short history of Two-Level Morphology

– http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/

Page 24: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

24Modified from Dorr and Habash (after Jurafsky and Martin)

Porter Stemmer

Discount morphologySo not all that accurate

Uses a series of cascaded rewrite rulesATIONAL -> ATE

(relational -> relate)

ING -> if stem contains vowel (motoring -> motor)

Page 25: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

25Modified from Dorr and Habash (after Jurafsky and Martin)

Porter StemmerStep 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible

Page 26: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

26Modified from Dorr and Habash (after Jurafsky and Martin)

Porter StemmerErrors of Omission

European Europeanalysis analyzesmatrices matrixnoise noisyexplain explanation

Errors of Commissionorganization organdoing doegeneralization genericnumerical numerousuniversity universe

Page 27: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

27Modified from Dorr and Habash (after Jurafsky and Martin)

Computational MorphologyWORD STEM (+FEATURES)*

cats cat +N +PLcat cat +N +SGcities city +N +PLgeese goose +N +PLducks (duck +N +PL) or

(duck +V +3SG)merging merge +V +PRES-PARTcaught (catch +V +PAST-PART) or

(catch +V +PAST)

Page 28: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

28Modified from Dorr and Habash (after Jurafsky and Martin)

Lexicon-only Morphology

acclaim acclaim $N$

acclaim acclaim $V+0$

acclaimed acclaim $V+ed$

acclaimed acclaim $V+en$

acclaiming acclaim $V+ing$

acclaims acclaim $N+s$

acclaims acclaim $V+s$

acclamation acclamation $N$

acclamations acclamation $N+s$

acclimate acclimate $V+0$

acclimated acclimate $V+ed$

acclimated acclimate $V+en$

acclimates acclimate $V+s$

acclimating acclimate $V+ing$

• The lexicon lists all surface level and lexical level pairs

• No rules …

• Analysis/Generation is easy

• Very large for English

• What about

•Arabic or

•Turkish or

• Chinese?

Page 29: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.

29

For Next Week

Software status:Software on 3 lab machines, more coming

Lecture on Monday Sept 13:Part of speech tagging

For Wed Sept 15Do exercises 1-3 in Tutorial 2 (Tokenizing)Do the following exercises from Tutorial 3 (Tagging)

1a-h2, 3, 4, 5a-b

Turn them in online (I’ll have something available for this by then)