1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing...
-
Upload
blaise-dixon -
Category
Documents
-
view
217 -
download
0
Transcript of 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing...
![Page 1: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/1.jpg)
1CSC 9010- NLP - 3: Morphology, Finite State Transducers
CSC 9010Natural Language Processing
Lecture 3: Morphology, Finite State Transducers
Paula MatuszekMary-Angela Papalaskari
Presentation slides adapted from: Marti Hearst (some following Dorr and Habash) http://www.sims.berkeley.edu/courses/is290-2/f04/ andJim Martin: http://www.cs.colorado.edu/~martin/csci5832.html
![Page 2: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/2.jpg)
2CSC 9010- NLP - 3: Morphology, Finite State Transducers
Today
Elementary MorphologyComputational morphologyFinite State TransducersLexicon-only schemesRule-only schemesLab: Introduction to NLTK
![Page 3: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/3.jpg)
3CSC 9010- NLP - 3: Morphology, Finite State Transducers
MorphologyMorphology:
The study of the way words are built up from smaller meaning units.Morphemes:
The smallest meaningful unit in the grammar of a language.Contrasts:
Derivational vs. InflectionalRegular vs. IrregularConcatinative vs. Templatic (root-and-pattern)
A useful resource:Glossary of linguistic terms by Eugene Looshttp://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm
![Page 4: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/4.jpg)
4CSC 9010- NLP - 3: Morphology, Finite State Transducers
Examples (English)
“unladylike”3 morphemes, 4 syllables
un- ‘not’lady ‘(well behaved) female adult human’-like ‘having the characteristics of’
Can’t break any of these down further without distorting the meaning of the units
“technique”1 morpheme, 2 syllables
“dogs”2 morphemes, 1 syllable
-s, a plural marker on nouns
![Page 5: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/5.jpg)
5CSC 9010- NLP - 3: Morphology, Finite State Transducers
Morpheme DefinitionsRoot
The portion of the word that:– is common to a set of derived or inflected forms, if any, when all affixes
are removed – is not further analyzable into meaningful elements– carries the principal portion of meaning of the words
StemThe root or roots of a word, together with any derivational affixes, to which inflectional affixes are added.
AffixA bound morpheme that is joined before, after, or within a root or stem.
Clitica morpheme that functions syntactically like a word, but does not appear as an independent phonological word
– Spanish: un beso, las aguas– English: Hal’s (genetive marker) – Proto-European: Kwe -que (Latin), te (Greek), and –ca (Sanskrit)
![Page 6: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/6.jpg)
6CSC 9010- NLP - 3: Morphology, Finite State Transducers
Inflectional vs. Derivational
Word ClassesParts of speech: noun, verb, adjectives, etc.Word class dictates how a word combines with morphemes to form new words
Inflection:Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast.
– Doesn’t change the word class– Usually produces a predictable, non-idiosyncratic change of
meaning.
Derivation:The formation of a new word or inflectable stem from another word or stem.
![Page 7: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/7.jpg)
7CSC 9010- NLP - 3: Morphology, Finite State Transducers
Inflectional Morphology
Adds: tense, number, person, mood, aspect
Word class doesn’t changeWord serves new grammatical roleExamples
come is inflected for person and number:The pizza guy comes at noon.
las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s
las manzanas rojas (‘the red apples’)
![Page 8: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/8.jpg)
8CSC 9010- NLP - 3: Morphology, Finite State Transducers
Derivational Morphology
Nominalization (formation of nouns from other parts of speech, primarily verbs in English):
computerizationappointeekillerfuzziness
Formation of adjectives (primarily from nouns) computationalcluelessEmbraceable
Diffulcult cases:building from which sense of “build”?
![Page 9: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/9.jpg)
9CSC 9010- NLP - 3: Morphology, Finite State Transducers
Concatinative MorphologyMorpheme+Morpheme+Morpheme+…Stems: also called lemma, base form, root, lexeme
hope+ing hoping hop hopping
AffixesPrefixes: AntidisestablishmentarianismSuffixes: AntidisestablishmentarianismInfixes: hingi (borrow) – humingi (borrower) in TagalogCircumfixes: sagen (say) – gesagt (said) in German
Agglutinative Languagesuygarlaştıramadıklarımızdanmışsınızcasına (Turkish)uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casınaBehaving as if you are among those whom we could not cause to become civilized
![Page 10: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/10.jpg)
10CSC 9010- NLP - 3: Morphology, Finite State Transducers
Templatic MorphologyRoots and Patterns
Example: Hebrew verbsRoot:
– Consists of 3 consonants CCC– Carries basic meaning
Template:– Gives the ordering of consonants and vowels– Specifies semantic information about the verb
Active, passive, middle voiceExample:
– lmd (to learn or study) CaCaC -> lamad (he studied) CiCeC -> limed (he taught) CuCaC -> lumad (he was taught)
![Page 11: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/11.jpg)
11CSC 9010- NLP - 3: Morphology, Finite State Transducers
Nouns and Verbs (in English)
Nouns have simple inflectional morphologycatcat+s, cat+’s
Verbs have more complex morphology
![Page 12: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/12.jpg)
12CSC 9010- NLP - 3: Morphology, Finite State Transducers
Nouns and Verbs (in English)
NounsHave simple inflectional morphologyCat/CatsMouse/Mice, Ox, Oxen, Goose, Geese
VerbsMore complex morphologyWalk/WalkedGo/Went, Fly/Flew
![Page 13: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/13.jpg)
13CSC 9010- NLP - 3: Morphology, Finite State Transducers
Regular (English) Verbs
Morphological Form Classes Regularly Inflected Verbs
Stem walk merge try map
-s form walks merges tries maps
-ing form walking merging trying mapping
Past form or –ed participle walked merged tried mapped
![Page 14: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/14.jpg)
14CSC 9010- NLP - 3: Morphology, Finite State Transducers
Irregular (English) Verbs
Morphological Form Classes Irregularly Inflected Verbs
Stem eat catch cut
-s form eats catches cuts
-ing form eating catching cutting
Past form ate caught cut
-ed participle eaten caught cut
![Page 15: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/15.jpg)
15CSC 9010- NLP - 3: Morphology, Finite State Transducers
“To love” in Spanish
![Page 16: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/16.jpg)
16CSC 9010- NLP - 3: Morphology, Finite State Transducers
Syntax and Morphology
Phrase-level agreementSubject-Verb
– John studies hard (STUDY+3SG)
Noun-Adjective– Las vacas hermosas
Sub-word phrasal structuresנויספרבש
נו+ים+ספר+ב+ש
That+in+book+PL+Poss:1PLWhich are in our books
![Page 17: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/17.jpg)
17CSC 9010- NLP - 3: Morphology, Finite State Transducers
Phonology and Morphology
Script Limitations
Spoken English has 14 vowels– heed hid hayed head had hoed hood who’d hide
how’d taught Tut toy enough
English Alphabet has 5– Use vowel combinatios: far fair fare– Consonantal doubling (hopping vs. hoping)
![Page 18: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/18.jpg)
18CSC 9010- NLP - 3: Morphology, Finite State Transducers
Computational MorphologyApproaches
Lexicon onlyRules onlyLexicon and Rules
– Finite-state Automata– Finite-state Transducers
SystemsWordNet’s morphyPCKimmo
– Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay
– Accurate but complex– http://www.sil.org/pckimmo/
Two-level morphology– Commercial version available from InXight Corp.
BackgroundChapter 3 of Jurafsky and MartinA short history of Two-Level Morphology
– http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/
![Page 19: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/19.jpg)
19CSC 9010- NLP - 3: Morphology, Finite State Transducers
Computational MorphologyWORD STEM (+FEATURES)*
cats cat +N +PLcat cat +N +SGcities city +N +PLgeese goose +N +PLducks (duck +N +PL) or
(duck +V +3SG)merging merge +V +PRES-PARTcaught (catch +V +PAST-PART) or
(catch +V +PAST)
![Page 20: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/20.jpg)
20CSC 9010- NLP - 3: Morphology, Finite State Transducers
FSAs and the Lexicon
First we’ll capture the morphotacticsThe rules governing the ordering of affixes in a language.
Then we’ll add in the actual words
![Page 21: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/21.jpg)
21CSC 9010- NLP - 3: Morphology, Finite State Transducers
Simple Rules
![Page 22: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/22.jpg)
22CSC 9010- NLP - 3: Morphology, Finite State Transducers
Adding the Words
![Page 23: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/23.jpg)
23CSC 9010- NLP - 3: Morphology, Finite State Transducers
Derivational Rules
![Page 24: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/24.jpg)
24CSC 9010- NLP - 3: Morphology, Finite State Transducers
Parsing/Generation vs. Recognition
Recognition is usually not quite what we need. Usually if we find some string in the language we need to find the structure in it (parsing)Or we have some structure and we want to produce a surface form (production/generation)
ExampleFrom “cats” to “cat +N +PL” and back
Morphological analysis
![Page 25: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/25.jpg)
25CSC 9010- NLP - 3: Morphology, Finite State Transducers
Finite State Transducers
The simple storyAdd another tapeAdd extra symbols to the transitions
On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around.
![Page 26: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/26.jpg)
26CSC 9010- NLP - 3: Morphology, Finite State Transducers
FSTs
![Page 27: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/27.jpg)
27CSC 9010- NLP - 3: Morphology, Finite State Transducers
Transitions
c:c means read a c on one tape and write a c on the other+N:ε means read a +N symbol on one tape and write nothing on the other+PL:s means read +PL and write an s
c:c a:a t:t +N:ε +PL:s
![Page 28: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/28.jpg)
28CSC 9010- NLP - 3: Morphology, Finite State Transducers
Ambiguity
Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state.
Didn’t matter which path was actually traversed
In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result
![Page 29: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/29.jpg)
29CSC 9010- NLP - 3: Morphology, Finite State Transducers
Ambiguity
What’s the right parse forUnionizableUnion-ize-ableUn-ion-ize-able
Each represents a valid path through the derivational morphology machine.
![Page 30: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/30.jpg)
30CSC 9010- NLP - 3: Morphology, Finite State Transducers
Ambiguity
There are a number of ways to deal with this problem
Simply take the first output foundFind all the possible outputs (all paths) and return them all (without choosing)Bias the search so that only one or a few likely paths are explored
![Page 31: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/31.jpg)
31CSC 9010- NLP - 3: Morphology, Finite State Transducers
The Gory Details
Of course, its not as easy as “cat +N +PL” <-> “cats”
As we saw earlier there are geese, mice and oxenBut there are also a whole host of spelling/pronunciation changes that go along with inflectional changes
Cats vs Dogs
Multi-tape machines
![Page 32: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/32.jpg)
32CSC 9010- NLP - 3: Morphology, Finite State Transducers
Multi-Level Tape Machines
We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape
![Page 33: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/33.jpg)
33CSC 9010- NLP - 3: Morphology, Finite State Transducers
Lexical to Intermediate Level
![Page 34: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/34.jpg)
34CSC 9010- NLP - 3: Morphology, Finite State Transducers
Intermediate to Surface
The add an “e” rule as in fox^s# <-> foxes
![Page 35: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/35.jpg)
35CSC 9010- NLP - 3: Morphology, Finite State Transducers
Foxes
![Page 36: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/36.jpg)
36CSC 9010- NLP - 3: Morphology, Finite State Transducers
Foxes
![Page 37: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/37.jpg)
37CSC 9010- NLP - 3: Morphology, Finite State Transducers
FST Review
FSTs allow us to take an input and deliver a structure based on itOr… take a structure and create a surface formOr take a structure and create another structure
In many applications its convenient to decompose the problem into a set of cascaded transducers where
The output of one feeds into the input of the next.We’ll see this scheme again for deeper semantic processing.
![Page 38: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/38.jpg)
38CSC 9010- NLP - 3: Morphology, Finite State Transducers
Overall Plan
![Page 39: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/39.jpg)
39CSC 9010- NLP - 3: Morphology, Finite State Transducers
Lexicon-only Morphology
acclaim acclaim $N$
acclaim acclaim $V+0$
acclaimed acclaim $V+ed$
acclaimed acclaim $V+en$
acclaiming acclaim $V+ing$
acclaims acclaim $N+s$
acclaims acclaim $V+s$
acclamation acclamation $N$
acclamations acclamation $N+s$
acclimate acclimate $V+0$
acclimated acclimate $V+ed$
acclimated acclimate $V+en$
acclimates acclimate $V+s$
acclimating acclimate $V+ing$
• The lexicon lists all surface level and lexical level pairs
• No rules …
• Analysis/Generation is easy
• Very large for English
• What about
•Arabic or
•Turkish or
• Chinese?
![Page 40: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/40.jpg)
40CSC 9010- NLP - 3: Morphology, Finite State Transducers
Stemming vs Morphology
Sometimes you just need to know the stem of a word and you don’t care about the structure.In fact you may not even care if you get the right stem, as long as you get a consistent string.This is stemming… it most often shows up in IR applications
![Page 41: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/41.jpg)
41CSC 9010- NLP - 3: Morphology, Finite State Transducers
Stemming in IR
Run a stemmer on the documents to be indexedRun a stemmer on users queriesMatch
This is basically a form of hashing
Example: Computerizationization -> -ize computerizeize -> ε computer
![Page 42: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/42.jpg)
42CSC 9010- NLP - 3: Morphology, Finite State Transducers
Porter StemmerStep 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible
![Page 43: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/43.jpg)
43CSC 9010- NLP - 3: Morphology, Finite State Transducers
Porter
No lexicon neededBasically a set of staged sets of rewrite rules that strip suffixesHandles both inflectional and derivational suffixes
Doesn’t guarantee that the resulting stem is really a stem (see first bullet)
Lack of guarantee doesn’t matter for IR
![Page 44: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/44.jpg)
44CSC 9010- NLP - 3: Morphology, Finite State Transducers
Porter StemmerErrors of Omission
European Europeanalysis analyzesmatrices matrixnoise noisyexplain explanation
Errors of Commissionorganization organdoing doegeneralization genericnumerical numerousuniversity universe
![Page 45: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/45.jpg)
45CSC 9010- NLP - 3: Morphology, Finite State Transducers
Soundex
You work as the Villanova telephone operator. Someone calls looking for:
Dr Papalarsky or Dr Matuzka
???????? What do you type as your query string?
![Page 46: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/46.jpg)
46CSC 9010- NLP - 3: Morphology, Finite State Transducers
Soundex
1. Keep the first letter2. Drop non-initial occurrences of vowels, h, w
and y3. Replace the remaining letters with numbers
according to group (e.g.. b, f, p, and v -> 14. Replace strings of identical numbers with a
single number (333 -> 3)5. Drop any numbers beyond a third one
![Page 47: 1 CSC 9010- NLP - 3: Morphology, Finite State Transducers CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek.](https://reader035.fdocuments.us/reader035/viewer/2022062421/56649d345503460f94a0b350/html5/thumbnails/47.jpg)
47CSC 9010- NLP - 3: Morphology, Finite State Transducers
Soundex
Effect is to map (hash) all similar sounding transcriptions to the same code.Structure your directory so that it can be accessed by code as well as by correct spellingUsed for census records, phone directories, author searches in libraries etc.