Post on 27-Dec-2015
October 2006 Advanced Topics in NLP 1
CSA3050: NLP Algorithms
Finite State Transducers for Morphological Parsing
October 2006 Advanced Topics in NLP 2
Acknowledgement
• This lecture is largely based on material from Jurafsky & Martin chapter 3
October 2006 Advanced Topics in NLP 3
Resumé
• FSAs are equivalent to regular languages
• FSTs are equivalent to regular relations (over pairs of regular languages)
• FSTs are like FSAs but with complex labels.
• We can use FSTs to transduce between surface and lexical levels.
October 2006 Advanced Topics in NLP 4
Morphological Parsing
• Given the input cats, we’d like to outputcat +N +Pl, telling us that cat is a plural noun.
• Given the Spanish input bebo, we’d like to outputbeber +V +PInd +1P +Sg telling us that bebo is the present indicative first person singular form of the Spanish verb beber, ‘to drink’.
October 2006 Advanced Topics in NLP 5
Two-Level Paradigm
from Jurafsky & Martin
October 2006 Advanced Topics in NLP 6
English Plural
surface lexical
cat cat+N+Sg
cats cat+N+Pl
foxes fox+N+Pl
mice mouse+N+Pl
sheep sheep+N+Pl
sheep+N+Sg
October 2006 Advanced Topics in NLP 7
Morphological Anlayser
To build a morphological analyser we need:• lexicon: the list of stems and affixes, together with
basic information about them• morphotactics: the model of morpheme ordering
(eg English plural morpheme follows the noun rather than a verb)
• orthographic rules: these spelling rules are used to model the changes that occur in a word, usually when two morphemes combine (e.g., fly+s = flies)
October 2006 Advanced Topics in NLP 8
Lexicon & Morphotactics
• Typically list of word parts (lexicon) and the models of ordering can be combined together into an FSA which will recognise the all the valid word forms.
• For this to be possible the word parts must first be classified into sublexicons.
• The FSA defines the morphotactics (ordering constraints).
October 2006 Advanced Topics in NLP 9
Sublexicons to classify the list of word parts
reg-noun irreg-pl-noun
irreg-sg-noun
plural
cat mice mouse -s
fox sheep sheep
geese goose
October 2006 Advanced Topics in NLP 10
FSA Expresses Morphotactics (ordering model)
October 2006 Advanced Topics in NLP 11
Towards the Analyser
• We can use lexc or xfst to build such an FSA (see lex1.lexc)
• To augment this to produce an analysis we must create a transducer Tnum which maps between the lexical level and an "intermediate" level that is needed to handle the spelling rules of English.
October 2006 Advanced Topics in NLP 12
Three Levels of Analysis
October 2006 Advanced Topics in NLP 13
1. Tnum: Noun Number Inflection
• multi-character symbols• morpheme boundary ^• word boundary #
October 2006 Advanced Topics in NLP 14
Towards the Analyser
• We do this by first allowing the lexicon itself to also have two levels. Since surface geese maps to lexical goose, the new lexical entry will be “g:g o:e o:e s:s e:e” (see lex2.lexc)
• We must also add the appropriate morphological labels (see lex3.lexc)
October 2006 Advanced Topics in NLP 15
Intermediate Form to Surface
• The reason we need to have an intermediate form is that funny things happen at morpheme boundaries, e.g.cat^s catsfox^s foxesfly^s flies
• The rules which describe these changes are called orthographic rules or "spelling rules".
October 2006 Advanced Topics in NLP 16
More English Spelling Rules
• consonant doubling: beg / begging
• y replacement: try/tries
• k insertion: panic/panicked
• e deletion: make/making
• e insertion: watch/watches
• Each rule can be stated in more detail ...
October 2006 Advanced Topics in NLP 17
Spelling Rules
• Chomsky & Halle (1968) invented a special notation for spelling rules.
• A very similar notation is embodied in the "conditional replacement" rules of xfst.
E -> F || L _ Rwhich means replace E with F when it appears between left context L and right context R
October 2006 Advanced Topics in NLP 18
A Particular Spelling Rule
This rule does e-insertion
^ -> e || x _ s#
October 2006 Advanced Topics in NLP 19
e insertion over 3 levelsThe rule corresponds to the mapping betweensurface and intermediate levels
October 2006 Advanced Topics in NLP 20
e insertion as an FST
October 2006 Advanced Topics in NLP 21
Incorporating Spelling Rules
• Spelling rules, each corresponding to an FST, can be run in parallel provided that they are "aligned".
• The set of spelling rules is positioned between the surface level and the intermediate level.
• Parallel execution of FSTs can be carried out:– by simulation: in this case FSTs must first be aligned.
– by first constructing a a single FST corresponding to their intersection.
October 2006 Advanced Topics in NLP 22
Putting it all together
execution of FSTi
takes place in parallel
October 2006 Advanced Topics in NLP 23
Kaplan and KayThe Xerox View
FSTi are alignedbut separate
FSTi intersectedtogether