Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g....
Transcript of Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g....
Morphological Recognition
• We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up the set of stems in the reg-noun word class.
• This way a FSA is created that can be used for morphological recognition.
Two-level Morphology
• Ideally, for morphological parsing we would like to input a word and get as output its stem with morphological information. e.g. cats -> cat + N + PL
• Two-level morphology represents a word as the correspondence between the lexical and the surface level.
Finite State Transducer (FST)
• A FST is an automaton that we use for performing the mapping between the two-levels.
• A FST is an automaton with two-tapes that recognizes or generates pairs of strings, therefore it defines a relation between strings.
• Another view of a FST is as a machine that reads one string and generates another string.
Formal FST definition
• Extention to FSA definition– Q: a finite set of states. (q0, q1, q2, …)
– Σ: a finite alphabet of complex symbols i:o pairs where i is a symbol from the input alphabet and o a symbol from the output alphabet (ε might be part of both the input and output alphabets)
– q0: the start state (first state)
– F: the states with of final states (subset of Q)
– δ(q,i:o): the transition function from states and complex input symbols to states. Given a state q and an input i, it returns a new state q’.
• e.g Σ= {a:a, b:b, !:!, a:!, a:b, b:a, a:ε, ε:!}
Useful FST Properties
• Inversion: The inversion of a transducer simply switches the input and output labels of the transducer (the two tapes). Therefore it is very easy to transform a FST from a parser into a generator.
• Composition: Given two FSTs T1 that maps from I to C and T2 that maps from C to O, their composition is a new transducer T1 o T2 that maps from I to O. Therefore is we have a number of FST that run serialy, it is possible to build a new FST that maps from the initial input to the final output.
Finite State Transducers
• It is convenient to view a FST as having two tapes.
– The upper or lexical tape
– The lower of surface tape
• Each symbol a:b in the FST alphabet expresses how a symbol from one tape is mapped to a symbol on the other tape.
• Symbols such as a:a are called default pairs and are represented simply as a.
FST Morphotactics
FST for English plural formation. ^ marks a morpheme boundary and # a word boundary.
FST Lexicon
Combining FST Lexicon and Morphtactics
• The two FST for lexicon and morphotactics can be cascaded, i.e. the input is run through the lexicon FST and then the output is run through the morphotactics FST.
• Based on the composition propery it is possible to compose these two FSTs into a single FST that maps directly from the lexical to the surface level (without any reference to word classes).
Orthographic Rules
• The previous FST will accept the word foxs and reject the word foxes.
• We need a way to deal with the spelling changes that often take place at morpheme boundaries. This is done by introducing orthographic rules. E.g. for English
– e is inserted after -s, -z, -x, -ch, -sh before -s.
– -y becomes -ie before -s.
• Formal rule notation: a -> b/c__d means “rewrite a as b when it occurs between c and d.
– ε ->e/{x,s,z}^__s#.
Orthographic Rules and FST
• The spelling rule can be seen as taking a simple concatenation of morphemes (intermediate level) and producing the surface form of the word.
Orthographic Rules and FST
• The previous orthographic rule can be represented as a FST.
Orthographic Rules and FST• Transition table for the previous FST.
State/Input
s:s x:x z:z ^:ε ε:e # other
q0: 1 1 1 0 - 0 0
q1: 1 1 1 2 - 0 0
q2: 5 1 1 0 3 0 0
q3 4 - - - - - -
q4 - - - - - 0 -
q5 1 1 1 2 - - 0
Combining FST Lexicon and Rules
• First the lexicon FST maps between the lexical level and the intermediate level which is just a concatenation of morphemes.
• Then, a number of spelling rule FSTs run in parallel (or as a cascade) mapping from the intermediate level to the surface level.
• The lexicon FST and the orthographic rules FST form a cascade. This can be run top-down (generation) or bottom-up (parsing).
FST Parsing
• Parsing is more complicated than generation because of ambiguity. E.g. foxes may be parsed as both fox+V+3SG and as fox+N+PL. Disambiguiation cannot be performed at the lexical level. Both parses should be given by the FST.
• Also ambiguities occur during parsing due to ε arcs or multiple possible paths. In fact, this is similar to the case for NFSA and similar search techniques must be employed.