Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g....

19
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up the set of stems in the reg-noun word class. This way a FSA is created that can be used for morphological recognition.

Transcript of Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g....

Page 1: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Morphological Recognition

• We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up the set of stems in the reg-noun word class.

• This way a FSA is created that can be used for morphological recognition.

Page 2: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Page 3: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Two-level Morphology

• Ideally, for morphological parsing we would like to input a word and get as output its stem with morphological information. e.g. cats -> cat + N + PL

• Two-level morphology represents a word as the correspondence between the lexical and the surface level.

Page 4: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Finite State Transducer (FST)

• A FST is an automaton that we use for performing the mapping between the two-levels.

• A FST is an automaton with two-tapes that recognizes or generates pairs of strings, therefore it defines a relation between strings.

• Another view of a FST is as a machine that reads one string and generates another string.

Page 5: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Formal FST definition

• Extention to FSA definition– Q: a finite set of states. (q0, q1, q2, …)

– Σ: a finite alphabet of complex symbols i:o pairs where i is a symbol from the input alphabet and o a symbol from the output alphabet (ε might be part of both the input and output alphabets)

– q0: the start state (first state)

– F: the states with of final states (subset of Q)

– δ(q,i:o): the transition function from states and complex input symbols to states. Given a state q and an input i, it returns a new state q’.

• e.g Σ= {a:a, b:b, !:!, a:!, a:b, b:a, a:ε, ε:!}

Page 6: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Useful FST Properties

• Inversion: The inversion of a transducer simply switches the input and output labels of the transducer (the two tapes). Therefore it is very easy to transform a FST from a parser into a generator.

• Composition: Given two FSTs T1 that maps from I to C and T2 that maps from C to O, their composition is a new transducer T1 o T2 that maps from I to O. Therefore is we have a number of FST that run serialy, it is possible to build a new FST that maps from the initial input to the final output.

Page 7: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Finite State Transducers

• It is convenient to view a FST as having two tapes.

– The upper or lexical tape

– The lower of surface tape

• Each symbol a:b in the FST alphabet expresses how a symbol from one tape is mapped to a symbol on the other tape.

• Symbols such as a:a are called default pairs and are represented simply as a.

Page 8: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

FST Morphotactics

FST for English plural formation. ^ marks a morpheme boundary and # a word boundary.

Page 9: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

FST Lexicon

Page 10: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Combining FST Lexicon and Morphtactics

• The two FST for lexicon and morphotactics can be cascaded, i.e. the input is run through the lexicon FST and then the output is run through the morphotactics FST.

• Based on the composition propery it is possible to compose these two FSTs into a single FST that maps directly from the lexical to the surface level (without any reference to word classes).

Page 11: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Page 12: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Orthographic Rules

• The previous FST will accept the word foxs and reject the word foxes.

• We need a way to deal with the spelling changes that often take place at morpheme boundaries. This is done by introducing orthographic rules. E.g. for English

– e is inserted after -s, -z, -x, -ch, -sh before -s.

– -y becomes -ie before -s.

• Formal rule notation: a -> b/c__d means “rewrite a as b when it occurs between c and d.

– ε ->e/{x,s,z}^__s#.

Page 13: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Orthographic Rules and FST

• The spelling rule can be seen as taking a simple concatenation of morphemes (intermediate level) and producing the surface form of the word.

Page 14: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Orthographic Rules and FST

• The previous orthographic rule can be represented as a FST.

Page 15: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Orthographic Rules and FST• Transition table for the previous FST.

State/Input

s:s x:x z:z ^:ε ε:e # other

q0: 1 1 1 0 - 0 0

q1: 1 1 1 2 - 0 0

q2: 5 1 1 0 3 0 0

q3 4 - - - - - -

q4 - - - - - 0 -

q5 1 1 1 2 - - 0

Page 16: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

Combining FST Lexicon and Rules

• First the lexicon FST maps between the lexical level and the intermediate level which is just a concatenation of morphemes.

• Then, a number of spelling rule FSTs run in parallel (or as a cascade) mapping from the intermediate level to the surface level.

• The lexicon FST and the orthographic rules FST form a cascade. This can be run top-down (generation) or bottom-up (parsing).

Page 17: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Page 18: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Page 19: Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.

FST Parsing

• Parsing is more complicated than generation because of ambiguity. E.g. foxes may be parsed as both fox+V+3SG and as fox+N+PL. Disambiguiation cannot be performed at the lexical level. Both parses should be given by the FST.

• Also ambiguities occur during parsing due to ε arcs or multiple possible paths. In fact, this is similar to the case for NFSA and similar search techniques must be employed.