Finite-State Methods in Natural Language Processing
description
Transcript of Finite-State Methods in Natural Language Processing
![Page 1: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/1.jpg)
Finite-State Methods in Natural Finite-State Methods in Natural Language ProcessingLanguage ProcessingLauri KarttunenLSA 2005 Summer InstituteJuly 20, 2005
![Page 2: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/2.jpg)
Course OutlineCourse Outline
July 18:Intro to computational morphologyXFST
ReadingsLauri Karttunen, “Finite-State Constraints”, The Last
Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.
Karttunen and Beesley, “25 Years of Finite-State Morphology”
Chapter 1: “Gentle Introduction” (B&K)
July 20:Regular expressionsMore on XFST
ReadingsChapter 2: “Systematic Introduction”Chapter 3: “The XFST interface”
![Page 3: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/3.jpg)
July 25Concatenative morphotacticsConstraining non-local dependencies
ReadingsChapter 4. “The LEXC Language”Chapter 5. “Flag Diacritics”
July 27Non-concatenative morphotactics
Reduplication, interdigitation
ReadingsChapter 8. “Non-Concatenative Morphotactics”
![Page 4: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/4.jpg)
August 1Realizational morphology
ReadingsGregory T. Stump. Inflectional Morphology. A Theory of
Paradigm Structure. Cambridge U. Press. 2001. (An excerpt)
Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.
August 3Optimality theory
ReadingsPaul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic
and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.
Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.
![Page 5: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/5.jpg)
Scripting xfstScripting xfst
xfst -l myscript
xfst -f myscript
xfst -e “echo Welcome” \ -e “regex a b c;” \ -e “save foo” \ -stop
Start XFSTexecute myscriptwait for more commands from the command line
Execute myscript and exit
Execute the commands in the given order. The commands must be on the same line. The -stop at the end is required to make xfst quit.
![Page 6: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/6.jpg)
Numeral ScriptNumeral Script
# This script constructs the language of English# numerals from "one” to "ninety-nine".# This is a comment.
# From "one" through "nine":
define OneToNine [{one} | {two} | {three} | {four} | {five} | {six} | {seven} | {eight} | {nine}];
# It is convenient to define a set of prefixes that# can be followed either by "teen" or by "ty".
define TeenTyStem [{thir} | {fif} | {six} | {seven} | {eigh} | {nine}] ;
![Page 7: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/7.jpg)
Numeral Script (Continued)Numeral Script (Continued)
# From "ten" to "nineteen"define Teens [{ten} | {eleven} | {twelve} |
[TeenTyStem | {four}] {teen}];
# Let’s define stems that can be followed "ty".define TyStem [TeenTyStem | {twen} | {for}];
# TyStem is followed either by "ty" or by ty-"# and a number from OneToNine.
define Tens [TyStem [{ty} | {ty-} OneToNine]];
define OneToNinetyNine [ OneToNine | Teens | Tens ];
push OneToNinetyNine
![Page 8: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/8.jpg)
Number to NumeralNumber to Numeral
Generation
105
hundred five hundred and five
one hundred and five
Analysis
hundred five
105
![Page 9: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/9.jpg)
NumberToNumeral scriptNumberToNumeral script
# This script constructs a transducer that relates the# English numerals "one", "two", ..., "ninety-nine",# to the corresponding numbers "1", 2 ... "99".
define OneToNine [1:{one} | 2:{two} | 3:{three} | 4:{four} |5:{five} | 6:{six} | 7:{seven} | 8:{eight} | 9:{nine}];
define TeenTyStem [3:{thir} | 5:{fif} | 6:{six}| 7:{seven} | 8:{eigh} | 9:{nine}];
define Teens [1:0 [{0}:{ten} | 1:{eleven} | 2:{twelve} | [TeenTyStem | 4:{four}] 0:{teen}]];
![Page 10: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/10.jpg)
NumberToNumeral (Continued)NumberToNumeral (Continued)
define TyStem [2:{twen} | TeenTyStem | 4:{for}];
# TyStem is followed either by "ty" paired with a zero# or by "ty-" mapped to an epsilon and followed by a# number. Note that {0} means zero and not epsilon.
define Tens [TyStem [{0}:{ty} | 0:{ty-} OneToNine]];
define OneToNinetyNine [ OneToNine | Teens | Tens ];
push OneToNinetyNine
![Page 11: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/11.jpg)
Xerox RE OperatorsXerox RE Operators
$ containment=> restriction-> @-> replacement
Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.
![Page 12: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/12.jpg)
ContainmentContainment
aa?? ?? aa$a$a
[?* a ?*][?* a ?*]
![Page 13: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/13.jpg)
RestrictionRestriction
??cc
bb
bb
cc?? aa
cc
a => b _ ca => b _ c
““AnyAny aa must be preceded bymust be preceded by bband followed byand followed by cc.”.”
~[~[?* b] a ?*] & ~[?* a ~[c ?*]] ~[~[?* b] a ?*] & ~[?* a ~[c ?*]]
Equivalent expression Equivalent expression
![Page 14: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/14.jpg)
ReplacementReplacement
a:ba:b
bb
aa
??
??
b:ab:a
aa
a:ba:b
a b -> b a
““Replace ‘ab’ by ‘ba’.”Replace ‘ab’ by ‘ba’.”
[[~$[a b] [[a b] .x. [b a]]]* ~$[a b]]
Equivalent expression Equivalent expression
![Page 15: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/15.jpg)
MarkingMarking
0:[0:[
[[
0:]0:]
??
aa
eeii
oo
uu]]
a|e|i|o|u -> %[ ... %]
p o t a t op o t a t op[o]t[a]t[o]p[o]t[a]t[o]
![Page 16: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/16.jpg)
a b | b | b a | a b a -> x(a) b (a) -> x
applied to “aba”
a b a a b a a b a a b aa x a a x x a x
Multiple ResultsMultiple Results
Four factorizations of the input string.
![Page 17: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/17.jpg)
Directed Replace OperatorsDirected Replace Operators
guarantee a unique result by constraining the factorization of the input string by
Direction of the match (rightward or leftward)Length (longest or shortest)
![Page 18: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/18.jpg)
@-> Left-to-right, Longest-match @-> Left-to-right, Longest-match ReplacementReplacement
(a) b (a) @-> x
applied to “aba”
a b a a b a a b a a b aa x a a x x a x
![Page 19: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/19.jpg)
Conditional ReplacementConditional Replacement
The relation that replaces A by B between L and R leaving everything else unchanged.
A -> BA -> BReplacement
L _ RL _ R
Context
Sources of complexity:
Replacements and contexts may overlap
Alternative ways of interpreting “between left and right.”A -> B || L _ R both contexts on the inputA -> B // L _ R left context on the outputA -> B \\ L _ R right context on the output
![Page 20: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/20.jpg)
Vowel shortening after a long Vowel shortening after a long vowelvowel
V %: -> V || V %: C* _V %: -> V || V %: C* _Left context on the input side
Slovakv o l + a: v + a: m e:v o l + a: v + a m ewe call often
Gidabalg u n u: m + ba: + d a: ng + b e: +g u n u: m + ba +d a: ng + b e +is certainly right on the stump
V%: -> V // V%: C* _V%: -> V // V%: C* _Left context on the output side
![Page 21: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/21.jpg)
Shortening scriptShortening script
define V [ a | e | i | o | u | a ];define C [ b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | x | y | z ];
define SlovakShortening %: -> 0 || V %: C* V _ ;
define GidabalShortening %: -> 0 // V %: C* V _ ;
push SlovakShorteningdown vola:va:me:vola:vame
push GidabalShorteningdown gunu:mba:da:ngbe:gunu:mbada:ngbe
![Page 22: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/22.jpg)
Palatalization and Vowel RaisingPalatalization and Vowel Raising
Palatalizationtim --> cim
Vowel Raisingmemi --> mimi
Interactiontemi --> cimitememi --> cimimi
![Page 23: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/23.jpg)
Vowel Raising & PalatalizationVowel Raising & Palatalization
define C [ b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | x | y | z ];
define Raising e -> i \\ _ C* i ;define Palatalization t -> c || _ i;
regex Raising .o. Palatalization;
down memimimidown timcimdown temicimidown tememicimimi
t e m e m i
t i m i m i
c i m i m i
![Page 24: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/24.jpg)
Making a lexical transducerMaking a lexical transducer
LexiconFST
RuleFSTs
Compiler Lexical Transducer(a single FST)composition
LexiconRegular Expression
RulesRegular Expressions
Morphotactics
Alternations
![Page 25: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/25.jpg)
Finnish Gradation ScriptFinnish Gradation Script
define Stems [ {tukka}| {kakku} | {pappi} | {tippa} | {katto} | {juttu} |{tikka} | {huppu} | {rotta} | {nahka} |{lika} | {maku} | {rako} | {tuke} | {halko} | {jalka} | {virka} | {lanka} | {linko} | {puku} | {suku} | {tiuku} | {raaka} |{ripa} | {sopu} | {tapa} | {kampa} | {rumpu} | {sampe} | {sota} | {pata} | {kita} | {rinta} | {kanto} | {ranta} | {ilta} | {kulta} | {parta} | {kerta} ];
define Case [ "+Part":a | "+Gen":n ];
define Finnish [Stems Case];
![Page 26: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/26.jpg)
Auxiliary definitionsAuxiliary definitions
define V [a | e | i | o | u | y | ä | ö];define C [b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | w | x | z];
define Coda [ C [C | .#.] ];
define ClosedSyll [V Coda] ;
![Page 27: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/27.jpg)
Weak form of kWeak form of k
define WeakK k -> ' || V a _ a Coda, V u _ u Coda .o. k -> j || r _ e Coda .o. k -> v || u _ u Coda .o. k -> g || n _ V Coda .o. k -> 0 || \[s|h] _ V Coda ; # kiskon 'rail', # nahkan 'skin
![Page 28: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/28.jpg)
Weak form of pWeak form of p
define WeakP p -> m || m _ V Coda .o. p -> v || \[s|p] _ V Coda # piispan 'bishop' .o. p -> 0 || p _ V Coda;
![Page 29: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/29.jpg)
Weak form of tWeak form of t
define WeakT t -> n || n _ V Coda .o. t -> l || l _ V Coda .o. t -> r || r _ V Coda .o. t -> d || \[s|t] _ V Coda # koston revenge .o. t -> 0 || t _ V Coda ;
![Page 30: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/30.jpg)
Putting it all togetherPutting it all together
define Gradation WeakK .o. WeakP .o. WeakT;
regex Finnish .o. Gradation;
print lower-words
echo *** Size of Finnish .o. Gradationprint sizeecho *** Size of Finnishpush Finnishprint sizeecho *** Size of Gradationpush Gradationprint size
![Page 31: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815b83550346895dc985e4/html5/thumbnails/31.jpg)
SyllabificationSyllabification
define C [ b | c | d | f ...define C [ b | c | d | f ...define V [ a | e | i | o | u ];define V [ a | e | i | o | u ];
s t r u k t u r a l i s m is t r u k t u r a l i s m is t r u k - t u - r a - l i s - m is t r u k - t u - r a - l i s - m i
[C* V+ C*] @-> ... "-" || _ [C V][C* V+ C*] @-> ... "-" || _ [C V]
““Insert a hyphen after the longest instance of theInsert a hyphen after the longest instance of the C* V+ C*C* V+ C* pattern in front of a pattern in front of a C VC V pattern.” pattern.”