Finite-State Methods in Natural Language Processing
description
Transcript of Finite-State Methods in Natural Language Processing
![Page 1: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/1.jpg)
Finite-State Methods in Natural Finite-State Methods in Natural Language ProcessingLanguage Processing
Lauri KarttunenLSA 2005 Summer InstituteJuly 18, 2005
![Page 2: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/2.jpg)
Course OutlineCourse Outline
July 18:Intro to computational morphologyXFST
ReadingsLauri Karttunen, “Finite-State Constraints”, The Last
Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.
Karttunen and Beesley, “25 Years of Finite-State Morphology”
Chapter 1: “Gentle Introduction” (B&K)
July 20:Regular expressionsMore on XFST
ReadingsChapter 2: “Systematic Introduction”Chapter 3: “The XFST interface”
![Page 3: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/3.jpg)
July 25Concatenative morphotacticsConstraining non-local dependencies
ReadingsChapter 4. “The LEXC Language”Chapter 5. “Flag Diacritics”
July 27Non-concatenative morphotactics
Reduplication, interdigitation
ReadingsChapter 8. “Non-Concatenative Morphotactics”
![Page 4: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/4.jpg)
August 1Realizational morphology
ReadingsGregory T. Stump. Inflectional Morphology. A Theory
of Paradigm Structure. Cambridge U. Press. 2001. (An excerpt)
Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.
August 3Optimality theory
ReadingsPaul Kiparsky “Finnish Noun Inflection” Generative Approaches to
Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.
Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.
![Page 5: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/5.jpg)
Getting credit for LSA 207Getting credit for LSA 207
There will be three assignments, given on each Wednesday. The first two are to be turned in by the following Monday, the last one by the following Friday.
You will get credit for the course if you solve at least two of the three assignments. The solutions will involve programming in the xfst scripting language. The problems will be easy to solve if you have attended the class.
If you have any problems in doing the assignments, Michael Wagner and I will be happy to help you.
![Page 6: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/6.jpg)
TextbookTextbook
Copies will arrive in theLinguistics Departmenttomorrow afternoon.
You can purchase a copy theretomorrow as soon as the bookshave arrived.
Starting Wednesday, books canBe purchased from our TA,Michael Wagner.
The price is $35.
With the book comes asoftware CD for Solaris,Linux, MacOSX and Windowsoperating systems.
![Page 7: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/7.jpg)
LSA 207 Web siteLSA 207 Web site
http://lsa.dlp.mit.edu/Class/207 You can use this username and password to
access materials:Username: LSA207Password: seunsehi207Your are free to copy, modify and use the slides
for whatever purpose provided that you give appropriate credit to the original source.
The readings for Wednesday’s class (“Finite-State Constraints”, “25 Years of Finite-State Morphology” and “Gentle Introduction” (Chapter 1 of B&K book) are posted on the web site).
![Page 8: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/8.jpg)
SoftwareSoftware
The software on the Book CD dates back to the Spring of 2003. For an update, point your browser tohttp://www.stanford.edu/~laurik/.lsa207/
Please read the README file and the License Agreement before downloading the software.
The updated software supports UTF-8 encoded Unicode input/output. The Book version supports only Latin-1 (ISO-8859-1).
The XFST application will be available locally on some computers (ask Michael).
Check out the web site for the Book:http://www.fsmbook.com/
![Page 9: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/9.jpg)
Finite-State Methods in NLPFinite-State Methods in NLP
Domains of ApplicationTokenizationSentence breakingSpelling correctionMorphology (analysis/generation)Phonological disambiguation (Speech Recognition)Morphological disambiguation (“Tagging”)Pattern matching (“Named Entity Recognition”)Shallow Parsing
Types of Finite-State SystemsClassical (non-weighted) automataWeighted (associated with weights in a semi-ring)
Binary relations (simple transducers)N-ary relations (multi-tape transducers)
![Page 10: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/10.jpg)
Computational morphologyComputational morphology
Analysis
leaves
leaf N Pl leave N Pl leave V Sg3
Generation
hang V Past
hanged hung
![Page 11: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/11.jpg)
Two challengesTwo challenges
MorphotacticsWords are composed of smaller elements that
must be combined in a certain order:piti-less-ness is Englishpiti-ness-less is not English
Phonological alternationsThe shape of an element may vary depending
on the contextpity is realized as piti in pitilessnessdie becomes dy in dying
![Page 12: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/12.jpg)
Morphology is regular (=rational)Morphology is regular (=rational)
The relation between the surface forms of a language and the corresponding lexical forms can be described as a regular relation.
A regular relation consists of ordered pairs of strings.leaf+N+Pl : leaves hang+V+Past : hung
Any finite collection of such pairs is a regular relation.
Regular relations are closed under operations such as concatenation, iteration, union, and composition.
Complex regular relations can be derived from simple relations.
![Page 13: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/13.jpg)
Morphology is finite-stateMorphology is finite-state
A regular relation can be defined using the metalanguage of regular expressions.
[{talk} | {walk} | {work}]
[%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];
A regular expression can be compiled into a finite-state transducer that implements the relation computationally.
![Page 14: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/14.jpg)
CompilationCompilation
[{talk} | {walk} | {work}]
[%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];
Regular expression
k
t
a
a
wo
l
r
+Progr:i :g
+3rdSg:s
+Past:e :d
:n
+Base:
Finite-state transducer
finalstate
initialstate
![Page 15: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/15.jpg)
work+3rdSg --> works
k:k
t:t
a:a
a:a
w:wo:o
l:l
r:r
+Progr:i :g
+3rdSg:s
+Past:e :d
:n
+Base:
GenerationGeneration
![Page 16: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/16.jpg)
talked --> talk+Past
k:k
t:t
a:a
a:a
w:wo:o
l:l
r:r
+Progr:i :g
+3rdSg:s
+Past:e :d
:n
+Base:
AnalysisAnalysis
![Page 17: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/17.jpg)
XFST Demo 1XFST Demo 1
xfst[0]: regex
[{talk} | {walk} | {work}]
[% +Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];
% xfstxfst[0]:
start xfst
compile a regular expression
apply the resultxfst[1]: apply up walkedwalk+Past
xfst[1]: apply down talk+SgGen3talks
![Page 18: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/18.jpg)
Lexical transducerLexical transducer
veut
vouloir +IndP +SG + P3
Finite-state transducer
inflected form
citation form inflection codes
v o u l o i r +IndP +SG +P3
v e u t
Bidirectional: generation or analysisCompact and fastComprehensive systems have been
built for over 40 languages:English, German, Dutch, French,
Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, Korean, Basque, Greek, Arabic, Hebrew, Bulgarian, …
![Page 19: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/19.jpg)
How lexical transducers are madeHow lexical transducers are made
LexiconFST
RuleFSTs
Compiler
f a t +Adj
r
+Comp
f a t t e
Lexical Transducer(a single FST)composition
LexiconRegular Expression
RulesRegular Expressions
Morphotactics
Alternations
![Page 20: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/20.jpg)
Sequential ModelSequential Model
...
Surface form
Intermediate form
Lexical form
fst 1
fst 2
fst n
Ordered sequenceof rewrite rules
(Chomsky & Halle ‘68)can be modeledby a cascade of
finite-state transducersJohnson ‘72
Kaplan & Kay ‘81
![Page 21: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/21.jpg)
Discovery and RediscoveryDiscovery and Rediscovery
C. Douglas Johnson (1972) showed that– phonological rewrite rules are interpreted in a way
that makes them less powerful than they appear– rewrite rules can be modeled by finite transducers– for any two finite transducers applied in a sequence
there exists an equivalent single transducer (Schützenberger 1961).
Johnson’s result was ignored and forgotten, rediscovered by Ronald M. Kaplan and Martin Kay at Xerox around 1980.
![Page 22: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/22.jpg)
Application constraintApplication constraint
Phonological rewrite rules are not as powerful as they appear because of the constraint that a rule does not apply to its own output. (Johnson 1972, Kaplan&Kay 1980).
![Page 23: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/23.jpg)
Sequential applicationSequential application
N -> m / _ p
p -> m / m _
k a N p a n
k a m p a n
k a m m a n
![Page 24: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/24.jpg)
Sequential application in detailSequential application in detail
N:m
N
?? 0
2
1
pN:m
m
pN
m
p:m
?? 0 1
mp
m
k a N p a n
k a m p a n
k a m m a n
0 0 0 2 0 0 0
0 0 0 1 0 0 0
![Page 25: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/25.jpg)
CompositionComposition
N:m
N
?? 0
3
1
N:m
m
p
N
?
m2
p:m
p:m
N m
N:mk a N p a n
k a m m a n
0 0 0 3 0 0 0
![Page 26: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/26.jpg)
Parallel ModelParallel Model
Set of parallelof two-level rules (constraints)
compiled into finite-state automatainterpreted as transducers
Koskenniemi ‘83
fst 1 fst 2 fst n...
Surface form
Lexical form
![Page 27: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/27.jpg)
Sequential vs. parallel rulesSequential vs. parallel rules
compose intersect
FST
rule 1 rule 2 rule n...
Surface form
Lexical form
Koskenniemi 1983
Intermediate form
...
Surface form
Lexical form
rule 1
rule n
rule 1
Chomsky&Halle 1968
![Page 28: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/28.jpg)
Rewrite rulesRewrite rules
Epenthesis
Harmony
Lowering
? u: t y ? A s
? u: t I y ? A s
? u: t u y ? a s
? o: t u y ? a s
Yawelmani Vowel Harmony Kisseberth 1969
![Page 29: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/29.jpg)
Two-level constraintsTwo-level constraints
? u: t 0 y ? A s
? o: t u y ? a s
Underlying representation controls all three alternations.
Epenthesis: Insert u or i (underspecification)Harmony: Rounding next to a round V of the same height.Lowering: Long u always realized as long o.
![Page 30: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/30.jpg)
Rewrite Rules vs. ConstraintsRewrite Rules vs. Constraints
• Two different ways of decomposing the complex relation between lexical and surface forms into a set of simpler relations that can be more easily understood and manipulated.
• One approach may be more convenient than the other for particular applications.
![Page 31: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/31.jpg)
The Big PictureThe Big Picture
Languageor
Relation
Regular Expression
Finite-State Network
describes
encodes
compiles into
a a
{a}
![Page 32: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/32.jpg)
XFST Demo 2XFST Demo 2
xfst[1]: apply upapply up> dogdogapply up> pantherapply up>apply up> END;
xfst[0]: define Cat {cat} | {tiger} | {lion};defined Cat: 640 bytes. 11 states, 12 arcs, 3 paths. ...xfst[0]:
xfst[0]: set verbose off
xfst[0]: define Dog {dog} | {spaniel} | {poodle};
xfst[0]: regex Cat | Dog ;
xfst[1]: define Animalxfst[0]:
![Page 33: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/33.jpg)
xfst[0]: regex Cat & Dog;
xfst[1]: print netSigma: a c d e g i l n o p r s tSize: 13, Label Map: DefaultNet: Flags: deterministic, pruned, minimized, epsilon_free, ...s0: (no arcs)xfst[1]:
xfst[1]: popxfst[0]:
xfst[0]: regex Animal - Dog;xfst[1]: push Catxfst[2]: test equivalent1, (0=NO,1=YES)xfst[2]: clearxfst[0]:
![Page 34: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/34.jpg)
Compiling networks from wordsCompiling networks from words
rlc ae
v ee
t hf
a
Networkxfst[0]: read textclearclevereareverfatfather^D432 bytes. 10 states, 12 arcs, 6 paths.
read text < file
read regex {clear}|{clever}|{ear}|{ever}|{fat}|{father} ;
![Page 35: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/35.jpg)
Regular Expression CalculusRegular Expression Calculus
SymbolsSimple symbols vs. symbol pairsSpecial symbols: ANY, EPSILON
Common regular expression operatorsconcatenation, union, intersection,
negation, composition
Xerox operatorscontains, restriction, replacement
![Page 36: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/36.jpg)
Symbols and LabelsSymbols and Labels
Single and multicharacter symbolsa, b, c, … , +Adj, +SG, ^Fin
Special symbols0 EPSILON? ANY
Symbols vs. symbol pairsIn general, no distinction is made between
a the language {“a”}a:a the identity relation {<“a”,
“a”>}
a
![Page 37: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/37.jpg)
Common RE OperatorsCommon RE Operators
concatenation* + iteration| union& intersection*~ \ - complementation*, minus*.x. : crossproduct.o. composition
* = not applicable to regular relations because the result may not be encodable by a finite-state network.
![Page 38: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/38.jpg)
IterationIteration
A* zero or more contatenations of A
A+ one or more concatenations of A
?* the universal language/the universal identity relation
?
a:A
b:B
c:C
d:D
[a:A | b:B | c:C | d:D | … ]*
![Page 39: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/39.jpg)
NegationNegation
\A any single symbol that is not in A\? the null language
~A any string that is not in A
a
\a Sigma: a, ?
~a
a
a
?
?a
a?
?
![Page 40: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/40.jpg)
CrossproductCrossproduct
A .x. B The relation that maps every string in A to every string in B, and vice versa
A:B Same as [A .x. B].
b:y c:0a:x
a b c .x. x y [a b c] : [x y] {abc}:{xy}
![Page 41: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/41.jpg)
CompositionComposition
A .o. B The relation C such that if A maps x to y and B maps y to z, C maps x to z.
b:B c:Ca:A
b ca
a:A
b:B
c:C
d:D {abc} .o. [a:A | b:B | c:C | d:D]*
![Page 42: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/42.jpg)
Xerox RE OperatorsXerox RE Operators
$ containment=> restriction-> @-> replacement
Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.
![Page 43: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/43.jpg)
ContainmentContainment
aa?? ?? aa$a$a
[?* a ?*][?* a ?*]
![Page 44: Finite-State Methods in Natural Language Processing](https://reader035.fdocuments.us/reader035/viewer/2022062409/56814573550346895db241c8/html5/thumbnails/44.jpg)
RestrictionRestriction
??cc
bb
bb
cc?? aa
cc
a => b _ ca => b _ c
““AnyAny aa must be preceded bymust be preceded by bband followed byand followed by cc.”.”
~[~[?* b] a ?*] & ~[?* a ~[c ?*]] ~[~[?* b] a ?*] & ~[?* a ~[c ?*]]
Equivalent expression Equivalent expression