Finite-State Methods in Natural Language Processing

44
Finite-State Methods in Finite-State Methods in Natural Language Natural Language Processing Processing Lauri Karttunen LSA 2005 Summer Institute July 18, 2005

description

Finite-State Methods in Natural Language Processing. Lauri Karttunen LSA 2005 Summer Institute July 18, 2005. Course Outline. July 18: Intro to computational morphology XFST Readings - PowerPoint PPT Presentation

Transcript of Finite-State Methods in Natural Language Processing

Page 1: Finite-State Methods in Natural Language Processing

Finite-State Methods in Natural Finite-State Methods in Natural Language ProcessingLanguage Processing

Lauri KarttunenLSA 2005 Summer InstituteJuly 18, 2005

Page 2: Finite-State Methods in Natural Language Processing

Course OutlineCourse Outline

July 18:Intro to computational morphologyXFST

ReadingsLauri Karttunen, “Finite-State Constraints”, The Last

Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.

Karttunen and Beesley, “25 Years of Finite-State Morphology”

Chapter 1: “Gentle Introduction” (B&K)

July 20:Regular expressionsMore on XFST

ReadingsChapter 2: “Systematic Introduction”Chapter 3: “The XFST interface”

Page 3: Finite-State Methods in Natural Language Processing

July 25Concatenative morphotacticsConstraining non-local dependencies

ReadingsChapter 4. “The LEXC Language”Chapter 5. “Flag Diacritics”

July 27Non-concatenative morphotactics

Reduplication, interdigitation

ReadingsChapter 8. “Non-Concatenative Morphotactics”

Page 4: Finite-State Methods in Natural Language Processing

August 1Realizational morphology

ReadingsGregory T. Stump. Inflectional Morphology. A Theory

of Paradigm Structure. Cambridge U. Press. 2001. (An excerpt)

Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.

August 3Optimality theory

ReadingsPaul Kiparsky “Finnish Noun Inflection” Generative Approaches to

Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.

Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

Page 5: Finite-State Methods in Natural Language Processing

Getting credit for LSA 207Getting credit for LSA 207

There will be three assignments, given on each Wednesday. The first two are to be turned in by the following Monday, the last one by the following Friday.

You will get credit for the course if you solve at least two of the three assignments. The solutions will involve programming in the xfst scripting language. The problems will be easy to solve if you have attended the class.

If you have any problems in doing the assignments, Michael Wagner and I will be happy to help you.

Page 6: Finite-State Methods in Natural Language Processing

TextbookTextbook

Copies will arrive in theLinguistics Departmenttomorrow afternoon.

You can purchase a copy theretomorrow as soon as the bookshave arrived.

Starting Wednesday, books canBe purchased from our TA,Michael Wagner.

The price is $35.

With the book comes asoftware CD for Solaris,Linux, MacOSX and Windowsoperating systems.

Page 7: Finite-State Methods in Natural Language Processing

LSA 207 Web siteLSA 207 Web site

http://lsa.dlp.mit.edu/Class/207 You can use this username and password to

access materials:Username: LSA207Password: seunsehi207Your are free to copy, modify and use the slides

for whatever purpose provided that you give appropriate credit to the original source.

The readings for Wednesday’s class (“Finite-State Constraints”, “25 Years of Finite-State Morphology” and “Gentle Introduction” (Chapter 1 of B&K book) are posted on the web site).

Page 8: Finite-State Methods in Natural Language Processing

SoftwareSoftware

The software on the Book CD dates back to the Spring of 2003. For an update, point your browser tohttp://www.stanford.edu/~laurik/.lsa207/

Please read the README file and the License Agreement before downloading the software.

The updated software supports UTF-8 encoded Unicode input/output. The Book version supports only Latin-1 (ISO-8859-1).

The XFST application will be available locally on some computers (ask Michael).

Check out the web site for the Book:http://www.fsmbook.com/

Page 9: Finite-State Methods in Natural Language Processing

Finite-State Methods in NLPFinite-State Methods in NLP

Domains of ApplicationTokenizationSentence breakingSpelling correctionMorphology (analysis/generation)Phonological disambiguation (Speech Recognition)Morphological disambiguation (“Tagging”)Pattern matching (“Named Entity Recognition”)Shallow Parsing

Types of Finite-State SystemsClassical (non-weighted) automataWeighted (associated with weights in a semi-ring)

Binary relations (simple transducers)N-ary relations (multi-tape transducers)

Page 10: Finite-State Methods in Natural Language Processing

Computational morphologyComputational morphology

Analysis

leaves

leaf N Pl leave N Pl leave V Sg3

Generation

hang V Past

hanged hung

Page 11: Finite-State Methods in Natural Language Processing

Two challengesTwo challenges

MorphotacticsWords are composed of smaller elements that

must be combined in a certain order:piti-less-ness is Englishpiti-ness-less is not English

Phonological alternationsThe shape of an element may vary depending

on the contextpity is realized as piti in pitilessnessdie becomes dy in dying

Page 12: Finite-State Methods in Natural Language Processing

Morphology is regular (=rational)Morphology is regular (=rational)

The relation between the surface forms of a language and the corresponding lexical forms can be described as a regular relation.

A regular relation consists of ordered pairs of strings.leaf+N+Pl : leaves hang+V+Past : hung

Any finite collection of such pairs is a regular relation.

Regular relations are closed under operations such as concatenation, iteration, union, and composition.

Complex regular relations can be derived from simple relations.

Page 13: Finite-State Methods in Natural Language Processing

Morphology is finite-stateMorphology is finite-state

A regular relation can be defined using the metalanguage of regular expressions.

[{talk} | {walk} | {work}]

[%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];

A regular expression can be compiled into a finite-state transducer that implements the relation computationally.

Page 14: Finite-State Methods in Natural Language Processing

CompilationCompilation

[{talk} | {walk} | {work}]

[%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];

Regular expression

k

t

a

a

wo

l

r

+Progr:i :g

+3rdSg:s

+Past:e :d

:n

+Base:

Finite-state transducer

finalstate

initialstate

Page 15: Finite-State Methods in Natural Language Processing

work+3rdSg --> works

k:k

t:t

a:a

a:a

w:wo:o

l:l

r:r

+Progr:i :g

+3rdSg:s

+Past:e :d

:n

+Base:

GenerationGeneration

Page 16: Finite-State Methods in Natural Language Processing

talked --> talk+Past

k:k

t:t

a:a

a:a

w:wo:o

l:l

r:r

+Progr:i :g

+3rdSg:s

+Past:e :d

:n

+Base:

AnalysisAnalysis

Page 17: Finite-State Methods in Natural Language Processing

XFST Demo 1XFST Demo 1

xfst[0]: regex

[{talk} | {walk} | {work}]

[% +Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];

% xfstxfst[0]:

start xfst

compile a regular expression

apply the resultxfst[1]: apply up walkedwalk+Past

xfst[1]: apply down talk+SgGen3talks

Page 18: Finite-State Methods in Natural Language Processing

Lexical transducerLexical transducer

veut

vouloir +IndP +SG + P3

Finite-state transducer

inflected form

citation form inflection codes

v o u l o i r +IndP +SG +P3

v e u t

Bidirectional: generation or analysisCompact and fastComprehensive systems have been

built for over 40 languages:English, German, Dutch, French,

Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, Korean, Basque, Greek, Arabic, Hebrew, Bulgarian, …

Page 19: Finite-State Methods in Natural Language Processing

How lexical transducers are madeHow lexical transducers are made

LexiconFST

RuleFSTs

Compiler

f a t +Adj

r

+Comp

f a t t e

Lexical Transducer(a single FST)composition

LexiconRegular Expression

RulesRegular Expressions

Morphotactics

Alternations

Page 20: Finite-State Methods in Natural Language Processing

Sequential ModelSequential Model

...

Surface form

Intermediate form

Lexical form

fst 1

fst 2

fst n

Ordered sequenceof rewrite rules

(Chomsky & Halle ‘68)can be modeledby a cascade of

finite-state transducersJohnson ‘72

Kaplan & Kay ‘81

Page 21: Finite-State Methods in Natural Language Processing

Discovery and RediscoveryDiscovery and Rediscovery

C. Douglas Johnson (1972) showed that– phonological rewrite rules are interpreted in a way

that makes them less powerful than they appear– rewrite rules can be modeled by finite transducers– for any two finite transducers applied in a sequence

there exists an equivalent single transducer (Schützenberger 1961).

Johnson’s result was ignored and forgotten, rediscovered by Ronald M. Kaplan and Martin Kay at Xerox around 1980.

Page 22: Finite-State Methods in Natural Language Processing

Application constraintApplication constraint

Phonological rewrite rules are not as powerful as they appear because of the constraint that a rule does not apply to its own output. (Johnson 1972, Kaplan&Kay 1980).

Page 23: Finite-State Methods in Natural Language Processing

Sequential applicationSequential application

N -> m / _ p

p -> m / m _

k a N p a n

k a m p a n

k a m m a n

Page 24: Finite-State Methods in Natural Language Processing

Sequential application in detailSequential application in detail

N:m

N

?? 0

2

1

pN:m

m

pN

m

p:m

?? 0 1

mp

m

k a N p a n

k a m p a n

k a m m a n

0 0 0 2 0 0 0

0 0 0 1 0 0 0

Page 25: Finite-State Methods in Natural Language Processing

CompositionComposition

N:m

N

?? 0

3

1

N:m

m

p

N

?

m2

p:m

p:m

N m

N:mk a N p a n

k a m m a n

0 0 0 3 0 0 0

Page 26: Finite-State Methods in Natural Language Processing

Parallel ModelParallel Model

Set of parallelof two-level rules (constraints)

compiled into finite-state automatainterpreted as transducers

Koskenniemi ‘83

fst 1 fst 2 fst n...

Surface form

Lexical form

Page 27: Finite-State Methods in Natural Language Processing

Sequential vs. parallel rulesSequential vs. parallel rules

compose intersect

FST

rule 1 rule 2 rule n...

Surface form

Lexical form

Koskenniemi 1983

Intermediate form

...

Surface form

Lexical form

rule 1

rule n

rule 1

Chomsky&Halle 1968

Page 28: Finite-State Methods in Natural Language Processing

Rewrite rulesRewrite rules

Epenthesis

Harmony

Lowering

? u: t y ? A s

? u: t I y ? A s

? u: t u y ? a s

? o: t u y ? a s

Yawelmani Vowel Harmony Kisseberth 1969

Page 29: Finite-State Methods in Natural Language Processing

Two-level constraintsTwo-level constraints

? u: t 0 y ? A s

? o: t u y ? a s

Underlying representation controls all three alternations.

Epenthesis: Insert u or i (underspecification)Harmony: Rounding next to a round V of the same height.Lowering: Long u always realized as long o.

Page 30: Finite-State Methods in Natural Language Processing

Rewrite Rules vs. ConstraintsRewrite Rules vs. Constraints

• Two different ways of decomposing the complex relation between lexical and surface forms into a set of simpler relations that can be more easily understood and manipulated.

• One approach may be more convenient than the other for particular applications.

Page 31: Finite-State Methods in Natural Language Processing

The Big PictureThe Big Picture

Languageor

Relation

Regular Expression

Finite-State Network

describes

encodes

compiles into

a a

{a}

Page 32: Finite-State Methods in Natural Language Processing

XFST Demo 2XFST Demo 2

xfst[1]: apply upapply up> dogdogapply up> pantherapply up>apply up> END;

xfst[0]: define Cat {cat} | {tiger} | {lion};defined Cat: 640 bytes. 11 states, 12 arcs, 3 paths. ...xfst[0]:

xfst[0]: set verbose off

xfst[0]: define Dog {dog} | {spaniel} | {poodle};

xfst[0]: regex Cat | Dog ;

xfst[1]: define Animalxfst[0]:

Page 33: Finite-State Methods in Natural Language Processing

xfst[0]: regex Cat & Dog;

xfst[1]: print netSigma: a c d e g i l n o p r s tSize: 13, Label Map: DefaultNet: Flags: deterministic, pruned, minimized, epsilon_free, ...s0: (no arcs)xfst[1]:

xfst[1]: popxfst[0]:

xfst[0]: regex Animal - Dog;xfst[1]: push Catxfst[2]: test equivalent1, (0=NO,1=YES)xfst[2]: clearxfst[0]:

Page 34: Finite-State Methods in Natural Language Processing

Compiling networks from wordsCompiling networks from words

rlc ae

v ee

t hf

a

Networkxfst[0]: read textclearclevereareverfatfather^D432 bytes. 10 states, 12 arcs, 6 paths.

read text < file

read regex {clear}|{clever}|{ear}|{ever}|{fat}|{father} ;

Page 35: Finite-State Methods in Natural Language Processing

Regular Expression CalculusRegular Expression Calculus

SymbolsSimple symbols vs. symbol pairsSpecial symbols: ANY, EPSILON

Common regular expression operatorsconcatenation, union, intersection,

negation, composition

Xerox operatorscontains, restriction, replacement

Page 36: Finite-State Methods in Natural Language Processing

Symbols and LabelsSymbols and Labels

Single and multicharacter symbolsa, b, c, … , +Adj, +SG, ^Fin

Special symbols0 EPSILON? ANY

Symbols vs. symbol pairsIn general, no distinction is made between

a the language {“a”}a:a the identity relation {<“a”,

“a”>}

a

Page 37: Finite-State Methods in Natural Language Processing

Common RE OperatorsCommon RE Operators

concatenation* + iteration| union& intersection*~ \ - complementation*, minus*.x. : crossproduct.o. composition

* = not applicable to regular relations because the result may not be encodable by a finite-state network.

Page 38: Finite-State Methods in Natural Language Processing

IterationIteration

A* zero or more contatenations of A

A+ one or more concatenations of A

?* the universal language/the universal identity relation

?

a:A

b:B

c:C

d:D

[a:A | b:B | c:C | d:D | … ]*

Page 39: Finite-State Methods in Natural Language Processing

NegationNegation

\A any single symbol that is not in A\? the null language

~A any string that is not in A

a

\a Sigma: a, ?

~a

a

a

?

?a

a?

?

Page 40: Finite-State Methods in Natural Language Processing

CrossproductCrossproduct

A .x. B The relation that maps every string in A to every string in B, and vice versa

A:B Same as [A .x. B].

b:y c:0a:x

a b c .x. x y [a b c] : [x y] {abc}:{xy}

Page 41: Finite-State Methods in Natural Language Processing

CompositionComposition

A .o. B The relation C such that if A maps x to y and B maps y to z, C maps x to z.

b:B c:Ca:A

b ca

a:A

b:B

c:C

d:D {abc} .o. [a:A | b:B | c:C | d:D]*

Page 42: Finite-State Methods in Natural Language Processing

Xerox RE OperatorsXerox RE Operators

$ containment=> restriction-> @-> replacement

Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.

Page 43: Finite-State Methods in Natural Language Processing

ContainmentContainment

aa?? ?? aa$a$a

[?* a ?*][?* a ?*]

Page 44: Finite-State Methods in Natural Language Processing

RestrictionRestriction

??cc

bb

bb

cc?? aa

cc

a => b _ ca => b _ c

““AnyAny aa must be preceded bymust be preceded by bband followed byand followed by cc.”.”

~[~[?* b] a ?*] & ~[?* a ~[c ?*]] ~[~[?* b] a ?*] & ~[?* a ~[c ?*]]

Equivalent expression Equivalent expression