Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July...

39
Finite-State Methods in Finite-State Methods in Natural Language Natural Language Processing Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005

Transcript of Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July...

Page 1: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Finite-State Methods in Natural Finite-State Methods in Natural Language ProcessingLanguage Processing

Lauri Karttunen

LSA 2005 Summer Institute

July 27, 2005

Page 2: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Course OutlineCourse Outline

July 18:Intro to computational morphologyXFST

ReadingsLauri Karttunen, “Finite-State Constraints”, The Last

Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.

Karttunen and Beesley, “25 Years of Finite-State Morphology”

Chapter 1: “Gentle Introduction” (B&K)

July 20:Regular expressionsMore on XFST

ReadingsChapter 2: “Systematic Introduction”Chapter 3: “The XFST interface”

Page 3: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

July 25More on XFST: Date ParserConcatenative morphotactics: The LEXC language

ReadingsChapter 4. “The LEXC Language”

July 27Constraining non-local dependencies: Flag DiacriticsComplex morphotactics and alternations: Finnish

Numerals

ReadingsChapter 5. “Flag Diacritics””

Page 4: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

August 1Non-concatenative morphotactics

Reduplication, interdigitation

Realizational morphologyReadings

Chapter 8. “Non-Concatenative Morphotactics”Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm

Structure. Cambridge U. Press. 2001. (An excerpt)Lauri Karttunen, “Computing with Realizational Morphology”, Lecture

Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.

August 3Optimality theory

ReadingsPaul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic and

Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.

Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

Page 5: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Syllabification revisitedSyllabification revisited

define MarkNonDiphthongs [ [. .] -> "." || [HighV | MidV] _ LowV, # i.a, e.a LowV _ MidV, # a.e i _ [MidV - e], # i.o, i.ä u _ [MidV - o], # u.e y _ [MidV - ö], # y.e $V i _ e, # poiki.en V u _ o, # $V y _ ö, # $V [MidV | LowV] _ [u|y] C C|.#.]]; # oike.us

define Syllabify [ C* V+ C* @-> ... "." || _ C V ];

regex FinnWords .o. MarkNonDiphthongs .o. Syllabify;

Page 6: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

ConstraintsConstraints

ge

hund

bon

nemal eg

et

ineget

o

a

ec

j n

MF%+ => _ ~$[%+Fem] %+Pl ;MF+ +Fem

+Pl

Page 7: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Constraining by compositionConstraining by composition

xfst[0]: read lexc < adj-noun-tags.lexcRoot...2, Nouns...2, NounRoots...4, Nmf...5, ....Building lexicon...Minimizing...Done!2.7 Kb. 45 states, 70 arcs, Circular.

xfst[1]: up gehundinoMF+hund+Noun+Fem+Sg

xfst[1]: regex "MF+" => _ ~$["+Fem"] "+Pl" ;1.2 Kb, 2 states, 7 arcs, Circular

xfst[2]: compose3.2 Kb, 61 states, 89 arcs, Circularxfst[1]: up gehundinoxfst[1]: *** Not accepted ***Less words, bigger network.

Page 8: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Esperanto with FlagsEsperanto with Flags

Multichar_Symbols+Noun +Adj +Nsuff+ASuff +Nize+Pl +Sg +Acc MF++Aug +Dim +Fem Op+ [email protected]@ @U.MF.No@

LEXICON Root Nouns ; Adjectives ;

LEXICON Nouns NounRoots ; @U.MF.Yes@ Ge ; LEXICON GeMF+:ge NounRoots;

LEXICON NounRoots bird Nmf ; hund Nmf ;kat Nmf ;

LEXICON Nmf+Noun:0 AugDimFem ;

LEXICON [email protected]@ Fem ; +Dim:et AugDimFem ; +Aug:eg AugDimFem ; Nend ; Adjend ;

LEXICON Fem+Fem:in AugDimFem ;

Page 9: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Constraining by flagsConstraining by flags

xfst[0]: read lexc < esperanto-flags.lexc

xfst[1]: up gehundinoxfst[1]:xfst[1]: down MF+hund+Noun+Fem+NSuff+Sgxfst[1]:

xfst[1]: set obey-flags offvariable obey-flags = off

xfst[1]: up gehundinoxfst[1]: MF+hund+Noun+Fem+NSuff+Sg

xfst[1]: set show-flags onvariable show-flags = on

xfst[1]: down [email protected]@[email protected]@[email protected]@

Page 10: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Flags in the sigmaFlags in the sigma

xfst[1]: print sigma

MF+ Neg+ Op+ a b c d e f g h i j k l m n o r

t u v +ASuff +Acc +Adj +Aug +Dim +Fem +Nsuff

+Nize +Noun +Pl +Sg @U.MF.No@ @U.MF.Yes@

Size: 35

@U.MF.Yes@: UNIFY feature 'MF' with value 'Yes'

@U.MF.No@: UNIFY feature 'MF' with value 'No'

2 flag diacritics

Page 11: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Eliminating flagsEliminating flags

xfst[1]: eliminate flag MF3.2 Kb. 61 states 89 arcs, CircularSize: 35

xfst[1]: print sigmaMF+ Neg+ Op+ a b c d e f g h i j k l m n o r t uv +ASuff +Acc +Adj +Aug +Dim +Fem +NSuff +Nize +Noun +Pl +SgSize: 33

The eliminate flag command composes the network with constraint networks that have the same effect as the flag diacritics that are removed.

Page 12: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Flag DiacriticsFlag Diacritics

Special symbols for encoding features, that is, attribute-value pairs.

Checked at runtime to avoid the cost of compiling them into the structure of the network

If a check fails, the path is abandoned.

Page 13: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Attributes and ValuesAttributes and Values

Epsilon arcs with feature constraints.

@U.Feature.Value@

@C.Feature@

Unify ‘Feature’ with ‘Value’ if possible.

Set ‘Feature’ to the unspecified value.

Page 14: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

RulesRules

There can be any number of attributes.

An attribute can have any number of values.

If the value of an attribute is unspecified, it unifies successfully with any given value and is set to that value.

If the value of an attribute is specified, it unifies only with the given value.

Page 15: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Actions: Unify, Positive SetActions: Unify, Positive Set

@U.Feature.Value@ Unify Value with the current setting of Feature, if possible. Otherwise fail.

@P.Feature.Value@ Set Feature to Value regardless of the currentsetting. Always succeeds.

Page 16: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

More Actions: Negative Set, ClearMore Actions: Negative Set, Clear

@N.Feature.Value@ Set Feature to thecomplement of Value

regardless of the current

setting. Always succeeds.

@C.Feature@ Make Feature beunspecified.

Alwayssucceeds.

Page 17: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

More Actions: RequireMore Actions: Require

@R.Feature.Value@ Succeed in Feature is set

to Value. Otherwise fail.

@R.Feature@ Succeed if Feature hasbeen set to some

value.Otherwise fail.

Page 18: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

More Actions: EqualityMore Actions: Equality

@E.Feature1.Feature2@ Succeed if Feature1has the same value asFeature2. Otherwise

fail.

Page 19: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Eliminating flagsEliminating flags

The constraints on "@U.FEATURE.VALUE@" have the form

~[?* PROHIBIT_FLAGS ~$[ALLOW_FLAGS] SELF ?*]

Constraint for eliminating @U.MF.No@:

~[?* ["@U.MF.Yes@"] # prohibit

~$["@P.MF.No@" | ”@C.MF@”] # allow

"@U.MF.No@"

?*]

Page 20: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Finnish NumeralsFinnish Numerals

Page 21: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Numbers and NumeralsNumbers and Numerals

The mapping from integers 0, 1, 2, 3 … to the corresponding numerals one, two, three… is a regular relation.

Some languages have a very simple numeral system, some are more complicated:seventy-three, soixante-treize, drei-und-sibzig

We can compile transducers that map between the numbers and the corresponding numerals.

Page 22: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Number-to-Numeral transducerNumber-to-Numeral transducer

Generation

105

hundred five hundred and five

one hundred and five

Analysis

hundred five

105

Page 23: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

The Goal Ahead: FinnishThe Goal Ahead: Finnish

Analysis

sadanviiden

105+Sg+Gen

hundred and five (Sg Gen)

Generation

28+Ord+Pl+Gen

kahdensienkymmenensienkahdeksansien

twenty-eighth (Pl Gen)

Page 24: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Finnish NumeralsFinnish Numerals

Compound numerals written as one word 2 • 1000 + 5 • 100 + 3 • 10 + 1 = 2531

kaksituhattaviisisataakolmekymmentäyksi

Express ordinality, number, and casesata+Sg+Nom (100) sata+Ord+Sg+Nom (100th)sata sadas

sata+Sg+Gen (100) sata+Ord+Sg+Gen (100th)sadan sadannen

sata+Pl+Gen (100) sata+Ord+Pl+Gen (100th)satojen sadansien

Page 25: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Singular vs. PluralSingular vs. Plural

Numerals generally occur with singular nounskaksi+Sg+Gen kenkä+Sg+Gen

kahden kengän omistaja

(owner of two shoes)

Sets and public events may be in pluralkaksi+Pl+Gen kenkä+Pl+Gen kaksien kenkien omistaja(owner of two pairs of shoes)

kolme+Ord+Pl+Nom olympialainen+Pl+Nomkolmannet olympialaiset(third olympic games)

yksi+Pl+Nom hää+Pl+Nomyhdet häät(one wedding)

Page 26: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

MorphotacticsMorphotactics

All parts of compound numerals agree in all respects two thousand five hundred (2500)kaksi+Sg+Gen tuhat+Sg+Gen viisi+Sg+Gen sata+Sg+Genkahden tuhannen viiden sadan

two ten eighth (28th)kaksi+Ord+Pl+Gen kymmenen+Ord+Pl+Gen kahdeksan+Ord+Pl+Genkahde ns i en kymmene ns i en kahdeksa ns i en

Page 27: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Singular nominative is exceptionalSingular nominative is exceptional

Numeral with a nounkaksi+Gen kenkä+Gen

kahden kengän (two shoes)

kaksi+Nom kenkä+Part

kaksi kenkää (two shoes)

Compound numeralkaksi+Gen tuhat+Gen viisi+Gen sata+Gen kolme+Gen (2503) kahden tuhannen viiden sadan kolmen

kaksi+Nom tuhat+Part viisi+Nom sata+Part kolme+Nom (2503) (kaksi • tuhatta) + (viisi • sataa) + kolme

Page 28: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Morphological AlternationsMorphological Alternations

Semiregular stem alternationsyksi+Sg+Nom : yksi (one)yksi+Sg+Ess : yhtenäyksi+Sg+Gen : yhdenyksi+Sg+Part : yhtäyksi+Pl+Gen : yksien

Irregular stem alternationsyksi+Ord+Sg+Nom : ensimmäinen (first)

Regular suffix alternationsVowel harmony

kolme+Sg+Part : kolmea vs. neljä+Sg+Part : neljää

Illative vowelkolme+Sg+Ill : kolmeen vs. neljä+Ill+Part : neljään

Partitive tyksi+Sg+Part : yhtä vs. neljä+Sg+Part : neljää

Page 29: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Solution for FinnishSolution for Finnish

Maps a number with morphological tagsinto an inflected Finnish numeral.Encodes morphotactic constraints.

Numbers/Finnish

Transducer

lexc sourcelexicon

.o.

Looping lexicon with all the formsof all Finnish single numerals concatenatedin all possible ways. Composed with morphophonological rules.

Page 30: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

ExampleExample

Numbers/Finnish

Transducer

2 5 +Ord +Pl +Genkaksi +Ord +Pl +Gen kymmenen +Ord +Pl +Gen viisi +Ord +Pl +Gen

lexc sourcelexicon

.o.

kaksi +Pl +Nom kymmenen +Part VIISI +Ord +Genkahdet kymmentä viidennen (ungrammatical)

kaksi +Ord +Pl +Gen kymmenen +Ord +Pl +Gen viisi +Ord +Pl +Genkahdensien kymmenensien viidensien

Page 31: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Sublexicon for OneSublexicon for One

LEXICON Yksi YKSI+Sg:yksi Nom; # singular nominative YKSI+Sg:yhde WeakGrade; # weak stem (most cases) YKSI+Sg:yhte StrongGrade; # strong stem (essive, ill.) YKSI+Sg:yht Par; # partitive stem YKSI:yks PlStem1; # plural stem YKSI+Ord1+Sg:ensimmäinen Nom; # singular nominative YKSI+Ord1+Sg:ensimmäise AnyGrade; # weak/strong stem YKSI+Ord1+Sg:ensimmäis Par; # partitive stem YKSI+Ord+Sg:yhdes Nom; # singular nominative YKSI+Ord+Sg:yhdenne WeakGrade; # weak stem YKSI+Ord+Sg:yhdente StrongGrade; # strong stem YKSI+Ord+Sg:yhdet Par; # partitive stem YKSI+Ord:yhdens PlStem1; # plural stem

Page 32: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Some sublexiconsSome sublexicons

LEXICON WeakGrade

SgGen; ! Singular Genitive

PlNom; ! Plural Nominative

InvarWeak; ! Invariant (plural and singular) cases

LEXICON InvarWeak

+Tra:ksi Next; ! Translative “into”

+Ine:ssA Next; ! Inessive “in”

+Ela:ltA Next; ! Elative “from” (inside)

+Ade:llA Next; ! Adessive “on”

+Abl:ltA Next; ! Ablative “from” (outside)

+All:lle Next; ! Allative “onto”

+Abe:ttA Next; ! Abessive “without”

Page 33: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Sample paths for TwoSample paths for Two

kaksi+Sg+Nom kaksi+Sg+Gen kaksi+Sg+Esskaksi kahde n kahte na

kaksi+Sg+Par kaksi+Pl+Gen kaksi+Pl+Illkah TA kaks i en kaks i Vn

kaksi+Ord+Sg+Nom kaksi+Ord1+Sg+Nomkahde s toinen

kaksi+Ord+Sg+Ill kaksi+Ord1+Sg+Illkahde nte Vn toise Vn

Page 34: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Morphophonologial rulesMorphophonologial rules

define BackV [a | o | u];define FrontV [ä | ö | y];define Vow [BackV | FrontV | i | e];

define VHarmony [A -> a || BackV ~$[FrontV] _

.o.

A -> ä];

define IllativeV [V -> a || a (h) _ ,

V -> e || e (h) _ , … ]

define PartitiveT [T -> 0 || \Vow Vow _ ];

Page 35: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Example againExample again

Numbers/Finnish

Transducer

2 5 +Ord +Pl +GenKAKSI +Ord +Pl +Gen KYMMENEN +Ord +Pl +Gen VIISI +Ord +Pl +Gen

lexc sourcelexicon

.o.

morpho-phonological

rules

.o.

KAKSI +Pl +Nom KYMMENEN +Part VIISI +Ord +Gen (ungrammatical)kahdet kymmentä viidennen

KAKSI +Ord +Pl +Gen KYMMENEN +Ord +Pl +Gen VIISI +Ord +Pl +Genkahdensien kymmenensien viidensien

Page 36: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Remaining problemsRemaining problems

Special ordinals for yksi (one), kaksi (two)ensimmäinen (1st) vs. kahdeskymmenesyhdes (21st)

Compose the lexicon with an appropriate filter to eliminate unwanted variants.

No internal tags2+Sg+Gen00+Sg+Gen

Delete them: 0 <- Tag || _ $[\Tag Tag+] .#. ;

Singular nominative as partitive in compounds%+Nom -> %+Par // %+Sg %+Nom ~$Tag %+Sg _ ;

Ordinal/Plural/Case agreementFlag diacritics!

Page 37: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Flags for Finnish numeralsFlags for Finnish numerals

@U.Type.Card@ @U.Type.Ord@

@U.Number.Sg@ @U.Number.Pl@

@U.Case.Nom@ @U.Case.Gen@ @U.Case.Par@ @U.Case.Tra@

@U.Case.Ess@ @U.Case.Abe@ @U.Case.Ine@ @U.Case.Ela@

@U.Case.Ill@ @U.Case.Ade@ @U.Case.Abl@ @U.Case.All@

@U.Case.Com@ @U.Case.Ins@

3 00 +Sg +Gen @U.Type.Card@ @U.Num.Sg@ @U.Case.Gen@ @U.Type.Card@ @U.Num.Sg@ @U.Case.Gen@

k o lmen s a dan

300+Sg+Genkolmensadan

Page 38: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

ConclusionConclusion

Mapping from numbers to numerals can be done in a simple and elegant way even for languages with complex morphology.

Necessary for text to speech applications.

Tervetuloa kahdensienkymmenensienkahdeksansien olympialaisten avajaisiin!

Welcome to the opening ceremonies of the 28th Olympic Games!

Page 39: Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 27, 2005.

Demo!Demo!