and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque,...

TWO-LEVEL MORPHOLOGYand

FINITE STATE METHODS: A CONSUMER’S VIEW

Kemal OflazerSabancı Universityİstanbul, Turkey

[email protected]

20 Years of Finite State Systems 2

OVERVIEW

Engineering a Morphological Analyzer for Turkish: Experiences and Reflections

Lenient Morphology

Bootstrapping Morphological Lexicons


TURKISH

Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...)Very productive inflectional and derivational suffixation,Small root word lexicon (~60 K roots), but essentially an infinite number of word forms.


TURKISH

Rich morphophonological processes (Vowel harmony, etc.)

evinizdekilerden (from the ones at your house)

ev+iniz+de+ki+ler+denev+HnHz+DA+ki+lAr+DAnA = {a,e}, H={ı, i, u, ü}, D= {d,t}

cf. odanızdakilerden (from the ones in your room)

oda+[ı]nız+da+ki+ler+denoda+HnHz+DA+ki+lAr+DAn


TURKISH

evinizdekilerden (from the ones at your house)

ev+iniz+de+ki+ler+denev+HnHz+DA+ki+lAr+DAn

ev+Noun+A3sg+P2pl+Loc ^DB+Adj^DB+Noun+A3pl+Pnon+Abl

0 0 0


ENGINEERING A MORPHOLOGICAL ANALYZERFirst implementation in 1992 – 1993 using PC-KIMMO

About two months to get the representation and 30+ two-level rules right• kgen rule compiler + some hand compilation

Crude morphotactics• Manual replications of lexicons to deal with

exceptions (maintenance nightmare)• Manual partitioning of root lexicons to deal with

allomorph selections (more of the same)


ENGINEERING A MORPHOLOGICAL ANALYZERFirst implementation in 1992 – 1993 using PC-KIMMO

No easy way to deal with numeric forms

Slow (~5 words / second on (old) workstations)


ENGINEERING A MORPHOLOGICAL ANALYZERReimplementation using twolc and lexc in late 1994.

Rule component was essentially a rewrite of the rules from the PC-KIMMO version taking advantage of some notational advantages offered.

Additional contexts were included to deal with vocalization of numeric constructions.



The morphotactics (encoded in the ordering of root and suffix lexicons) was completely re-structured and streamlined.About 300 finite state constraints added to deal with• Long distance feature constraints,• Exceptions,• Allomorph selection (which was a MAJOR pain in

the PC-KIMMO version)



Availability of regular expressions in lexicon specifications enabled us to handle simple vocalizations to deal with forms like• 2/3’ü, 2/3’si, 1995’te vs 1996’da, 12.si vs 12’yi,

F16’ları, 100,000’i vs 1,000,000’u and with variable forms like• aaaaaah! (Interjection) as a+ h,• çoook, (emphatic form of çok) as ç o+ k


Turkish Analyzer Architecture

Tes-is

Tis-lx

TR1 TR2 TR3 TR4 TRn...

= intersection of rule transducers

Tlx-if

TC

Tif-ef

Transducer to normalize case and map to platform independent char rep (xfst).

MorphographemicsTransducer (twolc)

Root and morphemelexicon transducer (lexc))

Transducers for morphotactic constraints (twolc/xfst)

Transducer to generate to clean-up symbolic output (xfst)

Transducers for individualtwo-level rules (twolc)



Tes-is

Tis-lx

TR1 TR2 TR3 TR4 TRn...

= intersection of rule transducers

Tlx-if

TC

Tif-ef

kütüğünden, Kütüğünden, KÜTÜĞÜNDEN

kUtUGUnden

kUtUk+sH+ndAnkUtUk+yH+ndAn

kUtUk+Noun+A3sg+P3sg+nAbl

kütük+Noun+A3sg+P3sg+Abl



kütüğünden, Kütüğünden, KÜTÜĞÜNDEN

kütük+Noun+A3sg+P3sg+Abl

Turkish Analyzer

(After all transducers

are intersected or composed)

(~300K States, 800K Transitions)


REFLECTIONS

Getting the two-level rules right is reasonably simple:

Get ALL your data rightUse a consistent representations Test early and oftenHack idiosynractic cases with diacritics or other special markers. No real need to be very religious about “theory” here.


REFLECTIONS

(For languages like Turkish) Getting the morphotactics “right” is REALLY hard:

Have a clean and manageable lexicon structureHandle • Overgeneration,• Exceptions, long distance dependencies,• Allomorph selection

using carefully crafted finite state filters.


REFLECTIONS

Any serious analyzer will consist of tens of files:

Use scripts and makefilesDuring development save intermediate transducers during compositions, so that you can trace bugs by checking intermediate results.

Resulting system compiles in a few minutes on a high-end SparcStation and runs at about 5-6K forms / second.


REFLECTIONS

This analyzer (along with an unknown word processor) will assign analyses to about 98-99% of the forms encountered in news text.

Basic analyzer covers about 97%.

Unknown word processor will attempt to analyze any word whose root is not in the lexicon (provided the orthography does not violate Turkish rules!)


ARE WE THERE YET?

How can we deal Eng-ish/Fren-ish?Dell serverları çok yetenekli.

Galatasaray Bordeaux’yu 2-1 yendi.


ARE WE THERE YET?



Very common in technical text (IT papers journals, popular science magazines, etc.)


ARE WE THERE YET?



E

F T

T


ARE WE THERE YET?



The problem is that even though foreign orthography is used, suffixation proceeds based on foreign pronunciation! (sörvır, Bordo)Orthographically, such forms violate two-level rules (e.g., vowel harmony is violated in serverları)

E

F T

T


ARE WE THERE YET?



Use the CMU pronunciation dictionary build a “TTS” transducer to map forms to a different representation capturing pronuciation, do the morphology, and use a reverse “TTS” transducer to get back to orthography.

E

F T

T


LENIENT MORPHOLOGY

Two-level morphology requires that all rules accept a given lexical-surface string pair: All rules have to put in a good word!

We want to analyze word forms even if they are mildly (and controllably) malformed.

Mismatches between orthography and pronunciationLinguistic variants

We do not want to do spelling correction!


LENIENT MORPHOLOGY

Allow some two-level rules to (conceptually) fail (in the analysis direction), instead of requiring all to succeed.

Use a “optimality theory” style constraint cascade to (leniently) filter / accept forms (Karttunen 1998, Gerdemann & van Noord 2000)


OT FILTERING

.O.

.O.

.O.

.O.

Ck

C2

C1

C0

• Each filter Ci passes� All input forms if NONE

satisfy the constraint, OR

� Only those input forms that satisfy the constraint

• C0 passes forms with 0 violations

• C1 passes forms with at most 1 violation (of possibly selected types).

• ...• Transducers are

composed with Karttunen’s lenient composition operator.

GEN


LENIENT MORPHOLOGY

.O.

.O.

.O.

.O.

Ck

C2

C1

C0

• Failing rules mark failures with additional symbols.

• Filters select outputs with selected violations.

• Clean-up removes failure symbols.Two-level

Rules Transducer

.o.Clean-up


LENIENT MORPHOLOGY

a:b => LC _ RC;X:b (new feasible pair)X:b /<= LC _ RC;Potentially overgenerating; filter with lexicon later

Clean-up handles a <- X replacement later


LENIENT MORPHOLOGY

a:b /<= LC _ RC;Y:b (new feasible pair)Y:b <=> LC _ RC;

Clean-up handles a <- Y replacement later


LENIENT MORPHOLOGY

a:b <= LC _ RC;Z:w new feasible pair for each w ≠ b such that a:w is a feasible pairZ:w => LC _ RC; for each such w.

Clean-up handles a <- Z replacement later.


LENIENT MORPHOLOGY

Assume rulesA:a <= LC1 _ ;A:e <= LC2 _;

handle vowel harmonyX:a is a new FP generated from the second rule.X:a => LC2 _ ; is an additional rule.

serverlarda

Two-level transducer

server+lAr+DA

Allow 0 violations

Allow ≤ 1 violations

server+lXr+DA

Clean-up..., A <- X,...

...

server+lXr+DA


LENIENT MORPHOLOGY

Before or after the lenient filtering cascade, one can employ finite state filters that limit violations

to just after or before root-suffix boundary, to specific morphemes,to specific roots, etc.

or just allow only selected rules to be violated.


THANKS

and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque,...

Documents

Transcript of and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque,...