and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque,...

32
TWO-LEVEL MORPHOLOGY and FINITE STATE METHODS: A CONSUMER’S VIEW Kemal Oflazer Sabancı University İstanbul, Turkey [email protected]

Transcript of and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque,...

Page 1: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

TWO-LEVEL MORPHOLOGYand

FINITE STATE METHODS: A CONSUMER’S VIEW

Kemal OflazerSabancı Universityİstanbul, Turkey

[email protected]

Page 2: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 2

OVERVIEW

Engineering a Morphological Analyzer for Turkish: Experiences and Reflections

Lenient Morphology

Bootstrapping Morphological Lexicons

Page 3: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 3

TURKISH

Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...)Very productive inflectional and derivational suffixation,Small root word lexicon (~60 K roots), but essentially an infinite number of word forms.

Page 4: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 4

TURKISH

Rich morphophonological processes (Vowel harmony, etc.)

evinizdekilerden (from the ones at your house)

ev+iniz+de+ki+ler+denev+HnHz+DA+ki+lAr+DAnA = {a,e}, H={ı, i, u, ü}, D= {d,t}

cf. odanızdakilerden (from the ones in your room)

oda+[ı]nız+da+ki+ler+denoda+HnHz+DA+ki+lAr+DAn

Page 5: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 5

TURKISH

evinizdekilerden (from the ones at your house)

ev+iniz+de+ki+ler+denev+HnHz+DA+ki+lAr+DAn

ev+Noun+A3sg+P2pl+Loc ^DB+Adj^DB+Noun+A3pl+Pnon+Abl

0 0 0

Page 6: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 6

ENGINEERING A MORPHOLOGICAL ANALYZERFirst implementation in 1992 – 1993 using PC-KIMMO

About two months to get the representation and 30+ two-level rules right• kgen rule compiler + some hand compilation

Crude morphotactics• Manual replications of lexicons to deal with

exceptions (maintenance nightmare)• Manual partitioning of root lexicons to deal with

allomorph selections (more of the same)

Page 7: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 7

ENGINEERING A MORPHOLOGICAL ANALYZERFirst implementation in 1992 – 1993 using PC-KIMMO

No easy way to deal with numeric forms

Slow (~5 words / second on (old) workstations)

Page 8: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 8

ENGINEERING A MORPHOLOGICAL ANALYZERReimplementation using twolc and lexc in late 1994.

Rule component was essentially a rewrite of the rules from the PC-KIMMO version taking advantage of some notational advantages offered.

Additional contexts were included to deal with vocalization of numeric constructions.

Page 9: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 9

ENGINEERING A MORPHOLOGICAL ANALYZERReimplementation using twolc and lexc in late 1994.

The morphotactics (encoded in the ordering of root and suffix lexicons) was completely re-structured and streamlined.About 300 finite state constraints added to deal with• Long distance feature constraints,• Exceptions,• Allomorph selection (which was a MAJOR pain in

the PC-KIMMO version)

Page 10: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 10

ENGINEERING A MORPHOLOGICAL ANALYZERReimplementation using twolc and lexc in late 1994.

Availability of regular expressions in lexicon specifications enabled us to handle simple vocalizations to deal with forms like• 2/3’ü, 2/3’si, 1995’te vs 1996’da, 12.si vs 12’yi,

F16’ları, 100,000’i vs 1,000,000’u and with variable forms like• aaaaaah! (Interjection) as a+ h,• çoook, (emphatic form of çok) as ç o+ k

Page 11: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 11

Turkish Analyzer Architecture

Tes-is

Tis-lx

TR1 TR2 TR3 TR4 TRn...

= intersection of rule transducers

Tlx-if

TC

Tif-ef

Transducer to normalize case and map to platform independent char rep (xfst).

MorphographemicsTransducer (twolc)

Root and morphemelexicon transducer (lexc))

Transducers for morphotactic constraints (twolc/xfst)

Transducer to generate to clean-up symbolic output (xfst)

Transducers for individualtwo-level rules (twolc)

Page 12: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 12

Turkish Analyzer Architecture

Tes-is

Tis-lx

TR1 TR2 TR3 TR4 TRn...

= intersection of rule transducers

Tlx-if

TC

Tif-ef

kütüğünden, Kütüğünden, KÜTÜĞÜNDEN

kUtUGUnden

kUtUk+sH+ndAnkUtUk+yH+ndAn

kUtUk+Noun+A3sg+P3sg+nAbl

kütük+Noun+A3sg+P3sg+Abl

Page 13: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 13

Turkish Analyzer Architecture

kütüğünden, Kütüğünden, KÜTÜĞÜNDEN

kütük+Noun+A3sg+P3sg+Abl

Turkish Analyzer

(After all transducers

are intersected or composed)

(~300K States, 800K Transitions)

Page 14: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 14

REFLECTIONS

Getting the two-level rules right is reasonably simple:

Get ALL your data rightUse a consistent representations Test early and oftenHack idiosynractic cases with diacritics or other special markers. No real need to be very religious about “theory” here.

Page 15: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 15

REFLECTIONS

(For languages like Turkish) Getting the morphotactics “right” is REALLY hard:

Have a clean and manageable lexicon structureHandle • Overgeneration,• Exceptions, long distance dependencies,• Allomorph selection

using carefully crafted finite state filters.

Page 16: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 16

REFLECTIONS

Any serious analyzer will consist of tens of files:

Use scripts and makefilesDuring development save intermediate transducers during compositions, so that you can trace bugs by checking intermediate results.

Resulting system compiles in a few minutes on a high-end SparcStation and runs at about 5-6K forms / second.

Page 17: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 17

REFLECTIONS

This analyzer (along with an unknown word processor) will assign analyses to about 98-99% of the forms encountered in news text.

Basic analyzer covers about 97%.

Unknown word processor will attempt to analyze any word whose root is not in the lexicon (provided the orthography does not violate Turkish rules!)

Page 18: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 18

ARE WE THERE YET?

How can we deal Eng-ish/Fren-ish?Dell serverları çok yetenekli.

Galatasaray Bordeaux’yu 2-1 yendi.

Page 19: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 19

ARE WE THERE YET?

How can we deal Eng-ish/Fren-ish?Dell serverları çok yetenekli.

Galatasaray Bordeaux’yu 2-1 yendi.

Very common in technical text (IT papers journals, popular science magazines, etc.)

Page 20: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 20

ARE WE THERE YET?

How can we deal Eng-ish/Fren-ish?Dell serverları çok yetenekli.

Galatasaray Bordeaux’yu 2-1 yendi.

E

F T

T

Page 21: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 21

ARE WE THERE YET?

How can we deal Eng-ish/Fren-ish?Dell serverları çok yetenekli.

Galatasaray Bordeaux’yu 2-1 yendi.

The problem is that even though foreign orthography is used, suffixation proceeds based on foreign pronunciation! (sörvır, Bordo)Orthographically, such forms violate two-level rules (e.g., vowel harmony is violated in serverları)

E

F T

T

Page 22: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 22

ARE WE THERE YET?

How can we deal Eng-ish/Fren-ish?Dell serverları çok yetenekli.

Galatasaray Bordeaux’yu 2-1 yendi.

Use the CMU pronunciation dictionary build a “TTS” transducer to map forms to a different representation capturing pronuciation, do the morphology, and use a reverse “TTS” transducer to get back to orthography.

E

F T

T

Page 23: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 23

LENIENT MORPHOLOGY

Two-level morphology requires that all rules accept a given lexical-surface string pair: All rules have to put in a good word!

We want to analyze word forms even if they are mildly (and controllably) malformed.

Mismatches between orthography and pronunciationLinguistic variants

We do not want to do spelling correction!

Page 24: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 24

LENIENT MORPHOLOGY

Allow some two-level rules to (conceptually) fail (in the analysis direction), instead of requiring all to succeed.

Use a “optimality theory” style constraint cascade to (leniently) filter / accept forms (Karttunen 1998, Gerdemann & van Noord 2000)

Page 25: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 25

OT FILTERING

.O.

.O.

.O.

.O.

Ck

C2

C1

C0

• Each filter Ci passes� All input forms if NONE

satisfy the constraint, OR

� Only those input forms that satisfy the constraint

• C0 passes forms with 0 violations

• C1 passes forms with at most 1 violation (of possibly selected types).

• ...• Transducers are

composed with Karttunen’s lenient composition operator.

GEN

Page 26: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 26

LENIENT MORPHOLOGY

.O.

.O.

.O.

.O.

Ck

C2

C1

C0

• Failing rules mark failures with additional symbols.

• Filters select outputs with selected violations.

• Clean-up removes failure symbols.Two-level

Rules Transducer

.o.Clean-up

Page 27: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 27

LENIENT MORPHOLOGY

a:b => LC _ RC;X:b (new feasible pair)X:b /<= LC _ RC;Potentially overgenerating; filter with lexicon later

Clean-up handles a <- X replacement later

Page 28: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 28

LENIENT MORPHOLOGY

a:b /<= LC _ RC;Y:b (new feasible pair)Y:b <=> LC _ RC;

Clean-up handles a <- Y replacement later

Page 29: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 29

LENIENT MORPHOLOGY

a:b <= LC _ RC;Z:w new feasible pair for each w ≠ b such that a:w is a feasible pairZ:w => LC _ RC; for each such w.

Clean-up handles a <- Z replacement later.

Page 30: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 30

LENIENT MORPHOLOGY

Assume rulesA:a <= LC1 _ ;A:e <= LC2 _;

handle vowel harmonyX:a is a new FP generated from the second rule.X:a => LC2 _ ; is an additional rule.

serverlarda

Two-level transducer

server+lAr+DA

Allow 0 violations

Allow ≤ 1 violations

server+lXr+DA

Clean-up..., A <- X,...

...

server+lXr+DA

Page 31: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 31

LENIENT MORPHOLOGY

Before or after the lenient filtering cascade, one can employ finite state filters that limit violations

to just after or before root-suffix boundary, to specific morphemes,to specific roots, etc.

or just allow only selected rules to be violated.

Page 32: and - Helsingin yliopisto · Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...) Very productive inflectional and derivational suffixation, Small root

20 Years of Finite State Systems 32

THANKS