Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we...
Transcript of Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we...
![Page 1: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/1.jpg)
Learning Morphology of
Romance, Germanic, and Slavic
languages with the tool Linguistica
Helena Blancafort
LREC 2010
![Page 2: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/2.jpg)
Outline
1. Introduction
2. State of the art
3. Linguistica: How it works
4. Experiments and Results
5. Conclusions and further work
20/05/2010LREC 2010
2
![Page 3: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/3.jpg)
Introduction
Motivation
How can we predict the cost of developing a morphosyntactic lexicon for a new language?
Goals
Evaluate if we can benefit from unsupervised learning of morphology
Input: Bible parallel corpus, tool Linguistica(Goldsmith 2001, 2006)
20/05/2010LREC 2010
3
![Page 4: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/4.jpg)
State of the Art: Induction of
morphologyObjective
- induce morphological information from raw data
20/05/2010LREC 2010
4
• Brent et al. 1995; Kazakov, 1997 • MDL (Rissanen ,1998)
Affix inventory
• Schone and Jurafsky 2001;• Yarowsky andWicentowski 2001
Cluster of stems and
affixes
![Page 5: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/5.jpg)
State of the Art II
Using linguistic knowledge or not
20/05/2010LREC 2010
5
• Nakov et al (2003); Oliver (2005)• Learn all possible endings of an unknown word• Apply Maximum Likelihood Estimation (Mikheev)
Lexicon
• Clément et al. (2004)• Fosbert et al (2006); Loupy et al. (2008)• Pos-tagger Zanchetta and Baroni
(2005)
Inflection Rules
![Page 6: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/6.jpg)
Linguistica: How it works I
• Knowledge-free
• Input: raw corpus
• Heuristics to generate a probabilistic morphological grammar
• MDL (minimum length description) & EM (expectation-maximization algorithm) to filter out inappropriate analysis
20/05/2010LREC 2010
6
![Page 7: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/7.jpg)
Linguistica: How it works II
Signatures
Paradigm-like clusters with words sharing the same affixes
could help to build a morphological grammar
The algorithm:
- Splits a word into stem and affix
- For each stem, list of affixes
- Cluster of stems sharing the same affixes
20/05/2010LREC 2010
7
![Page 8: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/8.jpg)
Linguistica: How it works III
Signatures
20/05/2010LREC 2010
8
NULL.ed.ing.s 68 7889
gather abound account ascend ask belong boil
chasten concern confirm consider delay doubt
encamp enter exceed explain fail fasten fold gain
gather glean greet groan guard hang happen harden
insult journey knock lack leap lift listen look minister
number obey offer overflow
![Page 9: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/9.jpg)
Linguistica: How it works IV
Main hurdles1) Allomorphy
2) Incomplete paradigms due to bad segmentationSpanish verb anunciar:
anunci(o, en, etc.), anunciab(a)
3) No distinction between inflectional and derivational suffixes
20/05/2010LREC 2010
9
ES colgar -> colg, cuelgFR acheter -> achet, achèt
![Page 10: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/10.jpg)
Experiments and Results I
20/05/2010LREC 2010
10
0
100
200
300
400
500
600
pl it cat es fr pt de nl en
number of suffixes generated by Linguistica
![Page 11: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/11.jpg)
Experiments and Results II
Number of paradigmes and number of suffixes
20/05/2010LREC 2010
11
ptfrde
cat
nlen
ites
pl
0
100
200
300
400
500
600
700
800
900
1000
1234567891011121314
![Page 12: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/12.jpg)
Experiments and Results III
20/05/2010LREC 2010
12
0
5
10
15
20
25
30
35
40
45
pl es cat it pt fr de nl en
Max nb forms per signature (Linguistica)
![Page 13: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/13.jpg)
Experiments and Results IV
20/05/2010LREC 2010
13
Max nb forms per
signature (Linguistica)
es 31
it 28
fr 24
de 14
en 9
Max nb forms per
paradigm (Multext)
it 63
fr 62
es 55
de 29
en 14
Knowledge-free vs. Knowledge based
![Page 14: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/14.jpg)
Experiments and Results VLongest signatures suggested by Linguistica for a stem
Affix Stem signature
pl 39 da NULL.ch.cie.dzą.j.je.jmy.jmyż.ją.jąc.li.liście.liśmy.m.my.na.ne.nej.ni.nie.niu.no.ny.ną.rze.sz.wa.wał.wszy.d.ł.ła.łby.łbyś.łem.łeś.ło.ły.o
es 31 anunci a.ad.ada.adas.adlo.ado.amos.an.ando.ar.ara.arles.aron.aros.arte.ará.arán.arás.aré.as.ase.asen.e.emos.en.es.o.áis.é.éis.ó
de 14 heil NULL.e.en.et.ig.los.lose.loser.sam.same.sames.t.te.ten
en 9 light NULL.ed.en.er.ing.ly.ness.ning.s
20/05/2010LREC 2010
14
![Page 15: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/15.jpg)
Experiments and Results VIList of most frequent prefixes for German
Prefix Nb occ. Prefix Nb occ. Prefix Nb occ.
ge 40 her 13 er 8
aus 30 un 13 *nied 7
ver 21 weg 11 bei 6
hin 20 be 10 heim 6
auf 19 zu 10 über 5
ab 19 *üb 9 durch 5
ein 16 an 9 ent 4
20/05/2010LREC 2010
15
![Page 16: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/16.jpg)
Conclusions and Further Work
Useful information to evaluate the richness and complexity of the morphology of a language
Unsupervised techniques should be improved with human input: handwritten-rules are necessary for dealing with allomorphy and correct bad segmentation (Karasimos & Petropoulo 2010)
Complete paradigms using the web (Oliver 2005) or
Output quality is language-dependent, English better results than other languages (complete verbal paradigms)
20/05/2010LREC 2010
16
![Page 17: Learning Morphology of Romance, Germanic, and Slavic ... fileIntroduction Motivation How can we predict the cost of developing a morphosyntactic lexicon for a new language? Goals Evaluate](https://reader030.fdocuments.us/reader030/viewer/2022040706/5e0604ede72ca927611f16a0/html5/thumbnails/17.jpg)
20/05/2010LREC 2010
17
Thank you
Grazzi