09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko...
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko...
![Page 1: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/1.jpg)
09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007"
09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1"
09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2"
10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis"
10:30 Break
11:00 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis"
11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology"
11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop
![Page 2: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/2.jpg)
Unsupervised Morpheme Analysis
Morpho Challenge Workshop 2007
Mikko Kurimo, Mathias Creutz, Matti Varjokallio, Ville Turunen
Helsinki University of Technology, Finland
![Page 3: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/3.jpg)
Opening
Welcome to the Morpho Challenge 2007 workshop:
• challenge participants• workshop speakers• other CLEF researchers• others interested in the topic
![Page 4: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/4.jpg)
Motivation• To design statistical machine learning
algorithms that discover which morphemes words consist of
• Follow-up to Morpho Challenge 2005 (segmentation of words into morphs)
• Morphemes are useful as vocabulary units for statistical language modeling in: Speech recognition, Machine translation, Information retrieval
![Page 5: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/5.jpg)
Discussion topics for the end• New ways to evaluate morphemes ?• New test languages: Hungarian, Estonian,
Russian, Arabic, Korean, Japanese, Chinese ?• New application evaluations: MT,..?• New organizing partners ?• Morpho Challenge 3 ?• Journal special issue ?• 3rd Morpho Challenge workshop ?
![Page 6: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/6.jpg)
Thanks
Thanks to all who made Morpho Challenge 2007 possible:
• PASCAL network, CLEF, Leipzig corpora collection• Morpho Challenge organizing committee• Morpho Challenge participants• Morpho Challenge program committee• Morpho Challenge evaluation team• CLEF 2007 workshop organizers
![Page 7: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/7.jpg)
09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007"
09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1"
09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2"
10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis"
10:30 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis"
11:00 Break 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology"
11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop
![Page 8: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/8.jpg)
09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007"
09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1"
09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2"
10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis"
10:30 Break
11:00 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis"
11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology"
11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop
![Page 9: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/9.jpg)
Unsupervised Morpheme Analysis Evaluation by a Comparison to a
Linguistic Gold Standard – Competition 1
Mikko Kurimo, Mathias Creutz, Matti Varjokallio
![Page 10: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/10.jpg)
Contents
• Objectives• Call for participation, Rules, Datasets• Participants• Morfessor• New evaluation method• Results • Conclusion
![Page 11: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/11.jpg)
Scientific objectives• To learn of the phenomena underlying word
construction in natural languages• To discover approaches suitable for a wide
range of languages• To advance machine learning methodology
![Page 12: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/12.jpg)
Call for participation• Part of the EU Network of Excellence
PASCAL’s Challenge Program• Organized in collaboration with CLEF• Participation is open to all and free of charge• Word sets are provided for: Finnish, English,
German and Turkish • Implement an unsupervised algorithm that
discovers morpheme analysis of words in each language!
![Page 13: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/13.jpg)
Rules• Morpheme analysis are submitted to the
organizers and two different evaluations are made
• Competition 1: Comparison to a linguistic morpheme "gold standard“
• Competition 2: Information retrieval experiments, where the indexing is based on morphemes instead of entire words.
![Page 14: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/14.jpg)
Datasets• Word lists downloadable at our home page• Each word in the list is preceded by its
frequency • Finnish: 3M sentences, 2.2M word types• Turkish: 1M sentences, 620K word types• German: 3M sentences, 1.3M word types• English: 3M sentences, 380K word types
• Small gold standard sample available in each language
![Page 15: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/15.jpg)
Examples of gold standard analyses
• English: baby-sitters baby_N sit_V er_s +PL• Finnish: linuxiin linux_N +ILL• Turkish: kontrole kontrol +DAT• German: zurueckzubehalten zurueck_B zu be
halt_V +INF
![Page 16: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/16.jpg)
New evaluation method• Problem: The unsupervised morphemes may
have arbitrary names, not the same as the ”real” linguistic morphemes, nor just subword strings
• Solution: Compare to the linguistic gold standard analysis by matching the morpheme-sharing word pairs
• Compute matches from a large random sample of word pairs where both words in the pair have a common morpheme
![Page 17: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/17.jpg)
Participants• Delphine Bernhard, TIMC-IMAG, F (now moved to
Darmstadt Univ. Tech., D)• Stefan Bordag, Univ. Leipzig, D • Paul McNamee and James Mayfield, JHU, USA • Daniel Zeman, Karlova Univ., CZ \\• Christian Monson et al., CMU, USA • Emily Pitler and Samarth Keshava, Univ. Yale,
USA• Morfessor MAP, Helsinki Univ. Tech., FI• (Michael Tepper, Univ. Washington, USA)
![Page 18: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/18.jpg)
Contents
• Objectives• Call for participation, Rules, Datasets• Participants• Morfessor• New evaluation method• Results • Conclusion
![Page 19: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/19.jpg)
Contents
• Objectives• Call for participation, Rules, Datasets• Participants• Morfessor• New evaluation method• Results • Conclusion
![Page 20: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/20.jpg)
New evaluation method• Problem: The unsupervised morphemes may
have arbitrary names, not the same as the ”real” linguistic morphemes, nor just subword strings
• Solution: Compare to the linguistic gold standard analysis by matching the morpheme-sharing word pairs
• Compute matches from a large random sample of word pairs where both words in the pair have a common morpheme
![Page 21: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/21.jpg)
Evaluation measures• F-measure = 1/(1/Precision + 1/Recall)• Precision is the proportion of suggested word
pairs that also have a morpheme in common according to the gold standard
• Recall is the proportion of word pairs sampled from the gold standard that also have a morpheme in common according to the suggested algorithm
![Page 22: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/22.jpg)
Results: Finnish, 2.2M word types
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
F-m
easu
re
Bernhard 2
Bernhard 1
Bordag 5a
Bordag 5
Zeman
McNamee 3
McNamee 4
McNamee 5
Morfessor MAP
![Page 23: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/23.jpg)
Results: Turkish, 620K word types
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
F-m
easu
re
Zeman
Bordag 5a
Bordag 5
Bernhard 2
Bernhard 1
McNamee 3
McNamee 4
McNamee 5
Morfessor MAP
Tepper
![Page 24: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/24.jpg)
Results: German, 1.3M word types
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
F-m
easu
re
Monson ParaMor-M.
Bernhard 2
Bordag 5a
Bordag 5
Monson Morfessor
Bernhard 1
Monson ParaMor
Zeman
McNamee 3
McNamee 4
McNamee 5
Morfessor MAP
![Page 25: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/25.jpg)
Results: English, 380K word types
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
55%
60%
F-m
easu
re
Bernhard 2
Bernhard 1
Pitler
Monson Paramor-M.
Monson Paramor
Monson Morfessor
Zeman
Bordag 5a
Bordag 5
McNamee 3
McNamee 4
McNamee 5
Morfessor MAP
Tepper
![Page 26: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/26.jpg)
Conclusion• 12 different unsupervised algorithms• 6 participating research groups• Evaluations for 4 languages• Good results in all languages and in IR• Full report and papers in the CLEF proceedings• Website:
http://www.cis.hut.fi/morphochallenge2007/
![Page 27: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/27.jpg)
Acknowledgments• Data from Leipzig and CLEF• Gold standard providers in all languages!• Workshop organization by CLEF• Funding from PASCAL and Academy of Finland• Competition participants!
![Page 28: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/28.jpg)
09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007"
09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1"
09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2"
10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis"
10:30 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis"
11:00 Break 11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology"
11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop
![Page 29: 09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.](https://reader030.fdocuments.us/reader030/viewer/2022032521/56649d565503460f94a34b7e/html5/thumbnails/29.jpg)
09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007"
09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic Gold Standard – Competition 1"
09:50 Ville Turunen: "Evaluation by IR experiments – Competition 2"
10:10 Delphine Bernhard: "Simple Morpheme Labelling in Unsupervised Morpheme Analysis"
10:30 Break
11:00 Stefan Bordag: "Unsupervised and Knowledge-free Morpheme Segmentation and Analysis"
11:30 Christian Monson: "ParaMor: Finding Paradigms across Morphology"
11:50 Paul McNamee: "Applying ngrams and morpheme analysis in IR" 12:10 Michael Tepper: "A Hybrid Approach to the Induction of Underlying Morphology" 12:25 Erwin Chan: "Towards unsupervised induction of morphophonological rules" 13:00 End of workshop