Fsmnlp presentation mohammed_attia
-
Upload
mohammed-attia -
Category
Documents
-
view
244 -
download
1
description
Transcript of Fsmnlp presentation mohammed_attia
![Page 1: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/1.jpg)
An Open-Source Finite State Morphological Transducer for Modern
Standard Arabic
Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, Josef van Genabith
National Centre for Language Technology (NCLT),
School of Computing, Dublin City University
Funded by:
Enterprise Ireland, the Irish Research Council for Science
Engineering and Technology (IRCSET), and
the EU projects PANACEA and META-NET
![Page 2: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/2.jpg)
Contribution
• We develop a finite state morphological transducer for Modern Standard Arabic1. Open source, distributed under the GPLv3 license
2. Large scale, more than 30,000 lemmas
3. Corpus based, truly representative of Modern Standard Arabic and not Classical Arabic.
4. Compatible with Foma, an open-source fst compiler
![Page 3: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/3.jpg)
Short Tutorial
(1) Download Foma
http://foma.sourceforge.net
(2) Download AraComLex
http://aracomlex.sourceforge.net
(3) Build the transducer: README
![Page 4: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/4.jpg)
The transducer online
• You can test the transducer online:
http://www.cngl.ie/aracomlex
![Page 5: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/5.jpg)
Introduction
• Modern Standard Arabic vs. Classical Arabic
• Current State of Arabic Lexicography– Lexicons are not corpus-based
– Buckwalter Arabic Morphological Analyser
• Importance of Lexical Resources
![Page 6: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/6.jpg)
Introduction
• Arabic Morphotactics
lemma
![Page 7: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/7.jpg)
Aim
• Building a finite-state morphological transducer
• Constructing a lexical database of Modern Standard Arabic
![Page 8: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/8.jpg)
Methodology
• Using Open-Source Finite State Technology
• Using statistics from a 1 billion word corpus– 90% from the LDC's Arabic Gigaword
– 10% collected from the Al-Jazeera website
• Using a medium-scale manually created lexicon of 10,799 lemmas
![Page 9: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/9.jpg)
Methodology
• Using Finite State Technology (XFST)– Bidrectional: Suitable for analysis and generation
– handles concatenative and non-concatenative morphotactics
– Speed and efficiency in dealing with millions of paths
– Handles separated dependencies.
– Handles phonological and orthographic changes through alteration rules.
![Page 10: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/10.jpg)
Methodology
• Design Approach: Three approaches– Root-based Morphology
Xerox Arabic FTM
– Stem-based morphology
Buckwalter$kr $akar PV thank;give thanks
$kr $okur IV thank;give thanks
– Lemma-based morphology
![Page 11: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/11.jpg)
Methodology
Our Approach: Lemma-based morphology
![Page 12: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/12.jpg)
Methodology
![Page 13: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/13.jpg)
Methodology
Alteration Rules: Alteration Rules are used for handling discrepancies between surface forms and underlying representation or lemmas. We have 130 replace rules.
a -> b || L _ R
![Page 14: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/14.jpg)
Results to Date
• Start-off with a seed lexicon– Four Lexical Databases, manually constructed
• 5,925 nominal lemmas
• 1,529 verb lemmas
• 490 patterns (456 for nominals and 34 for verbs)• lemma-root look up database
![Page 15: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/15.jpg)
Results to Date
• Automatically Extending the Lexical Database: Lexical Enrichment– Data-driven filtering technique
• 40,648 lemmas (in Buckwalter or SAMA 3.1)
• Statistics from three web search engines• Statistics from the corpus annotated by MADA• 29,627 lemmas (left after filtering)
![Page 16: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/16.jpg)
Results to Date
Automatically Extending the Lexical Database: Feature Enrichment
– Machine Learning– Multilayer Peceptron classification algorithm
– Training Data: 4,816 nominals and 1,448 verbs
– Classes for nominals: continuation classes (or inflection paths), the semantico-grammatical feature of humanness, and POS (noun or adjective)
– Classes for verbs: transitivity, allowing the passive voice, and allowing the imperative mood
– We feed these datasets with frequency statistics from the corpus and build a vector grid.
![Page 17: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/17.jpg)
Results to Date
• Extending the Lexical Database– Feature enrichment using Machine Learning
![Page 18: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/18.jpg)
Results to Date
• Extending the Lexical Database– With Machine Learning we add:
18,000 new lemmas: 12,974 nominals 5,034 verbs
![Page 19: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/19.jpg)
Results to Date
• Handling Broken PluralsjAnib (side)jawAnib (sides)
Poor handling of broken plural in Buckwalter
(4) <lemmaID>jAnib_1</lemmaID> <voc>jAnib</voc> <pos>jAnib/NOUN</pos> <gloss>side/aspect</gloss>
(5) <lemmaID>jAnib_1</lemmaID> <voc>jawAnib</voc> <pos>jawAnib/NOUN</pos> <gloss>sides/aspects</gloss>
Two differences: voc and gloss
![Page 20: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/20.jpg)
Results to Date
• Extracting Broken Plurals<gloss>side/aspect</gloss>
<gloss>sides/aspects</gloss>
We use Levenshtein Distance which measures the difference between two strings (here glosses having the same lemmaID).
distance of 2 / length of the first string = 0.15 (within the threshold 0.4)
We collect 2,266 candidates
![Page 21: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/21.jpg)
Results to Date
• Validating Broken Plurals<voc>jAnib</voc> singular
pattern is: fAEilregex is: .A.i.
<voc>jawAnib</voc> pluralpattern is: fawAEilregex is: .awA.i.
Pattern database: 135 singular patterns that choose from a set of 82 broken plural patterns
2,266 candidates -> 1,965 are validated (87%)
![Page 22: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/22.jpg)
Results to Date
• Interesting statistics on Arabic pluralsInsights from the corpus:
5,570 lemmas have a feminine plural suffix
1,942 lemmas have a masculine plural suffix
2,730 lemmas with a broken plural forms
![Page 23: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/23.jpg)
Results to Date• AraComLex Lexicon Writing Application
![Page 24: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/24.jpg)
Results to Date
• FST Morphology Coverage and RPW Results– a test corpus of 800,000 words, divided as
• 400,000 for Semi-Literary text
• 400,000 for General News texts.
![Page 25: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/25.jpg)
Future Work
• Going beyond SAMA• Including Named Entities and MWEs
![Page 26: Fsmnlp presentation mohammed_attia](https://reader033.fdocuments.us/reader033/viewer/2022051817/5487018cb4af9fa00d8b5350/html5/thumbnails/26.jpg)
Conclusion
• Open-source finite state transducer for Modern Standard Arabic (AraComLex) distributed under the GPLv3 license.
• We successfully use machine learning to predict morpho-syntactic features for newly acquired words.
• Comparing our morphological transducer to SAMA, we find that we achieve comparable coverage and lower rate of analyses per word.