Post on 24-Sep-2020
Towards Automatic
Normalizaton of the Moroccan
Dialectal User Generated Text
Ridouane Tachicart & Karim Bouzoubaa
Arabic Language Engineering and Learning Modeling – ALELM Lab
Mohammed V University in Rabat - Morocco
ICALP 2019
Social media provides rich information of human interaction and collective behavior
Related worksProblem
DescriptionProposed
Approach
Experimental
ResultsSummaryMotivation
Social media useRich informationUser timeHuge amont of text
Market Trend
others
Moroccan Arabic text
Social media increasing use. 17M active users in MoroccoUsers spend much time in social media platforms (average of 3h per day)Huge amount of shared text (UGT)73% of Moroccan UGT is written in Moroccan ArabicMarket trend is based on the content of the online news articles, sentiments, and events
Several opportunities to understand consumer through text analysis (promote products, reach potential consumers…)
NLP opportunities
Related works Proposed
Approach
Experimental
ResultsSummary
Problem
DescriptionMotivation
How to automatically examine Moroccan social media
text in order to generate
new and useful information ?
PreProcessing
Machine
translation
Sentiment
Analysis
Morphological
Analysis
Topics
detection
social media text
NLP
apps
❑ Noisy data *
❑ Spelling normalization
❑ Arabizi
Related works Experimental
ResultsSummary
Problem
DescriptionMotivation Proposed
Approach
* 37% of Moroccan User generated Text is noisyTachicart & Bouzoubaa, 2019. An Empirical Analysis of Moroccan Dialectal User-Generated Text. ICCCI, Hendaye 2019
Problems
System Authors Description Claimed
accuracy
CODA Habash et al. Writting standards for Arabic dialects -
CODAFy Eskander et al. (tool) converts EGY to CODA -
Tuni CODA Boujelbane et al. (tool) converts TUN to CODA 86%
MADARi Obeid et al. (tool) annotation & spelling correction of GULF -
UGT Afli et al. (tool) error correction system for Arabic UGT
machine translation
68%
Different solutions are proposed for the spelling inconsistency
cannot be extended to Moroccan
Moroccan Arabic has not been targeted yet.
Experimental
ResultsSummary
Problem
DescriptionMotivation Proposed
ApproachRelated works
Existing spelling normalization Works
Language Identification
Morphological Analysis
Normalized Text
Spelling
Normalization
MDA?NO
Moroccan
vocabulary
Machine
translation
Sentiment
Analysis
Morphological
Disambiguation
Topics
detection
Rule based Approach
Related worksProblem
DescriptionExperimental
ResultsSummaryMotivation Proposed Approach
Preprocessing
Moroccan Arabic Morphology
ProcliticPrefixLemmaSuffixEnclitic
وماكانخدموهاش
Stem
We do not process it
Word
Related worksProblem
DescriptionExperimental
ResultsSummaryMotivation Proposed Approach
Concatenative Morphology Templatic Morphology
morpheme
كان
lemma
خدم
morpheme
وها
root
خدم
pattern
كانفعلوهاكانخدموها
Related worksProblem
DescriptionExperimental
ResultsSummaryMotivation Proposed Approach
Language Identification
Morphological Analysis
Normalized Text
Spelling
Normalization
MDA?NO
Moroccan
vocabulary
Machine
translation
Sentiment
Analysis
Morphological
Disambiguation
Topics
detection
Preprocessing
Concatenative Morphology
morpheme
كان
lemma
خدم
morpheme
وها
Related worksProblem
DescriptionExperimental
ResultsSummaryMotivation Proposed Approach
Moroccan Reference Vocabulary
affixes + clitics
lexicon of lemmas
rules
Generator
concatenation orthography
compatibility
Related worksProblem
DescriptionExperimental
ConditionsSummaryMotivation Proposed Approach
MRV4.5M words
Language Identification
Morphological Analysis
Normalized Text
Spelling
Normalization
MDA?NO
Moroccan
vocabulary
Machine
translation
Sentiment
Analysis
Morphological
Disambiguation
Topics
detection
Rule based Approach
Related worksProblem
DescriptionExperimental
ResultsSummaryMotivation Proposed Approach
Preprocessing
Comparison & Results
Related worksProblem
DescriptionExperimental
ConditionsSummaryMotivation Proposed Approach
1
2
A
B
C
Moroccan vocabulary
Moroccan + MSA + NE
Vocabulary
automatic noise removal
manual normalization
UGT (700k words)
Spelling NormalizationInput word
?
Cleaned word
Levenshtein
distance measure
Candidate
words
yes
Noise removal
Lookup
algorithm
Related worksProblem
DescriptionExperimental
ConditionsSummaryMotivation Proposed Approach
Spelling Normalizer web interface
no
Moroccan + MSA + NE
Vocabulary
Spelling Normalization Evaluation
Recall Precision F-measure
Proposing
candidates
50% 69% 58%
Related worksProblem
DescriptionExperimental
ConditionsSummaryMotivation Proposed Approach
Test Corpus
3682 words
400 sentences
=
❑ Investigated the problem of text normalization
❑ Formalized the problem as a task of UGT standardization that involves the building of a reference vocabulary
❑ Proposed an approach to spelling normalization
❑ Spelling Normalization is a crucial stage towards processing the MDA UGT
Related worksProblem
Description
Proposed
ApproachMotivation Experimental
SetupSummary
Perspectives
Language Identification
Normalized Text
MDA?NO
Moroccan
vocabulary
Preprocessing
Spelling
Normalization
Arabic + Arabizi
Morphological
Analysis and
disambiguation
Resources
Morphological Analysis
Normalized
corpus
Questions
Demo
back
Diap
ositi
ve
11
back
back
Concatenation constraints
back
Affixes & clitics
كتب
كتاب
مكتب
كاتب
كتب
To write
A book
An office
A writer
books
affixes + clitics
back
Vocabulary Evaluation
Verbs Nouns Particles
# words 932 1453 615
OOV 11% 23% 7%
Related worksProblem
DescriptionExperimental
ResultsSummaryMotivation Proposed Approach