Design of Arabic Dialects Information Retrieval Model for ...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions
Transcript of Keynote - Computational Processing of Arabic Dialects: Challenges, Advances and Future Directions
Computa(onalProcessingofArabicDialects:Challenges,Advances&FutureDirec(ons
KeynoteThe2ndWorkshoponArabicCorporaandProcessingTools
LRECMay24,2016
NizarHabashNewYorkUniversityAbuDhabi
CAMeL Lab
2
Roadmap
• Introduc(on• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons
3
IntroducBon• FormsofArabic
– ClassicalArabic(CA)• ClassicalHistoricaltexts• Liturgicaltexts
– ModernStandardArabic(MSA)• Newsmedia&formalspeechesandsePngs• OnlywriQenstandard
– DialectalArabic(DA)• Predominantlyspokenvernaculars• NowriQenstandards
• Dialectvs.Language
ArabicanditsDialects• Officiallanguage:ModernStandardArabic(MSA)
Ø Noone’snaBvelanguage• Whatisa‘dialect’?
– PoliBcalandReligiousfactors• RegionalDialects
– EgypBanArabic(EGY)– LevanBneArabic(LEV)– GulfArabic(GLF)– NorthAfricanArabic(NOR):Moroccan,Algerian,Tunisian– Iraqi,Yemenite,Sudanese,Maltese?
• Socialdialects– City,Rural,Bedouin– Gender,Religiousvariants
5
IntroducBon• ArabicDiglossia
– Diglossiaiswheretwoformsofthelanguageexistsidebyside
– MSAistheformalpubliclanguage• Perceivedas“languageofthemind”
– DialectalArabicistheinformalprivatelanguage• Perceivedas“languageoftheheart”
• GeneralArabpercepBon:dialectsareadeterioratedformofClassicalArabic
• ConBnuumofdialects
6
CodeSwitching
الأنامابعتقدألنهعمليةالليعمبيعارضوااليومتمديدللرئيسلحودهمالليطالبوابالتمديدللرئيسالهراويوبالتاليموضوعمنهموضوعمبدئيعلىاألرضأنابحترمأنهيكونفينظرةديمقراطيةلألموروأنهيكونفياحترامللعبةالديمقراطيةوأنيكونفيممارسةديمقراطيةوبعتقدإنهالكلفي
علىموضوعإنجازاتبسبدييرجعلحظةأكثريةساحقةفيلبنانتريدهذااملوضوع،لبنانأوفيلبنانمنالنظامرئاسينظامفيلبنانالنظامعنإنجازاتالعهدلكنهليعنينعمنحكيالعهد
عمليابيدالحكومةمجتمعةوالرئيسلحودأثبتهيرئاسيوبالتاليالسلطةنظامبعدالطائفليسشخصمسؤولفيمنصبمعنيوأناعشتهذااملوضوعبأنهملابيكونفياألخيرةممارستهخالل
صالحةضمنخطابومبادئخطابملابياخدمواقفشخصيابممارستيفيموضوعاالتصاالتالسلطةالتنفيذيةألنهمنهرئيسجمهوريةهويكونرئيسمشمطلوبمنإنماهوإلىجانبهالقسم
عليهالتوجيهعليهإبداءاملالحظاتعليهبقىفيلبنانمابعدإتفاقالطائفرئيسالسلطةالتنفيذيةالوطنيةالشاملةكييظلفيمصالحةوطنيةكييظلالقولماهوخطأوماهوصحعليهتثميرجهود
باتجاهيروحتوافقمابنياملسلمواملسيحيفيلبنانيحتضنأبناءهذاالبلدمايتركاملسارفيوآمنوافيهاالليمشيوامعهالخطأنعمإنماخطابالقسمكانموضوعمبادئطرحتهوملتزمفيها
التزموافيهاأناأثبتخاللاألربعسنواتباملمارسةالحكوميةأنيالتزمتفيهاوملاالتزمنابهذاأنابتفهمتمامااملوضوعكانالرئيسلحودإلىجنبنافيهذااملوضوع،أمااملوضوعالديمقراطي
فتحإعادةانتخابهذاهالوجهةالنظربسماممكننقولإنهالدستورأوتعديلههوأوإمكانيةمسحهيئةفيجمهوريةبواليةثانيةهوديمقراطيضمناملجلسوالتصويتإلىماهنالكلرئيس
قناعتيفيهذااملوضوع.يعنيجوهرالديمقراطيةهذاباألقل
MSAandDialectmixinginspeech• phonology,morphologyandsyntax
AljazeeraTranscripthQp://www.aljazeera.net/programs/op_direcBon/arBcles/2004/7/7-23-1.htm
MSA
LEV
WhyisArabicprocessinghard?
Arabic EnglishOrthographicambiguity More LessOrthographicinconsistency More LessMorphologicalinflecBons More LessMorpho-syntacBccomplexity More LessWordorderfreedom More LessDialectalvariaBon More Less
ComputaBonalProcessingofStandardArabic
• TherehasbeenalargeandgrowingamountofworkonStandardArabicprocessing:– MulBplemorphologicalanalyzersandtaggers
• BAMA/SAMA,Elixir,AlKhalil,ALMOR,MADAMIRA,etc.
– MulBpletreebanksandparsers• PennATB,PragueDTB,CATiB,QuranCorpus
– LargecollecBonsofmonolingualtext• Gigaword,newscollecBons,QALB,andothers
– LargecollecBonsofbilingual/mulBlingualtext• UNcorpus,newscollecBons,etc.
– SenBmentResources• ArSenL,SLSA,SAMAR,etc.
– NottomenBonthetradiBonalresourcesonlexicography,morphologyandsyntax!
• MuchmoretodotosBll!• Resourcesandworkondialectsareverylimitedincomparison.
8
9
WhyWorkonArabicDialects?• DialectsaretheprimaryformofArabicusedinallunscriptedspokengenres:conversaBonal,talkshows,interviews,etc.– SpeechrecogniBonanddialoguesystemsmustmodeldialects
• DialectsareincreasinglyinuseinnewwriQenmedia(newsgroups,weblogs,forumsetc.)– TextanalyBcsofArabicmustincludedialectalmodeling
• SubstanBalDialect-MSAdifferencesimpededirectapplicaBonofMSANLPtools
ComputaBonalChallenges
• Enormousvariety– Manydialectsandsub-dialects,codeswitching
• Orthographicambiguity– Under-specificaBonandinconsistency
• Morphologicalcomplexity– morecliBcsandlessmorphofeaturesthanMSA
• Overallannotatedresourcepoverty– Thereisalotofmonolingualrawdata– Limitedlexicons– Limitedtreebanks,propbanks,etc.
10
ComputaBonalSoluBons• TreatArabicdialectsasdifferentlanguages
– Buildresourcesandtoolsfromscratch• Morphologicalanalyzers,annotatedtreebanks,paralleldata…
– Pro:modeldifferentgenres– Con:expensive,effortduplicaBon
• ExploitsimilaritybetweendialectsandMSAandamongdialects– Convert(orrelate)dialectalresourcestoMSAorviceversatoadapt– Pro:lessduplicaBon,exploitsrelaBonships– Con:thereisalimittohowwellthiswillwork
• Hybridapproach• Communitystandards
– Orthography,morphologicalanalysis,POStagsets,treebanks,etc.
11
12
Roadmap
• IntroducBon• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons
13
DialectalPhonologicalVariaBons• Major variants
• Some of many limited variants
• /l/ à/n/ MSA: /burtuqāl/ à LEV: /burtʔān/ ‘orange’
• /ʕ/ à /ħ/ MSA: /kaʕk/ à EGY: /kaħk/ ‘cookie’
• Emphasis add/delete: MSA: /fustān/ à LEV: /fustān/ ‘dress’
MSA Dialects ق /q/ /q/,/k/,/ʔ/,/g/,/ʤ/ث /θ/ /θ/,/t/,/s/ذ /δ/ /δ/,/d/,/z/ج /ʤ/ /ʤ/,/g/,/j/
ArabicScriptOrthographicVariants
IRQ LEV EGY TUN MOR/ʤ/ ج ج چ ج ج/g/ گ چ ج ڨ ڭ/tʃ/ چ تش تش تش تش/p/ پ پ پ پ پ/v/ ڤ ڤ ڤ ڥ ڥ
15
LaBnScriptforArabic?• SeveralproposalstotheArabic
LanguageAcademyinthe1940s• SaidAklExperiment(1961)• WebArabic(Arabizi,Arabish,Franco-arabe)
– Nostandard,butcommonconvenBons
عربي IPA La(n عربي IPA La(nأإآءؤئ /ʔ/ ‘ 2 Ø ث /θ/ th
ة /a/,/t/ a t ط /tʕ/ t T 6
ح ħ H h 7 ع /ʕ/ ‘ 3 Ø
خ /x/ kh 7’ x 8 غ /ʁ/ g gh 3’
ذ /δ/ th ق /q/ q
ش /ʃ/ sh ch ي /y//ay//ī//ē/
y,i,e, ai,ei,…
Akl1961
16
LackofOrthographicStandards
• Orthographicinconsistency
• EgypBan/mabinʔulhalakʃ/
– mAbinquwlhAlak$ مابنقولهالكش– mAbin&ulhalak$ مابنؤلهالكش – mAbin}ulhAlak$ مابنئلهالكش– mAbinqulhAlak$ مابنقلهالكش– …
SpellingInconsistency
• SocialmediaspellingvariaBons– +ak– +aaaaak– +k
18
ArabicLexicalVariaBon
• ArabicDialectsvarywidelylexically
• ArabicorthographyallowsconsolidaBngsomevariaBons
English Table Cat Of I_want There_is There_isn’tMSA Tāwila
طاولةqiTTaقطة
idafaØ
‘uriduاريد
yūjaduيوجد
lāyujaduاليوجد
Moroccan midaميدة
qeTTaقطة
dyālديال
bγītبغيت
kāynكاين
mākāynšماكاينش
Egyp(an Tarabēzaطربيزة
‘oTTaقطة
bitāςبتاع
ςāwezعاوز
gفي
magšمفيش
Syrian Tāwleطاولة
bisseبسة
tabaςتبع
biddiبدي
gفي
māfiمافي
Iraqi mēzميز
bazzūnaبزونة
mālمال
‘arīdاريد
akuاكو
mākuما
CODA:AConvenBonalOrthographyforDialectalArabic
• Developed by CADIM for computational processing • Objectives
– CODA covers all DAs, minimizing differences in choices
– CODA is easy to learn and produce consistently – CODA is intuitive to readers unfamiliar with it – CODA uses Arabic script
• Inspired by previous efforts from the LDC and linguistic studies
19
CODAExamples
CODA االمتحانات قبل اللي الفترة صحابي ماشفتش
gloss the exams before which the period my friends I did not see
Spelling variants
متحاناتإلا بلأ ـىاللـ هالفتر ـىصحابـ شفتشماـمتحاناتلـا بلا لليإ ةرطـالفـ حابيوصـ شفتشمـ
ناتـحـاالمتـ abl ـىللـإ هرطـالفـ ـىحابـوصـ فتشوماشـناتـحـمتـإلا qbl ـيلـا il�ra Su7abi فتشوشـماناتـحــمتـلـا qabl لىا sohaby فتشوشـمـ
ilimB7anat ـيإلـ masho�ish
limBhanaat إلىilli
CODAExamples
21
Phenomenon Original CODASpellingErrorsTyposSpeecheffectsMergesSplits
االجابهشبب
كبييييييييراليومبريستيج
روف املع
اإلجابةسببكبير
اليوم بريستيجاملعروف
MSARootCognate آلب، كلب قلبDialectalCli(cGuidelines
عهلبيتمشفناش
عهالبيتماشافناش
UniqueDialectWords بردو، برضو برضه
CODAfica(onRawOrthographytoCODAConversion
• What:-ConvertsfromrawDAorthographytoCODA-Correctstyposandvariousspeecheffects
• Approach• Eskanderetal.(2012)(CODAFY)
• Modelspecificphenomena:hamza,PluralwAsuffix,etc.• Supervisedlearning• ClassificaBonproblem
• Farraetal.(2014)• Generalizedcharacterreplacementmodel.
• Bestresults–integratedinmorphologicalanalysis(MADA-ARZ)
CODAfica(on Accuracy(tokens)
A/YNorm.Accuracy(tokens)
Baseline(doingnothing) 76.8% 90.5%
CODAFYv0.4 91.5% 95.2%
MADA-ARZ 92.9% 95.5%
Input مشفتش صحابى الفتره الى فاتتm$s$SHAbYAlsrhAlYfAt
Output ما شفتش صحابي الفترة اللي فاتتmA$s$SHAbyAlsrpAllyfAt
• Example:
• EvaluaBon:• EgypBanArabic
3ArribArabizi-to-ArabicConversion
• AsystemforautomaBcmappingofArabizitoArabicscriptinCODA
• EvaluaBon– transliteraBoncorrect83.6%ofArabicwordsandnames.
anamsh3arefa2raellyentakatboAnAm$EArfAqrAAllyAntkAtbh
انامشعارفاقراالليانتكاتبهwfelaa5ertele3fshenkwmab2raasharabicwflAxrTlEf$nkwmab2raashArAbyk
ارابيكmab2raashو+فال+اخرطلعفشنكو
(Al-Badrashinyetal.,CONLL2014;Eskanderetal.,EMNLPCodeSwitchWorkshop2014)
3ArribhQp://nlp.ldeo.columbia.edu/arrib/
• x
24
25
Roadmap
• IntroducBon• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons
26
DialectalArabicMorphologicalVariaBon
• Nouns– Nocasemarking
• WordorderimplicaBons– ParadigmreducBon
• ConsolidaBngmasculine&feminineplural
• Verbs– ParadigmreducBon
• Lossofdualforms• ConsolidaBngmasculine&feminineplural(2nd,3rdperson)• Lossofmorphologicalmoods
– SubjuncBve/jussiveformdominatesinsomedialects– IndicaBveformdominatesinothers
• Otheraspectsincreaseincomplexity
27
DAMorphologicalVariaBonVerbMorphology
conjverbobject subj tense
IOBJ negneg
MSAولمتكتبوهاله
/walamtaktubūhālahu//wa+lamtaktubū+hāla+hu/and+not_pastwrite_you+itfor+him
EGYوماكتبتوهالوش
/wimakatabtuhalūʃ//wi+ma+katab+tu+ha+lū+ʃ/
and+not+wrote+you+it+for_him+not
Andyoudidn’twriteitforhim
28
Perfect Imperfect
Past SubjuncBve Presenthabitual
Presentprogressive
Future
MSAكتب
/kataba/يكتب
/jaktuba/يكتب
/jaktubu/يكتبسـ
/sajaktubu/
LEVكتب
/katab/يكتب/jiktob/
يكتببـ/bjoktob/
يكتببـعم/ʕam bjoktob/
يكتبحـ/ħajiktob/
EGYكتب
/katab/يكتب/jikBb/
يكتببـ/bjikBb/
يكتبهـ/hajikBb/
IRQكتب/kitab/
يكتب/jikBb/
يكتبد/dajikBb/
يكتبرح/raħjikBb/
MORكتب/kteb/
يكتب/jekteb/
يكتبكـ/kjekteb/
يكتبغـ/ʁajekteb/
DAMorphologicalVariaBon
29
DAMorphologicalVariaBonVerbconjugaBon
Perfect Imperfect
1S 2S♂ 2S♀ 1S 1P 2S♀
MSA ت كتبـ /katabtu/
تكتبـ /katabta/
تكتبـ
/katabti/
كتب ا
/aktubu/
كتب نـ
/naktubu/
ين كتبـتـ/taktubīna/
ـيكتبـتـ
/taktubī/
LEV ت �كتبـ/katabt/
تي كتبـ
/katabti/
كتب ا/aktob/
كتبنـ /noktob/
ـيكتبـتـ
/toktobi/
IRQ ت �كتبـ/kitabt/
تيكتبـ
/kitabti/
كتب ا/aktib/
كتب نـ/niktib/
ينكتبـتـ
/tikitbīn/
MOR ت كتبـ/ktebt/
�تي كتبـ/ktebti/
كتب�نـ/nekteb/
وا�كتبـنـ/nektebu/
ـيكتبـتـ
/tektebi/
MorphologicalAmbiguity
• Morphological richness – Token Arabic/English = 80% – Type Arabic/English = 200%
• Morphological ambiguity – Each word: 12.3 analyses and 2.7 lemmas
• Derivational ambiguity العني – the eye, the water spring, Al-Ain city, the notable
Analysisvs.DisambiguaBon
Will will Ben Affleck be a good Batman?
PV+PVSUFF_SUBJ:3MS bay~an+a Hedemonstrated
PV+PVSUFF_SUBJ:3FP bay~an+~a Theydemonstrated(f.p)
NOUN_PROP biyn Ben
ADJ bay~in Clear
PREP bayn Between,among
Morphological Analysis is out-of-context Morphological Disambiguation is in-context
أفليكفيدورباتمان؟بنيهلسينجح
Analysisvs.Disambigua(on
Will Ben Affleck be a good Batman?
PV+PVSUFF_SUBJ:3MS bay~an+a Hedemonstrated
PV+PVSUFF_SUBJ:3FP bay~an+~a Theydemonstrated(f.p)
NOUN_PROP biyn Ben
ADJ bay~in Clear
PREP bayn Between,among
Morphological Analysis is out-of-context Morphological Disambiguation is in-context
*
أفليكفيدورباتمان؟بنيهلسينجح
W-3 W-2 W-1 W0 W1 W2 W3 W4 W-4
MORPHOLOGICAL ANALYZER
MORPHOLOGICAL CLASSIFIERS
• Rule-based
• Human-created
• Multiple independent classifiers • Corpus-trained
2nd
3rd
5th 4th
1st
RANKER
• Heuristic or corpus-trained
MADA (Habash&Rambow 2005;Roth et al. 2008) MADAMIRA (Pasha et al., 2014)
MADAMIRA• NewesttoolfromtheCADIMgroup(Pashaetal.,
2014)• CombinesMADA(Habash&Rambow,2005)and
AMIRA(Diabetal.,2004)– MorphologicaldisambiguaBon– TokenizaBon– Basephrasechunking– NamedenBtyrecogniBon
• MSAandEgypBanArabicmodes• Server-modewithXMLinterface• Onlinedemo
– hQp://nlp.ldeo.columbia.edu/madamira/– hQp://camel.abudhabi.nyu.edu/madamira/
InputArabicText
MorphologicalDisambigua(on
Tokeniza(on
BasePhraseChunking
NamedEn(tyRecogni(on
UserNLPApplica(ons
MorphologicalDisambiguaBon
System MDMRA-MSA MADA-ARZ
TrainingData MSA MSA ARZ MSA+ARZ
TestSet MSA EGY
All 84.3% 27.0% 75.4% 64.7%
POS+Features 85.4% 35.7% 84.5% 75.5%
FullDiacri(ciza(on 86.4% 32.2% 83.2% 72.2%
Lemma(za(on 96.1% 67.1% 86.3% 82.8%
BasePOS-tagging 96.1% 82.1% 91.1% 91.4%
ATBSegmenta(on 99.1% 90.5% 97.4% 97.5%
wakAtibu kAtib_1 pos:noun prc3:0 prc2:wa_conj prc1:0 prc0:0 per:3 asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:0
w+ kAtb
wkAtbوكاتب and (the) writer of
CALIMA-EgypBanv0.5• CALIMAistheColumbiaArabicLanguageMorphological
Analyzer• CALIMA-EGY
• Extends the EgypBan Colloquial Arabic Lexicon (ECAL) (Kilany et al.,2002) and Standard ArabicMorphological Analyzer (SAMA) (Graff etal.,2009).
• Follows the part-of-speech (POS) guidelines used by the LDC forEgypBanArabic(Maamourietal.,2012b).
• AcceptsmulBpleorthographicvariantsandnormalizesthemtoCODA(Habashetal.,2012).
• Incorporates annotaBons by the LDC for EgypBan Arabic. (~ 250Kwords)
CALIMA-ARZExample
katab_1LemmamA_katabt_lahA$CODAmA/NEG_PART+katab/PV+t/PVSUFF_SUBJ:2MS++li/PREP+hA/PRON_3FS+$/NEG_PART
POS
not+write+you+to/for+it/them/her+notGloss
katab_1LemmamA_katabit_lahA$CODAmA/NEG_PART+katab/PV+it/PVSUFF_SUBJ:3FS+li/PREP+hA/PRON_3FS+$/NEG_PART
POS
not+write+she/it/they+to/for+it/them/her+notGloss
mktbtlhA$ مكتبتلهاش
CALIMA-EgypBanv0.5
• IncorporatesLDCARZannotaBons(p1-p6)– 251Ktokens,52Ktypes– AnnotaBoncleanupneeded– ExtendsSAMA(StandardArabicMorphAnalyser)
System TokenRecall
TypeRecall
SAMAv3.1(StandardArabic) 67.7% 59.7%CALIMA-EGYv0.5(EgypBancore) 88.7% 75.8%CALIMA-EGYv0.5(++SAMAdialectextensions) 92.6% 81.5%
MorphologicalDisambiguaBon
System MDMRA-MSA MADA-ARZ
TrainingData MSA MSA ARZ MSA+ARZ
TestSet MSA EGY
All 84.3% 27.0% 75.4% 64.7%
POS+Features 85.4% 35.7% 84.5% 75.5%
FullDiacri(ciza(on 86.4% 32.2% 83.2% 72.2%
Lemma(za(on 96.1% 67.1% 86.3% 82.8%
BasePOS-tagging 96.1% 82.1% 91.1% 91.4%
ATBSegmenta(on 99.1% 90.5% 97.4% 97.5%
wakAtibu kAtib_1 pos:noun prc3:0 prc2:wa_conj prc1:0 prc0:0 per:3 asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:0
w+ kAtb
wkAtbوكاتب and (the) writer of
MorphologicalDisambiguaBon
System MDMRA-MSA MDMRA-EGY
TrainingData MSA MSA EGY MSA+EGY
TestSet MSA Egyp(anArabic(EGY)
All 84.3% 27.0% 75.4% 64.7%
POS+Features 85.4% 35.7% 84.5% 75.5%
FullDiacri(ciza(on 86.4% 32.2% 83.2% 72.2%
Lemma(za(on 96.1% 67.1% 86.3% 82.8%
BasePOS-tagging 96.1% 82.1% 91.1% 91.4%
ATBSegmenta(on 99.1% 90.5% 97.4% 97.5%
MADAMIRAhQp://camel.abudhabi.nyu.edu/madamira/
ي •
MADAMIRAhQp://camel.abudhabi.nyu.edu/madamira/
MADAMIRAhQp://camel.abudhabi.nyu.edu/madamira/
44
Roadmap
• IntroducBon• Orthographicprocessing• Morphologicalprocessing• Workingonanewdialect?• FuturedirecBons
Towards Morphological Tagging of a New Dialect?
• Review the literature – Hidden gems from previous efforts
• Data Collection • Data Annotation
– Guidelines: CODA, POS tags, etc. – Noisy automatic processing: Egyptian MADAMIRA? – Training annotators, quality control – This is necessary to benchmark at least
• Building the Morphological Analyzer – Eskandar et al. (2013)’s technique for paradigm completion – Salloum and Habash’s (2011) ADAM method for extending MSA
• Building the Morphological Tagger – MADAMIRA framework, e.g. Egyptian Arabic (Habash et al. 2012) – Other tagging techniques
45
Towards Morphological Tagging of a New Dialect?
• Review the literature – Hidden gems from previous efforts
• Data Collection • Data Annotation
– Guidelines: CODA, POS tags, etc. – Noisy automatic processing: Egyptian MADAMIRA? – Training annotators, quality control – This is necessary to benchmark at least
• Building the Morphological Analyzer – Eskandar et al. (2013)’s technique for paradigm completion – Salloum and Habash’s (2011) ADAM method for extending MSA
• Building the Morphological Tagger – MADAMIRA framework, e.g. Egyptian Arabic (Habash et al. 2012) – Other tagging techniques
46
• Curras Corpus (Jarrar et al., 2014)
• Gumar Corpus (Khalifa et al., 2016)
The Gumar Corpus: A Morphologically Annotated Corpus of Gulf Arabic
• ~100 million words • Mainly long conversational novels published
anonymously online ( النتروايات ‘Internet novels’). • Writers of the novels remain anonymous under
pen names. Although there is no claim of copyrights, it is conventional to credit the writer when the material is copied/transferred as per the writer request.
السالم علیكم
القصه هاذي قطریه روعه أتمنى انها تعجبكم طبعا أهي قصه منقوله من منتدى ثاني
وطبعا مصرحه الكاتبه نقل القصه مع ذكر اسمها وهي الكاتبهتحفه فنیه )) القطریه ((
نبدأ .....
الكاتبة تحفة فنیة
الفصل االول :-
وضحة والتوتر بدا یظهر علیها : الجازي ماتدرین عمي متى بیجي ؟ الجازي : واهللا یختس مدري بس ماهو باطي ،اله انت وشعندس الیوم على ابوي
؟ اخبرس ما تحبین مقعاد معاه ؟توترها : سالمتس بس بغیت اسلم علیه قبل ما وضحة وهي تحاول السیطرة على یجي حمد و نروح البیت ، قدلي كم مرة اجي وال القاه عد مهب عدله من زمان
ماوجهته . الجازي وهي تغمز عینها : ماوجهتي ابوي وال تنطرین ناس ؟
خجل على طول صار وجه وضحة احمر مثل الطماطم ، والجازي اعتبرت انه وتمت تضحك على وضحة ما تدري ان سبب احمرار وضحة هو القهر وجرح
الكرامة الى تحس به من بدت تلمح عن راشد و تقول في نفسها ماتدرین یالجازي، وفي هذه اللحظة انزلت علیهم ام راشد مرت عم ان اتمنه العمى وال اشوفه
وضحة جایه من غرفتها وفي ایدها كیسه كبیره ومدته على وضحة وهي تقول :خلها توزعه كلن وضحة یمس هذي صوغتن لكم من عند راشد عطیها امس
تعطیه حقه .
An example of raw text (Qatari) from a novel
Gumar Corpus Statistics
Words 112,410,688 Sentences 9,335,224 Documents 1,236
• Words are whitespace tokenized and the counts include punctuation.
• Number of sentences represents the number of lines. • Each document generally represents a single novel
Gumar Corpus Dialect Distribution
(Document level)
Dialect Percentage SA 60.52 AE 13.35 KW 5.91 OM 1.13 QA 0.65 BH 0.94 GA (other) 10.03 Arabic (other) 7.93
• 92% of the corpus is written in GA with SA being the most dominant.
• GA (other) are the cases of a novels containing a combination of several GA dialects. Or the case of dialect ambiguity (esp. between OM, QA and AE)
• The rest of the corpus (7.93%) is mostly MSA (original text or translation attempts of existing non Arabic text) and other DA such as Egyptian, Iraqi, Levantine, ... etc.
Morphological Analysis Evaluation
• Preliminary investigation into GA annotation are performed.
• 4000 words from text are annotated manually for: – Orthography (CODA) – Morphology (tokenization) – Part-of-speech – Lemma
• Same text was given to MADAMIRA (MSA & EGY) – Outputs are then evaluated against the gold standard.
Gulf CODA
• CODA: Conventional Orthography for Dialectal Arabic (Habash et al. 2012).
• There exist CODA guidelines for both EGY and PAL (Palestinian Arabic).
• CODA guidelines for different dialects share general rules that applies to all.
• Exceptional cases differs from one dialect to another.
Gulf CODA • One main feature that is different among dialects is the
root consonant mapping rules.
• General rules: spelling Al, Ta Marbuta, clitic attachment • Other examples of specific spelling…
سيدا، مب، مانيب، +ج\+ك
MSA/CODA Variants CODA Compliant CODA non-compliant
قدام /q/ or /ɡ/ or/ʤ/ ق جدام
�كبد /k/ or /ʧ/ or /ts/ ككذب
�جبدتسذب
جلس /ʤ/ or /j/ ج يلسشاي /ʃ/ or /ʧ/ ش چاي
CODAfied text examples
Example 1 Raw ياويلتس منتس هالحتسي اسمع
CODA ياويلج منج هالحكي اسمعEnglish
Example 2 Raw جاهز؟ الغدى عسى
CODA جاهز؟ الغدا عسىEnglish
Example 3 Raw الجامعهفياللحنياناصغيررونهمنيبساره
CODA الجامعةفيالحنياناصغيرونةمانيبسارةEnglish
An Annotation Example
Morphological Analysis Evaluation
• Preliminary investigation into GA annotation are performed.
• 4000 words from text are annotated manually for: – Orthography (CODA) – Morphology (tokenization) – Part-of-speech – Lemma
• Same text was given to MADAMIRA (MSA & EGY) – Outputs are then evaluated against the gold standard.
Morphological Analysis Evaluation
• Accuracy measure for the annotated features again the automatic output of MADAMIRA in two modes (MSA and EGY)
• MADAMIRA-EGY outperforms MADAMIRA-MSA on different metrics, confirming that it is better to use it as a baseline for manual annotation.
• Similar conclusions were reported by Jarrar et al. (2014)
Feature MADAMIRA-MSA MADAMIRA-EGY
Ortho 83.81 88.34
Morph 76.16 83.62 POS 72.37 80.39 Lemma 64.03 81.51
Summary & Future Directions • Arabic dialects pose many challenges to NLP
– No orthographic standards – Limited resources – Large number of differences from MSA
• A combination of solutions works best – Exploit similarities between dialects and MSA – Exploit similarities among dialects – Address differences through resource building
• Our goal is to make basic support for MSA and Dialects at the level of English – So, we can focus more on higher level applications!
Summary & Future Directions Although dialect processing may seem daunting, just remember • Breathe! There are rules in the dialects. Just not the
same rules as the ones in MSA.
• All these challenges are amazing opportunities to advance NLP – Not just for Arabic but for all languages.
• For Arabic native speakers, working with dialects is an eye opener (and can be a lot of fun!)
Announcements • Project MADAR
– Multi-Arabic Dialect Applications and Resources – QNRF funded project – Collaboration among CMUQ, NYUAD and Columbia – Modeling 25 Arabic city dialects
• Lexical resources, parallel data, dialect id, dialect MT – Looking for linguists and postdocs!
• WARDAT 2016 – First Workshop on Arabic Dialect Technologies – Discuss future of collaborations on Arabic Dialect Technologies – Funded by the NYUAD Institute; to be held in NYU Abu Dhabi – By invitation. Limited slots. Contact me if interested.
• CAMeL Lab – Hiring postdocs! – Funded NYU PhD in Computer Science. – Contact me if interested.
Announcements • Project MADAR
– Multi-Arabic Dialect Applications and Resources – QNRF funded project – Collaboration among CMUQ, NYUAD and Columbia – Modeling 25 Arabic city dialects
• Lexical resources, parallel data, dialect id, dialect MT – Looking for linguists and postdocs!
• WARDAT 2016 – First Workshop on Arabic Dialect Technologies – Discuss future of collaborations on Arabic Dialect Technologies – Funded by the NYUAD Institute; to be held in NYU Abu Dhabi – By invitation. Limited slots. Contact me if interested.
• CAMeL Lab – Hiring postdocs! – Funded NYU PhD in Computer Science. – Contact me if interested.
Announcements • Project MADAR
– Multi-Arabic Dialect Applications and Resources – QNRF funded project – Collaboration among CMUQ, NYUAD and Columbia – Modeling 25 Arabic city dialects
• Lexical resources, parallel data, dialect id, dialect MT – Looking for linguists and postdocs!
• WARDAT 2016 – First Workshop on Arabic Dialect Technologies – Discuss future of collaborations on Arabic Dialect Technologies – Funded by the NYUAD Institute; to be held in NYU Abu Dhabi – By invitation. Limited slots. Contact me if interested.
• CAMeL Lab – Hiring postdocs! – Funded NYU PhD Program in Computer Science. – Contact me if interested.
• http://nyuad.nyu.edu/en/
63
Thank You! Questions?