[IEEE 2013 IEEE Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE) - Napa,...
Transcript of [IEEE 2013 IEEE Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE) - Napa,...
DIACRITIZATION, AUTOMATIC SEGMENTATION AND LABELING FOR LEVANTINE
ARABIC SPEECH
1Yousef A. Alotaibi, 2Ali H. Meftah, 3Sid-Ahmed Selouani
1,2 College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
3LARIHS Lab. Université de Moncton, Campus de Shippagan, Canada
{yaalotaibi, ameftah} @ ksu.edu.sa, [email protected]
ABSTRACT
It is generally acknowledged that a reliable speech corpus is
necessary for any application involving speech processing. In
this paper, we propose methods to improve the BBN/AUB
DARPA Babylon Levantine Arabic speech corpus to increase
its reliability and efficiency. For this purpose, correction of
pronunciation, diacritization, and new transcription are
performed manually along with automatic phoneme
segmentation and labeling. The comparison with the original
transcription of the corpus shows a clear improvement in the
output results.
Index Terms— BBN/AUB, Levantine, dialect, transcription,
diacritics
1. INTRODUCTION
Speech corpora are essential for many applications and
automatic speech recognition (ASR) systems, as well as for
language identification and speaker verification. Speech
corpora are very important in linguistic fields, such as in the
areas of phonetics, phonology, typology and sociolinguistics
[1]. Appropriate and well-organized corpora are fundamental
for the development of robust speech recognition systems [2].
Arabic ASR system development has faced many
difficulties and challenges, such as the non-availability of
large corpora, the existence of many dialects with numerous
pronunciations, and the fact that Arabic script does not allow
for full vocalization of text, requiring the reader to infer short
vowels and other missing cues from the specific context [3].
Modern standard Arabic (MSA) is a formal language and is
used in most printed materials in the Arab World and in most
radio and TV broadcasts. The Arabic dialect is the natural
language spoken in everyday life. There are many Arabic
dialects, and almost every country has its own dialect [4].
Arabic dialects can be divided in many ways, and
geography and social factors are well-known dividing factors.
As an example, dialects can be divided into Western Arabic
dialects, which include Algerian, Moroccan, Tunisian, and
Libyan dialects, and Eastern Arabic dialects, which include
Gulf, Egyptian, Damascus, and Levantine dialects [4].
Another Arabic dialect classification is based on
the division of the dialects into five groups: Gulf Arabic
(GLF) includes the dialects of Kuwait, Saudi Arabia,
Bahrain, Qatar, the United Arab Emirates, and Oman. Iraqi
Arabic is considered a sub-dialect of Gulf Arabic. Levantine
Arabic (LEV) includes the dialects of Lebanon, Syria, Jordan,
and Palestine. Egyptian Arabic (EGY) covers the dialects of
the Nile valley, namely Egypt and Sudan. Maghrebi
(Western) Arabic covers the dialects of Morocco, Algeria,
Tunisia and Mauritania. Libya is sometimes included in this
group. Yemenite Arabic is often considered to be in a class of
its own [5].
Arabic dialects are not used in written form, so the
preparation of appropriate speech corpora for any Arabic
dialect is more difficult. Most searches in Arabic ASR
systems are related to MSA, while dialectical Arabic has
received less attention [6], [7].
In this paper, we present our work on the improvement of the
Arabic speech corpus called BBN/AUB DARPA Babylon
[8], which is related to the Levantine dialect of Arabic. In
fact, there is no baseline system to compare with our
improved corpus but we emphasize that all short vowels in
the spoken Arabic in this corpus are missing and we tried to
restore them. Another assessment consists of performing a
basic phone recognition task using the original and improved
corpora. This task is expected to demonstrate the
effectiveness and reliability of the corrections we made on the
original BBN/AUB DARPA corpus.
This paper is organized as follows. Section 2 presents the
objective of this study and how it is related to prior work. The
description of the BBN/AUB corpus and its problems are
given in Section 3 and Section 4 respectively. Experimental
results and discussions are given in Section 4, followed by the
conclusion and acknowledgements.
2. OBJECTIVES AND RELATED WORK
The most Arabic speech corpora are available with non-
diacritized transcription. The appreciation of the correct
diacritization will improve Arabic language processing by
reducing the missing vowels. Many work studied how to
automatically estimate missing diacritics from the context
[9,10,11,12,13,14,15] However, the problem still remains
7978-1-4799-1616-0/13/$31.00 ©2013 IEEE DSPSPE 2013
unsolvable and the Word Error Rate (WER) of automatic
diacritization systems ranges between 15% and 25% and the
available commercial applications for the task of automatic
diacritization still need manual review in order to achieve
lower WER [16].
Our aim in this work is to make the BBN/AUB a
more appropriate corpus that is more suitable for speech
processing applications. To the best of our knowledge, no one
has used BBN/AUB with its current drawbacks for speech
processing applications. In addition to this, immediate
manual diacritization of BBN/AUB is very complicated
because parts of the written text (undiacritized version) not
consistent with pronounced version. Most of the available
Arabic corpora and related researches were conducted on
MSA and very few on the Arabic regional dialects.
Due to this, we believe that the contribution of our
study is a major step because it provides a reliable and
dependable resource to an important Arabic dialect, namely
the Levantine Arabic.
3. BBN/AUB CORPUS
The BBN/AUB corpus is a set of spontaneous speech
sentences that were recorded from 164 speakers (101 males
and 63 females) speaking in colloquial Levantine Arabic. The
speakers in the corpus were responding to refugee/medical
questions, where each subject was given a part to play that
prescribed what information they were to give in response to
the questions, but they were told to express themselves
naturally, i.e., in their own way.
The corpus was recorded in two stages. From May 2002
to September 2002, approximately 20% of the corpus was
recorded by BBN in the Boston area by using paid subjects
who were recruited. The remaining 80% was recorded by the
American University of Beirut (AUB), under subcontract to
BBN, from July 2002 to November 2002. The BBN/AUB
audio was recorded in MS WAV, (signed PCM). The
sampling rate was 16 kHz with 16-bit resolution, and the
sound was recorded using a close-talking, noise-cancelling,
headset microphone. A Java-based data-collection tool,
developed by BBN, was used to collect the speech. The
duration of the total recorded speech is 45 hours, which is
distributed among 75,900 audio files, with a total audio size
of 6.5 GB. The total text size is 3.1 MB, the vocabulary
consists of 15,000 words, and the total word count is 336,000
[8].
4. BBN/AUB CORPUS PROBLEMS
Many of the pronunciations of phonemes are different in
Levantine countries, and some phonemes differ in
pronunciation within the same country.
For example, the MSA phoneme /ð/in /haða/ “هذا” word is
pronounced in Levantine dialect as /z/, /d/, or / ð?/ phonemes
to become /haza/, /hada/, or /ha ð?a/.
Table 1. Arabic consonants [17]
Bilabial
Labio-dental
Inter-dental
Alveo-dental
Alveolar
Palatal
Velar
Uvular
Pharyngeal
Glottal
Emphaticض
/d/
Non-Emphaticب
/b/
د
/d/
ج
//
Emphatic
ط
/t/
Non-Emphaticت
/t/
ك
/k/
ق
/q/
ء
//
Emphaticظ
//
Non-Emphaticذ
//
ز
/z/
غ
//
ع
//
Emphaticص
/s/
Non-Emphaticف
/f/
ث
//
س
/s/
ش
//خ
/x/
ح
//
هـ
h/
/
Nasal V Non-Emphaticم
/m/
ن
/n/
Non-Emphaticر ل
/l/r/
Emphatic /l/ ل
Semivowels V Non-Emphaticو
/w/
ي
/j/
Liquid V
Stop
V
U
Fricative
V
U
The main problem of BBN/AUB in transcribing
Arabic phonemes was due to the minimization (or neglect)
of differences between the dialects by transcribing the
allophones as their underlying phonemes. The transcribers
use the MSA spelling of words as much as possible.
However, the problem lies in the large difference between the
phoneme pronunciations and phoneme transcriptions; for
example, the phoneme /j/ in some words is transcribed into
/ʔ/, although the /ʔ/ and /j/ phonemes differ in terms of place
and manner of articulation. In particular, /ʔ/ is an unvoiced
stop phoneme and the place of articulation is glottal, while /j/
is a voiced semivowel phoneme and the place of articulation
is palatal. In addition, the /z/, /d/, and /ð?/ phonemes in some
words are transcribed into the /ð/ phoneme, although these
phonemes are produced at different places of articulation as
follows: alveo-dental for /z/ and /d/ and inter-dental for /ð?/.
They also differ in terms of the manner of articulation, where
/z/ and /ð?/ are voiced and fricative, and /d/ is a stop and
voiced phoneme. /ð/ is a voiced and fricative phoneme. Thus,
BBN/AUB corpus transcription is a challenge in the
development of our speech recognition system. In order to
give examples, Table 1 clarifies the place and manner of
articulation of Arabic phonemes, where V and U indicate
voiced and unvoiced phonemes, respectively. Depending on
this, Table 2 shows some examples of this confusion with
some words with different pronunciations in Levantine and
how BBN/AUB transcribed them into one MSA word.
8
Table 2. BBN/AUB transcribed phonemes [8]
Transcribed Example
Corresponding
dialect sound
(allophones)
BBN/AUB
Transcribed
(MSA
phonemes)
/qabel/
“Before”
”قبل“
[gabel]
[ʔabel]
[kabel]
[g]
/q/ [ʔ]
[k]
/θlaθeen/
“Thirty” "ثالثين"
[tlateen]
[t] /θ/
/ haða/
“this”
”هذا“
[ haza]
[hada]
[ hað?a]
[z]
/ð / [d]
[ð?]
/buð?a/
“ice
cream”
”بوظه“
[buZa]
[Z] /ð?/
/Dabet/
“Officer”
”ضابط“
[Zabet]
[ð?abet]
[Z] /D/
[ð?]
/fumh/
“his
mouth”
”فمه“
[tumh]
/t/ /f/
/Sʁire/
“small”
”صغيره“
[Zʁire]
[Z] /S/
/miʔt dinar/
“one
hundred”
”مئه“
[mit dinar]
[j] /ʔ/
The second difficulty and deficiency that we faced in
BBN/AUB corpus Arabic transcription was the missing of
short vowels (/i/, /u/, and /a/) due to lack of diacritics in the
written text. Diacritics are rarely used in modern Arabic
written text form (e.g., in newspapers, books, and the
Internet). The reader can restore the diacritics by analyzing
the text morphologically, syntactically and semantically
before reading, but it is difficult for a designed systems to
behave as a human being reader; for example, an Arabic text-
to-speech system would not produce speech from
undiacritized Arabic text because there is more than one way
of saying the same undiacritized written Arabic word [18].
This problem is more acute when the sentence contains only
one word. Unfortunately, many files in BBN/AUB contain
only one word.
Table 3 shows file “276_20021004_162423_001”, which
contains only one word as an example of such files. The
undiacritized Arabic word “/ðkr/, /ذكر/” is shown along with
different ways to pronounce it using different diacritics.
Unfortunately, like many Arabic speech corpora, BBN/AUB
phonemes are not labeled and are presented without time
segmentation. This is one of the large obstacles in Arabic
speech corpora.
Table 3. Possible pronunciations of an undiacritized Arabic
word
Meaning IPA Diacritized
Word
Undiacritized
Word
Prayer / ðikr / ذكر ذكر
Male / ðakar / ذكر ذكر
He
mentioned / ðakara / ذكر ذكر
It was
mentioned / ðukira / ذكر ذكر
He reminded / ðakkara / ذكر ذكر
It was
reminded / ðukkira / ر ذكر ذك
5. EXPERIMENTS AND DISCUSSION
The first step in our experiment was to study the Levantine
dialect phonemes that can be pronounced in different ways,
as well as to determine the closest MSA phonemes that can
be used to transcribe the BBN/AUB corpus (see Table 2). We
also define a rule for transcribing files in which speakers do
not pronounce all phonemes in some words. For example, in
the case where “wθmaanjt” is pronounced “wtmaan”, it is
sometimes difficult to determine whether the speaker
pronounced the last phoneme or not. Transcribing Arabic
dialect is very challenging. Despite this difficulty, we
decided to consider the phonemes as pronounced in the
speech file. To illustrate this, we give the following example.
The file “042_20020821_114056_013” was transcribed in
IPA symbols before the correction step as follows:
”تقريبا مئة وثمانية وثمانين“
“tqrjbn mʔt w θmaanjt w θmaanjn”
After the correction step, the transcription became as follows:
“ وتمانين هوتمان هتأريبا مي ”
“tʔrjbaan mjh w tmaanh w tmaanjn”
As observed in the above example, the phonemes /q/, /ʔ/, and
/θ/ changed to /ʔ/, /j/, and /t/, respectively, as they were
pronounced, and /t/ was changed to /h/.
The third step is to apply diacritics manually; it is
difficult to diacritize the Arabic dialect automatically because
almost all automatic diacritization systems were designed for
MSA. We focused on only three short Arabic vowels. The
same file, “042_20020821_114056_013”, was diacritized as
follows:
“ انين أت ان ه وتم ريب ا ميه وتم ”
taʔriibaan miiah w tmaanh w tmaniin
Diacritical notations in the Arabic text provide full
vocalization of the Arabic script, where vocalization errors
are sometimes unallowable [3].
9
Fig.1. Phoneme labeling without transcription correction and diacritization
Fig 2. Phoneme labeling after transcription correction and diacritization
Labeling phonemes and time alignment constituted
the last step and were performed automatically. In order to
perform this task with a time efficient manner, we used the
HTK [19] Toolkit parallel accumulator ability, of the HERest
tool for HMM re-estimation, in combination with GNU
parallel powerful parallelization capabilities [20]. The master
label file is divided into N parts in order to enable parallel
time-alignment with the HVite tool.
Figure 1 shows the time alignment for file
“042_20020821_114056_013” without phoneme
transcription correction and diacritics, and Figure 2 shows the
time-alignments and segmentation performance
improvement after the completion of step three for the same
file. The dotted line in Figure 2 shows the phonemes that
were corrected or replaced.
10
In BBN/AUB corpus, the percentage of short vowels
among the total of phonemes is approximately 23%. By
omitting these phonemes, the system cannot realize the
correct pronunciation. As shown in Table 4, adding these
phonemes has dramatically improved the performance of a
basic phone recognition task that uses the BBN/AUB corpus
and 4-mixture Gaussian hidden Markov models of
monophones.
Table 4. Percentages of phone recognition rate (%Cphn), insertion
rate (%Ins), deletion rate (%Del), and substitution rate (%Sub) of
systems using the original and improved BBN/AUB corpora.
Corpus %Sub %Del %Ins %Cph
Original BBN/AUB 26.88 26.51 11.30 46.61
Improved BBN/AUB 23.86 15.65 10.54 60.49
6. CONCLUSION
This paper presents a series of steps taken to improve the
transcription quality of the BBN/AUB corpus. The
improvement in time alignment is clear after phoneme
transcription correction and manual diacritization performed.
In fact, it is too difficult for any automatic digital speech
system to manipulate, deal, and differentiate two different
phonemes, for example, /ʔ/and /q/, if they are articulated as
one phoneme because they share the same manner of
articulation (stop, unvoiced and non-emphatic) but differ in
the place of articulation (i.e., /ʔ/ is glottal while /q/ is uvular).
Thus, if we transcribed the dialect sound to the closest MSA
phonemes, the ASR system performance would not be
effective, and the dialect properties would disappear.
Our approach and tools constitute a real added-value
to the BBN/AUB Babylon corpus; where roughly 23%
missed vowels have been added. The transcription quality of
the BBN/AUB corpus was greatly improved which can now
be used reliably for any research purpose dedicated to
Levantine speech processing.
7. ACKNOWLEDGEMENTS
This work was supported by the NPST program under King
Saud University Project Number 10-INF1325-02.
8. REFERENCES
[1] M. Alghamdi, F. Alhargan, M. Alkanhal, A. Alkhairy, M.
Eldesouki, and A. Alenazi, “Saudi Accented Arabic Voice
Bank,” J. King Saud University, Vol. 20, Comp. & Info. Sci.,
pp. 43-58, Riyadh, 2008.
[2] T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals,
“WSJCAMO: a British English speech corpus for large
vocabulary continuous speech recognition,” Acoustics, Speech,
and Signal Processing, ICASSP-95,vol.1, pp 81- 84, 1995.
[3] S. Ananthakrishnan, S. Bangalore, and S. S. Narayanan,
“Automatic diacritization of Arabic transcripts for automatic
speech recognition,” in Proceedings of the International
Conference on Natural Language Processing (ICON), Kanpur,
India, 2005.
[4] M. Elmahdy, R. Gruhn, W. Minker, and S. Abdennadher,
“Survey on common Arabic language forms from a speech
recognition point of view,” International conference on
Acoustics (NAG-DAGA), 2009.
[5] F. Biadsy, J. Hirschberg, and N. Habash, “Spoken Arabic
Dialect Identification Using Phonotactic Modeling,”
Proceedings of the (EACL) Workshop on Computational
Approaches to Semitic Languages, Athens, Greece March 31
- 31, 2009.
[6] P. Huang, and M. Hasegawa-Johnson, “Cross-Dialectal Data
Transferring for Gaussian Mixture Model Training in Arabic
Speech Recognition,” 4th International Conference on Arabic
Language Processing, Rabat, Morocco, pp. 119-123, May 2–
3,2012.
[7] F. Biadsy, P. J. Moreno, and M. Jansche, “Google’s Cross-
Dialect Arabic Voice Search,” ICASSP 2012
[8] Linguistic Data Consortium (LDC) Catalog Number
LDC2005S08, http://www.ldc.upenn.edu/ 2005.
[9] M. Afify, L. Nguyen, B. Xiang, S. Abdou, and J. Makhoul,
“Recent Progress in Arabic Broadcast News Transcription at
BBN” INTERSPEECH'05, Lisbon, Portugal, pp. 1637–164
[10] D. Vergyri, and K. Kirchhoff, “ Automatic diacritization of
Arabic for acoustic modeling in speech recognition,”, In
Proceedings of COLING Computational Approaches to Arabic
Scriptbased Languages, Geneva, Switzerland, pp. 66–73, 2004.
[11] R. Sarikaya, O. Emam, I. Zitouni, and Y. Gao, “Maximum
Entropy Modeling for Diacritization of Arabic Text,”
INTERSPEECH'07, pp. 145–148, 2007.
[12] N. Habash, and O. Rambow, “Arabic Diacritization through Full
Morphological Tagging,” In Proceedings of NAACL HLT, pp.
53–56, 2007.
[13] R. Nelken, and S. M. Shieber, “ Arabic Diacritization Using
Weighted Finite-State Transducers,” Workshop On
Computational Approaches To Semitic Languages 5(2), pp. 79–
86, 2005.
[14] A. Messaoudi, L. Lamel, and J. Gauvain, “Transcription of
Arabic Broadcast News,” INTERSPEECH'04, Jeju Island,
Korea, pp. 1701–1704, 2004.
[15] A. Messaoudi, J. Gauvain, and L. Lamel, L. “ Arabic Broadcast
News Transcription using a One MillionWord Vocalized
Vocabulary”. In Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing
(ICASSP), vol. 1, pp. 1093–1096, 2006.
[16] M. Elmahdy, R. Gruhn and W. Minker , “Novel Techniques for
Dialectal Arabic Speech Recognition,” Springer, Boston (USA),
2012.
[17] Y. A. Alotaibi, k. Abdullah-Al-Mamun, and G. Muhammad,
"Study on unique Pharyngeal and Uvular consonants in foreign
accented Arabic," Proc. INTERSPEECH'08, pp. 751-754,
Brisbane, Australia, September 2008.
[18] M. Alghamdi, Z. Muzaffar, and H. Alhakami, “Automatic
Restoration of Arabic Diacritics: A Simple, Purely Statistical
Approach,” Arabian Journal for Science and Engineering, vol.
35, pp. 137-155, 2010.
[19] S. J. Young, “The HTK hidden markov model toolkit: Design
and philosophy,” Entropic Cambridge Research Laboratory,
Ltd,vol. 2, pp. 2–44, 1994.
[20] O. Tange, “Gnu parallel - the command-line power tool,” The
USENIX Magazine, vol. 36, no. 1, pp. 42–47, Feb 2011.
Available: http://www.gnu.org/s/parallel
11