[IEEE 2013 IEEE Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE) - Napa,...

DIACRITIZATION, AUTOMATIC SEGMENTATION AND LABELING FOR LEVANTINE

ARABIC SPEECH

1Yousef A. Alotaibi, 2Ali H. Meftah, 3Sid-Ahmed Selouani

1,2 College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia

3LARIHS Lab. Université de Moncton, Campus de Shippagan, Canada

{yaalotaibi, ameftah} @ ksu.edu.sa, [email protected]

ABSTRACT

It is generally acknowledged that a reliable speech corpus is

necessary for any application involving speech processing. In

this paper, we propose methods to improve the BBN/AUB

DARPA Babylon Levantine Arabic speech corpus to increase

its reliability and efficiency. For this purpose, correction of

pronunciation, diacritization, and new transcription are

performed manually along with automatic phoneme

segmentation and labeling. The comparison with the original

transcription of the corpus shows a clear improvement in the

output results.

Index Terms— BBN/AUB, Levantine, dialect, transcription,

diacritics

1. INTRODUCTION

Speech corpora are essential for many applications and

automatic speech recognition (ASR) systems, as well as for

language identification and speaker verification. Speech

corpora are very important in linguistic fields, such as in the

areas of phonetics, phonology, typology and sociolinguistics

[1]. Appropriate and well-organized corpora are fundamental

for the development of robust speech recognition systems [2].

Arabic ASR system development has faced many

difficulties and challenges, such as the non-availability of

large corpora, the existence of many dialects with numerous

pronunciations, and the fact that Arabic script does not allow

for full vocalization of text, requiring the reader to infer short

vowels and other missing cues from the specific context [3].

Modern standard Arabic (MSA) is a formal language and is

used in most printed materials in the Arab World and in most

radio and TV broadcasts. The Arabic dialect is the natural

language spoken in everyday life. There are many Arabic

dialects, and almost every country has its own dialect [4].

Arabic dialects can be divided in many ways, and

geography and social factors are well-known dividing factors.

As an example, dialects can be divided into Western Arabic

dialects, which include Algerian, Moroccan, Tunisian, and

Libyan dialects, and Eastern Arabic dialects, which include

Gulf, Egyptian, Damascus, and Levantine dialects [4].

Another Arabic dialect classification is based on

the division of the dialects into five groups: Gulf Arabic

(GLF) includes the dialects of Kuwait, Saudi Arabia,

Bahrain, Qatar, the United Arab Emirates, and Oman. Iraqi

Arabic is considered a sub-dialect of Gulf Arabic. Levantine

Arabic (LEV) includes the dialects of Lebanon, Syria, Jordan,

and Palestine. Egyptian Arabic (EGY) covers the dialects of

the Nile valley, namely Egypt and Sudan. Maghrebi

(Western) Arabic covers the dialects of Morocco, Algeria,

Tunisia and Mauritania. Libya is sometimes included in this

group. Yemenite Arabic is often considered to be in a class of

its own [5].

Arabic dialects are not used in written form, so the

preparation of appropriate speech corpora for any Arabic

dialect is more difficult. Most searches in Arabic ASR

systems are related to MSA, while dialectical Arabic has

received less attention [6], [7].

In this paper, we present our work on the improvement of the

Arabic speech corpus called BBN/AUB DARPA Babylon

[8], which is related to the Levantine dialect of Arabic. In

fact, there is no baseline system to compare with our

improved corpus but we emphasize that all short vowels in

the spoken Arabic in this corpus are missing and we tried to

restore them. Another assessment consists of performing a

basic phone recognition task using the original and improved

corpora. This task is expected to demonstrate the

effectiveness and reliability of the corrections we made on the

original BBN/AUB DARPA corpus.

This paper is organized as follows. Section 2 presents the

objective of this study and how it is related to prior work. The

description of the BBN/AUB corpus and its problems are

given in Section 3 and Section 4 respectively. Experimental

results and discussions are given in Section 4, followed by the

conclusion and acknowledgements.

2. OBJECTIVES AND RELATED WORK

The most Arabic speech corpora are available with non-

diacritized transcription. The appreciation of the correct

diacritization will improve Arabic language processing by

reducing the missing vowels. Many work studied how to

automatically estimate missing diacritics from the context

[9,10,11,12,13,14,15] However, the problem still remains

7978-1-4799-1616-0/13/$31.00 ©2013 IEEE DSPSPE 2013

unsolvable and the Word Error Rate (WER) of automatic

diacritization systems ranges between 15% and 25% and the

available commercial applications for the task of automatic

diacritization still need manual review in order to achieve

lower WER [16].

Our aim in this work is to make the BBN/AUB a

more appropriate corpus that is more suitable for speech

processing applications. To the best of our knowledge, no one

has used BBN/AUB with its current drawbacks for speech

processing applications. In addition to this, immediate

manual diacritization of BBN/AUB is very complicated

because parts of the written text (undiacritized version) not

consistent with pronounced version. Most of the available

Arabic corpora and related researches were conducted on

MSA and very few on the Arabic regional dialects.

Due to this, we believe that the contribution of our

study is a major step because it provides a reliable and

dependable resource to an important Arabic dialect, namely

the Levantine Arabic.

3. BBN/AUB CORPUS

The BBN/AUB corpus is a set of spontaneous speech

sentences that were recorded from 164 speakers (101 males

and 63 females) speaking in colloquial Levantine Arabic. The

speakers in the corpus were responding to refugee/medical

questions, where each subject was given a part to play that

prescribed what information they were to give in response to

the questions, but they were told to express themselves

naturally, i.e., in their own way.

The corpus was recorded in two stages. From May 2002

to September 2002, approximately 20% of the corpus was

recorded by BBN in the Boston area by using paid subjects

who were recruited. The remaining 80% was recorded by the

American University of Beirut (AUB), under subcontract to

BBN, from July 2002 to November 2002. The BBN/AUB

audio was recorded in MS WAV, (signed PCM). The

sampling rate was 16 kHz with 16-bit resolution, and the

sound was recorded using a close-talking, noise-cancelling,

headset microphone. A Java-based data-collection tool,

developed by BBN, was used to collect the speech. The

duration of the total recorded speech is 45 hours, which is

distributed among 75,900 audio files, with a total audio size

of 6.5 GB. The total text size is 3.1 MB, the vocabulary

consists of 15,000 words, and the total word count is 336,000

[8].

4. BBN/AUB CORPUS PROBLEMS

Many of the pronunciations of phonemes are different in

Levantine countries, and some phonemes differ in

pronunciation within the same country.

For example, the MSA phoneme /ð/in /haða/ “هذا” word is

pronounced in Levantine dialect as /z/, /d/, or / ð?/ phonemes

to become /haza/, /hada/, or /ha ð?a/.

Table 1. Arabic consonants [17]

Bilabial

Labio-dental

Inter-dental

Alveo-dental

Alveolar

Palatal

Velar

Uvular

Pharyngeal

Glottal

Emphaticض

/d/

Non-Emphaticب

/b/

د

/d/

ج

//

Emphatic

ط

/t/

Non-Emphaticت

/t/

ك

/k/

ق

/q/

ء

//

Emphaticظ

//

Non-Emphaticذ

//

ز

/z/

غ

//

ع

//

Emphaticص

/s/

Non-Emphaticف

/f/

ث

//

س

/s/

ش

//خ

/x/

ح

//

هـ

h/

/

Nasal V Non-Emphaticم

/m/

ن

/n/

Non-Emphaticر ل

/l/r/

Emphatic /l/ ل

Semivowels V Non-Emphaticو

/w/

ي

/j/

Liquid V

Stop

V

U

Fricative

V

U

The main problem of BBN/AUB in transcribing

Arabic phonemes was due to the minimization (or neglect)

of differences between the dialects by transcribing the

allophones as their underlying phonemes. The transcribers

use the MSA spelling of words as much as possible.

However, the problem lies in the large difference between the

phoneme pronunciations and phoneme transcriptions; for

example, the phoneme /j/ in some words is transcribed into

/ʔ/, although the /ʔ/ and /j/ phonemes differ in terms of place

and manner of articulation. In particular, /ʔ/ is an unvoiced

stop phoneme and the place of articulation is glottal, while /j/

is a voiced semivowel phoneme and the place of articulation

is palatal. In addition, the /z/, /d/, and /ð?/ phonemes in some

words are transcribed into the /ð/ phoneme, although these

phonemes are produced at different places of articulation as

follows: alveo-dental for /z/ and /d/ and inter-dental for /ð?/.

They also differ in terms of the manner of articulation, where

/z/ and /ð?/ are voiced and fricative, and /d/ is a stop and

voiced phoneme. /ð/ is a voiced and fricative phoneme. Thus,

BBN/AUB corpus transcription is a challenge in the

development of our speech recognition system. In order to

give examples, Table 1 clarifies the place and manner of

articulation of Arabic phonemes, where V and U indicate

voiced and unvoiced phonemes, respectively. Depending on

this, Table 2 shows some examples of this confusion with

some words with different pronunciations in Levantine and

how BBN/AUB transcribed them into one MSA word.

8

Table 2. BBN/AUB transcribed phonemes [8]

Transcribed Example

Corresponding

dialect sound

(allophones)

BBN/AUB

Transcribed

(MSA

phonemes)

/qabel/

“Before”

”قبل“

[gabel]

[ʔabel]

[kabel]

[g]

/q/ [ʔ]

[k]

/θlaθeen/

“Thirty” "ثالثين"

[tlateen]

[t] /θ/

/ haða/

“this”

”هذا“

[ haza]

[hada]

[ hað?a]

[z]

/ð / [d]

[ð?]

/buð?a/

“ice

cream”

”بوظه“

[buZa]

[Z] /ð?/

/Dabet/

“Officer”

”ضابط“

[Zabet]

[ð?abet]

[Z] /D/

[ð?]

/fumh/

“his

mouth”

”فمه“

[tumh]

/t/ /f/

/Sʁire/

“small”

”صغيره“

[Zʁire]

[Z] /S/

/miʔt dinar/

“one

hundred”

”مئه“

[mit dinar]

[j] /ʔ/

The second difficulty and deficiency that we faced in

BBN/AUB corpus Arabic transcription was the missing of

short vowels (/i/, /u/, and /a/) due to lack of diacritics in the

written text. Diacritics are rarely used in modern Arabic

written text form (e.g., in newspapers, books, and the

Internet). The reader can restore the diacritics by analyzing

the text morphologically, syntactically and semantically

before reading, but it is difficult for a designed systems to

behave as a human being reader; for example, an Arabic text-

to-speech system would not produce speech from

undiacritized Arabic text because there is more than one way

of saying the same undiacritized written Arabic word [18].

This problem is more acute when the sentence contains only

one word. Unfortunately, many files in BBN/AUB contain

only one word.

Table 3 shows file “276_20021004_162423_001”, which

contains only one word as an example of such files. The

undiacritized Arabic word “/ðkr/, /ذكر/” is shown along with

different ways to pronounce it using different diacritics.

Unfortunately, like many Arabic speech corpora, BBN/AUB

phonemes are not labeled and are presented without time

segmentation. This is one of the large obstacles in Arabic

speech corpora.

Table 3. Possible pronunciations of an undiacritized Arabic

word

Meaning IPA Diacritized

Word

Undiacritized

Word

Prayer / ðikr / ذكر ذكر

Male / ðakar / ذكر ذكر

He

mentioned / ðakara / ذكر ذكر

It was

mentioned / ðukira / ذكر ذكر

He reminded / ðakkara / ذكر ذكر

It was

reminded / ðukkira / ر ذكر ذك

5. EXPERIMENTS AND DISCUSSION

The first step in our experiment was to study the Levantine

dialect phonemes that can be pronounced in different ways,

as well as to determine the closest MSA phonemes that can

be used to transcribe the BBN/AUB corpus (see Table 2). We

also define a rule for transcribing files in which speakers do

not pronounce all phonemes in some words. For example, in

the case where “wθmaanjt” is pronounced “wtmaan”, it is

sometimes difficult to determine whether the speaker

pronounced the last phoneme or not. Transcribing Arabic

dialect is very challenging. Despite this difficulty, we

decided to consider the phonemes as pronounced in the

speech file. To illustrate this, we give the following example.

The file “042_20020821_114056_013” was transcribed in

IPA symbols before the correction step as follows:

”تقريبا مئة وثمانية وثمانين“

“tqrjbn mʔt w θmaanjt w θmaanjn”

After the correction step, the transcription became as follows:

“ وتمانين هوتمان هتأريبا مي ”

“tʔrjbaan mjh w tmaanh w tmaanjn”

As observed in the above example, the phonemes /q/, /ʔ/, and

/θ/ changed to /ʔ/, /j/, and /t/, respectively, as they were

pronounced, and /t/ was changed to /h/.

The third step is to apply diacritics manually; it is

difficult to diacritize the Arabic dialect automatically because

almost all automatic diacritization systems were designed for

MSA. We focused on only three short Arabic vowels. The

same file, “042_20020821_114056_013”, was diacritized as

follows:

“ انين أت ان ه وتم ريب ا ميه وتم ”

taʔriibaan miiah w tmaanh w tmaniin

Diacritical notations in the Arabic text provide full

vocalization of the Arabic script, where vocalization errors

are sometimes unallowable [3].

9

Fig.1. Phoneme labeling without transcription correction and diacritization

Fig 2. Phoneme labeling after transcription correction and diacritization

Labeling phonemes and time alignment constituted

the last step and were performed automatically. In order to

perform this task with a time efficient manner, we used the

HTK [19] Toolkit parallel accumulator ability, of the HERest

tool for HMM re-estimation, in combination with GNU

parallel powerful parallelization capabilities [20]. The master

label file is divided into N parts in order to enable parallel

time-alignment with the HVite tool.

Figure 1 shows the time alignment for file

“042_20020821_114056_013” without phoneme

transcription correction and diacritics, and Figure 2 shows the

time-alignments and segmentation performance

improvement after the completion of step three for the same

file. The dotted line in Figure 2 shows the phonemes that

were corrected or replaced.

10

In BBN/AUB corpus, the percentage of short vowels

among the total of phonemes is approximately 23%. By

omitting these phonemes, the system cannot realize the

correct pronunciation. As shown in Table 4, adding these

phonemes has dramatically improved the performance of a

basic phone recognition task that uses the BBN/AUB corpus

and 4-mixture Gaussian hidden Markov models of

monophones.

Table 4. Percentages of phone recognition rate (%Cphn), insertion

rate (%Ins), deletion rate (%Del), and substitution rate (%Sub) of

systems using the original and improved BBN/AUB corpora.

Corpus %Sub %Del %Ins %Cph

Original BBN/AUB 26.88 26.51 11.30 46.61

Improved BBN/AUB 23.86 15.65 10.54 60.49

6. CONCLUSION

This paper presents a series of steps taken to improve the

transcription quality of the BBN/AUB corpus. The

improvement in time alignment is clear after phoneme

transcription correction and manual diacritization performed.

In fact, it is too difficult for any automatic digital speech

system to manipulate, deal, and differentiate two different

phonemes, for example, /ʔ/and /q/, if they are articulated as

one phoneme because they share the same manner of

articulation (stop, unvoiced and non-emphatic) but differ in

the place of articulation (i.e., /ʔ/ is glottal while /q/ is uvular).

Thus, if we transcribed the dialect sound to the closest MSA

phonemes, the ASR system performance would not be

effective, and the dialect properties would disappear.

Our approach and tools constitute a real added-value

to the BBN/AUB Babylon corpus; where roughly 23%

missed vowels have been added. The transcription quality of

the BBN/AUB corpus was greatly improved which can now

be used reliably for any research purpose dedicated to

Levantine speech processing.

7. ACKNOWLEDGEMENTS

This work was supported by the NPST program under King

Saud University Project Number 10-INF1325-02.

8. REFERENCES

[1] M. Alghamdi, F. Alhargan, M. Alkanhal, A. Alkhairy, M.

Eldesouki, and A. Alenazi, “Saudi Accented Arabic Voice

Bank,” J. King Saud University, Vol. 20, Comp. & Info. Sci.,

pp. 43-58, Riyadh, 2008.

[2] T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals,

“WSJCAMO: a British English speech corpus for large

vocabulary continuous speech recognition,” Acoustics, Speech,

and Signal Processing, ICASSP-95,vol.1, pp 81- 84, 1995.

[3] S. Ananthakrishnan, S. Bangalore, and S. S. Narayanan,

“Automatic diacritization of Arabic transcripts for automatic

speech recognition,” in Proceedings of the International

Conference on Natural Language Processing (ICON), Kanpur,

India, 2005.

[4] M. Elmahdy, R. Gruhn, W. Minker, and S. Abdennadher,

“Survey on common Arabic language forms from a speech

recognition point of view,” International conference on

Acoustics (NAG-DAGA), 2009.

[5] F. Biadsy, J. Hirschberg, and N. Habash, “Spoken Arabic

Dialect Identification Using Phonotactic Modeling,”

Proceedings of the (EACL) Workshop on Computational

Approaches to Semitic Languages, Athens, Greece March 31

- 31, 2009.

[6] P. Huang, and M. Hasegawa-Johnson, “Cross-Dialectal Data

Transferring for Gaussian Mixture Model Training in Arabic

Speech Recognition,” 4th International Conference on Arabic

Language Processing, Rabat, Morocco, pp. 119-123, May 2–

3,2012.

[7] F. Biadsy, P. J. Moreno, and M. Jansche, “Google’s Cross-

Dialect Arabic Voice Search,” ICASSP 2012

[8] Linguistic Data Consortium (LDC) Catalog Number

LDC2005S08, http://www.ldc.upenn.edu/ 2005.

[9] M. Afify, L. Nguyen, B. Xiang, S. Abdou, and J. Makhoul,

“Recent Progress in Arabic Broadcast News Transcription at

BBN” INTERSPEECH'05, Lisbon, Portugal, pp. 1637–164

[10] D. Vergyri, and K. Kirchhoff, “ Automatic diacritization of

Arabic for acoustic modeling in speech recognition,”, In

Proceedings of COLING Computational Approaches to Arabic

Scriptbased Languages, Geneva, Switzerland, pp. 66–73, 2004.

[11] R. Sarikaya, O. Emam, I. Zitouni, and Y. Gao, “Maximum

Entropy Modeling for Diacritization of Arabic Text,”

INTERSPEECH'07, pp. 145–148, 2007.

[12] N. Habash, and O. Rambow, “Arabic Diacritization through Full

Morphological Tagging,” In Proceedings of NAACL HLT, pp.

53–56, 2007.

[13] R. Nelken, and S. M. Shieber, “ Arabic Diacritization Using

Weighted Finite-State Transducers,” Workshop On

Computational Approaches To Semitic Languages 5(2), pp. 79–

86, 2005.

[14] A. Messaoudi, L. Lamel, and J. Gauvain, “Transcription of

Arabic Broadcast News,” INTERSPEECH'04, Jeju Island,

Korea, pp. 1701–1704, 2004.

[15] A. Messaoudi, J. Gauvain, and L. Lamel, L. “ Arabic Broadcast

News Transcription using a One MillionWord Vocalized

Vocabulary”. In Proceedings of the IEEE International

Conference on Acoustics, Speech, and Signal Processing

(ICASSP), vol. 1, pp. 1093–1096, 2006.

[16] M. Elmahdy, R. Gruhn and W. Minker , “Novel Techniques for

Dialectal Arabic Speech Recognition,” Springer, Boston (USA),

2012.

[17] Y. A. Alotaibi, k. Abdullah-Al-Mamun, and G. Muhammad,

"Study on unique Pharyngeal and Uvular consonants in foreign

accented Arabic," Proc. INTERSPEECH'08, pp. 751-754,

Brisbane, Australia, September 2008.

[18] M. Alghamdi, Z. Muzaffar, and H. Alhakami, “Automatic

Restoration of Arabic Diacritics: A Simple, Purely Statistical

Approach,” Arabian Journal for Science and Engineering, vol.

35, pp. 137-155, 2010.

[19] S. J. Young, “The HTK hidden markov model toolkit: Design

and philosophy,” Entropic Cambridge Research Laboratory,

Ltd,vol. 2, pp. 2–44, 1994.

[20] O. Tange, “Gnu parallel - the command-line power tool,” The

USENIX Magazine, vol. 36, no. 1, pp. 42–47, Feb 2011.

Available: http://www.gnu.org/s/parallel

11

[IEEE 2013 IEEE Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE) - Napa,...

Documents

Transcript of [IEEE 2013 IEEE Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE) - Napa,...