automated tajweed checking rules engine for quranic verse ...
Transcript of automated tajweed checking rules engine for quranic verse ...
AUTOMATED TAJWEED CHECKING RULES ENGINE FOR
QURANIC VERSE RECITATION
NOOR JAMALIAH BINTI IBRAHIM
FACULTY OF COMPUTER SCIENCE AND INFORMATION
TECHNOLOGY
UNIVERSITY OF MALAYA
KUALA LUMPUR
APRIL 2010
AUTOMATED TAJWEED CHECKING RULES ENGINE FOR
QURANIC VERSE RECITATION
NOOR JAMALIAH BINTI IBRAHIM
DISSERTATION SUBMITTED IN FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF MASTER OF
COMPUTER SCIENCE
FACULTY OF COMPUTER SCIENCE AND INFORMATION
TECHNOLOGY
UNIVERSITY OF MALAYA
KUALA LUMPUR
APRIL 2010
ORIGINAL LITERARY WORK DECLARATION
Name of Candidate: NOOR JAMALIAH BINTI IBRAHIM
I.C/Passport No: 84083 1 -1 1 -5602
Registration/Metric No: WGA 070122
Name of Degree: MASTER OF COMPUTER SCIENCE
Title of Project Paper/Research Report/Dissertation/Thesis (o.this work"):
AUTOMATED TA.IWEED CHECKING RULES ENGINE FOR QURANIC VERSERECITATION
Field of Study: SPEECH RECOCNITION
I do solemnly and sincerely declare that:
1) I am the sole author/writer of this Work;2) This Work is original;3) Any use of any work in which copyright exists was done by way of fair dealing and for
permitted purposes and any excerpt or extract from, or reference to or reproduction ofany copyright work has been disclosed expressly and sufficiently and the title of theWork and its authorship have been acknowledged in this Work;
4) I do not have any actual knowledge nor do I ought reasonably to know that the makingof this work constitutes an infringement of any copyright work;
5) I hereby assign all and every rights in the copyright to this Work to the University ofMalaya ("UM"), who henceforth shall be owner of the copyright in this Work and thatany reproduction or use in any form or by any means whatsoever is prohibited withoutthe written consent of UM having been first had and obtained;
6) I am fully aware that if in the course of making this Work I have infringed any copyrightwhether intentionally or otherwise, I may be subject to legal action o. u.ty other action asmay be determined by UM.
Candidat nature Date */+ f *roSubscribed and solemnly declared before,
z^r0r RAZAI(
Name: IECTUIERSYSTEU & COMPUTER TECHNILOGY DEP^RTMENT
^ FACULTY qF SCIENCE^NO INFORMATION TECHNOLOGYUeslgnatlon UNIVERSIIY oF MALAYA
50603 KUALA LUMPUR
Date Ztty(u r"
ii
ABSTRACT
Automated speech recognition for Quranic verse recitation with Tajweed checking
rules capabilities is a new research area. It is because, the current method of Quranic
learning process through manual method of Al-Quran reading skills, become less effective
and unattractive to be implemented, especially towards the young Muslim generation. This
method also known as talaqqi and musyafahah method, which described as face to face
learning process between students (Recitors) and teachers (Mudarris), where the process of
listening, correction and repetition of the correct Al-Quran recitation took place in real time
condition. Automated speech recognition system with tajweed checking rules capability
could be another alternative due to support the existing method of manual skills of Quranic
learning process, without denying the main role of Mudarris in teaching Al-Quran. This
system is not intended to replace the Al-Quran, nor will replace the role of teachers, but to
complement the teaching process and to ensure that the art of reciting Al-Quran is not lost
and forgotten. In this thesis, an automated Tajweed checking rules engine for Quranic verse
recitation was develop and tested, due to present the easiest way to Muslim to recite and
learn Al-Quran, with better understanding of Tajweed. Feature extraction technique of Mel-
frequency Cepstral Coefficients (MFCC), will be used to extract features and
characteristics from Quranic verse recitation, as well as Hidden Markov Model (HMM) for
training and recognition purposes. Most challenging task in this research is to implement
Al-Quran with speech recognition system, altogether with the engine capability in checking
the tajweed rules. However, this engine able to achieve recognition rate that exceeded
91.95% (ayates) and 86.41% (phonemes), which indicates that the development of this
engine was successfully implemented.
iii
ABSTRAK
Sistem Pengecaman Suara Automatik dengan aplikasi penilaian hukum-hukum
Tajwid, khusus bagi pembacaan ayat-ayat suci Al-Quran telahpun dibangunkan dan ianya
merupakan satu bidang yang masih lagi dianggap baru. Penyelidikan ini dilaksanakan,
ekoran timbulnya masalah dalam system pembelajaran dan pengajaran Al-Quran sedia ada
serta kaedah yang digunapakai sekarang ini, iaitu secara manual melibatkan para pelajar
dan guru-guru (Mudarris) itu sendiri. Kaedah ini dipercayai kurang berkesan serta kurang
daya tarikan untuk dilaksanakan, terutamanya terhadap generasi muda Islam. Pendekatan
kaedah pembelajaran ini diadaptasi daripada salah satu bentuk pembelajaran Al-Quran
secara Talaqqi dan Musyafahah, iaitu dikenali sebagai pembelajaran secara bersemuka di
antara para pelajar dan guru-guru (Mudarris). Melalui kaedah ini, segala proses
pembelajaran Al-Quran iaitu mendengar, membetulkan bacaan Al-Quran dan mengulang
kembali pembacaan dengan lancar dan bertajwid berlaku. Sistem automatik dengan
kemampuan seta keupayaan untuk menilai hukum-hukum Tajwid pada bacaan Al-Quran
merupakan salah satu bentuk alternatif lain bagi menyokong kaedah pembelajaran Al-
Quran yang sedia ada iaitu secara manual, tanpa mengabaikan atau mempertikaikan
peranan utama Mudarris dalam pengajaran Al-Quran. Sistem yang dibangunkan ini tidak
bermaksud untuk menggantikan Al-Quran, malah tidak bermaksud untuk menggantikan
peranan guru, tetapi fungsinya lebih cenderung untuk melengkapi proses pembelajaran
sedia ada ketika ini dan memastikan bahawa seni bacaan Al-Quran itu sendiri tidak hilang
dimamah usia serta tidak mudah dilupakan begitu sahaja. Dalam tesis ini, enjin
pengecaman bagi hukum-hukum Tajwid khusus bagi ayat-ayat Al-Quran telahpun
dibangunkan serta diuji kemampuannya, menerusi pengenalan kepada satu kaedah terbaru
iv
yang paling mudah untuk digunapakai oleh masyarakat Islam melalui pemahaman yang
lebih baik dalam mempelajari Al-Quran. Teknik pengekstrakan fitur menggunakan Mel-
frequency Cepstral Coefficient (MFCC) telahpun digunapakai dalam kajian ini, dimana
fitur dan ciri-ciri yang terdapat pada bacaan ayat-ayat suci Al-Quran diekstrak, manakala
klasifikasi Hidden Markov Model (HMM) pula digunakan bagi tujuan latihan dan
pengecaman. Tugasan yang paling mencabar dalam melaksanakan penyelidikan ini adalah
ketika proses implementasi ayat-ayat Al-Quran pada sistem pengecaman suara, ditambah
dengan keupayaannya untuk memeriksa hukum-hukum Tajwid. Namun begitu, enjin yang
telah dibangunkan ini berupaya mencapai kadar pengecaman yang tinggi melebihi 91.95%
(ayat) dan 86.41% (perkataan), di mana ia menunjukkan bahawa enjin yang telah
dibangunkan ini berjaya dilaksanakan.
v
ACKNOWLEDGEMENTS
In the name of Allah, the Most Gracious, the Most Merciful.
All praise is due to Allah, the Creator and Sustainer of this whole universe, the
Most Beneficent and the Most Merciful, for His guidance and blessing and granting me
knowledge, patience and perseverance to accomplish this research successfully.
Firstly, I would like to acknowledge University of Malaya, especially the
Department of Computer System & Technology for providing me support to carry out this
research. I take great pride to forward my sincere appreciation and deepest gratitude to my
supervisor Mr. Zaidi Razak, for his valuable guidance, support, encouragement and effort
throughout this research project. Without his tireless efforts, patience and guidance, this
research could not have been successfully completed. My special thank also dedicated to
my project leader, Prof. Dato’ Dr. Mohd Yakub @ Zulkifli Bin Haji Mohd Yusoff, for his
valuable guidance and moral support throughout this tough years.
I would like to take this opportunity to wish thank you to University of Malaya in
funding in University of Malaya Scholarship Scheme (SBUM). I am much honored to be
the recipient of this scholarship, which support my financial life and funded my studies,
thus enabled me to concentrate on my research project.
vi
Last but not least, most profound gratitude and respect to my family, especially my
beloved parents, Haji Ibrahim Bin Husain and Hajjah Maimunah Muda, who have been the
ultimate source of my motivation to work hard and inspiration of my life. Therefore, I
proudly dedicate this work to both of them, may Allah SWT bless both of them.
April 2010
Noor Jamaliah Binti Ibrahim
Department of Computer System & Technology,
Faculty of Computer Science & Information Technology,
University of Malaya,
Kuala Lumpur.
vii
TABLE OF CONTENTS
Page
ABSTRACT ii
ABSTRAK iii
ACKNOWLEDGEMENTS v
TABLE OF CONTENTS vii
LIST OF FIGURES xii
LIST OF TABLES xv
LIST OF ABBREVIATIONS xvi
CHAPTER 1: INTRODUCTION
1.1 Introduction 1
1.2 Background 2
1.3 Motivation 3
1.4 Problem Statements 4
1.5 Research Objectives 4
1.6 Scope of Research 5
1.7 Research Methodology 5
1.8 Terminology 6
1.8.1 Utterances 7
1.8.2 Vocabularies 7
1.8.3 Accuracy 7
viii
1.9 Thesis Outline 8
CHAPTER 2: LITERATURE REVIEW
2.1 Introduction 11
2.2 The “Art of Tajweed” 12
2.3 Effect of the “Art of Tajweed” on the acoustic model 13
2.4 Linguistic properties of Arabic 14
2.5 Quranic Verse Recitation Recognition Systems 21
2.5.1 Pre-processing 23
2.5.1.1 Endpoint Detection 23
2.5.1.2 Pre-emphasis filtering/Noise filtering/ 23
Smoothing
2.5.1.3 Channel Normalization/ Distortion 24
Equalization
2.5.2 Feature Extraction 25
2.5.2.1 Linear Predictive Coding (LPC) 25
2.5.2.2 Perceptual Linear Prediction (PLP) 26
2.5.2.3 Mel-Frequency Cepstral Coefficient 26
(MFCC)
2.5.2.4 Spectrographic Analysis 29
2.5.3 Training/Feature Classification and Pattern Recognition 29
Techniques
2.5.3.1 Hidden Markov Model (HMM) 30
ix
(a) HMM Training 31
(b) HMM Testing 31
2.5.3.2 Artificial Neural Network (ANN) 32
2.5.3.3 Vector Quantization (VQ) 33
2.5.4 Recognition/Identification 34
2.5.4.1 Hidden Markov Model (HMM) 34
2.5.4.2 Vector Quantization (VQ) 35
2.5.4.3 Artificial Neural Network (ANN) 36
2.6 Comparison of Speech Recognition techniques for 36
Quranic Arabic recitation
2.7 Summary 38
CHAPTER 3: RESEARCH METHODOLOGY
3.1 Introduction 39
3.2 Tajweed checking rules engine techniques and algorithms 40
3.2.1 Speech Samples Collection (Speech Recording) 42
3.2.2 Mel-Frequency Cepstral Coefficients Feature Extraction 43
3.2.2.1 Preemphasis 46
3.2.2.2 Framing 48
3.2.2.3 Windowing 49
3.2.2.4 Discrete Fourier Transform (DFT) 52
3.2.2.5 Mel Filterbank 53
3.2.2.6 Discrete Cosine Transform (DCT) 54
x
3.2.3 Hidden Markov Model Classification 56
3.2.3.1 Hidden Markov Model Training 59
(a) Initialization 60
(b) Probability Evaluation 64
(c) Re-Estimation 69
(d) Result – Model of HMM 72
3.2.3.2 Hidden Markov Model Testing/Recognition 72
(a) Initialization 75
(b) Probability Evaluation 76
(c) HMM Recognition Result 78
3.3 Summary 79
CHAPTER 4: DESIGN AND IMPLEMENTATION
4.1 Introduction 81
4.2 Overview of Automated Tajweed Checking Rules Engine 82
4.2.1 Engine Development Part 83
4.2.2 Content Development Part 83
4.3 Tajweed checking rules engine architecture 84
4.4 Data Flow Diagram for Tajweed Checking Rules Engine 87
4.5 Tajweed Checking Rules Engine Flow Chart 88
4.6 Tajweed Checking Rules Graphical User Interfaces 92
4.7 Summary 99
xi
CHAPTER 5: EXPERIMENTAL RESULTS AND DISCUSSION
5.1 Introduction 100
5.2 Speech Samples Collection (Recording Process) 100
5.3 Result of Feature Extraction 103
5.4 Result of Features Training 103
5.4.1 Tajweed Checking Rules Database 106
5.5 Result of Features Matching/Testing 107
5.5.1 Testing – Word (ayates) Like Template 110
5.5.2 Testing – Phonemes Like Template 113
5.6 Summary 121
CHAPTER 6: CONCLUSION & FUTURE ENHANCEMENT
6.1 Introduction 122
6.2 Significance and Contributions of Tajweed Checking 122
Rules engine for Quranic verse Recitation
6.3 Observations on Weaknesses and Strengths 123
6.3.1 Strengths 123
6.3.2 Weaknesses 125
6.4 Future Research 126
6.5 Conclusion 127
REFERENCES 128
APPENDIX A 134
APPENDIX B List of Published Papers and Achievements 139
xii
LIST OF FIGURES
Page
Figure 2.1: Arabic general Characteristics 18
Figure 2.2: System architecture 22
Figure 2.3: Block diagram of the computation steps of MFCC 28
Figure 2.4: Interconnected group of nodes in ANN 32
Figure 2.5: The Encoding-Decoding Operation in VQ 34
Figure 3.1: MATLAB code for recording process 42
Figure 3.2: Block diagram of the computation steps of MFCC 44
Figure 3.3: Time and Spectrum graph for the recitation 46
“Bismillahi Al-Rahmani Al-Rahim”
Figure 3.4: MATLAB code for the Preemphasis stage of MFCC 47
Figure 3.5: MATLAB code for framing stage of MFCC 48
Figure 3.6: Framing Signal (Frame size = 256 samples) 49
Figure 3.7: MATLAB code for the windowing stage of MFCC 50
Figure 3.8: Hamming Window 51
Figure 3.9: Windowed speech segment 51
Figure 3.10: FFT computation of MATLAB code 52
Figure 3.11: MFCC Cepstral Coefficients computation of MATLAB code 53
Figure 3.12: Result of MFCC Cepstral Coefficients 54
Figure 3.13: The MFCC Cepstral Coefficients for ayates ‘Maaliki yawmid 55
diini’
Figure 3.14: Automated Tajweed Checking Rules system structure 58
xiii
Figure 3.15: The HMM sequence of training block diagram 60
Figure 3.16: The state transition probability matrix (A) for ayates 61
‘Maaliki yawmid diini’
Figure 3.17: MATLAB code for initialize the model (mu, sigma) 62
Figure 3.18: M-File function of hmm_mint 62
Figure 3.19: The mean vectors mu (µ), for ayates ‘Maaliki yawmid diini’ 63
Figure 3.20: The covariance matrices sigma (Σ) for ayates 64
‘Maaliki yawmid diini’
Figure 3.21: MATLAB code for Forward-Backward Recursions 65
Figure 3.22: MATLAB code for the re-estimation of transition parameters 70
Figure 3.23(a): MAT-file trained model of A_ values (1-14) 72
Figure 3.23(b): MAT-file trained model of mu_ (μ) values (1-13) 72
Figure 3.23(c): MAT-file trained model of sigma_ (Σ) values (1-13) 72
Figure 3.24: The HMM sequence of testing/recognition block diagram 74
Figure 3.25: MATLAB code for ‘realmin’ 75
Figure 3.26 (a): Output score for the ayates ‘Maaliki yawmiddiini’ 78
Figure 3.26 (b): Log-Likelihood Ratio (LLR) for the ayates 79
‘Maaliki yawmiddiini’
Figure 4.1: Automated Tajweed Checking Rules for Quranic verse 81
recitation context diagram
Figure 4.2: Overview of Automated Tajweed Checking Rules Engine 82
Figure 4.3: Block diagram schematic illustrating Tajweed checking rules 85
engine
Figure 4.4: Tajweed checking rules engine architecture 86
xiv
Figure 4.5: Tajweed Checking Rules Engine Data Flow Diagram (DFD) 88
Figure 4.6: Automated Tajweed checking rules engine for Quranic flow chart 89
Figure 4.7: Automated Tajweed Checking Rules Engine for Quranic verse 92
Recitation Graphical User Interface
Figure 4.8: Load the wave file of input speech sample from sourate Al-Fatihah93
Figure 4.9: Analyzing process of sourate Al-Fatihah using MFCC (Started) 94
Figure 4.10: Analyzing process of sourate Al-Fatihah using MFCC (Finished) 94
Figure 4.11: The input speech sample and spectrogram graph for ‘Bismillah’ 95
utterance
Figure 4.12: The incorrect recitation of ‘Bismillah’ utterance 96
(1st mistake/notification)
Figure 4.13: The incorrect recitation part involved and Tajweed rules 96
Figure 4.14: The incorrect recitation of ‘Bismillah’ utterance 97
(2nd mistake/notification)
Figure 4.15: The incorrect recitation part involved and Tajweed rules 97
Figure 4.16: The correct recitation of ‘Bismillah’ utterance 98
Figure 4.17: The notification of correct recitation of ‘Arrahmaanirrahiim’ 98
utterance
Figure 4.18: The correct recitation of ‘Arrahmaanirrahiim’ utterance 99
Figure 5.1: Percentage of accuracy for recognition rate (Ayates & Phonemes) 119
Figure 5.2: Percentage of Word Error Rate (WER) for ayates & Phonemes 120
xv
LIST OF TABLES
Page
Table 2.1: The Arabic alphabets 15
Table 2.1: The Arabic alphabets (continued) 16
Table 2.2: Arabic diacritics 17
Table 2.3: Arabic Consonants 19
Table 2.4: Approaches used by Quranic Arabic recitation using speech 37
recognition techniques
Table 3.1: MFCC Parameter Definition 45
Table 3.2: MFCC Filter Equations 45
Table 5.1: Except from the dictionary of Sourate Al-Fatihah 101
Table 5.2: Summary of the Total Collected Speech Samples for each Ayates 102
Table 5.3: Template Data of HMM Model for Collected Quranic Recitations 104
Table 5.4: The Tajweed Pronunciations rules in Sourate Al-Fatihah 106
Table 5.5: Result of Likelihood Ratio (LLR) for 8 recitations of speech 111
samples (1.0 x 103)
Table 5.6: Test result for 8 recitations of speech samples 112
(ayates of Sourate Al-Fatihah)
Table 5.7: Comparison between correct and incorrect Tajweed rules 114
for ayates “Bismillahir <rahmaanir> rahimi”
Table 5.8: Comparison between correct and incorrect Tajweed rules 115
for ayates “Bismillahir rahmaanir <rahiimi>”
Table 5.9: Test result for 28 recitations of speech samples (Phonemes) 118
xvi
LIST OF ABBREVIATIONS
ANN : Artificial Neural Network
ASR : Automatic Speech Recognizer
CN : Channel Normalization
DCT : Discrete Cosine Transform
DFT : Discrete Fourier Transform
FFT : Fast Fourier Transform
FIR : Finite Impulse Response
FS : Sampling Frequency
GUI : Graphical User Interface
HMM : Hidden Markov Model
Hz : Hertz
ICT : Information & Communication Technology
IDFT : Inverse Discrete Fourier Transform
IV : In Vocabulary
J-QAF : Jawi, Quran, Arabic and Fardhu Ain (Islamic obligatory duty)
LBG : Linde, Buzo and Gray
LLR : Log Likelihood Ratio
LPC : Linear Predictive Coding
MFCC : Mel-frequency Cepstral Coefficients
MSA : Modern Standard Arabic
NN : Neural Network
xvii
OOV : Out of Vocabulary
PLP : Perceptual Linear Prediction
PC : Personal Computer
VQ : Vector Quantization
WER : Word Error Rate
1
CHAPTER 1
INTRODUCTION
1.1 Introduction
In this technological era, technologies such as information technology have making
a great impact to our daily life. Furthermore, the problem of communication between
human being and information technology had become so critical nowadays. Until now, this
communication had been completely done by using keyboard and screens, but there are
some weaknesses and limitation to be implemented to other applications. Speech is
considered as the most widely used and natural means of communication between human,
and it is an obvious substitute for such means of keyboard and screens in the
communications process. Moreover, the process of exchanging the ideas among the human
were carried out with the aid of communication and has facilitated the development of
technology into the various form. Although speech applications in the computer machines
interface area has been growing drastically, but the processing forms capabilities for
generating and interpreting speech is still incomplete and not perfect. Investigations in this
research field have led into the development of automatic speech recognition systems.
As we know, speech recognition is highly demanded and has many useful
applications. This research simulates speech recognition technology which incorporates
with the various components in Artificial Intelligence; natural languages processing, speech
2
recognition technology and human computer interaction fundamentals. Here, this research
is concerned with speech recognition technology, which is part of speech and signal
processing technology.
1.2 Background
In learning Al-Quran as shown by our Prophets, different systems and
methodologies are essential in putting the word of God in its rightful place. The
development of Quranic lesson learns have been successfully produced a lot of Quranic
scholars and at the same time promoting the Quranic standard into high priority level. The
development of the ICT also has change the world into many ways, both positive and
negative aspects. Therefore, each of Muslim must be able to identify the appropriate and
practical ways of selecting the right type of information obtained from this new technology.
Even though the world has changed drastically, the development in Quranic studies have
never become outdated. World globalization era as well as high technology, also could not
prevent the academia in Quranic studies from been influenced by the current trends of
technology.
Focus on this research, it will stress only on Quranic recitation of speech
processing, which related to ‘Tajweed Rules’ based on recitation recognition process. It is
believed that, this recognition system invented was capable to educate the students and
adults by using the interactive learning system with Tajweed checking rules (Al-Quran
reading rules) correcting capability. Moreover, the existing product/technology available
are only capable to show Al-Quran texts and/or play stored Al-Quran recitation, while this
3
system offers students to recite Al-Quran through the system and the recitation will be
revised and corrected in real-time.
It is believed that, Al-Quran learning process required the special and effective way
to recite Al-Quran (Tabbal et al., 2006). Furthermore, Al-Quran learning process is still
handled with manual method, based on Al-Quran reading skills through talaqqi and
musyafahah methods. These methods are described as face to face learning process
between students and teachers (Mudarris), where listening, correction of Al-Quran
recitation and recite again the correct Al-Quran recitation took place (Berita Harian, 2005).
This method is so important to be implemented, so that the Muslim will know how the
hijaiyah letters are correctly pronounced. The process only can be done, if the Mudarris and
also the recitors follow the art, rules and regulations while reading Al-Quran, known as
“Rules of Tajweed” (Tabbal et al., 2006).
1.3 Motivation
The motivation(s) of this research project are:
(i) In learning Al-Quran, recitors required to learn it through the manual
method of Al-Quran reading skills of talaqqi & musyafahah method.
Through this method, Mudarris required to check the tajweed rules of their
students individually. This bring a lot of problems to Muddaris to control or
handle the students prior with a large amount of students per classes. The
4
targeted objectives in j-QAF were going to be difficult to achieve, due to
complete the syllabus specified (Tasmik & Khatam Al-Quran module). It is
because of constraint of time schedule provided.
(ii) Shortage of ICT applications in teaching and learning process may bring a
bad outcome to the quality performance of students. Student easily become
bored and do not interested to participate in class.
(iii) Current busy lifestyle needs a modern and technological approach for self-
learning method to recite Al-Quran, which can improve the learning process
of Quranic and also optimize the study time.
1.4 Problem Statements
The problem statement(s) of this research project are:
(i) Non-automated of Tajweed checking rules existed and invented, as a
learning tool which independently capable to evaluate the user reading and
performances.
1.5 Research Objectives
The objective(s) of this research project are:
(i) To define the most suitable algorithm for feature extraction and recognition,
due to be implemented in Tajweed checking rules engine.
5
(ii) To determine the most accurate recognition process that suite the Quranic
verse recitation.
(iii) To develop an engine that combines feature extraction and recognition, due
to develop a new automated of Tajweed checking rules system.
1.6 Scope of Research
Tajweed checking rules engine only check the basic rules of Tajweed and “Mad” in
Quranic recitation, such as:
(i) Basic rules (Idgham– Bilal;Ma’al;Syamsi, Izhar–
Halqi;Syafawi;Qamari, Iqlab & Ikhfaq Haqiqi)
(ii) Mad Asli, Mad Arid Lissukun
This project is totally 100% software based system and did not involve with any hardware
implementation. Thus, only MATLAB coding, simulation and GUI modeling involved in
this research.
1.7 Research Methodology
This automated Tajweed checking rules engine for Quranic verse recitation is
typically designed, mainly to guide and assist the user, specifically Muslim user during
reading Al-Quran. The aim of this system is to facilitate the recitors during Al-Quran
learning process focused on Quranic recitation based on ‘Rules of Tajweed’. Meaning that,
6
the system created capable to check the tajweed rules based on stored database and
recognize the particular sourate in Al-Quran, which may recite by recitors either correct or
not, based on the Tajweed rules guidelines. This research is carried out in different stages
as described below:
(i) Collect the input speech samples from different recitors.
(ii) Extract the features from the collected Quranic recitation of speech samples
and produce a set feature vectors.
(iii) Train the features vectors against the initial/available database, in order to
build the unique database/model.
(iv) Recognize/Match as well as testing the unknown features vector against the
trained database in order to obtain the accuracy of recognition.
(v) Evaluate the performance of the Quranic recitation recognition engine.
1.8 Terminology
The following definitions are the basics needed for understanding speech
recognition technology. Besides, these definitions also would probably can be acts as
constraints or difficulties, which encountered by a speech recognition system that are
related to the Quranic Arabic languages.
7
1.8.1 Utterances
An utterance is the vocalization of a word which represents a single meaning
to the computer. Utterance can be a single word, a few words, a sentence or even
the multiple sentences, as long as it has a single meaning to the computer (Oxford
English Dictionary (11th Edition)). Here, the variability of the Quranic Arabic
language can be caused by the dialectical differences. The variability in dialect
between Arabic countries and even dialectical difference in the same country causes
the word to be pronounced in a different way.
1.8.2 Vocabularies
Vocabularies are also known as dictionary, is the list of words of utterances
computer (Oxford English Dictionary (11th Edition)), that can be recognizes by
speech recognition system. In fact, small dictionary are easier for computer to
recognizes, while the large dictionary were more difficult. Moreover, the Arabic
language is morphologically rich, thus causing a high vocabulary growth rate. This
high growth rate is problematic for language models by causing a large number of
out-of-vocabulary words.
1.8.3 Accuracy
The efficiency and the ability of the system recognizer or speaker can be
determined or examined by measuring its accuracy level of responsive against the
speech recognizers. It includes not only correct identifying of utterances, but also
8
identify either the spoken utterance is in its vocabulary or not. The acceptable
accuracy of a system is really depending on the application.
1.9 Thesis Outline
This thesis contains of 6 chapters, including the introduction chapter. Each chapter
is subject to certain scopes, which formulate the thesis contents. Below are the chapter
numbers, titles and summaries of each documented chapter in this thesis.
Chapter 1: Introduction
Chapter 1 present the definition and background of the project, including the
problem statements, research objectives, scope of research, research
methodology, terminology and thesis outline, which outline the scope and
coverage of the project.
Chapter 2: Literature Review
Chapter 2 highlight the key of related researches, algorithms and techniques
that related and relevant with this research, in terms of commonly used of
features extraction, classification and pattern matching techniques used, and
provides an overview on current research related to speech recognition
system.
9
Chapter 3: Research Methodology
Chapter 3 provides a brief description and explanation about the research
methodology used in this research. The sub-topics for this chapter include
the research procedures of the main techniques adopted, including the
algorithms used in this research.
Chapter 4: Design and Implementation
Chapter 4 provides an architecture design for Automated Tajweed Checking
rules engine for Quranic verse recitation. The sub-topics for this chapter
include the research design of this engine and its implementation, as well as
many other diagrams that represent the logical and physical designs of the
systems.
Chapter 5: Experimental Result and Discussion
Chapter 5 contains experimental data and result, as well as other extra
information, analysis and discussion of the result obtained, after the training
and testing procedures executed on the system/Evaluate the performance of
the overall system.
Chapter 6: Conclusion and Future Enhancement
Chapter 6 summarizes the work accomplished and discussed the possibilities
and the recommendations in the future.
10
Appendix A:
Appendix A contains all signals of ayates in Sourate Al-Fatihah, obtained
from the MATLAB simulation.
Appendix B:
Appendix B contains a list of achievements and participation for both
International and National conferences, as well as competition and
exhibitions.
11
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
Speech recognition is one of the most important areas in digital signal processing.
In speech recognition, the scope of research area also involved with ‘artificial intelligent’ of
the system or machines itself, that may able to ‘hear’ and ‘understand’ the spoken
information from the particular recitation. Recently, automatic speech recognition has
reached a very high standard and performance for the past 5 years. Moreover, speech
recognition is highly demanded technology, which consists of many useful applications.
The main area that believes can contribute towards the effectiveness of this research project
focus on speech recognition is the part of pattern recognition technology. However, speech
recognition has some problem, which belongs to a much broader scientific topic called
pattern recognition or pattern matching. According to Huang et al. (2001), spoken language
processing relies on pattern recognition, which is one of the most challenging problems for
machines.
In this chapter, the general concepts that are related to Quranic Arabic accent were
reviewed and the significances that motivate in conducting this research were presented in
the subsequent chapters of this thesis. First, the art of Tajweed in Al-Quran were discussed
and presented. This provide a short and brief description about the ‘art’, which totally
12
different in term of language in Arabic, that is recognizably unique towards a set of
pronunciation rules, known as Rules of Tajweed. Secondly, a brief overview to prosody is
discussed. Experimental studies from the literature are presented, which shows the gap and
differences between written and recite Al-Quran. Next, the brief discussion about the effect
of the “Art of Tajweed” on the acoustic model, which can influence the recitation
recognition aspect in checking the tajweed rules. Those effects, formerly related with the
Arabic linguistic properties, which then will be discussed elaborately in the next part of 2.4.
Finally, some key related to the research were highlights, as well as algorithms and
techniques that are relevant to this research. Various types of feature extractions,
classifications and matching techniques were also discussed in this chapter.
2.2 The “Art of Tajweed”
“Tajweed” is an Arabic word meaning proper pronunciation during recitation, as
well as recitation at a moderate speed. It is a set of rules which govern on how Al-Quran
should be read (Bashir, M.S. et al., 2003). It is considered as art because not all recitors will
perform the same recitation of Quranic verse in the same way (Tabbal, H. et al., 2006).The
“art of tajweed” defines with some of flexible well-defines rules to recite Al-Quran. Those
rules create a big difference between normal Arabic speeches and recited of the Quranic
verses, which may produce the interesting result based on the impact of “art” analysis for
the automatic speech recognition process, especially on the acoustic model. Furthermore,
the “Art of Tajweed” is the manual methods that need a lot of work and proved to be
13
unable to adapt to new recitors. However, it is still believe that, the special way to recite
Al-Quran is by looking forward the art of tajweed (Bashir, M.S. et al., 2003).
2.3 Effect of the “Art of Tajweed” on the acoustic model
As we already know, each person’s voice is different. Thus, Al-Quran sound which
had been recited by most of recitors will probably tends to differ a lot from one person to
another. Although those Quranic sentence were particularly taken from the same verse, but
the way of the sentence in Al-Quran been recited or delivered may be different (Tabbal, H.
et al., 2006). It may produce the difference sounds for the different recitors. Moreover,
there are many difficulties arise when dealing with the specialties of the Arabic language in
Al-Quran, regarding to the differences between written and recite Al-Quran. Those same
combinations of letters may be pronounced differently due to the use of harakattes (Tabbal,
H. et al., 2006). The most important tajweed rules that believed can influence the recitation
recognition aspect were stated as below:
i) Necessary prolongation of 6 vowels.
ii) Obligatory prolongation of 4 or 5 vowels.
iii) Permissible prolongation of 2, 4 or 6 vowels.
iv) Normal prolongation of 2 vowels.
v) Nasalization (ghunnah) of 2 vowels.
vi) Silent unannounced letters.
vii) Emphatic pronunciation of the letter R.
14
The above laws are based from the specific recitation rules. Moreover, the
predefined “maqams” also been used by recitors to vary the tone of their recitations
(Tabbal, H. et al., 2006). There are 10 different laws set according to the 10 certified
scholars, such as Hafs, Kaloun, Warsh, Shu’bah, Hicham, Ibn-Dhakwan, Al-Duri, Al-Susi,
Al-Bazzi and Kunbul, who teaches the recitation of the Holy Quran (Habash, M., 1998). In
order to deal with these laws, the prolongations as the repetition of the vowel n-
corresponding times need to be considered, same as well as the nasalization (Tabbal, H. et
al., 2006). This rule governs the consonants/vowel combinations, usage of short and long
vowels, the co-articulation effect of emphatics and pharyngeals, pronunciations, Tanween
and Ghonna rules, as well as rules for combining words (Ahmed, M.E., 1991). Note that, if
there any echoing sound produced during the Quranic recitation recording process, the echo
will be considered as noise. That noise can be eliminated using the noise-canceling filter
(Tabbal, H. et al., 2006).
2.4 Linguistic properties of Arabic
Arabic is an official language in more than 22 countries. Since it is also the
language of religious instruction in Islam, many more speakers have at least a passive
knowledge of the language. Arabic is one of the languages that are often described as
morphologically complex and the problem of language modeling for Arabic are multipart
by the variation of dialectal (Vergyri, D. & Kirchhoff, K., 2004; Maamouri, M. et al., 2006;
Kirchhoff, K., et al., 2004). However, only Modern Standard Arabic (MSA) is used for
written and formal communication. It is because only MSA has a universally agreed upon
15
the writing standard as well as for communication purposes (Vergyri, D. & Kirchhoff, K.,
2004; Maamouri, M. et al., 2006; Kirchhoff, K. et al., 2004; Kirchhoff, K., 2002).
As mentioned earlier in part 2.3, there are many difficulties begin when dealing
with the specialties of the Arabic language in Al-Quran, due to the differences between
written and recite Al-Quran (Tabbal, H. et al., 2006; Maamouri, M. et al., 2006; Kirchhoff,
K., et al., 2004). The Quranic Arabic alphabets consist of 28 letters, known as hijaiyah
letters (from alif (ا)…until ya ((ي) (Vergyri, D. & Kirchhoff, K., 2004; Kirchhoff, K., et al.,
2004). Those letters includes 25 letters, which represent consonants and 3 letters for vowels
(/i: /, /a: /, /u :/) and the corresponding semivowels (/y/ and /w/), if applicable. A letter can
have two to four different shapes: Isolated, beginning of a (sub) word, middle of a (sub)
word and end of a (sub) word (Kirchhoff, K. et al., 2004). Letters are mostly connected and
there is no capitalization. The letter is represented as below at table 2.1, in their various
forms.
Table 2.1: The Arabic alphabets (from Ramzi, A.H. & Omar, E.A., 2007)
16
Table 2.1: The Arabic alphabets (Continued)
17
Furthermore, other phonemes of pronunciation are marked by diacritics, such as
consonant doubling (phonemic in Arabic). It is indicated by the “shadda” sign and the
“tanween”, word final adverbial markers which add /n/ to the pronunciation (Maamouri, M.
et al., 2006; Kirchhoff, K., 2004), as shown below in table 2.2. Those signs can reflect the
differences of pronunciation. Moreover, the diacritics are really important in setting up the
grammatical functions, which leading to the acceptable text understanding and correct
reading or analysis (Maamouri, M. et al., 2006). The entire set of diacritics is listed in table
2.2 below:
Table 2.2: Arabic diacritics (from Vergyri, D. & Kirchhoff, K., 2004)
18
Some Arabic letters may have an additional character called Hamza. Another non-basic
character is Taa-Marbuwta which is always at the end of the word. The Arabic language
has a very large vocabulary. Arabic characters may have diacritics which are written as
strokes above or below the character, which can change the pronunciation and meaning of
the word. However, they are usually omitted in handwriting.
Figure 2.1: Arabic general characteristics
According to the figure 2.1 shown above, each number represents certain characteristic as
listed below:
1. Writing direction 6. Ligatures
2. Ascenders 7. Connected Components (sub-word).
3. Descenders 8. Turning points.
4. Holes (loops). 9. Different letters forms with regards to
5. Secondary Parts (dots/diacritics). their position within the words (Sari, T. et
al., 2002)
According to table 2.3, the Arabic language is characterized by a relatively large
number of back consonants. This type of consonants can cause a complex co-articulation
phenomenon in Arabic speech. Besides, a set of allophone as well as the consonants letters
19
(Ahmed, M.E., 1991; Youssef, A. & Emam, O., 2004) also described, which had been
divided into several groups classified as below:
Table 2.3: Arabic Consonants (from Ahmed, M.E., 1991)
Group A: The Emphatic Consonants: /T/, /S/, /D/, and /∂/.
20
Group B: The Pharyngeals: /q/, /x/, and /γ/; and /r/.
Group C: The Madd letters: Alif, “أ “, Ya’a, “ی”, Waw, “ۉ”.
Group D: The rest of the letters (except the pharyngealized Lam /L/).
Group E: Glottal/Pharyngeals (Al-Ezhar letters); /E/, /h/, /H/, /? /, /x/, /γ/,
Group F: Ash-Shamsi letters: /t/, /Ө/, /d/, /∂/, /z/, /s/, /∫/, /S/, /D/, /T/, /∂/, /l/, /n/.
Group G: Al-Qamari letters: /E/, /b/, /dz/, /H/, /x/, /? /, /γ/, /f/, /q/, /k/, /m/, /w/, /h/
Group H: Muqalqal letters (aspirated): /q/, /T/, /b/, /dz/, /d/.
Group I: Ikhfa’a letters: /t/, /Ө/, /s/, /∫/, /dz/, /d/, /∂/, /z/, /S/, /D/, /T/, /∂/, /f/, /k/, /q/.
Group J: Voiceless Fricative consonants: /f/, /Ө/, /s/, /∫/, /h/, /H/, /S/, /x/.
Group K: Stops: /D/, /d/, /t/, /T/, /k/, /q/, /b/.
Group L: The Consonants: /dz/, /q/, /k/.
Letter to sound conversion for Arabic usually has a simple one to one mapping
between orthography and phonetic transcription for given correct diacritics. 14 vowels had
been used to accommodate for short and long vowels, same as well as the emphatic vowels.
Each syllable begins with a consonant followed by a vowel, which are limited and easily
detectable. Short vowels are denoted by “V” and long vowels are denoted as “V:” (Ahmed,
M.E., 1991; Youssef, A. & Emam, O., 2004; Essa, O., 1998). Those syllable can be
classified according to the length of the syllable, which also known as harakattes (Tabbal,
H. et al., 2006).
CV Short ; open
CV: Long ; open
CVC Long ; closed
21
CV: C Long ; closed
CVCC Long ; closed
CV: C Long ; closed
2.5 Quranic Verse Recitation Recognition Systems
According to the research, the project is mainly focus on the basic of speech
recognition technology, but it will implemented into the different type of application or
languages such as Arabic in Al-Quran. Quranic Arabic recitation is best described as long,
slow pace rhythmic, monotone utterances (Kirchhoff, K. et al., 2003). The sound of
Quranic recitation recognizably unique and reproducible according to a set of
pronunciation rules, tajweed, designed for clear and accurate presentation of the text.
Tabbal, H. et al. (2006) already go through the implementation of Quranic verse
recitation recognition, which covered Al-Quran verse delimitation system in audio files
using speech recognition techniques. Here, the Quranic recitation and pronunciation as well
as software used for recognition purposes had been discussed. The Automatic Speech
Recognizer (ASR) has been developed by using the open source Sphinx framework as the
basis of this research. The scope of this project more focus towards the automated
delimiter, which can extract the verse from the audio files. Research techniques for each
phase were discussed and evaluated using the implementation of various techniques for
different recitors, which recite sourate “Al-Ikhlas”. Here, the most important Tajweed rules
and Tarteel, which may influence the recognition of a specific recitation, can be specified.
22
A comprehensive evaluation of the Quranic verse recitation recognition techniques
was provided by Ahmad, A.M. et al. (2004). The survey provides recognition rates and
descriptions of test data for the approaches considered. The Quranic Arabic recitation
recognition that is incorporates with the background on the area, discussion of the
techniques and potential research directions. Here, Recurrent Neural Network with Back
propagation based on time approaches in speech recognition had been implemented.
Differences of each Arabic's letters from alif (ا) until ya (ي) have been observed based on
performance of cepstral analysis and recognition effectiveness. In general, there are five
major stages in a speech recognition system. Under the same techniques of speech
recognition, the Quranic Arabic recitation recognition also can be implemented based on
these techniques specified:
1. Pre-Processing 3. Training /Feature classification
2. Feature Extraction 4. Recognition /Identification
It can be described based on the system architecture shown below:
Figure 2.2: System architecture
23
2.5.1 Pre-Processing
In order to improve the readability and the automatic recognition of speech
processing, pre-processing steps are essential. The main benefit in pre-processing of speech
recognition is to organize the information and simplify the following task of recognition.
Those pre-processing steps are mainly consists by the following:
1. Endpoint Detection.
2. Pre-Emphasis Filtering/Noise Filtering/Smoothing.
3. Channel Normalization/ Distortion Equalization.
2.5.1.1 Endpoint Detection
Short-time energy or spectral energy is usually used as the primary features
parameter with the augmentation of zero-crossing rate, pitch and duration information in
endpoint detection algorithms. However, recently the endpoint detection features become
less reliable in the presence of non-stationary noise and various type of sound artifact
(Shen, J. et al., 1998). It is because the endpoint detection and verification of speech
segments become relatively difficult in noisy environment.
2.5.1.2 Pre-Emphasis Filtering/Noise Filtering/Smoothing
The purpose of the smoothing stage is to decrease the noise and regularize the word
contours. Ahmad, A.M. et al. (2004) are also digitized the Arabic’s alphabets from speaker,
as well as digital filtering. The digital filtering may emphasize the important frequency
24
component in signal. Then the start-end point can be analyzed based on the signal of the
phonemes. Here, GoldWave audio editor software has been used to filter the input speech
signal from analog to digital signals, due to analyze the start-end points that contain
information of speech.
According to Tabbal, H. et al. (2006), the use of 2-stage pre-emphasis filter with the
different factor value (0.92 and 0.97) could increase the recognition ratio of some audio
files. It is due to the speech frame of 10ms and a threshold of 10dB for the speech extractor
chosen. It also can consider as the noise canceling filter, due to eliminate echo (noise).
Besides the pre-emphasis filtering, there is another technique used by Kirchhoff, K. et al.
(2004), which is Kneser-Ney smoothing. Kneser-Ney smoothing able to built trigram
models for each of the stream with different morphology. It believes can outperform other
smoothing method consistently including in noisy environment.
2.5.1.3 Channel Normalization/ Distortion Equalization
Another approach used for pre-processing method was also known as Channel
Normalization. According to J. de Veth & Boves, L. (1998), Channel Normalization (CN)
techniques has been developed with different applications domains, where a particular
recognizer is trained with speech recorded using the microphone. The recognition is
attempted based on speech recorded with the different microphone. Here, the contribution
of the channel normalization during the training is still unknown in details, but, it is still
constant although during the test time.
25
2.5.2 Feature Extraction
Feature extraction is the process of extracting measurements from the input to
differentiate among classes. The main objective of feature extraction is to extract
characteristics from the speech signal that are unique, discriminative, robust and
computationally efficient to each word, which then used to differentiate between different
words (Ursin, M.,2002). According to Martens, J.P. (2002), there is various speech of
features extraction techniques, stated as below:
1. Linear Predictive Coding (LPC)
2. Perceptual Linear Prediction (PLP)
3. Mel-Frequency Cepstral Coefficient (MFCC)
4. Spectrographic Analysis
2.5.2.1 Linear Predictive Coding (LPC)
Ahmad, A.M. et al. (2004) used this type of extraction technique to extract the LPC
coefficients from the speech token. Then, the coefficients are converted to cepstral
coefficient that served as input to the neural networks. The drawback of LPC may estimate
the high sensitivity to quantization noise. By converting the LPC coefficients back into
cepstral coefficient, it can decrease the sensitivity of high and low order cepstral coefficient
to noise.
26
According to the Ahmed, M.E. (1991), LPC model had been replaced with a
formant that has much wider frequency spectrum. It is believed that, the LPC synthetic
model can give a bad outcome for the research, due to deduce the prosodic rules. This rule
is very important rules of missing blocks, in order to construct an allophone based Arabic
text-to-speech by rules.
2.5.2.2 Perceptual Linear Prediction (PLP)
Another popular feature set is Perceptual Linear Prediction (PLP) coefficients,
which had been used by Vuuren, S.V. (1996) in his research. In the research, Vuuren, S.V.
compared the discriminability and robustness against noise for both Perceptual Linear
Prediction (PLP) and Linear Predictive Coding (LPC). Particularly for PLP, the spectral
scale is the non-linear Bark scale and the spectral features are smoothed within the
frequency bands.
PLP is first introduced by Hermansky, H. (1990), who formulated PLP feature
extraction as a method for deriving a more auditory-like spectrum, based on linear
predictive analysis by making some engineering approximations on the psychophysical
attributes of the human hearing process.
2.5.2.3 Mel-Frequency Cepstral Coefficient (MFCC)
The purpose of this research is to convert the speech waveform into the form with
some type of parametric representation. Thus, the viability of Mel-Frequency Cepstral
Coefficient (MFCC) technique to extract features from the Quranic verse recitation can be
explored and investigated. MFCC is perhaps the best popular features extraction method
27
used recently (Bateman, D. et al., 1992; Ehab, M. et al., 2007), and MFCC also one of the
most popular feature extraction techniques used in speech recognition, whereby it is based
on the frequency domain of Mel scale for human ear scale (Chetouani, M. et al., 2002).
MFCC’s are based on the known variation of the human ear’s critical bandwidths with
frequency. Speech signal had been expressed in the Mel frequency scale, in order to capture
the important characteristics of phonetic in speech. This scale has a linear frequency
spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. Normal speech
waveform may vary from time to time depending on the physical condition of speakers’
vocal cord. Rather than the speech waveforms themselves, MFFCs are less susceptible to
the said variations (Rabiner, L. & Juang, B.H., 1993).
Based on the research conducted by Ahmad, A.M. et al. (2004), Mel scale has been
used to perform filterbank processing to the power spectrum. It has been performed after
the windowing and FFT process had been implemented. Similar approaches also had been
carried out by Tabbal, H. et al. (2006). The use of the MFCC has proven the remarkable
result in the field of speech recognition. It is because, the behavior of the auditory system
had been tried to emulate by transforming the frequency from a linear scale to a non-linear
one.
According to the Youssef, A. & Emam, O. (2004), 12 dimensional Mel Frequency
Cepstral Coefficients (MFCCs) is been coded for recorded speech data. Pitch marks were
produced using Wavelet transform approach by using the glottal closure signal. This signal
is obtained from the professional speaker during the recording process. Khalifa, O. et al.
28
(2004), had identified the main steps for MFCCs, that clearly shown in figure 2.3 below.
The main steps include the followings:
Figure 2.3: Block diagram of the computation steps of MFCC
According to figure 2.3, MFCCs consist of the following steps:
1. Preprocessing 5. Mel-Filterbank
2. Framing 6. Logarithm
3. Windowing 7. Inverse DFT
4. DFT
Same as well as Hasan, M.R. et al. (2004), MFCCs has been used for feature extraction for
security system based on speaker identification. Here, the pitch tone of the speech signal is
measured on the ‘Mel’ scale. The Mel-frequency scale formula is based on mathematical
equation shown below:
)700/1(10log*2595)( ffMel (1)
29
2.5.2.4 Spectrographic Analysis
There are a few Arabic speech recognition systems, which are normally speaker
dependent and use the different techniques such as formants values and their trends. Here,
automated spectrogram provides the better result by using Spectrographic Analysis
compare than simple formants values. The objective research of Bashir, M.S. et al. (2003)
is to implement one of the feature extraction strategy for Arabic language phoneme
identification through spectrographic analysis. According to the research, the different
spectrograms are represented by particular distinct bands within the spectrogram, which
can be identified from each phoneme of Arabic language. Determination of the particular
phoneme depends on certain frequency band specified. Based on the result, speech
processing using spectrograms gives more accurate results compared to other conventional
techniques. However, spectrogram analysis is believed takes more times to execute and
difficult to automate, especially in speech processing.
2.5.3 Training / Feature Classification and Pattern Recognition Techniques
According to Huang, X. et al. (2001), spoken language processing relies heavily on
pattern recognition, which is one of the most challenging problems for machines. The main
objective of pattern recognition is to classify the object of interest into one of a number of
categories or classes. The object of interest is known as patterns, but in this case the classes
are referring to the individual words. Since the classification procedure applied on this
research is implemented on extracted features, so it also can refer as feature matching. The
pattern matching for recognition purposes is divided into 3 types, which are:
30
1. Hidden Markov Model (HMM)
2. Artificial Neural Network (ANN)
3. Vector Quantization (VQ)
2.5.3.1 Hidden Markov Model (HMM)
Nathan, K. et al. (1995) had implemented the HMM’s for recognizing handwritten
words captured from a tablet. It is because Hidden Markov Model (HMM’s) had been
successfully applied for speech recognition system. Moreover, the output of the front-end
was then used to feed the sphinx core recognizer that used the Hidden Markov Model
(HMM) as the recognition tool. This recognition method had been implemented by Tabbal,
H. et al. (2006) in their research. The results of the recognizer have been implemented in a
hash map to translate into the common Arabic words. HMM has generates a discrete time
random process consisting of two sequences of random variables, which are hidden and the
known observations. The underlying structure of HMM is set of states that associated with
the probabilities of transitions between states, known as Markov Chain (Hansen, J.C.,
2003).
In the other hand, the acoustic decisions trees used in synthesis are built from the
HMM alignment. The HMM alignment is done by Youssef, A. & Emam, O. (2004), where
acoustic, energy, pitch and duration trees have been developed and executed with the
efficient maximum-likelihood algorithms existed for HMM training and recognition (Lee,
K.F. & Hon, H.W.,1989).
31
(a) HMM Training
HMM has introduced Baum-Welch or Forward-Backward algorithm for
training HMMs. All the algorithms of HMM play a crucial role in ASR (Automated
Speech Recognizer). It involved with states, transitions and observations map into
the speech recognition tasks. The extensions to the Baum-Welch algorithms needed
to deal with spoken language. This method had been implemented by Jurafsky, D.
& Martin, J.H. (2007) in their research. Here, speech recognition systems trained
each phone of HMM that embedded in an entire sentence. Thus, the segmentation
and phone alignment are done automatically, as parts of the training procedure.
Vocabulary of words to be recognized is modeled by a distinct HMM, whereas each
word in the vocabulary has a training set of k utterances by different speakers
(Rabiner, L. & Juang, B.H., 1993). Those HMM model parameters (A, B, π) need to
be estimated, which represented the likelihood values of the training set.
(b) HMM Testing
HMM has introduced the Viterbi algorithm for decoding the HMMs. Any
unknown words to be recognized, as well as measurements of the observation
sequence via feature analysis of the speech regardless to the word are made. The
word is selected using the Viterbi algorithm, whose model likelihood is maximum
(Hemantha, G.K. et al., 2006).
32
2.5.3.2 Artificial Neural Network (ANN)
Artificial Neural Network (ANN) often called as Neural Network (NN). It is a
computational model or mathematical model based on biological neural networks. ANN are
made up from the artificial neurons interconnecting and it may either used to gain an
understanding of biological neural networks or for solving artificial intelligence problems
without necessarily creating a model of a real biological system. Artificial Neural Network
(ANN) belongs to the Artificial intelligence approaches, which attempt to mechanize the
recognition procedure. The procedure is depend to the way a person applies intelligence in
visualizing, analyzing and characterized the speech based on a set of measured acoustic
features (Madisetti, V.K. & William, D.B., 1999).
According to Huang, X. et al. (2001), dealing with non-stationary signals need us to
address on how to map an input sequences to an output sequences properly. It could be
happen when 2 sequences are not synchronous, where the proper alignment, segmentation
and classification should be included. Thus, the basic neural networks are not well
equipped to address these problems as compared to HMM’s.
Figure 2.4: Interconnected group of nodes in ANN (from Huang, X. et al., 2001)
33
2.5.3.3 Vector Quantization (VQ)
Quantization is the process of approximating continuous amplitude signals by
discrete symbols. It can be quantizes on a single signal value or parameter known as scalar
quantization, vector quantization or others. Related to this topic, Huang, X. et al. (2001)
had described the vector quantizer as the codebook, which is a set of fixed prototype
vectors or reproduction vectors. Each prototype vectors also known as a codeword. In order
to perform the quantization process, each of input vector need to be matched against each
codeword in the codebook, using distortion measured. Thus, the VQ process includes the
distortion measure and the generation of each codeword for particular codebook involved.
The goal of VQ is definitely on how to minimize the distortion (Vuuren, S.V., 1996).
VQ is divided into 2 parts, known as features training and features matching.
Features training are mainly concerned with randomly selecting features vector and
perform training for the codebook using Vector Quantization (VQ) algorithm. Besides, the
features training are also involved with Vector Quantization (VQ). The training process of
the VQ codebook applies an important algorithm known as the LBG VQ algorithm, which
is used for clustering a set of L training vectors into a set of M codebook vector. This
algorithm is formally implemented by the recursive procedure (Linde, Y. et al., 1980). The
following steps are required for the training of the VQ codebook using the LBG algorithm
described by Rabiner, L. & Juang, B.H. (1993).
34
Figure 2.5 shows a block diagram of a vector quantizer, which consist of two main
parts known as encoder and decoder. The task of encoder is to identify in which of N
geometrically specified regions of the input vector lays. Then, the decoder refers to the
table lookup and it is fully determined by specifying the codebook (Wai, C.C., 2003).
Figure 2.5: The Encoding-Decoding Operation in VQ (from Wai, C.C., 2003)
2.5.4 Recognition/Identification
There are many methods used for recognition as well as identification. Under the
same techniques of speech recognition, the normally methods used nowadays listed as
below:
1. Hidden Markov Model (HMM)
2. Vector Quantization (VQ)
3. Artificial Neural Network (ANN).
2.5.4.1 Hidden Markov Model (HMM)
As described at section 2.5.3 under the feature classification, HMM method had
been fully implemented for both recognition and training purposes (Lee, K.F. & Hon,
H.W., 1989). Under the same research handled by Jurafsky, D. & Martin, J.H. (2007), digit
35
recognition task for HMM recognition have been used. A lexicon specifies the phone
sequence and each phone of HMM are composed from three sub-phones with a Gaussian
Emission Likelihood Model. The observation of likelihood is computed by the acoustic
model. By combining all these elements with adding an optional silence at the end of each
word will results into a single HMM for the whole task. Note that, the transition from the
‘End state’ to the ‘Start state’ is to allow digit sequences of arbitrary length.
In the order hand, recognition also has been carried out by Viterbi, A.J. (1967),
through the research in a large HMM. For context-independent phone recognition, an initial
and a final state are created. The initial state is connected with null arcs to the initial state
of each phonetic HMM, and null arcs connect the final state of each phonetic HMM to the
final state. The final state is also connected to the initial state. Hidden Markov Model
(HMM) is widely used as statistical method of characterizing the spectral properties of
frames for certain utterance. The process can be assumed as a random process and the
parameters of this process can be estimated in precise and accurate.
2.5.4.2 Vector Quantization (VQ)
The most successful text-independent recognition method is based on VQ. In this
method, VQ codebook consists of a small number of representative features vector, which
are used as an efficient means of characterizing speaker-specific features. A speaker-
specific codebook is generated by clustering the training features vector of each speaker,
which described at part 2.5.3.3. In the recognition stage, an input utterance is vector-
quantized using the codebook of each reference speaker and the VQ distortion has
36
accumulated over the entire input utterance, which is used to make the recognition
decision. It is believed that VQ-based method is more robust than a continuous HMM
method, which had been stated by Matsui, T. & Furui, S. (1993) in their research.
2.5.4.3 Artificial Neural Network (ANN)
Artificial Neural Network (ANN), also known as Neural Network (NN). ANN also
mainly used as feature matching or recognition for speech processing. It normally used to
classify a set of features, which represent the spectral-domain content of the speech
(regions of strong energy at particular frequencies). The features then will be converted into
phonetic-based categories at each frame. Then, Viterbi search is used to match the neural-
network output scores to the target words (the words that are assumed to be in the input
speech), in order to determine the word that most likely uttered (Hosom, J.P. et al., 1999).
2.6 Comparison of Speech Recognition techniques for Quranic Arabic recitation
This section provides a comparison between researchers on speech recognition
systems using the techniques discussed in this literature. The main criterion for the
comparison of the approaches used on the Quranic Arabic based on its performances is
shown in table 2.4 below.
37
Table 2.4: Approaches used by Quranic Arabic recitation using speech recognition
techniques
References
Pre-
processing
Feature
Extraction
Method
Classification/
Recognition
Techniques
Performance
[Tabbal, H.
et al. ‘06]
Pre-
emphasis
filter
MFCC Hidden Markov
Model (HMM)
85% - 92%
[Youssef, A.
& Emam,
O.’04]
- MFCC Hidden Markov
Model (HMM)
90.2%
[Ahmad,
A.M. et
al.’04]
Digital
Filtering
MFCC
LPCC
Recurrent
Neural Network
(RNN)
MFCC 95.9%-98.6%
LPCC 94.5%-99.3%
[Bashir,
M.S. et al.
‘03]
Preemphasis
filtering
(Bandpass
filter)
Spectrographic
Analysis
Spectrographic
Analysis based
on different
frequency band
of intensity
93.33%
[Kirchoff,
K. et al.’04]
Kneser-Ney
smoothing Not stated
Hidden Markov
Model (HMM) Not stated
[Hasan,
M.R. et
al.’04]
- MFCC
Vector
Quantization
(VQ)
57%-100%
[Podder,
S.K.’97]
-
LPC VQ and HMM 62%-96%
[Bhotto,
M.Z.A &
Amin,
M.R.’04]
- MFCC
Vector
Quantization
(VQ)
70%-85%
38
2.7 Summary
In this study, all different methods or approaches have been discussed, in order to
find the most suitable method to be used in this project. After a while, method or
approaches which logically can be used in this project had been decided. MFCCs method
was decided to be used in feature extraction, because it implements the DFT and FFT
algorithms. Moreover, majority of researches had used MFCCs, as their main features for
extraction purposes.
In order hand, the training as well as recognition part will be conducted either using
the HMM, ANN or VQ. Those 3 methods normally used in speech recognition purposes
recently and the most dominant pattern recognition techniques used in the field of speech
recognition. Moreover, these methods have shown great performance equally, through the
different ways and expectations. Both methods have their own benefits and weaknesses.
From the point of view, HMM is the most suitable methods used and this methods have
been implemented by most of researchers in Arabic speech recognition. However, these
methods have been implemented with speaker-dependent and not speaker-independent,
with low percentage of accuracy. It totally different with VQ algorithm that mostly used by
the researchers through their research project on Speech Recognition related with English
language. In addition, the combinations for both MFCC and HMM techniques were mostly
implemented for speech recognition application, especially for Arabic language, as shown
in table 2.4. Besides, those techniques also have been successfully proven to be applied in
this research, since the percentages of the performance were above 90%.
39
CHAPTER 3
RESEARCH METHODOLOGY
3.1 Introduction
In previous chapter, notice that the success of automated speech recognition
systems requires a combination of various techniques and algorithms, which performs a
specific task for achieving the main goal of the system. Therefore, a combination of related
algorithms is essential, due to improve the accuracy and recognition rate of such
applications. However, in this chapter it will highlight on research methodology for the
development of Automated Tajweed checking rules engine for Quranic verse recitation,
which mainly stress on the techniques and algorithms used for the development and
implementation of this engine. In fact, this chapter provides a step-by-step MATLAB
implementation of feature extraction, features classification and features matching process
used in developing this engine.
Here, the main algorithm used for feature extraction technique of Mel-Frequency
Cepstral Coefficients (MFCC) is described and implemented to all set of ayates or
phonemes of the Quranic recitation. Besides, this engine also implemented the Hidden
Markov Model (HMM) algorithm, mainly for classification/matching and pattern
recognition purposes.
40
3.2 Tajweed checking rules engine techniques and algorithms
As mentioned earlier in previous chapters, the conventional method for speech
recognition of Hidden Markov Model (HMM) has been highlighted. In this technique, the
features vector of speech is been extracted and the recognition results were depends on its
log likelihood for each of word in vocabulary. The largest value of log likelihood is decided
as recognition result. Since different people can give different pronunciation even for the
same sentence, so the HMM classification is used to improve the accuracy of recognition.
In this chapter, the techniques and algorithms involves in this research will be
discussed in details. First, the input instruction is filtered to get rid of the noise and the
features are extracted using feature extraction technique of Mel-Frequency Cepstral
Coefficients (MFCC), due to extract the important characteristics of speech signal, which
represent a set of features vector as the output result. Then, the whole sentence can be
estimated and classification can be made. Here, pattern classification method used is known
as Hidden Markov Model (HMM). The entire process in this research is shown as below:
Input : Quranic verse recitation of Sourate Al-Fatihah
Output : Result of Sourate Al-Fatihah recitation – notification for any correct or
incorrect recitation based on Tajweed rules
Stage 1: Training
Begin
Step 1 : Input speech signal of Quranic verse recitation is sampled
41
Step 2 : Preemphasis is executed – Finite Impulse Response (FIR) filter
Step 3 : The speech signal is framed
Step 4 : Framed speech signal is windowed by using Hamming Window
Step 5 : Fast Fourier Transform is applied to the windowed speech signal
Step 6 : Mel-Frequency Cepstral Coefficients (MFCC) is calculated
Step 7 : HMM model is developed, i.e: λ (A, pi0, mu, sigma) is evaluated and
stored in the database
End
Stage 2: Testing/Recognition
Begin
Step 1 : Input speech signal of Quranic verse recitation is sampled
Step 2 : Preemphasis is executed – Finite Impulse Response (FIR) filter
Step 3 : The speech signal is framed
Step 4 : Framed speech signal is windowed by using Hamming Window
Step 5 : Fast Fourier Transform is applied to the windowed speech signal
Step 6 : Mel frequency Cepstral Coeefficients (MFCC) is calculated
Step 7 : HMM model is developed, i.e: λ (A, pi0, mu, sigma) is evaluated
Step 8 : The observation sequence and HMM values, obtained from the test input
are compared with all models present in the database, through the Viterbi
algorithm
Step 9 : The recognition results of the recognized word is decided based on the
maximum value of log likelihood of the test data match with trained data
End
42
3.2.1 Speech Samples Collection (Speech Recording)
In this part, recording process will be executed, due to collect the Quranic recitation
of speech samples from different speakers. According to Rabiner and Juang (1993), there
are 4 main factors need to be considered while collecting the speech sample, such as:
1. Who are the talkers
2. The speaking condition
3. The transducers & transmission systems
4. The speech unit
These 4 factors need to be identified first, before any process of recording executed.
It is because; these factors will affect the performance and the output result, especially the
training set vectors that will be used in training and testing process. In this project, this
automated Tajweed checking rules engine has used a simple MATLAB function for
recording the speech samples. In figure 3.1 below, shows the MATLAB code used in
recording process of speech samples.
Figure 3.1: MATLAB code for recording process
43
However, this function requires the user to define the certain parameters before the
recording process were carried out. Those parameters include the sampling rate (Hz), as
well as time length in seconds. Here, the MATLAB command "wavrecord" is used to read
the audio signals from the microphone directly. The command format is:
y = wavrecord (n, fs);
Where "n" is the number of samples to be recorded and "fs" is the sampling rate. In
this recording part, duration of time length for the recording process took place is 4
seconds, recorded using the normal microphone. "Duration*fs" is the number of sample
points to be recorded. The recorded sample points are stored in the variable "y" with vector
size of 64000x1.
3.2.2 Mel-Frequency Cepstral Coefficients Feature Extraction
In the late 1970s, coefficients derived from the cepstrum began to replace the Linear
Prediction Coefficients (LPC) as the basic algorithm and parameter set for speech
recognition applications. The Mel-Frequency Cepstral Coefficients (MFCC) is frequently
used nowadays for feature extraction technique in speech processing. In this technique, the
used of Mel scale in the derivation of cepstrum coefficients was introduced. The Mel scale
is a mapping of the linear frequency scale based on human auditory perception (Levent,
M.A., 1996).
44
As mentioned earlier, the main objectives of feature extraction is to extract the
important characteristics from the speech signal, that are unique for each word, due to
differentiate between a wide set of distinct words. According to Ursin, M. (2002), MFCC is
considered as the standard method for feature extraction in speech recognition and perhaps,
the most popular feature extraction technique used nowadays. MFCC able to obtain a better
accuracy with a minor computational complexity, respect to alternative processing as
compared to other feature extraction techniques (Davis, S.B. & Mermelstein, P., 1980).
Figure 3.2: Block diagram of the computation steps of MFCC
The proposed method for feature extraction is given in figure 3.2 shown above. At
this stage, it will emphasize on MFCC computational process, as the main algorithm for
feature extraction analysis. Here, the feature extraction algorithm of MFCC has been used
and applied to all collected of speech samples to obtain the targeted output of features
vector. There are certain parameters need to define first before the MFCC algorithm and
coefficients value were estimated. In table 3.1, shows the parameter values as well as
MFCC filter equations, which is used in this entire MFCC MATLAB code.
DCT
45
Table 3.1: MFCC Parameter Definition
Parameter Value
Time Length 4 seconds
Sampling Rate 16 000 Hertz per second
Frame Size (windowSize) 256
Number of filters 40
Table 3.2: MFCC Filter Equations
Parameter Value
FFT points (NFFT) 2048
Linear filter (Nlinear) 13
Logarithmic filters (Nlog) 27
Spacing of linear filters (Slinear) 66.667 Hz
Spacing of logarithmic filter (Slog) 1.0712
Lower bound of the 1st filter (f0) 133.13 Hz
The voice input is recorded using the normal microphone and sound recorder utility
supported by latest OS of Windows XP or Vista. In automated Tajweed checking Rules
engine, the speech sample is 16 000 Hz for 4 seconds of time length, with a sampling
precision of 16 bits. In the preprocessing stages, array of speech signal were obtained from
the microphone after the recording process. The time graph and spectrum of speech signal
were calculated and displayed in both time graph, as well as spectrum graph in specific
figure of plot format. Figure 3.3 shown below is the result of the time graph and spectrum
for the Quranic recitation of “Bismillahi Al-Rahmani Al-Rahiim”.
46
Figure 3.3: Time and Spectrum graph for the recitation “Bismillahi Al-Rahmani Al-
Rahim”
3.2.2.1 Preemphasis
Preemphasis is considered as the first step of MFCC under the preprocessing stage
in speech processing, which involved the signal conversion from analog to digital signal.
The sequence of samples x[n] is obtained from the continuous time signal x(t), which stated
clearly in the relationship below:
nTxnx (1)
47
Where, T is the sampling period and 1/T = fs is the sampling frequency, in
samples/sec. ‘n’ is represented as the number of samples. The above equation is mainly
used to obtain a discrete time representation of a continuous time signal through periodic
sampling. The size of the sample for digital signal is determined by the sampling frequency
and the length of the speech signal in seconds. At the first stage in MFCC feature
extraction, the amount of energy is used to boost into the high frequencies. It can be seen
through the spectrum graph for speech segments like vowels, where there is more energy at
the lower frequencies compared to the higher frequencies. This drop of energy across
frequencies is caused by the nature of the glottal pulse (Jurafsky, D. & Martin, J.H., 2007).
When the frequency increases, preemphasis also increased the energy of the signal. This
preemphasis is implemented by using a filter, based on the equation below:
197.0 nxnxny (2)
Figure 3.4: MATLAB code for the Preemphasis stage of MFCC
Preemphasis can be executed after the digitization of a speech signal through the 1st order
of FIR (Finite Impulse Response) filter.
11 zzH (3)
48
Where, α is the preemphasis parameter set to a value close to 1, in this case 0.97. By
applying the FIR filter to speech signal, the preemphasis signal will be related with the
above equation (2) mentioned and MATLAB code in figure 3.4.
3.2.2.2 Framing
After preemphasis filtering process executed, the filtered of input speech will be
framed. Here, the columns of data from the particular speech input will be determined. The
Fourier Transform used here, only reliable when the signal is in a stationary position. In
this case, speech or voice implementation holds within a short time interval only, less than
100 milliseconds of frame rate. Thus, the speech signal will be decomposed into a series of
short segments and each of the frames will be analyzed, then any useful features will be
extracted from it. A 256 of window size or point frame is chosen in this research, which
can be seen in figure 3.6.
Figure 3.5: MATLAB code for framing stage of MFCC
49
Figure 3.6: Framing Signal (Frame size = 256 samples)
3.2.2.3 Windowing
Windowing is one of the important parts in MFCC feature extraction process. Here,
each individual frame of speech signal is windowed, due to minimize the signal
discontinuities at the beginning and at the end of each frame. The purpose of this action is
to minimize the spectral distortion and to taper the signal to zero at the beginning and at the
end of each frame. The windowing equation is defined as below:
);(nw Where )1(0 Nn (4)
N is the number of samples in each frame. The result of windowing signal of (y (n)) is
defined as:
)(*)()( nwnxny , 10 Nn (5)
50 100 150 200 250
-4
-3
-2
-1
0
1
2
3
4
Frame Size
Ampli
tude
FRAMING SIGNAL
50
The Hamming window, )(nw used in this work is given by equation (6) below:
otherwise
NnN
nnw
,0
10,1
2cos46.054.0
)(
(6)
Figure 3.7 shows the MATLAB code for performing the windowing of the segmented of
speech samples, whereas figure 3.8 shows the hamming window graph developed.
Figure 3.7: MATLAB code for the windowing stage of MFCC
The effect of windowing of speech sample can be visualized clearly in figure 3.9. The
transition of speech sample seems to be smooth towards the edge of the frame.
51
Figure 3.8: Hamming Window
Figure 3.9: Windowed speech segment
50 100 150 200 250
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1HAMMING WINDOW
Samples
Am
plitu
de
50 100 150 200 250
-4
-3
-2
-1
0
1
2
3
4
Frame Size
Am
plitu
de
WINDOWED SPEECH SEGMENT
52
Once the data of speech sample is framed and windowed, the data at the end of the
frame is going to be likely reduced to zero and resulted with the loss of information. Thus,
the overlapping approach is allowed to be executed between frames. It allows the adjacent
frame to include the portion of data into the current frame. Meaning that, the edges of the
current frame will be included at the center of adjacent frames. Normally, around 60% of
overlapping signal is sufficient to cover the lost information and also attempt to smoothen
the varying parameters. Fast Fourier Transforms (FFT) is applied to windowed of speech
sample, which converts each frame of N samples from the time domain into the frequency
domain.
3.2.2.4 Discrete Fourier Transform (DFT)
According to Owen, F.J. (1993), the Discrete Fourier Transform (DFT) normally
computed via Fast Fourier Transform (FFT) algorithm. This algorithm is widely used for
evaluating the frequency spectrum of the speech and converts each frame of N samples
from the time domain into the frequency domain. The FFT is defined on the set of N
samples nX as:
,1
0
/2
N
k
Nknkn exX Where 1...2,1,0 Nn (7)
In this research, the windowed of speech segment is transformed into the frequency domain
by using the Fourier Transform through the MATLAB command shown in figure 3.10. It
computes the FFT that returns the result of DFT values:
Figure 3.10: FFT computation of MATLAB code
53
3.2.2.5 Mel Filterbank
Mel scale is applied due to place more emphasize on the low frequency
components. It is because, the information carried by low frequency components of the
speech signal is more important than the high frequency components. Mel scale is a unit of
special measure or scale of perceived pitch of tone. Mel Filterbank also known as Mel
Frequency Warping, where it does not correspond linearly to the normal frequency, but
behaves linearly below 1000Hz and a logarithmic spacing above 1000Hz. The following
equation shown below is the approximate empirical relationship to compute the Mel
frequencies for a given frequency f expressed in Hz:
)700/1log(2595)( ffMel (8)
In order to implement the filterbanks, the magnitude coefficients of each Fourier
transform of speech segment is binned by correlating them with triangular filter in the
filterbank. In other hand, Mel-scaling is performed by using a number of triangular filters
or filterbanks (Thomas, F.Q., 2002).
Figure 3.11: MFCC Cepstral Coefficients computation of MATLAB code
54
In this part, the cepstral coefficients of Mel-Frequency Cepstral Coefficients
(MFCC), which corresponding to the input were obtained. Those output results can be seen
through the MFCC Cepstral Coefficients graph shown in figure 3.12.
Figure 3.12: Result of MFCC Cepstral Coefficients
3.2.2.6 Discrete Cosine Transform (DCT)
DCT is a Fourier transform, which is similar to the Discrete Fourier Transform
(DFT), but using the real numbers only. DCT used to extract the Mel Frequency Cepstral
Coefficients (MFCC) results, and it is often used to calculate the cepstrum instead of
inverse FFT.
2 4 6 8 10 12-35
-30
-25
-20
-15
-10
-5
0
MFCC Cepstral Coefficients
Samples
Am
plitu
de
55
In this research, this part was the final step of computing the MFCCs. It required
computing the logarithm of the magnitude spectrum, in order to obtain the Mel-Frequency
Cepstral Coefficients. The MFCCs at this stage are ready to be form in a vector format
known as features vector. This features vector is then considered as an input for the next
process, which is concerned with training the features vector for recognition purposes. The
result of MFCC cepstral coefficients is shown below:
Figure 3.13: The MFCC cepstral coefficients for ayates ‘Maaliki yawmid diini’
56
3.2.3 Hidden Markov Model Classification
Hidden Markov Model (HMM) is a statistical model of system that is used in
pattern recognition field, especially in speech recognition. It is widely used for
characterizing the spectral properties of the frames for a certain pattern. Using the HMM,
the input of speech signal is well characterized as a parametric random process and the
parameters of stochastic process can be determined in precise and well-defined manner.
The parameter of HMM model need to update regularly, due to make the system able to fit
a sequence for particular application. Thus, the training of the HMM model is so important,
due to represent the utterances of words. This model is used later on in the testing of
utterances and calculating the probability of HMM model, in order to create the sequence
of vectors.
In HMM statistical approach, the Quranic recitation of input speech is represented
accordingly with some probability distributions. According to Markov models, if the
observation is a probabilistic function of state, it is called as Hidden Markov Model. It is
because, it consist of doubly embedded stochastic process with underlying, that is not
directly observable (hidden), but can be observed through another set of stochastic process
only, that may produce the sequence of observations (Rabiner, L.R. & Juang, B.H., 2003).
In this research, HMM with Multivariate Gaussian state conditional distribution has
been used in Hidden Markov Model (HMM). The HMM for discrete symbols observations
is characterized by the following elements listed below:
N : Number of states.
57
pi0 (π0) : Row vector that containing the probability distribution for the first
(Unobserved) state: )()( 10 isPi (9)
A : State Transition probability: )1( isjsPa ttij (10)
mu (μ) : Mean vectors (of the state-conditional distributions) stacked as row
vectors, such mu(i,:) is the mean (row) vector, corresponding to i-th
state of the HMM.
sigma (Σ): Covariance matrices. These values will be stored in 2 different ways
depend on whether full or diagonal covariance matrices used.
Full covariance matrices: Sigma ((1+ (i-1)*p) :( i*p), :) (11)
Diagonal covariance matrices: Sigma (i, :) (12)
Figure 3.14 shown below depict an automated Tajweed Checking rules system
structure, which illustrated a speaker recognition system for Quranic verse recitation. There
are 2 main stages in a speech recognition system, which are training and recognition
stages. Under the training stage, models (patterns) are generated from the input of speech
samples, after the feature extraction process and modeling techniques. Meanwhile, in the
recognition stage, features vector will be generated from the input speech samples with the
same extraction procedures in the training stage, mentioned earlier. After that classification
process, as well as the decision process was made and executed with some matching
techniques. Under the classification type, the recognition task can be divided either
identification or verification process.
58
Figure 3.14: Automated Tajweed Checking Rules system structure (λ = model parameter)
Moreover, the distinct HMM is used to model the vocabulary of words. Each word
in the vocabulary has a training set of k utterances by different speakers (Rabiner, L. &
Juang, B.H., 1993). Those utterances constitutes with an observation sequence of MFCC.
The isolated word of speech recognition mainly for Automated Tajweed checking rules
engine that consist of the following 3 major steps, which are:
59
(1) Training/Modeling: Each word in the vocabulary build an HMM model and
estimate the model parameters of ),,0,( sigmamupiA , which represent the
likelihood of the training set observation vectors.
(2) Identification: Each unknown words to be recognized and measurement of the
observation sequence through the feature analysis of the speech, corresponding
to the word were made. Lastly, the word is selected using the Viterbi algorithm,
which the model likelihood is maximum as given in figure 3.14.
(3) Verification: The input features were compared with the registered pattern, and
any features that giving the highest score is identified as the selected/target
speaker (recitor) and recitation results. Then, these input features are compared
with the claimed speaker (recitor) and decision is made either to accept or reject
the claimed/results.
According to these 3 major steps listed above, the training/modeling step was executed
during HMM training, while the identification and verification steps were carried out
during HMM testing/matching.
3.2.3.1 Hidden Markov Model Training
The training of Hidden Markov Model is used to model and represent the
particular utterances of word or phoneme from the Quranic recitation. Thus, a complete
specification of HMM from 2 items for observation symbols of HMM model parameters, N
and p, as well as 3 sets of probabilities measures A, mu, sigma and initial state distribution,
60
pi0 are required. According to Hemantha, G.K. et al. (2006), the complete parameter of
HMM model is denoted by λ = (A, B, pi0), but in this research B represent by 2 sets of
probabilities measures of mu and sigma, which denoted as:
),,0,( sigmamupiA (13)
It is done by adjusting the parameter for the model ),,0,( sigmamupiA . The
execution of this adjustment is an estimation of the parameters for the model
),,0,( sigmamupiA , that maximizes P(O/ ). The values obtained from the λ model
will be stored in the database, for further processing in testing/recognition part in stage 2.
The sequence to create a HMM model of speech utterances is shown below:
Figure 3.15: The HMM sequence of training block diagram
(a) Initialization
A: The state transition probability matrix, using the Left-to-Right Model. The state
transition probability matrix, A is initialized with the equal probability for each
state and it can be made in sparse, due to save memory space (A should be upper
triangular for a Left-to-Right Model).
61
The values of A were obtained after the MATLAB simulations were successfully executed.
Those A values is initialized with equal probability for each state, denoted as below in
MATLAB command window:
Figure 3.16: The state transition probability matrix (A) for ayates ‘Maaliki yawmid diini’
pi0: Initialize the initial state of probability distribution, using the left-to-right
model. The initial state of probability distribution pi0 is initialized to be
deterministic and in state 1 at the beginning (ie. pi0 = [1 0 … 0]). This description
is based on speech recognition theory. (Rabiner, L.R., 1989)
pi0i = 00000000000001
62
Where, 1 ≤ i ≤ number of states. In this case, for the ayates ‘Maaliki yawmid diini’,
number of states i =14.
Initialize the Mean vectors (mu (μ)) and Covariance matrices (sigma (Σ)), for
model parameters using multiple observations for a Left-to-Right Hidden Markov
Model (HMM). These values are able to determine the dimensions of the model
(size of observation vector and number of states) and the type of covariance
matrices (either full or diagonal) from the size of input arguments.
Figure 3.17 shows the MATLAB code for initializing the model parameter of mu (µ) and
sigma (Σ), using multiple observations for Left-to-Right Hidden Markov Model (HMM).
Here, each parameter sequence of speech is chopped into N segment of equal length, where
N is the number of states.
Figure 3.17: MATLAB code for initialize the model (mu, sigma)
Figure 3.18: M-File function of hmm_mint
It is believed that most functions (with mu and sigma as their input arguments) are able to
determine the dimensions of HMM model (size of observations sequence and number of
states (N)) and type of covariance matrices (either full or diagonal) from their input
63
arguments. It can be calculated through the functions hmm_chk. Below are the model
parameter values of mu (µ) and sigma (Σ) for the ayates ‘Maaliki yawmid diini’.
Figure 3.19: The mean vectors mu (µ) for ayates ‘Maaliki yawmid diini’
64
Figure 3.20: The covariance matrices sigma (Σ) for ayates ‘Maaliki yawmid diini’
(b) Probability Evaluation
In this part, multiple iterations of the Expectation-Maximization (EM) algorithm for
a Left-to-Right Model are performed, with multiple training sequences. This process was
just a call to the lower-level functions, where the supplied values from the previous part in
3.2.3.1 (a) of A, mu and sigma were also used as initialization (A_, mu_, sigma_) values.
65
These values will be used and implemented in the next process, where the Forward-
Backward Recursions (with scaling) process will be executed. Figure 3.21 shows the
MATLAB code for Forward-Backward Recursions implementation.
Figure 3.21: MATLAB code for Forward-Backward Recursions
Based on MATLAB code shown above, alpha is the forward variable, meanwhile beta is
the backward variable with log1 variable as the log likelihood values. Notice that, at each
step the log-likelihood is computed from the forward variables using log1 term, returned by
hmm_fb (forward-backward), which is sum of logarithmic scaling factors used during the
computation of alpha and beta. Another variables of dens, contains the values of Gaussian
densities for each time index (useful variables for the transition probabilities estimation).
Below are the brief descriptions of those variables involved in this part:
(i) α (alpha): The Forward Algorithm
The probability of an observation sequence O = TOOO ...21 for
model λ = (A, pi0, mu, sigma) can be carried out, by finding for which of
the model that most likely has produced the observation sequence. Thus,
every possible sequence of states for length T can be evaluated, through the
equation below: (In this case, mu & sigma values represent by b or B)
)()|(2,...,,
1
21
1 t
T
tqqq
qqqq obaOP
ttt
T
(14)
66
Based on equation (14), initially at time (t = 1) P is in state q1 with
probability1q . Symbol
1o with probability )( 11
obq were generated. The
clock changes from t to t + 1 and a transition from q1 to q2 will occur with
probability 21qqa and the symbol 2o will be generated with probability
)( 22obq . The process was continued in this manner until the last transition is
made (at time T). i.e., A transition from 1Tq to Tq will occur with
probabilityTT qqa
1, and the symbol To will be generated with probability
)( Tq obT
. The Forward Algorithm is based on the forward variables )(it ,
defined by:
)|,...()( 21 iqoooPi ttt (15)
From equation (15), )(it is the probability at time t and in state i, given by
the model, in which the partial observation sequence from the first
observation until observation number t, tooo ...21 having generated. The
Forward Algorithm can be computed at any time t, 1 ≤ t ≤ T, shown below:
1. Initialization
Set t = 1;
Niobi ii 1),()( 11
In this part, forward variables gets its start value
(Joint probability being in state 1 and observing
67
the symbol .1o Only )1(1 will have a nonzero
value in Left-to-Right Model.
2. Induction
Njaiobj ij
N
ittjt
1,)()()(
111
3. Update time
Set t = t + 1;
Return to step 2 if t ≤ T;
Otherwise, terminate the algorithm (go to step 4).
4. Termination
)()|(1
iOPN
iT
α (Alpha): Alpha scaled
Due to the precision range complexity, while calculate with multiplication
of probabilities, makes α (alpha) and β (beta) scaling is necessary. Scaling of
the forward variables is performed at each time index t = 2: T, where each
row of the alpha matrix sums to 1, except the first one as shown below for
the ayates of ‘Maaliki yawmid diini’.
Alpha scale = 11
02
03
.
.
.0T
68
Where, 1 ≤ T ≤ number of input. In this case, for the ayates ‘Maaliki yawmid
diini’, number of input arguments, T = 46.
(ii) β (Beta): Backward Algorithm
If the recursion process is described as to calculate the forward variable in
reverse way, then the )(it will be the backward variable. This variable is
described with the following equation:
),|...()( 21 iqoooPi tTttt (16)
From the equation (16), )(it is the probability at time t and state i given by
the model, in which the partial observation sequence from t + 1 observation
until observation number T, Ttt ooo ...21 having generated. The variable can
be calculated inductively according to below:
1. Initialization
Set t = T – 1;
NiiT 1,1)(
2. Induction
NiobaiiN
jtjijtt
1,)()()(
111
3. Update time
Set t = t - 1;
Return to step 2 if t ≥ 0;
69
Otherwise, terminate the algorithm.
β (Beta): Beta scaled
The backward variables are scaled using the same normalization factors as
indicated in forward scale factors. The reason is to ensure the re-estimation
of the transition matrix is correct.
(iii) Log (P(O|λ)) : Probability of the observation sequence
The probability of the observation sequence (Log (P(O|λ)) is saved in a
matrix to see the adjustment of the re-estimation sequence. In this case, Log
(P(O|λ)) represent by log1 in MATLAB program. Note that, summation of
the sum (log (scale)) of total probability has been used for every iteration.
The current value of log1 is compared with the previous log1 in previous
iteration, where if the different value (measure value) is less than threshold
value, then maximum value can be obtained.
(c) Re-Estimation
The recommended algorithm used for re-estimation parameters for the model, λ =
(A, pi0, mu, sigma), is by using the Iterative Baum-Welch Algorithm. This algorithm
responsible to maximize the likelihood function for the model λ = (A, pi0, mu, sigma).
Here, for every iteration the Baum-Welch algorithm will re-estimate the HMM parameters
to a closer value (maximum). The Baum-Welch algorithm is based on a combination of the
forward algorithm and the backward algorithm, which have been implemented before.
70
As mentioned earlier in part 3.3.3.1(b), the values of A, mu and sigma were also
been used as initialization (A_, mu_, sigma_) values. Those values will be used to re-
estimate the transition parameters for the multiple observation sequence left to right HMM.
However, before the process was carried out, the dimensions of the HMM model need to be
checked and determined first, through hmm_chk function as discussed before. Then, the
re-estimation process able to be executed through the functions of hmm_mest as shown in
figure 3.22 below:
Figure 3.22: MATLAB code for the re-estimation of transition parameters
In this case, the matrix X contains all the observation sequences, while the vector st
yields the index, which corresponding to the beginning of each sequence. Thus, X (1: st
(2)-1, :) contains the vectors that relate to the first observation sequence, until X (st (length
(st)), length (X (1, :),:), which corresponds to the last observation sequence. Meanwhile,
the transition parameters are re-estimated in hmm_mest function, where posteriori
distributions of states are returned in gamma (γ). In other hand, note that mix_par also has
been used for re-estimating of HMM parameters (mu_ and sigma_) from posterior state
probabilities. The descriptions of transition parameters re-estimation are describes as
below:
(i) A_: Re-estimate the state transition probability matrix
The Baum-Welch algorithm is used to adjust the model parameter, through
maximization the probability of the model, using the below equation:
71
)]|([maxarg*
OP (17)
Here, the re-estimation process of matrix A is quite extensive, due to the use
of multiple observation sequences. Below equation has been used, to
calculate an average estimation with the contribution from all utterances
used in training session.
1
1
1
1
)(
),(
exp
exp)(
T
tt
T
tt
ij
i
ji
istatefromstransitionofnumberected
jstatetoistatefromstransitionofnumberectedia
(ii) mu_ (μ): Re-estimate mean vector
A new mean value, x_mu (m, n) is used for the next iteration of the process,
where the value of gamma (γ); ),( kjt is used:
T
tt
t
T
tt
jk
kj
okj
1
1
),(
),(
(18)
(iii) sigma_ (Σ): Re-estimate covariance matrices
A new covariance value, x_sigma (m, n) is calculated and used for the next
iteration, where the value of gamma (γ); ),( kjt is used:
T
tt
jtjt
T
tt
jk
kj
ookj
1
'
1
),(
))()(,(
(19)
72
(d) Result – Model of Hidden Markov Model (HMM)
Lastly, after the re-estimation process successfully executed, the HMM model for
the specific utterance need to be save. The model developed represent the specific
observation sequences, i.e: Isolated word, in which it used for recognition purposes later
on. The HMM model obtained, will be discussed in details at chapter 5. The model is
presented with the specific denotation λ = (A_, mu_, sigma_) of MATLAB MAT-file
(matrices 7x14), but here only half of it was shown in figure 3.23 (‘Maaliki yawmiddiini’):
Figure 3.23(a): MAT-file trained model of A_ values (State, i=1-14)
Figure 3.23(b): MAT-file trained model of mu_ (μ) values (State, i=1-13)
Figure 3.23(c): MAT-file trained model of sigma_ (Σ) values (State, i=1-13)
3.2.3.2 Hidden Markov Model Testing/Recognition
Decoding or aligning the acoustic feature sequence requires the prior specification
of parameter from the particular HMM. As mentioned earlier, the HMM model has a role
of stochastic templates, for comparing the observations. Those templates consist of several
73
sentences, which represent different phonemes of Quranic recitations. Each of templates
can be determined and identified through the estimation of HMMs parameter, specified by
a certain database which contained the observations sequences, either supervised or
unsupervised way of learning method.
Based on HMM basic concepts, the parameter defines the probability measure
for observation sequence O. ie: P (O/ ). This observation sequence O =
TOOOO ...321 need to be compared with a model λ = (pi0, A, mu, sigma), in
order to find the optimal sequence of states q = {q1 q2 q3 . . . . qT} to a given
observation sequence and model. Due to maximize P (q|O,), the suitable algorithm to be
used must be Viterbi algorithm (Rabiner, L.R., 1989). The Viterbi algorithm is used to find
the best single state sequence for the given observation sequence (Rabiner, L.R. & Juang,
B.H., 1993). The testing process was carried out, where the tested utterances are compared
with each model and then, the score value is obtained after each comparison executed.
In this case, the observation sequences of O do not involved in calculation, but the
feature analysis of MFCC of speech samples are corresponding with the word. For
example, a reasonable measure of the similarity for two HMMs model of 1 and 2 , using
the concept of logarithmic distance (defining the distance measure ),( 21 D ) between two
Markov models 1 and 2 is denoted as:
)]/(log)/([log/1),( 2210121021 OPOPTD (20)
74
Where )...,( 212 tOOOO is a sequence of observations generated by model 2 . Basically,
the expression of 2O shown above is the measure of how well model 1 matches the
observations generated by model .2
Under the same concepts mentioned above, the above equation (20) has been
implemented with the current research application, mainly in recognizing the Tajweed rules
based on certain ayates of Quranic recitation. Here, the log likelihood of the
word/phoneme itself acts as measurements. The standard Log Likelihood Ratio (LLR) is
calculated as follows:
ndPObestPNnLLR 2(.log)(.[log/1 )]Obest (21)
Here, N is the length of input utterance, )(.log ObestP is the largest log likelihood
and )2(.log ObestndP is the second largest log likelihood. The HMM testing is done in
such matter that the particular utterance to be tested is compared with each model, and
output score result is defined for each comparison. The sequence for the test of the Quranic
utterances is based on the following:
Figure 3.24: The HMM sequence of testing/recognition block diagram
75
(a) Initialization
(i) Log (A): State transition probability matrix of the model (Refer to
HMM training)
Load the A_ values (MAT-file) from the trained model λ and calculate the
logarithm value of A. But, by using left-to-right model, the use of logarithm
to zero components values in the A and π can cause problems. It is because;
the zero components will turn into minus infinity. To avoid this problem,
Matlab ‘realmin’ (smallest value) value can be used. It can be shown, based
on the MATLAB code in figure 3.25 below:
Figure 3.25: MATLAB code for ‘realmin’
(ii) mu (μ): Mean matrix from the model (Refer to HMM training)
Load the mu_ (μ) values (MAT-file) from the trained model λ.
(iii) Sigma (Σ): Variance matrix from the model (Refer to HMM training)
Load the sigma_ (Σ) values (MAT-file) from the trained model λ.
(iv) Log (pi0): Initial state probability vector (Refer to HMM training)
The problem was similar likely with Log (A). Thus, a small number is added
to the elements that contain a zero value, such as ‘realmin’ as described in
76
details at part 3.2.3.2 (a) of (i). Note that, the value of π is same for each
model.
(b) Probability Evaluation
(i) Log (P*): The probability calculation of the most likely state sequence. The
max argument is at the last state. Here, log1 has been used to represent Log
P.
(ii) plog1: The state that give the largest Log (P*) at time T is calculated. Later,
the backtracking is used.
(iii) Path: Backtracking state sequence. The optimal state sequence is calculated.
(iv) Log (B): Compute the probability of density values as i (state) from the
previous chapter (HMM training). Here, dens has been used to represent it.
(v) Delta (δ): Maximization of a single path needs for the quantity of δ t (i).
)|...,,...()( 21121,....,,q
max121
tttqq
t oooiqqqqPit
(22)
The quantity of δ t (i) is probability that observed o1 o2 o3… ot, for the best
path, which ends in state i at time t, for a given model.
(vi) Psi (ψ): The optimal state sequence is retrieved and saved in a vector ψ t (j),
which maximizes δ t+1 (j). While calculating b j (o t), the value of μ, Σ is
gathered from the different models for comparison purposes.
77
The ayates and phonemes of the Quranic recitation is recognized after comparing
the testing model with the help of Viterbi algorithm. This algorithm is used to find the
single best state sequence for the given observation sequence (Rabiner, L.R. & Juang, B.H.,
1993). The following steps for finding the best state sequence are included in the
Alternative Viterbi Algorithm listed below:
1. Preprocessing
Niii 1),log(~
Njiaa ijij ,1),log(~
2. Initialization
Set t = 2;
Niobob ii 1)),(log()(~
11
Niobi ii 1),(~~)(
~11
3. Induction
Njobob tjtj 1)),(log()(~
Njaiobj ijtNi
ttt 1],~)(
~[max)(
~)(
~1
1
4. Update time
Set t = t + 1;
Return to step 3 if t ≤ T;
Otherwise, terminate the algorithm (go to step 5).
5. Termination
)](~
[max~
1
* iP TNi
78
)](~
[maxarg1
iq TNi
T
6. Path (state sequence) backtracking
a. Initialization
Set t = T - 1;
b. Backtracking
)( *11
* ttt qq
c. Update time
Set t = t – 1;
Return to step b if t ≥ 1;
Otherwise, terminate the algorithm.
(c) HMM Recognition Result
(i) Score
The result of the score was obtained from the Viterbi algorithm. From the
calculation value of log1, the probability value of a single path is saved as a
result (output score) for each comparison. Below is the result of the output
score for the ayates ‘Maaliki yawmiddiini’.
Figure 3.26 (a): Output score for the ayates ‘Maaliki yawmiddiini’
79
(ii) Log-Likelihood Ratio (LLR)
From the output score obtained above, the maximization of these probability
values need to determine using Log Likelihood Ratio (LLR). The highest
output score gained is the highest probability that the HMM model (compare
model) has produced for the particular test utterance given, based on the
rank of the threshold value set. In this case, the result of confidence score of
LLR is 0.7253 x 103, which is above the threshold value set > 0.2. The
calculation and result obtained through this method will be discussed in
details in the next chapter 5.
Figure 3.26 (b): Log-Likelihood Ratio (LLR) for the ayates ‘Maaliki
yawmiddiini’
3.3 Summary
This chapter has presented a brief description of technical overview of MFCC and
HMM, and how both algorithms relate each other. It was clearly stated that MFCC handles
the feature extraction process, which then produces features vectors outputs of the Quranic
recitation. These output values are considered as the training set used in the HMM
classification, to train the HMM model. Therefore, HMM works as a classification or
pattern recognition technique that classifies different signals of the Quranic recitation,
based on Log-Likelihood Ratio (LLR) values calculated.
80
The combination of MFCC and HMM have been widely used in speaker
recognition, especially in English language. However, the implementations of the Quranic
recitation with both algorithms (MFCC & HMM) are still considered as a new approach.
Thus, this research studies the possibility of using this combination in Automated Tajweed
Checking Rules Engine for Quranic Verse Recitation. Besides, this chapter also has
presented a detailed methodology of research involved and MATLAB implementation
using MFCC and HMM.
81
CHAPTER 4
DESIGN AND IMPLEMENTATION
4.1 Introduction
This chapter emphasizes on the design and implementation for the development of
Automated Tajweed checking rules engine for Quranic verse recitation. In this system, it
will cover all aspects from various diagrams and parts, which exhibit the logical and
physical designs of this application, as well as algorithms and methodologies involved.
There are a few diagrams shown, which probably include the most relevant diagrams of
Quranic verse recitation recognition based on speech recognition system, such as context
diagram, data flow diagram, flow chart and other diagrams. Finally, this chapter also
provides some snapshots of the Quranic verse recitation recognition graphical user
interface (GUI).
Figure 4.1: Automated Tajweed Checking Rules for Quranic verse recitation context
diagram
82
4.2 Overview of Automated Tajweed Checking Rules Engine
According to the research, the project is mainly focus on the basic of speech
recognition technology, but it will implemented into the different type of application or
languages such as Quranic Arabic. Those different of input content, which had been
implemented in this engine, would probably affect the percentage of accuracy during
recognition process. So, the reliability and effectiveness of the system also depend on
language and system design created. The system is implemented using the Programming
language of MATLAB as programming tools. In this project, the system developed is
divided into two main parts, which are:
1. Engine Development part
2. Content Development part
Figure 4.2: Overview of Automated Tajweed Checking Rules Engine
83
4.2.1 Engine Development Part
In Engine Development part, speech recognition engine is developed due to
extract, store and analyze the parameters of Al-Quran recitation. The Mel-Frequency
Cepstral Coefficient (MFCC) and Hidden Markov Model (HMM) based algorithm is
currently selected for feature extraction and classification (comparison). Here, the process
of speech recording (speech samples collection), features extraction, features training and
pattern recognition were formulates the Quranic verse recitation recognition methodology,
which enhances the design of tajweed checking rules guidelines shown below. The
architecture/block diagram of this part will be shown clearly in this chapter, meanwhile the
process and algorithms involved will be discussed in details in part 3.2.
4.2.2 Content Development Part
For Content Development part, the sample of Quranic recitation is recited by the
certified teacher (Mudarris) and those samples will be stored in PC for analysis purposes.
Relevant GUI interface also developed, in order to provide the user-friendly system of
Automated Tajweed Checking Rules system. The Content Development part will
responsible for all the contents part. The task including, the preparation process of Al-
Quran contents namely Al-Quran transcript and Al-Quran recitation. For Al-Quran
transcript, it is already completed 100% and ready to be used by Engine Development
part. Meanwhile, for Al-Quran recitation, currently each word of the first Chapter of Al-
Quran (Al-Fatihah) has been carefully recited by a certified teacher (Mudarris) and has
been stored in Personal Computer (PC). All the stored files (.wav) will be sent to Engine
84
Development part for integration process with speech processing technology. The Engine
and Content Development part will eventually work together to apply the speech
recognition technology, in order to analyze both recitations (teacher and student) based on
the Rules of Tajweed. If a student recites Al-Quran incorrectly, the system will show errors
on the Graphical User Interface (GUI) and show the playback for correct recitation.
4.3 Tajweed checking rules engine architecture
According to the research, the project is mainly focus on the basic of speech
recognition technology, but it will implemented into the different type of application or
languages such as Arabic Quranic. Those different of input content, which had been
implemented in this engine, would probably affect the percentage of accuracy during
recognition process. So, the reliability and effectiveness of the system also depend on
languages and system design created.
The Quranic Arabic recitation is best described as long, slow pace rhythmic,
monotone utterance (Essa,O., 1998 ; Nelson & Kristina, 1985). The sound of the Quranic
recitation recognizably unique and reproducible according to a set of pronunciation rules of
tajweed, which designed for clear and accurate presentation of the text. The input of the
system is the speech signal and phonetic transcription of the speech utterance. Thus, this
project need to have speaker (input speech sample), features extraction, features training
and pattern classification/matching, which are components that are important for the
Quranic verse recitation recognition formulation of architecture. Here, the main
architecture of Automated Tajweed checking rules for Quranic verse recitation is adhere
85
with the Engine Development part, which has been mentioned earlier in previous part of
4.2.1. This part is divided into 3 main architectures, which are features extraction,
training/testing architecture and lastly the recognition architecture. Figure 4.1 shows the
Automated Tajweed Checking Rules for Quranic verse recitation context diagram that
represent the external look of the system, where the speaker perform their Quranic
recitation via Tajweed checking rules engine and receives the respond from the system
after processing the speech input samples, whereas the training/testing and recognition will
be respond respectively after that. The schematic Tajweed checking rules engine block
diagram is shown in figure 4.3, whereas both training/testing and recognition architecture
is shown in figure 4.4.
Figure 4.3: Block diagram schematic illustrating Tajweed checking rules engine
86
Figure 4.4: Tajweed checking rules engine architecture
Refer to figure 4.3 of Automated Tajweed Checking Rules engine block diagram,
as well as in figure 4.4 of system architecture, show us the process flow that taken part in
this research. The important parts involved in this research can be described as above.
Figure 4.3 show us the overall process in Quranic verse recitation recognition, which
represent in term of block diagram. In this block diagram, 2 distinguished phases have been
represented, which are enrolment or training phase and matching/testing phase, as shown
in figure 4.4. Training and matching/testing phase is totally different process, which had
been executed. In the training phase, each recitor needs to provide the samples of Quranic
recitation, so that the invented engine can build or train a reference model, specifically for
the particular recitor. Meaning that, in this part the researcher only need to train and stored
correct data of certain sourate or Quranic recitation into the database. In the case of speaker
verification process, a specific threshold value also can be computed from the training
samples by researcher. The aims of this action are to provide the correct data due to make it
as a reference for upcoming process for recognition.
87
In other hand, the input speech executed in matching/testing phase is matched with
stored reference model, and thus a decision can be made (recognition). The output data
from Hidden Markov Model (HMM) need to be responded and compared their output data
by referring to the database created during the training process. Meanwhile, at the same
time the system need to act upon the feedback result and then give the answer, either the
output data produces can match the stored data in database or not. If the output data is
slightly different from the stored data in database, the system will assume those output data
(Quran recitation) as false/wrong.
4.4 Data Flow Diagram for Tajweed Checking Rules Engine
In this part, the data flow diagram shows the main process performed by the
Tajweed checking rules engine. There are four mains processes that will be performed for
different tasks as shown in figure 4.5. Those processes include receiving the Quranic
recitation (speech samples), analyzing speech, searching and matching of speech and
producing results and returning the output of result to the recitor.
Recitor represent as a receiver of speech inputs and receiving the speech process
works from the Tajweed checking rules engine. The next process includes analyzing those
speech inputs followed by searching and matching the analyzed of speech samples. Lastly,
this Tajweed checking rules engine will produce and returns the matching results to the
recitor. This system will help and assist the recitors until the process successfully executed
until the final destination.
88
Figure 4.5: Tajweed Checking Rules Engine Data Flow Diagram (DFD)
4.5 Tajweed Checking Rules Engine Flow Chart
Tajweed Checking Rules Engine flow chart emphasizes on the system’s flow of
events. This engine has 5 main stages which include the sampling, segmentation, features
extraction, training/testing and recognition/classification. Figure 4.6 shows those selection
stages as well as process that probably occur for each stage.
89
Figure 4.6: Automated Tajweed checking rules engine for Quranic flow chart
90
Stage 1:
Refer to figure 4.6, sample inputs speech were recorded within the particular time
frame. The speech input for the above utterance, will be segmented due to differentiate the
speech region and non-speech region. Non-speech region were immediately detected and
only the speech regions was allow for further processing.
Stage 2:
It will become as input to the phoneme segmentation module, where the basic level
of segmentation is performed. After segmentation process, MFCC feature extraction
module will extract those speech signals, which is extensively used as a feature vector for
speech recognition systems.
Stage 3:
The next process is HMM classification as well as phoneme classification process.
HMM classification (recognition) represent for training and testing as well, which mainly
used for tajweed rules checking process. In training part, a set of training speech is used
due to construct model for each word/phoneme, regardless to recitors.
Stage 4: (Checking tajweed character/database)
In developing database of the engine, the HMM training process needs to be
executed. The training process involved, mainly include the task involved with Content
Development part mentioned earlier in part 4.2.2. Here, the recitors need to train/repeat a
set of word/phoneme or phrase of the Quranic recitation, and adjusting its comparison
algorithm to be match with initial training data set. Each word or phoneme from the
91
vocabulary will be connected to Hidden Markov Model, using the values obtained from
HMM modeling, such as A, mu and sigma. The values obtained from HMM Modeling (A,
mu and sigma), will be used as reference patterns and stored as database.
Stage 5: (End of utterance?)
Each line of the Quranic recitation (represent in array value of input sample) based
on ayates is arranged in sequence, line by line in MATLAB array editor. Based on the set
of phonemes arrangement on that array editor, the value of (A, mu and sigma) will be
obtained from HMM modeling (HMM training), based on the line specified for each ayates
in certain sourate (Al-Fatihah). The values for each line in particular phonemes of ayates,
will be used as reference patterns, where the looping process of new inputs of the Quranic
recitation will be executed line by line, alternately work altogether with reference patterns
until the looping process ended (based on line parameter set) and completed.
Stage 6: Result (Response)
The values obtained at this stages, will be known as recognition results. The results
were obtained after a real time acquisition of the Quranic recitation, a speech processing
stage and HMM modeling executed. Then, the process continues with recognition
procedure, where the values are compared with all codebook models (reference patterns),
and due to get the Maximal Likelihood Ratio. Those maximal values were only
corresponds to the recognize word only.
92
4.6 Tajweed Checking Rules Graphical User Interfaces
The automated Tajweed Checking Rules engine which had been design here,
become crucial to be implemented. The system designed, must be flexible and user friendly
to be used, and also easy to be visualized by the user. Thus, there is a way that probably can
be used for this projects’ development, through the Graphical User Interface (GUI).
In this part, both logical and physical aspect of Tajweed checking rules engine were
presented and visualized using the Graphical User Interface (GUI). Besides, the
understanding of the Tajweed Checking Rules engine functional requirement also described
through the graphical representations.
Figure 4.7: Automated Tajweed Checking Rules Engine for Quranic verse Recitation
Graphical User Interface
93
Figure 4.8 shown below were the list of the selected item for particular ayates that
have been recorded before and need to be load into the engine, for further processing for
analysis and matching process.
Figure 4.8: Load the wave file of input speech sample from sourate Al-Fatihah
After the input speech sample has been selected and load into the system, the
process continued with analyzing process for further processing. The process in analyzing
part is mainly to extract the features extraction from the sourate Al-Fatihah of input speech
sample, due to obtain the features vector. The GUI visualization of this part can be seen
through figure 4.9 and figure 4.10.
94
Figure 4.9: Analyzing process of sourate Al-Fatihah using MFCC (Started)
Figure 4.10: Analyzing process of sourate Al-Fatihah using MFCC (Finished)
95
Figure 4.11: The input speech sample and spectrogram graph for ‘Bismillah’ utterance
Then, after the analyzing process successfully completed, the process will proceed
for matching analysis. Meaning that, in this part the Tajweed checking process will be
executed. If the Quranic recitation was pronounced incorrectly, this engine will notify the
user (recitor) any false word/s for the certain ayates involved. The engine will show errors
on the Graphical User Interface (GUI) for any incorrect recitation of Al-Quran. Next, the
engine will guide the user (recitor) towards the correct ayates that need to be followed or
recite orderly with playback the correct recitation. This description can be seen through
these GUI shown in figure 4.12, figure 4.13, figure 4.14 and figure 4.15.
0 0.5 1 1.5 2 2.5 3 3.5-1
-0.5
0
0.5
1
Time [sec]
Am
plitu
de
Speech Sample of Quranic Recitation
Time [sec]
Freq
uenc
y [k
Hz]
0 0.5 1 1.5 2 2.5 3 3.50
2
4
6
0
1
2
3
96
Figure 4.12: The incorrect recitation of ‘Bismillah’ utterance (1st mistake/notification)
Figure 4.13: The incorrect recitation part involved and Tajweed rules
97
Figure 4.14: The incorrect recitation of ‘Bismillah’ utterance (2nd mistake/notification)
Figure 4.15: The incorrect recitation part involved and Tajweed rules
98
Figure 4.16: The correct recitation of ‘Bismillah’ utterance
But, in other hand, if the Quranic recitation is correct, in term of its recitation as well its
tajweed rules, the engine will give a result and shown the match based on the ayates,
recited by the user (recitor). It can be visualized based on the figure shown below:
Figure 4.17: The notification of correct recitation of ‘Arrahmaanirrahiim’ utterance
99
Figure 4.18: The correct recitation of ‘Arrahmaanirrahiim’ utterance
4.7 Summary
This chapter presented for both logical and physical aspects of Automated Tajweed
Checking Rules for Quranic verse Recitation and provided a visualization of the main
graphical user interface of the engine. It also provided an understanding of the Automated
Tajweed checking rules for Quranic verse recitation, functional requirements through
graphical representations.
100
CHAPTER 5
EXPERIMENTAL RESULTS AND DISCUSSION
5.1 Introduction
In this chapter, relevant experimental result will be shown based on the findings of
this research, and discussed in details based on the system chronology as described in
methodology part in chapter 3 and chapter 4. The aims of this task is to clearly show the
experimental results starting from the collection of speech samples, followed by features
extraction, then features training and lastly features matching/testing. The last part of this
method of features matching/testing is the main part that evaluates the performance of
Tajweed checking rules engine focus on recognition rate.
5.2 Speech Samples Collection (Recording process)
In this section, the main process concern is focus towards the collecting of speech
samples from 5 different speakers (recitor) through recording process. Each of distinct
word (ayates in sourate Al-Fatihah) will be recorded, and those speech samples were saved
for further processing. The numbers of speech samples collected were 52 words (ayates)
and 82 probable samples of phonemes for those ayates in different samples of Quranic
recitation. These samples will be used in training of Hidden Markov Model (HMM) and
also in the testing part. Speech samples were recorded in a constraint environment, where 5
101
selected speakers (recitors) were choose and highly trained in Quranic recitation based on
the ‘Tajweed rules’. Here, the first chapter of Al-Quran (Al-Fatihah) were recited, with
approximately recited 4 seconds (time length) each, in ‘.wav’ of file format. Table 5.1
shows the summary of the collected speech samples of Sourate Al-Fatihah.
Table 5.1: Except from the dictionary of Sourate Al-Fatihah
The word in the dictionary
(Wave file assigned)
The utterances
(Phonemes)
The ayates in Al-Quran
Bismillahirrahmanirrahim
(Bismillah.wav)
Bismi
Llahii
Rraohimani
Rraohiiim
Alhamdu lillahi rabbi
alAAalameen
(fatihah1.wav)
Allhamdu
Lillahhirabbil
A’alamiinna
Arrahmaanirrahiim
(fatihah2.wav)
Alrrahmani
Alrraheemi
Maalikiyawmiddiini
(fatihah3.wav)
Maaliki
Yawmi
Alddeeni
Iyyakana’Abudu waiyyaka
nastaeen
(fatihah4.wav)
Iyyaka
naA’Abudu
waiyyaka
nastaAAeenu
Ihdinaassiratholmustakiim
(fatihah5.wav)
Ihdina
Alssiratho
Almustaqeema
SiraathollazinaAn’amta’Alai Siratho
102
him
ghayrillmaghdoobi’Alaihim
waladdholeen
(fatihah6.wav
& fatihah7.wav)
Allatheena
An’Aamta
‘AAalayhim
Ghayri
Almaghdoobi
‘AAalayhim
Wala
Alddhalleena
For phonemes templates, each ayates of sourate Al-Fatihah sound files were segmented
into individual files, by cutting the desired part or specified region (Region of Interest)
only, using the GoldWave Editor. The parameter of these inputs is equally set up into the
same parameter, due to avoid any inconsistency value of incoming results. The summary of
the collected phoneme from the speech samples of 8 ayates is listed in table 5.2.
Table 5.2: Summary of the Total Collected Speech Samples for each Ayates
Ayates in wave file No. of Collected of Speech Samples
Bismillah.wav 17 Samples
fatihah1.wav 9 Samples
fatihah2.wav 6 Samples
fatihah3.wav 7 Samples
fatihah4.wav 11 Samples
fatihah5.wav 9 Samples
fatihah6.wav &
fatihah7.wav
11 Samples
12 Samples
Total No. of Speech
Samples 82 Samples
103
5.3 Result of Feature Extraction
In this part, the experimental result of MFCC (Mel-Frequency Cepstral Coefficient)
algorithm for feature extraction has been presented. The process of feature extraction was
applied to all 52 and 82 collected of Quranic recitation of speech samples. Here, MFCC
cepstral coefficients values will be obtained from each input of speech sample, which then
been transformed into the output of features vector format. Based on the result, the column
data end up was 398 columns with 13 features vector (12 coefficients + 1 log energy).
5.4 Result of Features Training
After feature extraction process executed, the recognition process will compare the
extracted features with its reference model. This reference model is developed, after the
enrolment or training phase had been successfully implemented. In this case, the reference
model (stored model in database) used consist of 2 types of models, which are Word based
Model and Phoneme based Model. The reference model for phoneme based model is
totally differs from word based model, where speech features that have been extracted are
directly compared to the word templates. Here, each of word templates in direct matching
model were stored as a vector of features parameters. Word based model has been used as a
first model, while phoneme based model been the second model used as template matching
at testing/recognition part. The phoneme based model was phoneme like template
matching, where the word templates are stored as phoneme like template parameters. For
the phoneme based model, it will be discussed in details at Tajweed checking rules
database and testing – phoneme like template, later on.
104
Here, the experimental result of performing the Hidden Markov Model (HMM) for
features training has been completely done. Features vector produced by Mel Frequency
Cepstral Coefficient (MFCC) will be combined due to create the database that serves as
HMM model, specifically used for provide the template matching while training the data.
Each of distinct words of features vectors will be combined altogether due to create a
database for that particular distinct word, where the value of ),,0,( sigmamupiA is
evaluated and stored in database. Table 5.3 shows the result of creating the HMM model
for the particular recitation of Al-Quran during the enrolment or training phase in (.mat)
format. Each distinct word in the dictionary was trained against the initial template of
HMM model mentioned earlier for 8 training iterations.
Table 5.3: Template Data of HMM Model for Collected Quranic Recitations
The word in the dictionary Wave file assigned
HMM Model(Word/ayates like
template)
HMM Model(Phonemes
like template)
Bismillahirrahmanirrahim Bismillah.wav
Alfatihah_model.mat
Bismillah_model.mat
Alhamdu lillahi rabbi alAAalameen
fatihah1.wav ayat1_model.mat
Arrahmaanirrahiim fatihah2.wav ayat2_model.mat
Maalikiyawmiddiini fatihah3.wav ayat3_model.mat
105
Iyyakana’Abudu waiyyaka nastaeen
fatihah4.wav ayat4_model.mat
Ihdinaassiratholmustakiim fatihah5.wav ayat5_model.mat
SiraathollazinaAn’amta’Alaihim
ghayrillmaghdoobi’Alaihim waladdholeen
fatihah6.wav &
fatihah7.wavayat6_model.
mat
As mentioned earlier, the system contains 2 separate template of HMM model from
the training corpus. The first model stands for the Word (ayates) Template, while the
second model is for the Phoneme-Like Templates. The training corpus used 2 tests to
compose the samples of Quranic recitation. Those tests can be seen in part 5.5 and will be
discussed later. From the corpus, 82 samples of Quranic recitation phonemes like templates
are produced and converted into phoneme strings using the Quranic pronunciation rules.
Those templates were particularly taken from 8 words (ayates) of sourate Al-Fatihah, then
those template will manually arranged into 7 files of model and stored into the database as
HMM model (.mat) shown in table 5.3. These models not just recognize the phonemes but
also checks for the tajweed rules that govern the recitation of Al-Quran. Each of
experiment executed, both the training and word templates uttered are from the same
speaker (recitor).
106
5.4.1 Tajweed Checking Rules Database
In the database, it contains of 8 ayates of Sourate Al-Fatihah with 52 samples of
utterances, meanwhile another 28 phonemes from those ayates with 82 samples of input
phonemes. This engine scans those input of Holy Quran Ottoman sound and text, searching
for symbols and features, where it will generates its code, pronunciation status, as well as
acoustic characteristic at each probably pronounced character (such as voicing, place of
articulation, nasalization and aspiration). Then, the engine will analyze those codes and
characteristics, as well as generates the corresponding correct phonetic transcription,
according to the Quranic recitation rules and their exceptions. The HMM
enrolment/training part, will gathers all the information, due to develop phoneme based
template (recitation pattern) at probable pronunciation locations. These pronunciation
patterns are used for matching with pronunciation variants rules, during the matching and
testing process. Here, the engine database contains 10 rules of pronunciation errors in the
Quranic recitation and those rules were presented the way of these recitation errors
hypotheses as listed below:
Table 5.4: The Tajweed Pronunciations rules in Sourate Al-Fatihah
Word/Sentence/Phoneme Ahkam al-Tajweed
, , , , , ,
, , ,
Mad Asli Mutlak
, , ,
,
Idgham Syamsi : alif lam
meet ra;,alif lam meet
dal; alif lam meet syad;
alif lam meet zai;
107
, , , ,
, ,
Mad ‘arid Lissukun:
Letter of mad has been
Waqf (Stop)
, , ,
Izhar Syafawi: min
sukoon meet dal;min
sukoon meet ta’;min
sukoon meet ghim; min
sukoon meet wau
Izhar Qamari: alif lam
meet ‘ain
Izhar Halqi, nun sukoon
meet ‘ain
Beside the tajweed rules listed above, other 4 additional Ahkam al-Tajweed also checked
such as Iqlab, Idgham Bilal Ghunna, Idgham Ma’al Ghunna and Ikhfa’ Haqiqi.
5.5 Result of Features Matching/Testing
The experimental results of performing the MFCC algorithm for features extraction
from the Quranic recitation of speech samples and then, matching/testing against the
trained HMM (Hidden Markov Model) model of data templates, using the same
classification of HMM method. As mentioned earlier in part 5.4, those data templates also
known as template matching, which is in a form of pattern recognition where each
word/ayates or phonemes is stored as a separate templates (phoneme-like templates and
word (ayates) template). Both templates were used as reference model (template matching),
purposely for recognition task. In this task, any input that passing through this engine will
108
be compared with the stored template, and any template that most closely match with the
incoming speech pattern is identified as recognized word (ayates) or phrase/phoneme.
The automated Tajweed checking rules engine will act upon any Quranic recitation,
whenever it receives the input speech signal because any speech that passing through the
system will give an output score and cause the engine to make judgements. Thus, the score
value measuring the confidence of a recognized word needs to find out. Besides, those
ayates and phonemes has been classified under 2 different probabilities, either In
Vocabulary (IV) of data or Out of Vocabulary (OOV) data, due to ensure that the engine
compatible in checking the tajweed rules. The basic idea for separating the IV and OOV
phonemes/words are the likelihood difference between the best and 2nd best result of IV
input are smaller than those of OOV input, because of unmatched model of OOV input. As
mentioned earlier in chapter 3, the standard Log Likelihood Ratio (LLR) and augment LLR
are used, using the below equation:
ndPObestPNnLLR 2(.log)(.[log/1 )]Obest (1)
N, is the length of input utterance, ObestPlog is the largest LLR and OndbestP 2log is
the second largest LLR.
According to Yongwon, J. et al. (2001), log likelihood of the word itself is not
appropriate to acts as measurements, due to the setting the threshold value. Thus, the above
equation (1) is used in their research. In this case, the equation (1) shown above need to
modify, in order to improve the reliability. If the input utterances of IV word/phoneme
109
changed a little bit, the recognition result obtained would not change too much, due to the
relative large likelihood differences between best and 2nd best of the results. But, in this
case of OOV word/phoneme, high probably that the result of changed input may be
difference from the original input. Because of that, the perturbed input need to be employed
in order to improve the robustness of confidence score. Here, several methods have been
applied to perturb the input features vector, such as:
coef1 = k1*coef;
coef2 = coef – k2*mc;
coef3 = coef – k3*Oc;
Based on the above formulas, coef: feature vector, mc: mean vector of feature vector for
input speech and Oc: standard deviation vector of feature vector for input speech. In
other hand, k1, k2 and k3 are the constant values which need to be adjusted, so that the
percentage of the divergence between the recognition results of original and perturbed input
of features vector will remain < 10%, especially for the IV word/phoneme. After the coef2
been perturbed, if the recognition result is change, then a certain values need to be added to
log likelihood (LLR). It can be done as follows:
LLRA = LLR + k; if Wo = Wp
LLRA = LLR; if Wo ≠ Wp
Formula; Wo: recognized word from the original input feature vector. Wp: recognized
word from the perturbed input feature vector. Here, the threshold value for the LLRA is set
by training the IV input and OOV input. After that, the LLRA result will be obtained, after
110
the testing process had successfully executed. If the value of LLRA > threshold, it is
considered as IV word/phonemes. In other hand, if LLRA < threshold, it will considered as
OOV word/phonemes. But, this threshold setting arguments can be changed, depend on the
MATLAB program developed. After the implementation of LLRA had been carried out, the
result obtained, did not give the perfect result, especially for the recognition of Tajweed
Checking rules, which been presented in term of phonemes. LLRA was presented and has
been used by Yongwon, J. et al. (2001) before, but for the single or direct word recognition.
In this case, LLRA is not suitable to be implemented to the phonemes input, because
the LLR value for IV phoneme and OOV phoneme are almost the same. Thus, another
alternative of LLR has been made, through the implementation of the LLR ratio with the
largest LLR value, presented as below equation:
Diff_ratio = ndPObestP 2(.log)(.[log ObestPObest .log/)] (2)
There are 2 tests performed on this system, in order to evaluate the system
performance. As mentioned earlier in part 5.4, in every experiment done, both training and
word templates uttered are from the same speaker. Table 5.6 and table 5.9 show the overall
result of the two testing sets, respectively.
5.5.1 Testing - Word (ayates) Like-Template
In this part, the LLR threshold value is -1100 with the different ratio value of 0.2. If
the value of LLRA >-1100, it is considered as IV word, meanwhile, if LLRA < -1100, it will
considered as OOV word. Moreover, result obtained after the implementation from the
111
equation (2), gives the values of diff_ratio of IV almost bigger than 0.2, while most of
OOV input give the values less than 0.2. It can be seen through the result obtained based on
table 5.5 below, where the value highlight with red color represent the LLR value above
0.2 for the IV words. Meanwhile, other LLR values highlight with blue color, represent the
OOV words (LLR values less than 0.2). Meaning that, all 8 ayates of Sourate Al-Fatihah
shown below, were categorized under the IV words. In relation with the application of this
engine, whenever any of input claimed as OOV word/ayates, there is notification of the
incorrect recitation of Sourate Al-Fatihah, as well as the notifications of Tajweed Rules
references, made for evaluation purposes. Whenever an IV input identified as an IV, there
is a correct IV notification detection, altogether with the ayates of Sourate Al-Fatihah
identified with the correct recitation will be heard all along.
Table 5.5: Result of Likelihood Ratio (LLR) for 8 recitations of speech samples (1.0 x 103)
Sequence x1 x2 x3 x4 x5 x6 x7 x8
logP(X│Θ1) 0.2112 -3.9878 -4.4179 -4.6103 -5.1018 -5.4842 -5.6575 -5.7628
MLM 1 3 6 7 8 5 2 4
logP(X│Θ2) 0.4394 -4.7675 -4.8948 -4.9438 -5.1501 -5.2021 -5.5128 -5.8265
MLM 2 7 8 5 6 1 4 3
logP(X│Θ3) 0.2472 -3.6302 -3.9481 -4.5353 -4.8883 -5.0468 -5.1351 -5.1712
MLM 3 1 6 7 2 8 4 5
logP(X│Θ4) 0.7253 -3.9347 -4.1251 -4.2471 -4.4244 -4.5630 -4.6807 -4.8629
MLM 4 6 1 5 7 3 8 2
logP(X│Θ5) 0.2659 -4.8868 -5.7913 -5.9782 -6.6163 -7.6434 -7.6572 -8.4972
MLM 5 6 7 1 8 3 4 2
112
logP(X│Θ6) 0.2667 -4.4097 -4.8590 -4.8904 -4.9843 -5.3690 -5.8303 -7.7457
MLM 6 7 5 8 1 3 4 2
logP(X│Θ7) 0.6612 -4.6829 -5.1914 -5.3106 -5.4521 -6.4570 -6.9848 -7.8626
MLM 7 6 8 1 5 3 4 2
logP(X│Θ8) 0.8678 -4.3930 -4.6584 -4.8508 -5.1978 -5.8682 -6.4164 -7.1213
MLM 8 1 7 5 6 4 3 2
MLM = Most Likely Model
Table 5.6: Test result for 8 recitations of speech samples (ayates of sourate Al-Fatihah)
Ayates/Articulation # of utterances
Correct Wrong % Accuracy
% Word error rate
5 5 0 100 0
5 5 0 100 0
7 7 0 100 0
6 6 0 100 0
9 8 1 88.89 11.1
9 9 0 100 0
6 4 2 66.67 33.33
113
5 4 1 80 20
Total 52 48 4 91.95 8.05
For the first test, 8 ayates of sourate Al-Fatihah have been tested and the result of
the test is shown above at table 5.6. In this experiment, the extracted features of 8 ayates of
the Quranic recitation was directly compared to the word templates (Word based Model).
As a result, the test result on the training data is perfectly reached at 91.95%, which means
only 4 errors with 8.05% of Word Error Rate (WER). It is better than the result of the
previous researches, carried out by Ehab, M. et al. (2007) and Anwar, M.J. et al. (2006),
with the accuracy rate of recognition are 85% and 89% respectively.
5.5.2 Testing – Phonemes-Like Template
As mentioned earlier, the experiment for phonemes-like template is carried out, in
order to check the Tajweed rules for the particular ayates of Quranic recitation given. Note
that, the threshold value for phonemes like template experiment is -500, with the value of
different ratio is 0.01. However, the threshold setting value is totally different from the
previous testing process which, if the value of LLRA >-500, it is considered as OOV
phonemes, meanwhile, if LLRA <-500, it will considered as IV phonemes. If the particular
utterance has been detected as OOV phoneme, the identification and verification process of
pronunciation rules error (Tajweed rules) will be executed. Meaning that, the pronunciation
for the particular Quranic recitation is detected as false/incorrect. Table 5.7 and table 5.8
shown below are the experimental results for the two sample phonemes of ‘Bismillahir
<rahmaanir> rahimi” and “Bismillahir rahmaanir <rahiimi>” respectively, for a better
understanding.
114
Table 5.7: Comparison between correct and incorrect Tajweed rules for ayates “Bismillahir
<rahmaanir> rahimi”
Correct Recitation Incorrect Recitation
Ayates
The
utterances
(Articulation)
Bismillahir RAHMAANIR rahimi Bismillahir RAHMUUNIR rahimi
Output Score
Score:
1.0e+003 *
Columns 1 through 3
-1.5521 -1.3030 -2.1808
Columns 4 through 6
-2.2018 -0.7968 -1.1091
Columns 7 through 9
-0.6398 -0.6541 -0.5685
Columns 10 through 12
-0.8995 -1.0463 -0.8684
Columns 13 through 15
-1.1624 -0.6604 -1.0446
Columns 16 through 17
-0.6033 -0.7845
Score:
1.0e+003 *
Columns 1 through 3
-1.2703 -0.9670 -2.2708
Columns 4 through 6
-1.4974 -0.8738 -0.9279
Columns 7 through 9
-0.5621 -0.7777 -0.8362
Columns 10 through 12
-0.7123 -0.9958 -0.7422
Columns 13 through 15
-0.9294 -0.4929 -1.0265
Columns 16 through 17
0.0544 -0.9155
Log-likelihood
(LLR)
LLR:
1.0e+003 *
Columns 1 through 3
-0.5685 -0.6033 -0.6398
Columns 4 through 6
-0.6541 -0.6604 -0.7845
Columns 7 through 9
-0.7968 -0.8684 -0.8995
LLR:
1.0e+003 *
Columns 1 through 3
0.0544 -0.4929 -0.5621
Columns 4 through 6
-0.7123 -0.7422 -0.7777
Columns 7 through 9
-0.8362 -0.8738 -0.9155
115
Columns 10 through 12
-1.0446 -1.0463 -1.1091
Columns 13 through 15
-1.1624 -1.3030 -1.5521
Columns 16 through 17
-2.1808 -2.2018
Columns 10 through 12
-0.9279 -0.9294 -0.9670
Columns 13 through 15
-0.9958 -1.0265 -1.2703
Columns 16 through 17
-1.4974 -2.2708
Tajweed
Rules - Mad Asli Mutlak
Table 5.8: Comparison between correct and incorrect Tajweed rules for ayates “Bismillahir
rahmaanir <rahiimi>”
Correct Recitation Incorrect Recitation
Ayates
The
utterances
(Articulation)
Bismillahir rahmaanir RAHIIMI Bismillahir rahmaanir RAHUUMI
Output Score
Score:
1.0e+003 *
Columns 1 through 3
-2.0779 -1.6710 -1.9139
Columns 4 through 6
-2.0321 -1.2066 -1.1630
Columns 7 through 9
-1.1592 -1.2839 -0.8137
Columns 10 through 12
-1.5029 -1.6649 -1.5198
Columns 13 through 15
-1.6956 -0.8598 -1.4082
Score:
1.0e+003 *
Columns 1 through 3
-1.8007 -1.6138 -2.5081
Columns 4 through 6
-2.7402 -0.9999 -1.0721
Columns 7 through 9
-0.6768 -0.7334 0.0782
Columns 10 through 12
-1.1591 -1.3392 -1.0342
Columns 13 through 15
-1.4091 -0.7923 -1.4355
116
Columns 16 through 17
-1.1358 -1.7441
Columns 16 through 17
-0.6912 -0.9625
Log-likelihood
(LLR)
LLR:
1.0e+003 *
Columns 1 through 3
-0.8137 -0.8598 -1.1358
Columns 4 through 6
-1.1592 -1.1630 -1.2066
Columns 7 through 9
-1.2839 -1.4082 -1.5029
Columns 10 through 12
-1.5198 -1.6649 -1.6710
Columns 13 through 15
-1.6956 -1.7441 -1.9139
Columns 16 through 17
-1.7441 -1.9139
LLR:
1.0e+003 *
Columns 1 through 3
0.0782 -0.6768 -0.6912
Columns 4 through 6
-0.7334 -0.7923 -0.9625
Columns 7 through 9
-0.9999 -1.0342 -1.0721
Columns 10 through 12
-1.1591 -1.3392 -1.4091
Columns 13 through 15
-1.4355 -1.6138 -1.8007
Columns 16 through 17
-2.5081 -2.7402
Tajweed
Rules
- Mad ‘arid Lissukun: Letter of mad
has been Waqf (Stop)
According to LLR result obtained for both table 5.7 and table 5.8 shown above, the
value highlighted with red color represents the IV phoneme (LLR value less than 0.01).
Meanwhile, the LLR value highlight with blue color represents the OOV phoneme, with
the value above 0.01. In this case, two different phonemes from the ayates “Bismillahir
<rahmaanir> rahimi” and “Bismillahir rahmaanir <rahiimi>” have been successfully
tested. The result obtained for both phonemes are -0.5685 and -0.8137, which are below
the LLR threshold value (LLR<-500) and been classified under the IV phonemes (Correct
recitation). In other hand, the LLR values highlight with blue color (0.0544 and 0.0782)
were categorized as OOV phonemes (Incorrect recitation), since these values were
117
specified above the LLR threshold value (LLR >-500). For the first phoneme, the incorrect
recitation of tajweed pronunciation error is claimed as ‘Mad Asli Mutlak’, where the
phoneme of need to pronounce as ‘rahmaanir’ and not ‘rahmuunir’, with
2 haraakat of recitation. Besides that, the pronunciation for the 2nd phoneme also has been
detected as false regardless to the tajweed rule, which claimed as Mad ‘arid Lissukun (letter
of mad has been Waqf (Stop)), since the phoneme of need to be pronounced
as ‘rahiimi’ and not ‘rahuumi’.
The results obtained from sample phonemes, shown in table 5.7 and table 5.8 were
2 out of 28 samples of Quranic recitation phonemes of the overall result, in which
purposely to check the Tajweed Rules in this sourate. In this experiment as shown at table
5.9 below, features vector from the input phonemes was perfectly match the phoneme
based template with the percent accuracy reached to 86.41%, with 14.34% of error rate
only. Although the percent accuracy in this experiment quite smaller compared to previous
result in table 5.6, but the result is still under the expectation. It is because, this experiment
involved with a large amount of samples, particularly for testing purposes. From the
experiment, the current method used is much simpler than LLRA, which only need to
calculate the perturbed value only with the easier calculation.
118
Table 5.9: Test result for 28 recitations of speech samples (Phonemes)
Ayates Phonemes# of
utterances Correct Wrong%
Accuracy%
WER
Bismillah.wav
Bismi
Llahii
Rraohimani
Rraohiiim
17 16 1 94.12 5.88
fatihah1.wav
Allhamdu
Lillahhirabbil
A’alamiinna
9 8 1 88.89 11.1
fatihah2.wav
Alrrahmani
Alrraheemi 6 6 0 100 0
fatihah3.wav
Maaliki
Yawmi
Alddeeni
8 8 0 100 0
fatihah4.wav
Iyyaka
naA’Abudu
waiyyaka
nastaAAeenu
11 8 3 72.72 33.3
fatihah5.wav
Ihdina
Alssiratho
Almustaqeema
9 8 1 88.89 11.11
fatihah6.wav
Siratho
Allatheena
An’Aamta
‘AAalayhim
Ghayri
Almaghdoobi
12 8 4 66.67 33.33
fatihah7.wav
‘AAalayhim
Wala
Alddhalleena
10 8 2 80 20
Total 28 82 70 12 86.41 14.34
119
The figure 5.1 and figure 5.2 shown below are the bar chart represented for the
percentage of Accuracy and Word Error Rate (WER) for ayates and phonemes, which have
been summarized from the overall result of the previous experiments for both ayates and
phonemes.
Figure 5.1: Percentage of accuracy for recognition rate (Ayates & Phonemes)
Percentage of accuracy (%)
120
Figure 5.2: Percentage of Word Error Rate (WER) for ayates & phonemes
Based on the figure 5.1, there are 2 ayates achieved the 100% accuracy for both
ayates and phoneme, which are fatihah 2 and fatihah 3 (ayates 2 & 3). Hence, the WER for
these ayates will be remaining 0%, without any error detected. This can be seen through the
bar chart shown in figure 5.2. This case probably because, those ayates were the short
sentences and phonemes, which avoidable from any complexity during the matching and
recognition process took place. Meanwhile, fatihah 6 (ayates 6) achieved the smallest
percentage of accuracy, in which these ayates and phonemes only reached 66.67% of
accuracy value, but the percentage of WER value reached into the highest percent of
33.33%. The rationale behind this result, probably because of the complexity in
pronouncing this ayates, as well as the difficulties in matching and recognizing the exact
utterance properly.
Ayates/Phonemes (.wav)
121
5.6 Summary
The overall process conducted in this research is shown clearly in previous chapter
4 of Data Flow Diagram (DFD) and flowchart for Tajweed Checking Rules Engine. Based
on this DFD and flowchart, the process involved in this research, were clearly seen and
justified.
The experimental results obtained had fulfilled the targeted criteria and goals that
had been set and planned earlier, although there are some limitations of unexpected
problems occurred while recording and running the simulations process.
122
CHAPTER 6
CONCLUSION AND FUTURE ENHANCEMENT
6.1 Introduction
In this chapter, the process involved in this project will be discussed briefly and any
recommendation for the future enhancement for the overall research will be presented, due
to make this system more eagerly efficient and sophisticated. It also highlights the
significance and contributions of this research. In addition, it will explain elaborately about
the weaknesses and also the strength of this research, as well as propositions for the
improvement and future works. Lastly, at the end of this chapter the entire research will be
summarized shortly in details.
6.2 Significance and Contributions of Tajweed Checking Rules engine for Quranic
verse recitation
The significance(s) of this research project are:
(i) Provides alternative way to learn Al-Quran recitation for creating knowledgeable
society.
(ii) Facilitate students in reading Al-Quran with their own pace and time. This engine
also can be as a self learning tool for working adults with time constraints to learn
Al-Quran.
123
(iii) Capabilities of the system created, due to check the tajweed rules based on stored
database.
(iv) Enhance a better skills and understanding of Quranic reading with the faster way.
(v) Promote Quranic literacy and explore new approches in signal processing
technology.
(vi) Support the Quranic learning process, especially in j-QAF educational programme.
A complimentary school program that utilizes current ICT development and j-QAF
curriculum to assist in reciting Al-Quran using interactive learning technique.
(vii) Encourage Muslims to advance their recitations as well as new converted Muslims
and students to learn and practice Islam in a more convenient and effective way.
6.3 Observations on Weaknesses and Strengths
Different observers or researchers have different opinions and views, while testing
and evaluate this system. Moreover, that is normal when the particular system invented also
have its own strengths and weaknesses, as mentioned below:
6.3.1 Strengths
Generally, reliable speech recognition is a hard problem, which required a
combination of many techniques. However, in this research the alternative methods
implemented able to achieve the targeted objectives with its own strengths. The strength(s)
of this research project are:
124
(i) In this modern and technological era, speech interaction system is believed able to
achieve the users’ targeted objectives in a very easy and fast manner. Besides, the
interactive speech recognition system will ease and fasten the communication
process.
(ii) The automated Tajweed Checking Rules engine for Quranic verse recitation will
enable the user to recite Al-Quran through the MATLAB Graphical User Interface
(GUI), hearing the correct recitation and hence, determine the proper way to recite
Al-Quran. As a result, personal improvement in reading Al-Quran can be easily
determined in real time basis without any delay.
(iii) This interactive engine is a self learning educational tool that can support the
students in j-QAF learning, especially in learning Al-Quran (Tasmik & Khatam al-
Quran model). Besides, this engine also able to put some ease to the j-QAF teachers
while teaching the Quranic syllabus.
(iv) This project will allow a chance for more researchers to get involved to the project
done by University of Malaya students, since the students may able to refer to this
project particularly for their own benefits in developing the system with the same
nature.
(v) Allow the interchange of ideas and collaboration between 2 or more faculty
(Inter-faculty) researchers or agency, due to produce a magnificent product,
for own benefits of Muslim community.
125
(vi) The engine developed, shows the promising results in which almost the exact match
of recitors’ preferences and entries.
(vii) This research shown that, the combination of MFCC feature extraction and HMM
classification managed to work well and able to produce a magnificent results, in
Quranic speech recognition.
(viii) Most challenging task in this research is to implement Al-Quran with speech
recognition system, altogether with the engine capability in checking the tajweed
rules. But, this engine able to achieve recognition rate that exceeded 91.95%
(ayates) and 86.41% (phonemes), which indicates that the engine was successful.
6.3.2 Weaknesses
Throughout these years, the research conducted also facing some problems and
difficulties, due to some limitations and weaknesses in speech recognition research area.
The weaknesses of this research project are:
(i) The implementation of Quranic with speech recognition system is not an easy job to
be developed and to be implemented to all chapters of Al-Quran, since this
technology is still new in the market. The software and hardware required might be
unavailable yet in the market.
126
(ii) Most of past research executed and implemented to English language only.
Thus, the implementation of Quranic in speech recognition system is still in
earlier stages, which need to be improved a lot.
(iii) Speaker recognition is a difficult task. It is very hard to get the exact match with
high accuracy rate in many cases especially during the training and testing sessions.
It is because, during these sessions it can be greatly different due to many fact such
as, human voice change with time, health conditions (e.g. the speaker has a cold),
recording environments and others.
6.4 Future Research
According to the research conduct, the engine developed showed the promising
results although it was only tested against the small Quranic chapters (ie: Sourate Al-
Fatihah). But, there are still in earlier stage of research, which need the proper
attention/action and improvement due to make this engine more compatible and useful to
the end users. The Quranic implementation in speech recognition system, especially in
checking the Tajweed rules always be a new developments in this technology in which
allow more researchers and creativity to get involved. Many things need to be considered
due to improve the system further in the future. Below are the proposed tasks for the
improvement of the engine.
(i) The engine shall be able to accept more test cases from the various users of Quranic
recitations inputs. Here, the engine must be multi-users, which accept any voice
input from a different people, due to develop a larger evaluation database.
127
(ii) The engine shall be integrated into the hardware part, which allows the users to use
the engine in real system (portable device) rather than using a simulation of the
engine. However, the integration process is believed could be costly and very time
consuming, but it will be very effective and efficient system.
6.5 Conclusion
This research has covered many aspects of speech recognition system and this
research finding will be highly beneficial, due to learn Al-Quran with more interesting
manner, while complying with the established Islamic ways and rules. For recognition
purposes, the recitors recitation scoring was evaluated against the database system for the
transparent evaluation, in which to ensure that the learning experience is optimized. In
addition, this research has successfully achieved their objectives and hopefully, it will give
a lot of benefits to the end users as it is designated for that purposes. However, this
automated Tajweed Checking rules engine for Quranic verse recitation had shown the
strength and weaknesses after the engine has been successfully developed. The
achievements of this engine are very valuable indeed, as it will be references to other
researchers and the developers of such a system in the future. It is very much hopes that the
engine will be implemented in real life and been integrated with the hardware system.
128
REFERENCES
Ahmad, A.M., Ismail, S., Samaon, D.F., 2004, 'Recurrent Neural Network with Backpropagation through Time for Speech Recognition,' IEEE International Symposium on Communications & Information Technology, 2004. ISCIT ‘04. Volume 1, pp. 98 – 102.
Ahmed, M.E., 1991,” Toward an Arabic Text-To-Speech system.” The Arabic Journal Science and Engine, 1991.
Anwar, M.J., Awais, M.M., Masud, S. & Shamail, S.,” Automatic Arabic Speech Segmentation System.” Department of Computer Science, Lahore University of Management Sciences, Lahore, Pakistan.
Bashir, M.S., Rasheed, S.F., Awais, M.M., Masud, S., & Shamail, S., 2003,'Simulation of Arabic Phoneme Identification through Spectrographic Analysis,' Department of Computer Science, University of Engineering & Technology, Lahore Pakistan, Lahore Pakistan.
Bateman, D. Bye, D. and Hunt, M., 1992, 'Spectral Constant Normalization and Other Techniques for Speech Recognition in Noise,” Proc. IEEE.Inter.Conf. Acoustic. Speech Signal Process, vol.1, pp. 241-244, 1992.
Chetouani, M., Gas, B., Zarader, J.L. & Chavy, C., 2002, ‘Neural Predictive Coding for speech Discriminant Feature Extraction: The DFE-NPC’, ESANN’2002 Proceedings – European Symposium on Artificial Neural Network, Bruges, Belgium, pp. 275-280.
Davis, S.B. & Mermelstein, P., 1980, ‘Comparison of Parametric Representations of Monosyllabic Word Recognition in Continuously Spoken Sentences’, IEEE Transactions on Acoustics, Speech and Signal Processing, 28, pp.357-366.
Ehab, M., Ahmad, S. and Mousa, A. 2007,'Speaker Independent Quranic Recognizer Based on Maximum Likelihood Linear Regression,' Proceedings of World Academy of Science, Engineering and Technology Volume 20 April 2007.
Essa, O., 1998, ‘Using Prosody in Automatic Segmentation of Speech’, Proceeding 36th ACM Southeast Regional Conference, pp. 44 - 49, April 1998.
129
Essa, O.,”Using Suprasegmentals in Training Hidden Markov Models for Arabic."Computer Science Department, University of South Carolina, Columbia.
Felber, P. 2001, 'Speech Recognition: Report of an Isolated Word Experiment', Department of Electrical & Computer Engineering, Illinois Institute of Technology, Chicago, USA. Available at: http://www.ece.iit/~pfelber/speechrecognition/ retrieved on 1 September 2008.
Habash, M., 1986, “How to memorize the Quran”, Dar al-Khayr, Beirut 1986.
Hansen, J.C., 2003, ‘Modulation based parameter for Automatic Speech Recognition’,Master Thesis of Department of Electrical Engineering, University of Rhode Island, USA.
Hasan, M.R., Jamil, M., Rabbani, M.G. & Rahman, M.S., 2004, ‘Speaker Identification Using Mel Frequency Cepstral Coefficients’, 3rd International Conference on Electrical & Computer Engineering ICECE 2004, 28-30 December 2004, Dhaka, Bangladesh ISBN 984-32-1804-4 565.
Hemantha, G.K., Ravishankar, M., Nagabushan, P. & Basavaraj, S.A., 2006, ‘Hidden Markov Model based approach for generation of Pitman shorthand language symbols for consonants and vowels from spoken English’, Sadhana – June 2006. Vol. 31, part 3, pp. 227-290.
Hermansky, H., 1990, ‘Perceptual linear predictive (PLP) analysis of speech’, The Journal of the Acoustical Society of America -April 1990. Volume 87, Issue 4, pp. 1738-1752.
Hosom, J.P., Cole, R. and Fanty, M. 1999, Speech Recognition Using Neural Networks at the Center for Spoken Language Understanding, Center for Spoken Language Understanding (CSLU) Oregon Graduate Institute of Science and Technology, July 6, 1999.
Huang, X., Acero, A., & Hon, H.W., 2001, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Prentice Hall, Upper Saddle River, NJ, USA.
130
Institute for Research in Islamic education (Newspaper), 2007, The New Strait Times Press-26 September 2007 [Online] Available at: http://www.nst.com.my/ retrieved on 20 November 2007.
J. de Veth and L. Boves, 1998, ‘Channel normalization techniques for automatic speech recognition over the telephone’. Speech Communication 25 (1998) 149-164.
Jurafsky, D. & Martin, J.H., 2007, Automatic Speech Recognition: Speech and Language Processing: An Introduction to natural language processing, computational linguistics, and speech recognition, Prentice Hall, New Jersey, USA.
Khalifa, O., Khan, S., Islam, M.R., Faizal, M. & Dol, D., 2004, ‘Text IndependentAutomatic Speaker Recognition’, 3rd International Conference on Electrical & Computer Engineering, Dhaka, Bangladesh, pp.561-564.
Kirchhoff, K., Bilmes, J., Das, S.,Duta,N., Egan,M. Ji,G. He,F.,Henderson,J., D. Liu, M. Noamany, P. Schone, R. Schwartz, D. Vergyri, 2003, ‘Novel approaches to Arabic speech recognition: report from the 2002 Johns-Hopkins Summer Workshop’, IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 Proceedings. (ICASSP '03). Volume 1, 6-10 April 2003, pp. I-344 - I-347 vol.1
Kirchhoff, K., Vergyri, D., Bilmes, J., Duh,K. and Stolcke, A. 2004,'Morphology-based language modeling for conversational Arabic speech recognition,' Eighth International Conference on Spoken Language ISCA, 2004.
Lee, K.F. & Hon, H.W., 1989, ‘Speaker-Independent Phone Recognition Using Hidden Markov Models’, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 31, pp.1641-1648.
Levent, M.A., 1996, “Foreign Accent Classification in American English.” Dissertation for Doctor of Philosophy in Department of Electrical & Computer Engineering, Graduate School of Duke University, Durham, USA.
Linde, Y., Buzo, L. & Gray, R.M., 1980,” An algorithm for Vector Quantizer Design”, IEEE Transactions on Communications, Vol.COM28,no 1,pp.84-95.
Madisetti, V.K. & Williams, D.B., 1999, Digital Signal Processing Handbook,CRCnetBASE. CRC Press LLC, USA.
131
Martens, J.P., 2002, 'Continuous Speech Recognition over the Telephone', Electronics & Information Systems, Ghent University, Belgium. Available at: http://trappist.elis.ugent.be/ELISgroups/speech/cost249/report/intro.pdf retrieved on 10 September 2008.
Matsui, T., & Furui, S., 1993, 'Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs'. Proceedings of the 1993 International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Institute of Electrical and Electronic Engineers. Minneapolis, Minnesota, pp.157 – 160.
Maamouri, M., Bies, A. and Kulick, S., 2006,'Diacritization to Arabic Treebank Annotation and Parsing,' Proceedings of the Conference of the Machine Translation SIG, 2006.
Nathan, K., Beigi, H.S.M. and Subrahmonia, J., 1995, ‘On-line Unconstrained Handwriting Recognition Based On Probabilistic Techniques’.
Nelson & Kristina, 1985, ‘The art of Reciting the Quran’, University of Texas Press, 1985.
Owen, F.J., 1993, ‘Signal Processing of Speech’. Macmillan Press Ltd., London, UK.
Prime Minister's Office of Malaysia 2006, Ninth Malaysia Plan 2006 – 2010, Chapter 11: Enhancing Human Capital. Available at: http://www.epu.jpm.my/rm9/english/ Chapter11.pdf retrieved on 18 November 2007.
Program j-QAF sentiasa dipantau (Newspaper), 2005, Berita Harian Press-10 May 2005 [Online] Available at: http://www.bharian.com.my/ retrieved on 18 November 2007.
Penutupan Majlis Tilawah al-Quran (Newspaper), 1995, Utusan Malaysia-10 January 1995. Retrieved on 18 November 2007.
Rabiner, L.R. & Juang, B.H., 1993, ‘Fundamental of Speech Recognition’, Prentice Hall, New Jersey, USA.
Rabiner, L.R., 1989, ‘A Tutorial on Hidden Markov Model and Selected Applications in Speech Recognition’, Proceeding of the IEEE, Volume 7, No.2, February 1989.
132
Ramzi A.H., Omar E.A., 2007. ‘CASRA+: A Colloquial Arabic Speech Recognition Application". American Journal of Applied Sciences 4(1):23-32, 2007 Science Publication.
Sari, T., Souici, L. and Sellami, M., 2002, ‘Off-Line Handwritten Arabic Character Segmentation Algorithm: ACSA’, Proc. Int’l Workshop Frontiers in Handwriting Recognition, pp. 452-457, 2002.
Shen, J., Hung, J. & Lee, L., 1998, ‘Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments’, 5th International conference ICSLP ’98, Sydney, Australia, 1998.”
Tabbal, H., El-Falou, W. & Monla, B., 2006, 'Analysis and Implementation of a “Quranic” verses delimitation system in audio files using speech recognition techniques',In: Proceeding of the IEEE Conference of 2nd Information and Communication Technologies, 2006. ICTTA ’06.Volume 2, pp. 2979 – 2984.
Thomas, F.Q., 2002, ‘Discrete Time Speech Signal Processing’, Prentice Hall, New Jersey, USA.
Ursin, M., 2002, ‘Triphone Clustering in Finnish Continuous Speech Recognition’, Master Thesis, Department of Computer Science, Helsinki University of Technology, Finland.
Vergyri,D.,Kirchhoff, K. 2004, 'Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition,' COLING Workshop on Arabic-script Based Languages, Geneva, 2004.
Viterbi, A.J.,1967, ‘Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,’ IEEE Trans. Information Theory, vol. IT-13, pp. 260-269, Apr. 1967.
Vuuren, S.V., 1996,’Comparison of Text-Independent Speaker Recognition Methods on Telephone Speech with Acoustic Mismatch’, Proceeding (ICSLP)96, Vol:3, Philadelphia, PA. pp. 1788-1791.
Yongwon, J. & Hyung, S.K., 2001,’Recognition Confidence Scoring using Recognition Results from Perturbed Input Feature Vectors’, Electronics Letter, Volume: 37, Issue: 18, pp. 1143 – 1145.
133
Youssef, A. & Emam, O., 2004, ‘An Arabic TTS based on the IBM Trainable Speech Sythesizer’, Department of Electronics & Communication Engineering, Cairo University, Giza, Egypt.
Wai, C.C., 2003, Speech Coding Algorithm foundations and evolution of standardized Coders, John Wiley & Sons Inc.,NJ, USA.
134
APPENDIX A
Signal of 8 ayates of Sourate Al-Fatihah
1) Result from 'Bismillah' utterance
0 0.5 1 1.5 2 2.5 3 3.5-1
-0.5
0
0.5
1
Time [sec]
Am
plitu
de
Speech Sample of Quranic Recitation
Time [sec]
Freq
uenc
y [k
Hz]
0 0.5 1 1.5 2 2.5 3 3.50
2
4
6
0
1
2
3
135
2) Result from 'fatihah1' utterance
0 0.5 1 1.5 2 2.5 3 3.5-1
-0.5
0
0.5
1
Time [sec]
Am
plitu
de
Speech Sample of Quranic Recitation
Time [sec]
Freq
uenc
y [k
Hz]
0 0.5 1 1.5 2 2.5 3 3.50
2
4
6
3) Result from 'fatihah2' utterance
0 0.5 1 1.5 2 2.5 3 3.5-1
-0.5
0
0.5
1
Time [sec]
Am
plitu
de
Speech Sample of Quranic Recitation
Time [sec]
Fre
quen
cy [
kHz]
0 0.5 1 1.5 2 2.5 3 3.50
2
4
6
136
4) Result from 'fatihah3' utterance
0 0.5 1 1.5 2 2.5 3 3.5-1
-0.5
0
0.5
1
Time [sec]
Am
plitu
de
Speech Sample of Quranic Recitation
Time [sec]
Freq
uenc
y [k
Hz]
0 0.5 1 1.5 2 2.5 3 3.50
2
4
6
5) Result from 'fatihah4' utterance
0 0.5 1 1.5 2 2.5 3 3.5-1
-0.5
0
0.5
1
Time [sec]
Am
plitu
de
Speech Sample of Quranic Recitation
Time [sec]
Fre
quen
cy [
kHz]
0 0.5 1 1.5 2 2.5 3 3.50
2
4
6
137
6) Result from ‘fatihah5’ utterance
7) Result from ‘fatihah6’ utterance
0 0.5 1 1.5 2 2.5 3 3.5-1
-0.5
0
0.5
1
Time [sec]
Am
plitu
de
Speech Sample of Quranic Recitation
Time [sec]
Freq
uenc
y [k
Hz]
0 0.5 1 1.5 2 2.5 3 3.50
2
4
6
0 0.5 1 1.5 2 2.5 3 3.5-1
-0.5
0
0.5
1
Time [sec]
Am
plitu
de
Speech Sample of Quranic Recitation
Time [sec]
Fre
quen
cy [
kHz]
0 0.5 1 1.5 2 2.5 3 3.50
2
4
6
138
8) Result from ‘fatihah7’ utterance
0 0.5 1 1.5 2 2.5 3 3.5-1
-0.5
0
0.5
1
Time [sec]
Am
plitu
deSpeech Sample of Quranic Recitation
Time [sec]
Fre
quen
cy [
kHz]
0 0.5 1 1.5 2 2.5 3 3.50
2
4
6
139
APPENDIX B
List of Published Papers and Achievements
Journal
Zaidi Razak, Noor Jamaliah Ibrahim, Mohd Yamani Idna Idris, Emran Mohd Tamil, Mohd
Yakub @ Zulkifli Mohd Yusoff, and Noor Naemah Abdul Rahman, 2008 "Quranic Verse
Recitation Recognition Module for Support in j-QAF Learning: A Review" IJCSNS
International Journal of Computer Science and Network Security, VOL.8 No.8, August
2008,(In Press). pp. 207-2016, Journal ISSN: 1738-7906.
Proceeding
1. Noor Jamaliah Ibrahim, Zaidi Razak, Mohd Yakub @ Zulkifli Mohd Yusoff, Mohd
Yamani Idna Idris, Emran Mohd Tamil, "Quranic verse Recitation feature
extraction using Mel-Frequency Cepstral Coefficients (MFCC)", In Proceedings of
the 4th IEEE International Colloquium on Signal Processing and its Application
(CSPA) 2008, 7-9 March 2008, Kuala Lumpur, MALAYSIA.
2. Noor Jamaliah Ibrahim, Mohd.Yakub@Zulkifli Mohd Yusoff & Zaidi Razak, 2008
"Quranic verse Recitation Recognition Module for Educational Programme",
International Seminar on Research in Islamic Studies 2008 @ ISRIS '08, 17-18
December 2008, Kuala Lumpur, MALAYSIA.
Awards
Gold Medal - Mohd Yakub @ Zulkifli Bin Haji Mohd Yusoff, Zaidi Razak, Noor Jamaliah
Binti Ibrahim, Mohd Yamani Idna Idris, Emran Mohd Tamil & Noorzaily Mohamed Noor,
“Effective Learning of Quranic Verse Recitation Using Automated Tajweed Checking
Rules Educational Tools”, 20th International Invention, Innovation and Technology
Exhibition ITEX 2009, Kuala Lumpur, Malaysia, 15-17 May 2009.