Post on 13-Oct-2015
description
SPEECH RECOGNITION SYSTEM
Submitted in partial fulfillment of the requirements for the award of
Degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted By:
ADITYA SHARMA
Roll No: 1005210005
2013-2014
Under the supervision of
Dr. Y N Singh
Associate Professor
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
INSTITUTE OF ENGINEERING AND TECHNOLOGY
LUCKNOW
1
Certificate
This is to certify that the project entitled Speech Recognition System submitted by Aditya
Sharma, for the award of the degree of Bachelor of Technology in Computer Science and
Engineering is a record of the bonafied work carried out by him under my guidance and
supervision at the Department of Computer Science and Engineering, Institute of
Engineering and Technology, Lucknow.
This work has not been submitted anywhere else for the award of any other degree.
Dr. Y N Singh
Associate Professor
Dept. of Computer Science and Engineering
IET Lucknow
2
Acknowledgment
I would like to place on record my deep appreciation and gratitude towards my project supervisor
Dr. Y N Singh, Associate Professor, Dept. of Computer Science and Engineering, for his invaluable
support and encouragement. I would also want to express our heartfelt thanks to Mr. Radhe Shyam who provided me with invaluable guidance to find various resources required for the project.
I would like to thank Dr. Manish Gaur, Associate Professor, Dept. of Computer Science and
Engineering, for his kindness and graciousness in providing us a disciplined environment for finishing my project.
I would also like to thank Prof. Lawrence Rabiner of the Univ. of California Santa Barbara whose
paper inspired me to use Hidden Markov Models in my effort to make an efficient and robust speech recognition system.
At last but not the least I would like to thank my parents for their encouragement and support in my
studies.
Aditya Sharma
B. Tech. Final Year
Computer Science and Engineering
IET Lucknow
3
Abstract
This report takes a brief look at the basic building block of a speech recognition system. It talks about
implementation of different modules which are required for the construction of a speech recognition
system. Word detector, feature extractor and HMM training and recognizer modules have been
described in details. The objective of the project is the implementation of a connected word speech
recognition system using Hidden Markov Models. This involves the design of an efficient MATLAB
code on a PC. This phase of the project involves the development of a limited domain recognition
engine spanning limited vocabulary and in constrained environment only. The system can recognize
single words as well as connected words. The results of the executions that were conducted are also
provided. The system show high accuracy rate of nearly 90-100% in case when it is executed in the
known environment and with the know user. The report ends with a conclusion and Future plan.
4
Table of Contents
1. Introduction 1.1. An overview of Speech Recognition ...................................................................... 7
1.2. Problem Definition and Scope ............................................................................... 7
1.3. Design of the System ............................................................................................ 8
1.4. History................................................................................................................. 9
1.5. Uses of Speech Recognition System .................................................................... 10
1.6. Applications ....................................................................................................... 10
1.7. Speech Recognition Weakness and Flaws ............................................................ 11
1.8. Related Works.................................................................................................... 11
1.9. The future of Speech recognition ......................................................................... 12
1.10. Overview of the Project Report ......................................................................... 12
2. Types of Speech Recognition Systems
2.1. Based on Algorithms 2.1.1. Hidden Markov Model based................................................................... 14
2.1.2. Dynamic Time Warping Based ................................................................ 14
2.1.3. Artificial Neural Network based .............................................................. 14
2.2. Based on ability to recognize words
2.2.1. Isolated Speech Recognition.................................................................... 15
2.2.2. Connected Speech .................................................................................. 15
2.2.3. Continuous Speech ................................................................................. 15
2.2.4. Spontaneous Speech ............................................................................... 15
2.3. Based on dependency on user
2.3.1. Speaker dependent speech recognition ..................................................... 15
2.3.2. Speaker Independent speech recognition .................................................. 15
3. Words Detection and Extraction 3.1. Principle of Word Detection ................................................................................ 16
3.2. Methodology ...................................................................................................... 16
3.3. Performance ....................................................................................................... 19
4. Feature Extraction .................................................................................................. 20
5. Knowledge Models
5.1. Acoustic Models 5.1.1. Word Model .......................................................................................... 24
5.1.2. Phone Model .......................................................................................... 25
5.1.2.1. Context Independent Phone Model .................................................... 26
5.1.2.2. Context Dependent Phone Model....................................................... 26
5.2. Language Model 5.2.1. Classification ......................................................................................... 27
6. Hidden Markov Model 6.1. HMM and Speech Recognition ............................................................................ 29
6.2. Three Basic Problems of Hidden Markov Models ................................................. 29
6.3. Solution to Problem 1 Probability Evaluation..................................................... 30
6.3.1. The forward Algorithm ........................................................................... 31
6.3.2. The Backward Algorithm ........................................................................ 33
6.3.3. Scaling the Forward and Backward Variables ........................................... 35
5
6.4. Solution to Problem 2 Optimal State Sequence ............................................... 39
6.4.1. The Viterbi Algorithm ............................................................................ 40
6.4.2. The Alternative Viterbi Algorithm ........................................................... 41
6.5. Solution to Problem 3 - Parameter Estimation....................................................... 43
6.5.1. Initial Estimates of HMM Parameters ...................................................... 47
7. Implementation....................................................................................................... 49
7.1. Software Modules............................................................................................... 49
7.2. Working of the System ....................................................................................... 50
8. Results and Conclusion ........................................................................................... 54
8.1. Training ............................................................................................................. 54
8.2. Recognition........................................................................................................ 54
8.3. Conclusion ......................................................................................................... 55
8.4. Future Work ....................................................................................................... 55
Reading and References.......................................................................................... 56 Appendix A -Source Code in MATLAB .................................................................. 57
6
List of figures Page No. 1. Fig 1.1 Block Diagram of Speech Recognition System 8
2. Fig 3.1 Speech Sample 17
3. Fig 3.2 Energy plot of the sample 17
4. Fig 3.3 Zero crossing rate of the sample 18
5. Fig 3.4 Detected word from the sample 18
6. Fig 4.1 Steps in Feature Extraction 21
7. Fig 4.2 Mel Scale Filter Bank 22
8. Fig 5.1 Word based Acoustic Model 25
9. Fig 5.2 Phone based Acoustic Model 26
10. Fig 6.1 Diagrammatic Representation of HMM 28
11. Fig 6.2 Forward Variable Calculation 32
12. Fig 6.3 Backward Procedure-Induction step 34
13. Fig 6.4 Baum-Welch method 44
14. Fig 6.5 Left-Right model of HMM 47
15. Fig 7.1 Block Diagram of working of the system 51
16. Fig 7.2 The main interface 52
17. Fig 7.3 Training 52
18. Fig 7.4 Recognition 53
7
Chapter One
Introduction
1.1 An overview of Speech Recognition
Speech recognition refers to the ability to listen (input in audio format) spoken words and
identify various sounds present in it, and recognize them as words of some known language.
Speech recognition in computer system domain may then be defined as the ability of
computer systems to accept spoken words in audio format - such as wav or raw - and then
generate its content in text format.
Speech recognition in computer domain involves various steps with issues attached with
them. The steps required to make computers perform speech recognition are: Voice
recording, word boundary detection, feature extraction, and recognition with the help of
knowledge models.
Word boundary detection is the process of identifying the start and the end of a spoken
word in the given sound signal[8]. While analyzing the sound signal, at times it becomes
difficult to identify the word boundary. This can be attributed to various accents people
have, like the duration of the pause they give between words while speaking.
Feature Extraction refers to the process of conversion of sound signal to a form suitable for
the following stages to use. Feature extraction may include extracting parameters such as
amplitude of the signal, energy of frequencies, etc.
Recognition involves mapping the given input (in form of various features) to one of the
known sounds. This may involve use of various knowledge models for precise
identification and ambiguity removal.
Knowledge models refers to models such as phone acoustic model, language models etc.
which help the recognition system. To generate the knowledge model one needs to train
the system. During the training period one needs to show the system a set of inputs and
what outputs they should map to. This is often called as supervised learning.
1.2 Problem Definition and Scope
The aim of this project is to build a speech recognition system for English language. This
is connected word speech recognition system. The system will receive speech input
consisting of series of connected words and it will output the text sequence corresponding
8
to it. The system will use Continuous Hidden Markov Model for acoustic modeling of the
speech.
Scope: This project has the speech recognizing capabilities. It is designed to work in noise
as well as user constrained environments. This software also can recognize a word and
convert it into text which can be linked with an action such as execution of a command,
opening a program, sending mails etc. It can recognize single word as well as sequence of
connected words separated by a small pause.
1.3 Design of the system
An abstract overview of the system is visualized as block diagram below:
The various components are briefed below:
Word Detector and Extractor: This component is responsible for taking the
input from the microphone or from a prerecorded WAV file and detecting the
words present in them. In this way the speech signal is segmented into
individual words present in them. These individual words can then be processed
independently. The word boundary is detected by measuring the Energy and the
Zero Crossing rate of the signal.
Feature Extractor: This component generates feature vectors from the given
signal. It generates Mel Frequency Cepstrum Coefficients (MFCCs) and
9
Normalized energy as features that should be used to uniquely identify the given
sound signal[9].
Recognizer Module: This is a continuous Hidden Markov Model[7] based
component. It is the most important part of the system which performs the actual
recognition of the speech signal by finding the best match in the knowledge
base.
Acoustic and Knowledge model: These components define structure of the
sound and how it will be represented in the computer. These models are used to
map and model the acoustic characteristics of given sound signal.
1.4 History
The concept of speech recognition started somewhere in 1940s, practically the first speech
recognition program was appeared in 1952 at the bell labs, that was about recognition of a
digit in a noise free environment.
1940s and 1950s consider as the foundational period of the speech recognition
technology, in this period work was done on the foundational paradigms of the
speech recognition that is automation and information theoretic models.
In the 1960s we were able to recognize small vocabularies (order of 10-100
words) of isolated words, based on simple acoustic-phonetic properties of
speech sounds. The key technologies that were developed during this decade
were, filter banks and time normalization methods.
In 1970s the medium vocabularies (order of 100-1000 words) using simple
template-based, pattern recognition methods were recognized.
In 1980s large vocabularies (1000-unlimited) were used and speech recognition
problems based on statistical, with a large range of networks for handling
language structures were addressed. The key invention of this era were Hidden
Markov model (HMM) and the stochastic language model, which together
enabled powerful new methods for handling continuous speech recognition
problem efficiently and with high performance.
In 1990s the key technologies developed during this period were the methods
for stochastic language understanding, statistical learning of acoustic and
language models, and the methods for implementation of large vocabulary
speech understanding systems.
10
After the five decades of research, the speech recognition technology has finally
entered marketplace, benefiting the users in variety of ways. The challenge of
designing a machine that truly functions like an intelligent human is still a major
one going forward.
1.5 Uses of Speech Recognition System Basically speech recognition is used for two main purposes. First and foremost dictation
that is in the context of speech recognition is translation of spoken words into text, and
second controlling the computer, that is to develop such software that probably would be
capable enough to authorize a user to operate different application by voice.
Writing by voice let a person to write 150 words per minute or more if indeed he/she can
speak that much quickly. This perspective of speech recognition programs create an easy
way for composing text and help the people in that industry to compose millions of words
digitally in short time rather than writing them one by one, and this way they can save their
time and effort.
Speech recognition is an alternative of keyboard. If you are unable to write or just dont
want to type then programs of speech recognition helps you to do almost anything that you
used to do with keyboard.
1.6 Applications
1.6.1 From medical perspective: People with disabilities can benefit from
speech recognition programs. Speech recognition is especially useful for people
who have difficulty using their hands, in such cases speech recognition
programs are much beneficial and they can use for operating computers. Speech
recognition is used in deaf telephony, such as voicemail to text.
1.6.2 From military perspective: Speech recognition programs are important from military perspective; in Air Force speech recognition has definite potential
for reducing pilot workload. Beside the Air force such Programs can also be
trained to be used in helicopters, battle management and other applications.
1.6.3 From educational perspective: Individuals with learning disabilities who have problems with thought-to-paper communication (essentially they think of
an idea but it is processed incorrectly causing it to end up differently on paper)
can benefit from the software.
11
1.7 Speech Recognition weakness and flaws
Besides all these advantages and benefits, yet a hundred percent perfect speech recognition
system is unable to be developed. There are number of factors that can reduce the accuracy
and performance of a speech recognition program. Speech recognition process is easy for
a human but it is a difficult task for a machine, comparing with a human mind speech
recognition programs seems less intelligent, this is due to that fact that a human mind is
God gifted thing and the capability of thinking, understanding and reacting is natural, while
for a computer program it is a complicated task, first it need to understand the spoken words
with respect to their meanings, and it has to create a sufficient balance between the words,
noise and spaces. A human has a built in capability of filtering the noise from a speech
while a machine requires training, computer requires help for separating the speech sound
from the other sounds.
Few factors that are considerable in this regard are:
Homonyms: Are the words that are differently spelled and have the different meaning but acquires the same meaning, for example there their be and bee. This is a
challenge for computer machine to distinguish between such types of phrases that sound
alike.
Overlapping speeches: A second challenge in the process, is to understand the speech uttered by different users, current systems have a difficulty to separate simultaneous
speeches form multiple users.
Noise factor: The program requires hearing the words uttered by a human distinctly and clearly. Any extra sound can create interference, first you need to place system away from
noisy environments and then speak clearly else the machine will confuse and will mix up
the words.
1.8 Related Works
A lot of speech aware applications are already there in the market. Various dictation
softwares have been developed by Dragon[10], IBM and Philips. Genie is an interactive
speech recognition software developed by Microsoft. Various voice navigation
applications, one developed by AT&T, allow users to control their computer by voice, like
browsing the Internet by voice. Many more applications of this kind are appearing every
day.
12
The SPHINX speech recognizer of CMU[11] provides the acoustic as well as the language
models used for recognition. It is based on the Hidden Markov Models (HMM). The
SONIC recognizer[12] is also one of them, developed by the University of Colorado. There
are other recognizers such as XVoice[13] for Linux that take input from IBMs ViaVoice
which, now, exists just for Windows. Background noise is the worst part of a speech
recognition process. It confuses the recognizer and makes it unable to hear what it is
supposed to. One such recognizer has been devised for robots that, despite of the inevitable
motor noises, makes it communicate with the people efficiently. This is made possible by
using a noise-type-dependent acoustic model corresponding to a performing motion of
robot. Optimizations for speech recognition on a HP SmartBadge IV embedded system has
been proposed to reduce the energy consumption while still maintaining the quality of the
application.
Another such scalable system has been proposed in for DSR (Distributed Speech
recognition) by combining it with scalable compression and hence reducing the
computational load as well as the bandwidth requirement on the server. Various capabilities
of current speech recognizers in the field of telecommunications are described in like Voice
Banking and Directory Assistance.
1.9 The future of speech recognition
Accuracy will become better and better.
Dictation speech recognition will gradually become accepted.
Greater use will be made of intelligent systems which will attempt to guess
what the speaker intended to say, rather than what was actually said, as people
often misspeak and make unintentional mistakes.
Microphone and sound systems will be designed to adapt more quickly to
changing background noise levels, different environments, with better
recognition of extraneous material to be discarded.
1.10 Overview of the Project Report
The contents of the chapters are as follows:
Chapter 2 discusses the type of speech recognition systems based algorithms,
speaker dependency and their ability to recognize list of words. It also compares
the different types of the algorithm based on their application and reliability.
13
Chapter 3 explains how the words are detected and extracted from the speech
samples. It discusses the algorithms and techniques which are required to
implement a word boundary detector.
Chapter 4 explains how the features are extracted from the given speech
samples. It explains how the signal can be segmented into overlapping frames
and each frame can be transformed into set of multi-dimensional vectors.
Chapter 5 explains what is acoustic model and how is it represented using
HMM. It also gives a brief about language models.
Chapter 6 explains in detail the Hidden Markov model and how can it be used
in speech recognition. It explains in detail about how a continuous HMM can
be implemented for the recognition of the words in a speech signal.
Chapter 7 explains how the project was implemented in MATLAB. It gives brief
information about the modules used in the software.
Chapter 8 concludes the report with a summary of the work done and what are
the next proposed steps to be taken and the further works which can be done.
An appendix is provided in the last which contains detailed source code of the
project in MATLAB.
14
Chapter Two
Types of Speech Recognition System
The speech recognition systems can be classified on various basis such as algorithms, ability to
recognize words and list of words, dependency on user etc. Some of the classifications can be
explained as under:
2.1 Based on Algorithms: There are mainly three popular algorithms to preform speech recognition, namely, Hidden Markov Model (HMM) based, Dynamic Time
Warping (DTW) based and, Artificial Neural Network based.
2.1.1 Hidden Markov Model based: Modern general-purpose speech recognition systems are based on Hidden Markov Models[1][2][3][4][5][6]. These are statistical models that
output a sequence of symbols or quantities. HMMs are used in speech recognition because
a speech signal can be viewed as a piecewise stationary signal or a short-time stationary
signal. In a short time-scale (e.g., 10 milliseconds), speech can be approximated as a
stationary process. Speech can be thought of as a Markov model for many stochastic
purposes. Another reason why HMMs are popular is because they can be trained
automatically and are simple and computationally feasible to use.
2.1.2 Dynamic Time Warping based: Dynamic time warping[2] is an approach that was historically used for speech recognition but has now largely been displaced by the more
successful HMM-based approach. Dynamic time warping is an algorithm for measuring
similarity between two sequences that may vary in time or speed. For instance, similarities
in walking patterns would be detected, even if in one video the person was walking slowly
and if in another he or she were walking more quickly, or even if there were accelerations
and decelerations during the course of one observation. DTW has been applied to video,
audio, and graphics indeed, any data that can be turned into a linear representation can be
analyzed with DTW.
2.1.3 Artificial Neural Network based: Neural networks emerged as an attractive acoustic modeling approach in ASR in the late 1980s. Since then, neural networks have
been used in many aspects of speech recognition such as phoneme classification, isolated
word recognition, and speaker adaptation. In contrast to HMMs, neural networks make no
assumptions about feature statistical properties and have several qualities making them
attractive recognition models for speech recognition. When used to estimate the
probabilities of a speech feature segment, neural networks allow discriminative training in
a natural and efficient manner. Few assumptions on the statistics of input features are made
with neural networks. However, in spite of their effectiveness in classifying short-time units
such as individual phones and isolated words, neural networks are rarely successful for
continuous recognition tasks, largely because of their lack of ability to model temporal
15
dependencies. Thus, one alternative approach is to use neural networks as a pre-processing
e.g. feature transformation, dimensionality reduction, for the HMM based recognition.
2.2 Based on ability to recognize words: Speech recognition systems can be divided into the number of classes based on their ability to recognize that words and list of
words they have. A few classes are as under:
2.2.1 Isolated Speech Recognition: Isolated words usually involve a pause between two utterances; it doesnt mean that it only accepts a single word but instead it requires one
utterance at a time.
2.2.2 Connected Speech: Connected words or connected speech is similar to isolated
speech but allow separate utterances with minimal pause between them.
2.3.3 Continuous speech: Continuous speech allow the user to speak almost naturally, it is also called the computer dictation.
2.3.4 Spontaneous Speech: At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR system with spontaneous speech ability should
be able to handle a variety of natural speech features such as words being run together,
"ums" and "ahs", and even slight stutters.
2.3 Based on dependency on user: Based on the dependency on the users voice the speech recognition systems can be classified as:
2.3.1 Speaker dependent speech recognition: Speakerdependent systems works by learning the unique characteristics of a single person's voice, in a way similar to voice
recognition. New users must first "train" the software by speaking to it, so the computer
can analyze how the person talks. This often means users have to read a few pages of text
to the computer before they can use the speech recognition software.
2.3.2 Speaker Independent speech recognition: Speakerindependent software is designed to recognize anyone's voice, so no training is involved. This means it is the only
real option for applications such as interactive voice response systems where businesses
can't ask callers to read pages of text before using the system. The downside is that speaker
independent software is generally less accurate than speakerdependent software.
16
Chapter Three
Words Detection and Extraction
The component responsibility is to accept input from a microphone and forward it to the feature
extraction module. Before converting the signal into suitable or desired form it also does the
important task of identifying the segments of the sound containing words. It also has a provision
of saving the sound into WAV files which are needed by the training component. The microphone
is configured to receive the input signal at sampling rate of 8000 samples per seconds with 16 bits
per sample and mono channel.
3.1 Principle of Word Detection
In speech recognition it is important to detect when a word is spoken. The system does
detects the region of silence. Anything other than silence is considered as a spoken word
by the system. The system uses energy pattern present in the sound signal and zero crossing
rate to detect the silent region. Taking both of them is important as only energy tends to
miss some parts of sounds which are important. This process is also called Voice Activity
Detection.
3.2 Methodology
For word detection a sample are broken into frames by taking frame samples every 10
milliseconds. Each consecutive segment is separated by an overlapping distance which is
nearly 50% of the length of the frame. Energy and zero crossing for this duration is
calculated. Energy is calculated by adding the square of the value of waveform at each
instance and then dividing it by to number of instances over the period of sample. Zero
crossing rate is the number of times the value of the wave goes from the negative number
to positive of vice-versa.
Word Detector assumes that the first 100 millisecond is silence. It uses the average Energy
and average Zero Crossing Rate obtained during this time for identifying the background
noise. Upper threshold for energy and zero crossing is set to 2 times the average value of
background noise. Lower thresholds are set to 0.75 times the upper threshold.
While detecting the presence of word in the sound, if the energy or zero crossing goes above
the upper threshold and stays above for three consecutive sample word is assumed to be
17
present and the recording is started. The recording continues till the energy and zero
crossing both fall below the lower threshold and stay there for at-least 30 milliseconds[8].
Fig 3.1 Speech Sample
Fig 3.2 Energy plot of the samples
18
Fig 3.3 Zero Crossing Rate Plot of the samples
Fig 3.4 Detected Word from the sample
By applying the similar algorithm after extraction of a word, all the words present in the
speech sample can be extracted in this way provided they are separated by a small pause.
This set of words extracted from the signal are then fed into feature extractor where the
signal is converted into set of feature vectors. This word detection algorithm is only
applicable in the case when the speech signal is constituted of connected words where there
is a pause within consecutive words.
19
In case of continuous and spontaneous speech it is not possible to apply the techniques of
Energy and Zero crossing rates because the word boundaries in case of continuous speech
is not visible due to fusion.
3.3 Performance
The word detector was developed in MATLAB and tested. The word detector showed a
good performance and it was able to extract all the words from the signal provided there
was a half second gap between two consecutive words.
Some words may themselves be constituted of sub words at some distance. This problem
can be solved by measuring a threshold distance and merging up of the frames in case the
distance between then is less the threshold. In the experiment performed the threshold
distance was measured to be 100 frames.
20
Chapter Four
Feature Extraction
Humans have a capacity of identifying different types of sounds (phones). Phones put in a
particular order constitutes a word. If we want a machine to identify the spoken word, it will have
to differentiate between different kinds of sound the way the humans perceive it. The point to be
noted in case of humans is that although, one word spoken by different people produces different
sound waves humans are able to identify the sound waves as same. On the other hand two sounds
which are different are perceived as different by humans. The reason being even when same phones
or sounds are produced by different speakers they have common features. A good feature extractor
should extract these features and use them for further analysis and processing. So feature extraction
is mainly extraction of relevant information from the speech blocks. A variety of choices for this
task can be applied. Some most commonly used methods for speech recognition is linear prediction
and Mel-Cepstrum coefficient calculation[9]. These measures are widely used and here are some
reasons why:
These measures provide a good model of the speech signal. This is particularly true in
quasi steady state of voiced region of speech.
The way these measures are calculated leads to reasonable source-vocal tract separation.
This property leads to fairly good representation vocal tract characteristics.
The measure has analytical tractable model.
Experiences have found that these measures works well in speech recognition applications.
Other measures to add to the feature vectors are the energy measures and also the calculation of
delta and acceleration coefficients. The delta coefficient means that a derivative approximation of
some measures (e.g. MFCC coefficients) is added and the acceleration coefficients is the second
derivative approximation of some measures.
The steps of feature extraction can be show as in the figure[2]:
21
Windowing: Windowing is the process in which the speech signal is segmented into overlapping frames. Each segment is called a frame. The length of each frame is 0.01
seconds. The overlap between two frames is kept up to 50% i.e. 0.005 seconds. There are
various types of windows which can be used:
Rectangular Window Bartlett Window Hamming Window
The system developed uses Hamming window as it introduces least amount of distortion.
Impulse response of the Hamming
window is a raised cosine impulse as
show below in the figure:
Transfer function of hamming
window:
0.54+0.46cos (n*pi/m)
22
Mel Scale: After the framing of the signal N-point DFT is calculated to analyze its spectrum. The
frequencies are mapped into a MEL scale which is given by
Fig 4.2 Mel Scale Filter bank
The frequencies are picked according to a logarithmic scale as they match to the human
auditory system. Then the amplitude values corresponding to the Mel frequencies are
calculated and their logarithms are taken. These sequence of Log values are treated as
signals and then Inverse Discrete cosine transformation is performed on them. The resulting
coefficients are called cepstrum.
Lifter: Lifter is used to zero out (or cut away) some of the last mel cepstrum coefficients. After
this the final mel cestrum values are found. This is done to remove or discard the unwanted
mel coefficients.
23
Energy Measures: An extra measure to augment the coefficients derived from the mel-cepstrum is the log of
signal energy. This means for every frame an extra energy term is added.
Delta and Acceleration coefficients: Spectral transitions are believed to play an important role in human speech perception.
Therefore it is desired to add information about time difference, or delta coefficients and
also the acceleration coefficients.
24
Chapter Five
Knowledge Models
For speech recognition, the system needs to know how the words sound. For this we need to train
the system. During the training, using the data given by the user, the system generates acoustic
model and language model. These models are later used by the system to map a sound to a word
or a phrase.
5.1 Acoustic Model Features that are extracted by the Feature Extraction module need to be compared against
a model to identify the sound that was produced as the word that was spoken. This model
is called as Acoustic Model.
There are two kinds of Acoustic Models:
1. Word Model
2. Phone Model
5.1.1 Word Model
Word models are generally used to small vocabulary systems. In this model the
words are modelled as whole. Thus each word needs to be modelled separately. If
we need to add support to recognize a new word, we will have to train the system
for the word. In the recognition process, the sound is matched against each of the
model to find the best match. This best match is assumed to be the spoken word.
Building a model for a word requires us to collect the sound files of the word from
various users. These sound files are then used to train a HMM Model. Figure 5.1
shows a diagrammatic representation of word based acoustic model.
25
Fig 5.1 Word Based Acoustic Model
5.1.2 Phone Model
In phone model instead of modelling the whole word, we model only parts of words
generally phones. And the word itself is modelled as sequence of phone. The heard
sound is now matched against the parts and parts are recognized. The recognized
parts are put together to for a word. For example the word ek is generated by
combination of two phones A and k. This is generally useful when we need a large
vocabulary system. Adding a new word in the vocabulary is easy as the sounds of
phones are already know only the possible sequence of phone for the word with it
probability needs to be added to the system. Figure 5.2 shows a diagrammatic
representation of phone based acoustic model.
26
Fig 5.2 Phone based Acoustic Model
Phone models can be further classified into:
1. Context-Independent Phone Model
2. Context-Dependent Phone Model
5.1.2.1 Context-Independent Phone Model
In this model individual phones are modelled. The context that they occur
is not modelled. The good thing about this model is that the number of
phone that have to be modelled is small. Thus the complexity of the
system is less.
5.1.2.2 Context-Dependent Phone Model
27
While modelling phone their neighbors are also considered. That means iy
surrounded by z and r is a separate entity as compared to iy surrounded by
h and r. This results in a growth of number of modelled phones which
increases the complexity.
In both word acoustic model and phone acoustic model we need to model
silence and filler words too. Filler words are the sounds that humans produce
between two words.
Both these models can either be implemented using a Hidden Markov Model
or a Neural Network. HMM is more widely used technique in automatic
speech recognition systems.
5.2 Language Model
Although there are words that have similar sounding phone, humans generally do not find
it difficult to recognize the word. This is mainly because they know the context, and also
have a fairly good idea about what words or phrases can occur in the context. Providing
this context to a speech recognition system is the purpose of language model. The language
model specifies what are the valid words in the language and in what sequence they can
occur.
5.2.1 Classification
Language models are classified into several categories:
Uniform Models: Each word has equal probability of occurrence.
Stochastic Model: Probability of occurrence of a word depends on the words preceding it.
Finite State Language: Language uses finite state network to define allowed word sequences.
Context Free Grammar: Context free grammar can be used to encode which kind of sentences are allowed.
28
Chapter Six
Hidden Markov Model
Hidden Markov Model (HMM)[1][2][3][4][5][6] is a state machine. The states of the model are
represented as nodes and the transition are represented as edges. The difference in case of HMM
is that the symbol does not uniquely identify a state. The new state is determined by the symbol
and the transition probabilities from the current state to a candidate state.
Fig 6.1 Diagrammatic Representation of HMM
Above figure shows a diagrammatic representation of a HMM. Nodes denoted as circles are states.
O1 to O5 are observations. Observation O1 takes us to states S1. aij defines the transition
probability between Si and Sj. It can be observed that the states also have self-transitions. If we are
at state S1 and observation O2 is observed, we can either decide to go to state S2 or stay in state
S1. The decision is made depending on the probability of observation at both the states and the
transition probability.
Thus HMM Model is defined as:
= (Q, O, A, B, )
Where Q is {qi} (all possible states)
O is {vi} (all possible observation)
A is {aij} where aij = P (Xt+1 = qj |Xt = qi) (transition probabilities)
29
B is {bi} where bi(k) = P(Ot = vk|Xt = qit) (observation probabilities of observation k at state i)
is {i} where i = P(X0 = qi) (initial state probabilities)
Xt denotes the state at time t.
Ot denotes the observation at time t.
6.1 HMM and Speech Recognition
HMM can be classified upon various criteria:
1. Value of Occurrences
i) Discrete
ii) Continuous
2. Dimension
i) One Dimensional
ii) Multi-Dimensional
3. Probability Density Function
i) Continuous Density (Gaussian Distribution based)
ii) Discrete Density (Vector quantization based)
While using HMM for recognition, we provide the occurrences to the model and it returns
a number. This number is the probability with which the model could have produced the
output (occurrences). In speech recognition occurrences are feature vectors rather than just
symbols. Hence for each occurrence, feature vector has a group of real numbers. Thus, what
we need for speech recognition is a Continuous, Multi-dimensional HMM.
6.2 Three Basic Problems of Hidden Markov Model
Given the basics of an HMM from the previous section, three basic problem arise for
applying the model in a speech recognition task:
Problem 1
Given the observation sequence O=(o1,o2,o3,.oT), and the model =(A,B,), how is
the probability of the observation sequence, given the model, computed? That is, how is
P(O|) computed efficiently?
30
Problem 2
Given the observation sequence O=(o1,o2,o3,.oT), and the model =(A,B,), how is a
corresponding state sequence, q=(q1,q2,.,qT), chosen to be optimal in some sense (i.e.
best explains the observations)?
Problem 3
How are the probability measures, =(A,B,), adjusted to maximize P(O|)?
The first problem can be seen as the recognition problem. With some trained models, each
model represents a word, which model is the most likely if an observation is given? In the
second problem the hidden part of the model is attempted to be uncovered. It should be
clear that for all except the case of degenerated models, there is no correct state sequence
to be found. Thereby it is a problem to be solved best possible with some optimal criteria.
The third problem can be seen as the training problem. That is given the training sequence,
create a model for each word. The training problem is the crucial one for most applications
of HMMs, because it will optimally adapt the model parameter to observe training data-
i.e., it will create the best models for real phenomena.
6.3 Solution to Problem 1 Probability Evaluation
The aim of this problem is to find the probability of the observation sequence,
O=(o1,o2,o3,.oT), given the model , i.e. P(O|). Because the observations produced
by states are assumed to be independent of each other and the time t, the probability of
observation sequence O=(o1,o2,o3,.oT) being generated by a certain states sequence q
can be calculated by a product:
P(O|q,B)=bq1(o1).bq2(o2)..bqT(oT)
And the probability of the state sequence q can be found as
P(q|A,) = q1 aq1q2 aq2q3 ... aqT1qT
The joint probability of O and q, i.e., the probability that O and q occur simultaneously, is
simply the product of the above two terms, i.e.:
31
The aim was to find P (O|), and this probability of O (given the model ) is obtained by
summing the joint probability over all possible state sequences q, giving:
The interpretation of the computation in above is the following. Initially at time t = 1 the
process starts by jumping to state q1 with probability q1, and generate the observation
symbol o1 with probability bq1(o1). The clock changes from t to t+1 and a transition from
q1 to q2 will occur with probability aq1q2, and the symbol o2 will be generated with
probability bq2(o2). The process continues in this manner until the last transition is made (at
time T), i.e., a transition from qT1 to qT will occur with probability aqT1qT , and the symbol
oT will be generated with probability bqT(oT ).
This direct computation has one major drawback. It is infeasible due to the exponential
growth of computations as a function of sequence length T. To be precise, it needs (2T
1)NT multiplications, and NT 1 additions [1]. Even for small values of N and T; e.g., for
N = 5 (states), T = 100 (observations), there is a need for (2 100 1)5100 1.6 1072
multiplications and 5100 1 8.0 1069 additions! Clearly a more efficient procedure is
required to solve this problem. An excellent tool which cuts the computational requirements
to linear, relative to T, is the well-known forward algorithm.
6.3.1 The Forward Algorithm
32
Consider a forward variable t(i), defined as:
Where t represents time and i is the state. This gives that t(i) will be the probability
of the partial observation sequence, o1o2 ...ot, (until time t) when being in state i at
time t. The forward variable can be calculated inductively, Fig 6.2
Fig 6.2
t+1(i) is found by summing the forward variable for all N states at time t multiplied
with their corresponding state transition probability, aij, and by the emission
probability bj(ot + 1). This can be done with the following procedure:
33
1. Initialization
2. Induction
3. Update time Set t=t+1;
Return to step 2 if t
34
probability. In a similar manner (according to the forward algorithm), can the
backward be calculated inductively, see Fig. 6.3.
Fig 6.3 Backward Procedure Induction Step
The backward algorithm includes the following steps:
1. Initialization
2. Induction
35
3. Update time Set t=t-t;
Return to step 2 if t>0
Otherwise, terminate the algorithm
Note that the initialization step 1 arbitrarily defines T (i) to be 1 for all i.
6.3.3 Scaling the Forward and Backward Variables
The calculation of t(i) and t(i) involves multiplication with probabilities. All these
probabilities have a value less than 1 (generally significantly less than 1), and as t
starts to grow large, each term of t(i) or t(i) starts to head exponentially to zero.
For sufficiently large t (e.g., 100 or more) the dynamic range of t(i) and t(i)
computation will exceed the precision range of essentially any machine (even in
double precision) . The basic scaling procedure multiplies t(i) by a scaling
coefficient that is dependent only of the time t and independent of the state i. The
scaling factor for the forward variable is denoted ct (scaling is done every time t for
all states i - 1 i N). This factor will also be used for scaling the backward
variable, t(i). Scaling t(i) and t(i) with the same scale factor will show useful in
problem 3 (parameter estimation).
Consider the computation of the forward variable, t(i). In the scaled variant of the
forward algorithm some extra notations will be used. t(i) denote the unscaled
forward variable, t(i) denote the scaled and iterated variant of t(i) , t(i) denote
the local version of t(i) before scaling and ct will represent the scaling coefficient
at each time. Here follows the scaled forward algorithm:
1. Initialization
2. Induction
36
3. Update time Set t=t+1;
Return to step 2 if t
37
The ordinary induction step can be found as
Now it is possible to write
As the above equation shows t(i) is scaled by the sum over all states of t(i) when
the scaled forward algorithm is applied.
The termination (step 4) of the scaled forward algorithm, evaluation of P(O|) , must
be done in a different way. This because the sum of T (i) cannot be used, because
T (i) is scaled already. However the following properties can be used:
38
As above equation shows P(O|) can be found, but the problem is that if it is used
the result will still be very small (and probable out of the dynamic range for a
computer). If the logarithm is taken on both sides the following equation can be
used:
This is exactly what is done in the termination step of the scaled forward algorithm.
The logarithm of P(O|) is often just as useful as P(O|), because in most cases, this
measure is used as comparison with other probabilities (for other models).
The scaled backward algorithm can be found more easily, since it will use the same
scale factor as the forward algorithm. The notations used is similar to the forward
variables notations, t(i) denote the unscaled backward variable, t(i) denote the
scaled and iterated variant of t(i), t(i) denote the local version of t(i) before
scaling and ct will represent the scaling coefficient at each time. Here follows the
scaled backward algorithm:
1. Initialization
2. Induction
39
3. Update time Set t=t-1;
Return to step 2 if t>0;
Otherwise, terminate the algorithm
6.4 Solution to Problem 2 Optimal State Sequence
The problem is to find the optimal sequence of states to a given observation sequence and
model. Unlike problem one, for which an exact solution can be found, there are several
possible ways of solving this problem. The difficulty lies with the definition of the optimal
state sequence, that is, there are several possible optimality criteria. One optimal criteria is
to choose the states, qt, that are individually most likely at each time t. To find this state
sequence the following probability variable is needed:
That is, the probability of being in state i at time t given the observation sequence, O, and
the model . Other ways to look at t(i) can be:
It is now possible to write
40
When t(i) is calculated according to above equation, the most likely state at time t, q t, will be found by:
Even if above equation maximizes the expected number of correct states, there could be
some problems with the resulting state sequence. This because the state transition
probabilities have not been taken into account. For example what happens when some state
transitions have zero probability (aij = 0)? This means that the found optimal path may not
be valid. Obviously a method generating a path that is guaranteed to be valid would be
preferably. Fortunately such a method exist, based on dynamic programming, namely the
Viterbi algorithm. Even though t(i) could not be used for this purpose, it will be useful in
problem 3 (parameter estimation).
6.4.1 The Viterbi Algorithm
This algorithm is similar to the forward algorithm. The main difference is that the
forward algorithm uses summing over previous states, whereas the Viterbi
algorithm uses maximization. The aim for the Viterbi algorithm is to find the single
best state sequence, q = (q1,q2,...,qT ), for the given observation sequence O =
(o1,o2,...,oT ) and a model . Consider the following quantity:
That is the probability of observing o1o2 ...ot using the best path that ends in state i
at time t, given the model . By using induction can t+1(i) be found as:
To actually retrieve the state sequence, it is necessary to keep track of the argument
that maximizes above equation for each t and j . This is done by saving the argument
in an array t(j). Here follows the complete Viterbi algorithm:
1. Initialization
41
2. Induction
3. Update time Set t=t+1;
Return to step 2 if t=1
Otherwise, terminate the algorithm
The same problem as for the forward and backward algorithm occurs here. That is
the algorithm involves multiplication with probabilities and the precision range will
be exceeded. This is why an alternative Viterbi algorithm is needed.
6.4.2 The Alternative Viterbi Algorithm
As mentioned the original Viterbi algorithm involves multiplications with
probabilities. One way to avoid this is to take the logarithm of the model parameters,
42
giving that the multiplications become additions. Obviously will this logarithm
become a problem when some model parameters has zeros is present. This is often
the case for A and and can be avoided by adding a small number to the matrixes.
Here follows the alternative Viterbi algorithm:
1. Preprocessing
2. Initialization
3. Induction
4. Update time
Set t=t+1;
Return to step 3 if t
43
(c) Update time
Set t=t-1;
Return to step (b) if t>=1;
Otherwise, terminate the algorithm.
6.5 Solution to Problem 3 - Parameter Estimation
The third problem is concerned with the estimation of the model parameters, = (A,B,).
The problem can be formulated as:
Given an observation O, find the model from all possible that maximizes P (O|). This problem is the most difficult of the three problems. This because there is no known way to
analytically find the model parameters that maximizes the probability of the observation
sequence in a closed form. However can the model parameters be chosen to locally
maximize the likelihood P (O|). Some common used methods for solving this problem is
Baum-Welch method (also known as expectation-maximization method) or gradient
technics. Both of these methods uses iterations to improve the likelihood P (O|), however
there are some advantages with the Baum-Welch method compared to the gradient
techniques:
1. Baum-Welch is numerically stable with the likelihood non-decreasing with every
iteration.
2. Baum-Welch converges to a local optima.
3. Baum-Welch has linear convergence.
This is why the Baum-Welch is used in this project. This section will derive the re-
estimation equations used in the Baum-Welch method.
The model , has three terms to describe namely the state transition probability distribution
A, the initial state distribution and the observation symbol probability distribution B.
Since continuous observation densities are used, will B be represented by cjk,jk and jk.
To describe the procedure for re-estimation, the following probability will prove useful:
44
That is the probability of being in state i at time t, and state j at time t + 1, given the
model and the observation sequence O. The paths that satisfy the conditions required by
above equation are illustrated in Fig. 6.4.
Fig 6.4
By using all the previous equations we can conclude that
As mentioned in problem 2 is t(i) the probability of being in state i at time t, given the
entire observation sequence O and the model . Hence
45
If the sum over time t is applied to t(i), one will get a quantity that can be interpreted as
the expected (over time) number of times that state i is visited, or equivalently, the expected
number of transitions made from state i (if the time slot t = T is excluded) [1]. If the same
summation is done over t(i,j), one will get the expected number of transitions from state i
to state j. The term 1(i) will also prove to be useful. Conclusion:
Given the above definitions it is possible to derive the re-estimation formulas for and A:
And
46
The re-estimation of cjk,jk and jk is a bit more complicated. However if the model has
only one state j and one mixture, it would be an easy averaging task
In practice, of course, there are multiple states and multiple mixtures and there are no direct
assignment of the observation vectors to individual states because the underlying state
sequence is unknown. Since the full likelihood of each observation sequence is based on
the summation of all possible state sequences, each observation vector o t contributes to the
computation of the likelihood for each state j. In other words instead of assigning each
observation vector to a specific state, each observations i assigned to every state and is
weighted with the probability of the model being in that state accounted for that specific
mixture when the vector was observed. This probability, for state j and mixture k (there are
M mixtures), is found by
The re-estimation formula for cjk is the ratio between the expected number of times the
system is in state j using the kth mixture component, and the expected number of times the
system is in state j. That is:
To find jk and jk one can weigh the simple averages by the probability of being in state j
and using mixture k when observing ot:
47
The re-estimation formulas described in this section, are performed based on one training
sample. This will of course not be sufficient to get reliable estimates for a training sample,
especially when left-right models are used. To get reliable estimates it is convenient to use
multiple observation sequences.
6.5.1 Initial Estimates of HMM Parameters Before the re-estimation formulas can be applied for training, it is important to get
good initial parameters so that the re-estimation leads to the global maximum or as
close as possible to it. An adequate choice for and A is the uniform distribution.
But since left-right models are used, will have probability one for the first state
and zero for the other states. For example will the left-right model in figure below
have the following initial and A:
Fig 6.5 Left-Right model of HMM
48
The parameters for the emission distribution needs good initial estimations, to get a
rapid and proper convergence. This is done by using uniform segmentation into the
states of the model, for every training sample. After segmentation, all observations
from the state j is collected from all training samples. Then a clustering algorithm
is used to get the initial parameters for state j and this procedure is done for every
state. The clustering algorithm used in this thesis is the well-known k-means
algorithm. Before the clustering proceeds one has to choose the number of clusters,
K. In this task is the number of clusters equal to the number of mixtures, that is, K
= M.
The K-Means Algorithm 1. Initialization
Choose K vectors from the training vectors, here denoted x, at random. These
vectors will be the centroids k, which is to be found correctly.
2. Recursion For each vector in the training set, let every vector belong to a cluster k. This
is done by choosing the cluster closest to the vector:
Where d(x,k) is a distance measure, here is the Euclidian distance measure
used:
3. Test
Recomputed the centroids, k, by taking a mean of the vectors that belong to
this centroid. This is done for every k. If no vectors belongs to some k for
some value k - create new k by choosing a random vector from x. If there has
been no change of the centroids from the previous step go to step 4, otherwise
go back to step 2.
4. Termination From this clustering (done for one state j), the following parameters are found:
49
Chapter Seven
Implementation
The system was built and tested on a platform with below mentioned specification:
1. System Specification 1.1. Intel core i5 CPU @ 2.67GHz 1.2. 4GB of RAM
1.3. Microsoft Windows 7 x64
1.4. MATLAB 7.12.0 (R2011a)
1.5. Microphone
2. Minimum System Requirement 2.1. Pentium 200Hz processor 2.2. 512 MB of RAM
2.3. Microphone
2.4. Soundcard
The system was coded in MATLAB and was limited up to command line interface. The system is
divided into various component each dedicated to perform a unique task for e.g. speech acquisition,
feature extraction, training and, recognition. The full detailed source code can be found at the end
of the report.
The user can train the model either by directly recording from the microphone or by using samples
in pre-recorded WAV files. Same applies for recognition. The project has a constrained that either
it can recognize isolated words or it can recognize series of connected word separated by a small
pause. Each word is detected and extracted by the Word Detector and then each word is recognized
separately.
The sound is recorded at 8000 samples per second with 16 bit per sample. So the quality of the
signal it not that good to sense small differences among two similar signals. So the user has to
speak more clearly.
7.1 Software Modules
50
All the software module where coded in MATLAB and a brief description of some
important modules are given below:
1. GetSpeechSample.m: This function is responsible for acquiring the speech signal either
from a WAV file or directly from the microphone. It receives the input at 8000 samples
per second and 16 bits per sample for 6 seconds in case of training and 10 seconds in
case of recognition. It also trims first 8000 samples to remove the initial click noise
due to microphone. So the user should start speaking only after approx. 1.5 seconds.
2. ExtractFeatures.m: This function receives a speech signal and performs feature
extraction on it. It returns a 39 dimensional feature vectors set where each vector
contains MFCCs, energy and delta and acceleration coefficients.
3. TrainWord.m: This function receives feature vectors and a string which will be output
by the recognizer upon its recognition and it generates a model from it assigning a
unique id to it.
4. Recognize.m: This function performs the recognition of the speech sample feed to it by
WAV file or directly by microphone.
5. forwardBackwardAlgorithm.m: This function is used to implement the Forward-
Backward algorithm for training the samples.
6. viterbiAlgorithm.m: This function implement the Viterbi Algorithm which is used in
the calculation of the probability score and recognition of the words.
7. UpdateKnowledge.m: This function updates the dictionary wherever new word is
added to the database by training.
8. GaussianProbability.m: This function calculates the Gaussian probability of a frame of
the speech sample using the probability distribution function.
9. ResetDictionary.m: This function is used to reset the dictionary and remove all the
trained models.
10. FindNextSegment.m: This function finds the next word by detecting its word boundary
by comparing the energy and the zero crossing rates.
7.2 Working of the system
The working of the system can be understood by breaking the full functionality into sub
functions. Firstly signal is acquired by the speech acquisition module. This module trims
off first 8000 sample corresponding to 1sec of the signal and passes the signal to the words
51
detector. The word detector detects the boundary of each word present in it by analysis the
energy and the zero crossing rates. It stores the start and finish point of each word in a list.
The word segments whose distance is less than the threshold distance are merged together
and the list is updated. Then each of the word segment is fed into feature extractor module
and it is converted into set of feature vectors. Now this is passed on to the recognizer
module which finds the closest match with the models stored in the database. After the
completion of the recognition of this word the next word is fetched from the list and the
process is repeated.
Before the speech recognition can be performed the individual words are needed to be
trained first. In training part the user provides input either from a pre-recorded WAVE file
or directly from the microphone. The training module extract feature from it and generates
a HMM for that. The probabilities are adjusted by performing expectation maximization
process. The full overview of the working of the system is shown below in figure 7.1.
Fig 7.1 Block Diagram of Working of the system
52
Snapshots
1. The Main Interface
Fig: 7.2 the Main Interface
2. Training new word
Fig: 7.3 Training
53
3. Recognition
Fig: 7.4 Recognition
54
Chapter Eight
Results and Conclusion
8.1 Training
Ten words from English language which are most frequently used were trained and there
models were generated. For this I used my voice and approx. 300 samples in total were
taken with 30 training sample for each word.
Each model was trained by Expectation Maximization algorithm with 7 iterations on each
model. The total number of states in each model was 5. Each model was stored in unique
.MAT file.
The sound was recorded at 8000 frames per seconds with 16 bits per sample mono channel.
8.2 Recognition
The recognition module was given a wave file as input which contained prerecorded speech
sample. Each speech sample contained 5 to 6 words separated by a minimum distance of
100 frames. Firstly the words are detected and extracted from the signal and then each word
is recognized independently.
The recognition was tested by using samples from the user who trained it as well as an
unknown user. Also the environment was changed which was different from the
environment in which it was trained.
The result of the experiment are as show below:
Type of condition No of samples Correct Output Accuracy (%)
Known user and known environment 30 30 100%
Known user and unknown environment
20 14 80%
Unknown user and known
environment
25 12 48%
Unknown user an unknown environment
15 3 20%
Table 8.1 Recognition Results
55
8.3 Conclusion
The theory of Hidden Markov Model have been studied thoroughly. Together with the
signal processing of speech signals, a speech recognizer has been implement in MATLAB.
A word boundary detector has also been implemented. The performance of the word
detector is up to the requirement and it performed absolutely well in all conditions.
A HMM library was build which was useful in recognition and also training a word based
acoustic model. Word models for some most frequently used words of English language
was used.
The trained model was used to recognize the speech. The recognizer gave best performance
when the user and the environment both were known to the recognizer. It gave the worst
performance when the user and the environment both were unknown. This happened
because the models were trained in constrained environment with only one user. This
problem can be solved by including more training data from different persons and in
different environments. Other cases produced average results.
8.4 Future Work
Further improvements and expansions may be achieved by using one or more of the following
suggestions:
The speech recognizer is implemented in MATLAB and because of that it runs slow.
Implementing the speech recognizer in C or assembler will be desired to get a faster
execution time.
In a noisy environment, like in a car, noise reduction algorithms are preferred to enhance
signal to noise ratio. Algorithms useful can be based on adaptive noise reduction, spectral
subtraction or beam forming.
Record a larger evaluation database, for different speakers and different environments, to
get more test cases.
Try different setting in the speech recognizer, for example change the model structure, the
number of states or mixtures. More or less measures of the speech can be added to the
feature vectors, that is experiment with the feature vector dimension and its content.
56
Reading and References
[1] Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications
in Speech Recognition Proceedings of IEEE 1989.
[2] B. Plannerer, An Introduction to Speech Recognition March 28 2005.
[3] Christopher M. Bishop, Pattern Recognition and Machine Learning, Chapter 13 Page
605-646
[4] M. Narasimha Murthy, V. Susheela Devi, Pattern Recognition An Algorithmic
Approach University Press. Chapter 3, 5 and 9.
[5] B.H. Juang, L.R Rabiner. Hidden Markov Models for Speech Recognition.
Technometrics, August 1991, Vol. 33, No. 3
[6] Lawrence Rabiner, Biing-Hwang Juang. Fundamentals of Speech Recognition,
Prentice-Hall International, Chapter 6 and 7.
[7] Christophe Couvreur, Hidden Markov Models and Their Mixtures .
[8] L.R. Rabiner and M.R. Sambur. An algorithm for detecting the Endpoints for Isolated
Utterances. The Bell System Technical Journal, Vol. 54, No. 2, Feb. 1975.
[9] Mel Frequency Cepstral Coefficient (MFCC) tutorial.
http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-
cepstral-coefficients-mfccs/
[10] http://www.nuance.com/
[11] http://cmusphinx.sourceforge.net/
[12] Pellom, B., Sonic: The University of Colorado Continuous Speech Recognition
System.
[13] http://xvoice.sourceforge.net/
57
Appendix A
Source Code in MATLAB
function [] = Main( ) clear; clc; check=0; fprintf('#===================SPEECH RECOGNITION SYSTEM==================#\n'); fprintf('#Developed By: Aditya Sharma #\n');
fprintf('#Branch: CSE Final Year #\n'); fprintf('#Institute of Engineering and Technology,Lucknow #\n'); fprintf('#============================================================#\n\n); fprintf('Choose an option:\n'); fprintf('1: Train a new word\n'); fprintf('2: Perform Recognition\n'); fprintf('3: Reset Knowledge\n'); fprintf('4: Exit\n\n'); option=input('Your option is:','s');
switch option case '1' TrainWord; case '2' Recognize; case '3' ResetDictionary; case '4' cls;
check=1; otherwise fprintf('Invalid option!! Retry...\n'); end if check==0 %input(''); else end end
function [ ] = TrainWord( ) fprintf('Training Mode....\n'); x=GetSpeechSampleW(); fprintf('Enter the string corresponding to this word:'); name=input('','s'); fprintf('Training...\n'); Ini = 0.1; %Initial silence duration in seconds Ts = 0.01; %Frame width in seconds Tsh = 0.005; %Frame shift in seconds Fs = 8000; %Sampling Frequency ZTh = 40; %Zero crossing comparison rate for threshold w_sam = fix(Ts*Fs); %No of Samples/window o_sam = fix(Tsh*Fs); %No of samples/overlap lengthX = length(x); segs = fix((lengthX-w_sam)/o_sam)+1; %Number of segments in speech signal
58
sil = fix((Ini-Ts)/Tsh)+1; %Number of segments in silent period win = hamming(w_sam); Limit = o_sam*(segs-1)+1; %Start index of last segment
FrmIndex = 1:o_sam:Limit; %Vector containing starting index %for each segment ZCR_Vector = zeros(1,segs); %Vector to hold zero crossing rate %for all segments
%Below code computes and returns zero crossing rates for all segments in %speech sample for t = 1:segs ZCRCounter = 0; nextIndex = (t-1)*o_sam+1; for r = nextIndex+1:(nextIndex+w_sam-1) if (x(r) >= 0) && (x(r-1) >= 0)
elseif (x(r) >= 0) && (x(r-1) < 0) ZCRCounter = ZCRCounter + 1; elseif (x(r) < 0) && (x(r-1) < 0)
elseif (x(r) < 0) && (x(r-1) >= 0) ZCRCounter = ZCRCounter + 1; end end ZCR_Vector(t) = ZCRCounter; end
%Below code computes and returns frame energy for all segments in speech %sample Erg_Vector = zeros(1,segs); for u = 1:segs nextIndex = (u-1)*o_sam+1; Energy = x(nextIndex:nextIndex+w_sam-1).*win; Erg_Vector(u) = sum(abs(Energy)); end
IMN = mean(Erg_Vector(1:sil)); %Mean silence energy (noise energy) IMX = max(Erg_Vector); %Maximum energy for entire utterance I1 = 0.03 * (IMX-IMN) + IMN; %I1 & I2 are Initial thresholds I2 = 4 * IMN; ITL = min(I1,I2); %Lower energy threshold
ITU = 5 * ITL; %Upper energy threshold IZC = mean(ZCR_Vector(1:sil)); %mean zero crossing rate for silence region stdev = std(ZCR_Vector(1:sil)); %standard deviation of crossing rate for %silence region IZCT = min(ZTh,IZC+2*stdev); %Zero crossing rate threshold
flag=1; startpoint=1; segment_count=0;
while flag==1
[st,fi,fl]=FindNextSegment(ITU,ITL,IZCT,x,startpoint,Erg_Vector,ZCR_Vector);
59
if fl==0 break; end segment_count=segment_count+1; seg_list{segment_count,1}=st; seg_list{segment_count,2}=fi; startpoint=fi; end distance_threshold=100; valid_index=ones(1,segment_count); for l=1:segment_count-1 if valid_index(l)==1 if abs(seg_list{l+1,2}-seg_list{l,2}) ITU) counter1 = counter1 + 1;
60
indexi(counter1) = i; end end if counter1==0 flag=0; start=0; finish=0; return; end ITUs = indexi(1); first_hit=ITUs; %Search further forward for frame with energy greater than ITL for j = ITUs:-1:1 if (Erg_Vector(j) < ITL) counter2 = counter2 + 1; indexj(counter2) = j; end end start = indexj(1)+1;
%BackSearch = min(start,25); BackSearch=25; for m = start:-1:start-BackSearch+1 rate = ZCR_Vector(m); if rate > IZCT ZCRCountb = ZCRCountb + 1; realstart = m; end end if ZCRCountb > 3 start = realstart; %If IZCT is exceeded in more than 3 frames %set start to last index where IZCT is
exceeded end
l_c=0; for k=first_hit:length(Erg_Vector) if(Erg_Vector(k)
61
end
function [ pcm ] = GetSpeechSampleW( )
fprintf('Enter the name of the WAVE file:'); fname=input('','s'); buffer=wavread(fname); FrameRate=8000; [bufferlength,~]=size(buffer); buffer=buffer(FrameRate:bufferlength); pcm=buffer;
end
function [ features ] = ExtractFeatures( signal) fs=8000; frameSizeInSec = 0.025; frameShiftInSec= 0.010; hamming=1; preEmphesis=0; totalFilterBanks=26; cepstralOrder=12; lifter=22; deltaWindow=2; deltaWindowWeight = ones(1,2*deltaWindow+1); signal=double(signal); len=length(signal); preEmpSignal=zeros(len,1); preEmpSignal(1)=signal(1); preEmpSignal(2:end)=signal(2:end)-preEmphesis*signal(1:end-1); frameSize=round(fs*frameSizeInSec); frameShift=round(fs*frameShiftInSec); frameNo= floor( 1 + (len - frameSize)/frameShift ); maxMF = 2595*log10(1 + 0.5*fs/700.0); deltaMF=maxMF/(totalFilterBanks+1); f=zeros(totalFilterBanks+2,1); for m=1:totalFilterBanks+2 f(m)=(10^((m-1)*deltaMF/2595)-1)*700.0; end mfcc_tran=zeros(cepstralOrder,totalFilterBanks); for k=1:cepstralOrder for m=1:totalFilterBanks mfcc_tran(k,m)=sqrt(2/totalFilterBanks)*cos(k*pi/totalFilterBanks *
(m-0.5)); end end n=(1:cepstralOrder)'; lifter_weighting=1+(lifter/2)*sin(pi*n/lifter); k=(1:frameSize)'; h=0.54-0.46*cos(2*pi*(k-1)/(frameSize-1)); mfcc=zeros(cepstralOrder,frameNo); melspec=zeros(totalFilterBanks,frameNo);
62
for fr=1:frameNo s=preEmpSignal((fr-1)*frameShift+1:(fr-1)*frameShift+frameSize); if hamming ~= 0 s=s.*h; end fftN=2; while fftN
63
newId=GenerateId(); id=newId; modelFileName=['hmms\' int2str(newId) '.mat'];
for iter=ITERATION_BEGIN:ITERATION_END if iter==1 MIN_SELF_TRANSITION_COUNT=0; vector_sums_i_m=zeros(dim,STATE_NO); var_vec_sums_i_m=zeros(dim,STATE_NO); fr_no_i_m=zeros(STATE_NO); self_tr_fr_no_i_m=zeros(STATE_NO); [dim,fr_no]=size(features); for i=1:STATE_NO begin_fr=round( fr_no*(i-1) /STATE_NO)+1; end_fr=round( fr_no*i /STATE_NO); seg_length=end_fr-begin_fr+1; vector_sums_i_m(:,i) = vector_sums_i_m(:,i) +
sum(features(:,begin_fr:end_fr),2); var_vec_sums_i_m(:,i) = var_vec_sums_i_m(:,i) + sum(
features(:,begin_fr:end_fr).*features(:,begin_fr:end_fr) , 2); fr_no_i_m(i)=fr_no_i_m(i)+seg_length; self_tr_fr_no_i_m(i)= self_tr_fr_no_i_m(i) + seg_length-1; end for i=1:STATE_NO mean_vec_i_m(:,i) = vector_sums_i_m(:,i) / fr_no_i_m(i); var_vec_i_m(:,i) = var_vec_sums_i_m(:,i) / fr_no_i_m(i);
A_i_m(i)=(self_tr_fr_no_i_m(i)+MIN_SELF_TRANSITION_COUNT)/(fr_no_i_m(i)+2*MIN_
SELF_TRANSITION_COUNT); end else MIN_SELF_TRANSITION_COUNT=0.00; [dim,STATE_NO]=size(mean_vec_i_m); vector_sums_i_m=zeros(dim,STATE_NO); var_vec_sums_i_m=zeros(dim,STATE_NO); fr_no_i_m=zeros(STATE_NO); self_tr_fr_no_i_m=zeros(STATE_NO); total_log_prob = 0; total_fr_no = 0; [log_prob, pr_i_t, pr_self_tr_i_t
]=forwardBackwardAlgorithm(features,mean_vec_i_m(:,:),var_vec_i_m(:,:),A_i_m(:
)); total_log_prob = total_log_prob + log_prob; total_fr_no = total_fr_no + fr_no; for i=1:STATE_NO fr_no_i_m(i)=fr_no_i_m(i)+sum(pr_i_t(i,:));
self_tr_fr_no_i_m(i)=self_tr_fr_no_i_m(i)+sum(pr_self_tr_i_t(i,1:end-1)); for fr=1:fr_no vector_sums_i_m(:,i) = vector_sums_i_m(:,i) +
pr_i_t(i,fr)*features(:,fr); var_vec_sums_i_m(:,i) =var_vec_sums_i_m(:,i) +
pr_i_t(i,fr)*(features(:,fr)-mean_vec_i_m(:,i)).*(features(:,fr)-
mean_vec_i_m(:,i)); end end old_mean_vec_i_m=mean_vec_i_m;
64
old_var_vec_i_m= var_vec_i_m; old_A_i_m=A_i_m; for i=1:STATE_NO; mean_vec_i_m(:,i) = vector_sums_i_m(:,i)/ fr_no_i_m(i); var_vec_i_m(:,i)= var_vec_sums_i_m(:,i) / fr_no_i_m(i); A_i_m(i)=(self_tr_fr_no_i_m(i)+MIN_SELF_TRANSITION_COUNT)
/(fr_no_i_m(i)+2*MIN_SELF_TRANSITION_COUNT); end var_new_to_old_ratio=var_vec_i_m ./ old_var_vec_i_m; end end save(modelFileName, 'mean_vec_i_m', 'var_vec_i_m', 'A_i_m'); fprintf('The new word is added to the knowledge...\n');
end
function [log_prob, pr_i_t, pr_self_tr_i_t ]=forwardBackwardAlgorithm( V,
mean_vec_i, var_vec_i, A_i ) [dim , N]=size(mean_vec_i); [dim2 , T]=size(V); [log_prob, logfw, logObsevation_i_t ]=forwardAlgorithm(V, mean_vec_i,
var_vec_i, A_i ); pr_self_tr_i_t=zeros(N,T); logbw=ones(N,T)*(-inf); t=T; logbw(N,T)=log(1-A_i(N)); for t=T-1:-1:1 for i=1:N if i==N logbw(i,t)= log(A_i(i))+ logObsevation_i_t(i,t+1) + logbw(i,t+1); pr_self_tr_i_t(i,t)=exp(logfw(i,t)+ log(A_i(i))+
logObsevation_i_t(i,t+1) + logbw(i,t+1)-log_prob); else logbw(i,t)=CalculateSum([ (log(A_i(i))+ logObsevation_i_t(i,t+1) +
logbw(i,t+1)) , (log(1-(A_i(i))) + logObsevation_i_t(i+1,t+1)+ logbw(i+1,t+1)
)] ); pr_self_tr_i_t(i,t)=exp(logfw(i,t)+
log(A_i(i))+logObsevation_i_t(i,t+1) + logbw(i,t+1)-log_prob); end end end pr_i_t= exp( logfw+logbw - log_prob ); count_at_t(1:T)=sum(pr_i_t,1); count_at_t=squeeze(count_at_t); if (sum(count_at_t) -T) > 1E-6 diff=sum(count_at_t) -T ; end end
function [log_pr, varargout]=forwardAlgorithm(V, mean_vec_i, var_vec_i, A_i
) [dim , N]=size(mean_vec_i); [dim2 , T]=size(V); logObsevation_i_t=zeros(N,T);
65
for t=1:T for i=1:N
logObsevation_i_t(i,t)=GaussianProbability(V(:,t),mean_vec_i(:,i),var_vec_i(:,
i)); end end logfw=ones(N,T)*(-inf); t=1; logfw(1,1)=logObsevation_i_t(1,1); for t=2:T i=1; logfw(i,t)=logfw(i,t-1) + log(A_i(i))+logObsevation_i_t(i,t); for i=2:N logfw(i,t)= CalculateSum( [ (logfw(i-1,t-1) +log(1-A_i(i-1)) ) ,
(logfw(i,t-1) + log(A_i(i))) ] ) + logObsevation_i_t(i,t); end end log_pr=logfw(N,T) + log(1-A_i(N)); varargout(1)= {logfw}; varargout(2)= {logObsevation_i_t}; end
function [ ] = UpdateKnowledge( word_id,value )
dictionaryFile='dictionary.mat'; if ( exist(dictionaryFile,'file')) load(dictionaryFile,'dictionary'); dictionary{word_id,1}=word_id; dictionary{word_id,2}=value; save(dictionaryFile,'dictionary'); else dictionary{word_id,1}=word_id; dictionary{word_id,2}=value; save(dictionaryFile,'dictionary'); end end
function [ ] = Recognize( )
x=GetSpeechSampleW(); fprintf('Recognizing...\n'); Ini = 0.1; %Initial silence duration in seconds Ts = 0.01; %Frame width in seconds Tsh = 0.005; %Frame shift in seconds Fs = 8000; %Sampling Frequency ZTh = 40; %Zero crossing comparison rate for threshold w_sam = fix(Ts*Fs); %No of Samples/window o_sam = fix(Tsh*Fs); %No of samples/overlap lengthX = length(x); segs = fix((lengthX-w_sam)/o_sam)+1; %Number of segments in speech signal sil = fix((Ini-Ts)/Tsh)+1; %Number of segments in silent period
66
win = hamming(w_sam); Limit = o_sam*(segs-1)+1; %Start index of last segment
FrmIndex = 1:o_sam:Limit; %Vector containing starting index %for each segment ZCR_Vector = zeros(1,segs); %Vector to hold zero crossing rate %for all segments
%Below code computes and returns zero crossing rates for all segments in %speech sample for t = 1:segs ZCRCounter = 0; nextIndex = (t-1)*o_sam+1; for r = nextIndex+1:(nextIndex+w_sam-1) if (x(r) >= 0) && (x(r-1) >= 0)
elseif (x(r) >= 0) && (x(r-1) < 0) ZCRCounter = ZCRCounter + 1; elseif (x(r) < 0) && (x(r-1) < 0)
elseif (x(r) < 0) && (x(r-1) >= 0) ZCRCounter = ZCRCounter + 1; end end ZCR_Vector(t) = ZCRCounter; end
%Below code computes and returns frame energy for all segments in speech %sample Erg_Vector = zeros(1,segs); for u = 1:segs nextIndex = (u-1)*o_sam+1; Energy = x(nextIndex:nextIndex+w_sam-1).*win; Erg_Vector(u) = sum(abs(Energy)); end
IMN = mean(Erg_Vector(1:sil)); %Mean silence energy (noise energy) IMX = max(Erg_Vector); %Maximum energy for entire utterance I1 = 0.03 * (IMX-IMN) + IMN; %I1 & I2 are Initial thresholds I2 = 4 * IMN; ITL = min(I1,I2); %Lower energy threshold ITU = 5 * ITL; %Upper energy threshold
IZC = mean(ZCR_Vector(1:sil)); %mean zero crossing rate for silence region stdev = std(ZCR_Vector(1:sil)); %standard deviation of crossing rate for %silence region IZCT = min(ZTh,IZC+2*stdev); %Zero crossing rate threshold
flag=1; startpoint=1; segment_count=0;
while flag==1
[st,fi,fl]=FindNextSegment(ITU,ITL,IZCT,x,startpoint,Erg_Vector,