Post on 16-May-2018
i
Implementing Speech Recognition with Artificial Neural Networks
by
Alexander Murphy
Department of Computer Science
Thesis Advisor: Dr. Yi Feng
Submitted in partial fulfillment
of the requirements for the degree of
Bachelor of Computer Science
Algoma University
Sault Ste. Marie, Ontario
April 11, 2014
ii
Abstract
Speech recognition can be a complicated problem. But in today’s society when technology is
consistently striving for a hands-free or voice driven implementation, speech recognition can be
a very useful tool. With speech being such a fascinating phenomenon of the human body, many
different properties of speech end up being unique for each person. This leads to many different
options when considering implementing a speech recognition system. Which techniques of
speech recognition can yield better results? What is an efficient way of implementing such a
system? I will be answering these questions in this thesis by implementing an isolated word
recognition system with an artificial neural network. With this network being used to implement
the recognition system I will attempt to gain an understanding of how neural networks are used
for pattern recognition, and the techniques behind them.
iii
Acknowledgements
I would like to thank my thesis supervisor, Dr. Yi Feng. Her feedback, guidance, and advice was
a great help in the completion of my thesis. Also, I would like to thank Dr. George Townsend for
agreeing to be my second reader for this project, as well as teaching me how to use the massive
program that is Matlab. All of your help was greatly appreciated.
I would also like to thank the entire Department of Computer Science for helping me through the
past four years. It was a wonderful experience from which I benefitted greatly.
iv
Table of Contents
Abstract ii
Acknowledgements iii
Table of Contents iv
List of Figures vi
Chapter 1 Introduction 1
1.1 Speech Recognition……………………………………………………………………….2
1.2 Neural Networks…………………………………………………………………………..3
1.3 Thesis in Brief……………………………………………………………………………..3
1.4 Thesis Outline……………………………………………………………………………..4
Chapter 2 Neural Networks 5
2.1 Biological Neural Networks………………………………………………………………5
2.2 Artificial Neural Networks………………………………………………………………..6
2.2.1 Fundamentals of Neural Networks………………………………………………..8
2.2.2 Neural Network Structures………………………………………………………11
2.2.3 Training Algorithms……………………………………………………………..15
Chapter 3 Speech Recognition 18
3.1 Fundamentals of Speech Recognition……………………………………………………18
3.2 Dynamic Time Warping Algorithm……………………………………………………...20
3.3 Hidden Markov Model Approaches……………………………………………………...22
3.4 Neural Network Approaches…………………………………………………………..…23
Chapter 4 Design of Neural Network 26
4.1 Limitations of Neural Network…………..………………………………………………26
4.2 Structure of Neural Network……………………………………………..………………28
v
4.3 Training Algorithm………………………………………………………………………30
4.4 Audio Processing Algorithm……………………………………………….…………….36
Chapter 5 Neural Network Implementation 39
5.1 Materials………………..…………….………………………………………………….39
5.2 Training Data Preparation………………………………………………………………..40
5.3 Building the Neural Network…………………………………………………………….43
5.4 Testing the Neural Network…………………………………………………...…………46
Chapter 6 Results 47
6.1 Testing Results…………………………………………………………………………...47
6.2 Discussion of Results….....................................................................................................49
6.2 Difficulties……………………………………………………………………………….50
Chapter 7 Conclusions and Future Work 51
7.1 Summary…………………………………………………………………………………51
7.2 Future Work……………………………………………………………………………...51
Bibliography 52
Appendices 55
A: trainingPrep.m 55
B: createNetwork.m 58
C: testNetwork.m 60
D: mfcc.m 61
vi
List of Figures
Figure 2.1 – Simplified Biological Neurons……………………………………………………....5
Figure 2.2 – Symbolic Illustration of Linear Threshold Gate……………………………………..7
Figure 2.3 – Activation Functions………………………………………………………………...9
Figure 2.4 – Threshold Logic Unit………………………………………………………………10
Figure 2.5 – Single Layer Feedforward Network [11]…………………………………………..12
Figure 2.6 – Multi Layer Feedforward Network [11]……………………………………………12
Figure 2.7 – Recurrent Neural Network…………………………………………………...…….14
Figure 2.8 – Block Diagram of Reinforcement Learning [14]…………………………………..16
Figure 2.9 – Competitive Neural Network………………………………………………………17
Figure 3.1 – Structure of a Standard Speech Recognition System [15]………………………….18
Figure 3.2 – Image Frame Extraction [18]……………………………………………………….19
Figure 3.3 – Global Distance Grid……………………………………………………………….21
Figure 3.4 – A Simple Hidden Markov Model Topology……………………………………….22
Figure 3.5 – Static and Dynamic Classification………………………………………………….23
Figure 3.6 – The Multilayer Perceptron and Decision Regions………………………………….24
Figure 3.7 – Time Delay Neural Network Structure……………………………………………..25
Figure 4.1 – Simple Fully Connected Feedforward Multilayer Perceptron [23]………………...28
Figure 4.2 – Structure of Neural Network……………………………………………………….30
Figure 5.1 – Matlab Audio Recording…………………………………………………………...40
Figure 5.2 – MFCC function called and vector created………………………………………….40
Figure 5.3 – Wav file and MFCC of “Go”……………………………………………………….41
Figure 5.4 – Wav file and MFCC of “Hello”…………………………………………………….41
Figure 5.5 – Wav file and MFCC of “No”……………………………………………………….41
vii
Figure 5.6 – Wav file and MFCC of “Stop”………………………………………….………….42
Figure 5.7 – Wav file and MFCC of “Yes”……………………………………………..……….42
Figure 5.8 – Creating the Neural Network……………………………………………………….43
Figure 5.9 – Setting Up Division of Data………………………………………………………..43
Figure 5.10 – Matlab Training Code……………………………………………………………..44
Figure 5.11 – Training the Neural Network……………………………………………………...45
Figure 5.12 – testNetwork Function……………………………………………………………..46
Figure 6.1 – Test Results for “Go”………………………………………………………………47
Figure 6.2 – Test Results for “Hello”……………………………………………………………47
Figure 6.3 – Test Results for “No”………………………………………………………………47
Figure 6.4 – Test Results for “Stop”……………………………………………..………………48
Figure 6.5 – Test Results for “Yes”……………………………………………………………...48
1
Chapter 1
Introduction
Speech is a natural phenomenon that occurs every single day. From a very early point in our
lives we learn the necessary skills to use speech as a primary mode of communication. Because
speech comes so naturally to us, it is easy to forget how complex it really is. It begins with lungs
producing airflow and air pressure. This pressure then vibrates the vocal chords which separate
the air flow into audible pulses. The muscles of the larynx then adjust the length and tension of
the vocal cords to change the pitch and tone of the sound produced. The vocal tract, consisting of
the tongue, palate, cheek and lips then articulate and filter the sound [1]. Each instance of speech
that occurs after this process is very unique. Voices can vary in volume, speed, pitch, roughness,
tone and other different aspects. Due to different cultural backgrounds voices can also differ in
terms of accent, articulation, and pronunciation. All of these differences make the
implementation of speech recognition a very challenging problem.
In today’s society people prefer options in technology that simplify activities. Using speech
recognition, users could type text or issue device commands through speech. Systems like
language translation and dictation could become simple hands-free devices. But with these ideas
comes the complicated task of implementing the system. Speech recognition in computers is
nowhere near the speech recognition capabilities of the human brain. But perhaps using systems
that mimic human brain functions could lead to further advancements in the field.
2
1.1 Speech Recognition
Speech recognition can be a very complex problem with a large number of characteristics that
need to be considered.
Vocabulary size and confusability
Speech recognition can generally function more efficiently with a small vocabulary size.
As the number of words that need to be recognized increases, the amount of confusion
and percentage of error can also increase.
Speaker dependence vs. independence
As mentioned above, different speakers can have completely different voices due to
several different factors. A speech recognition system will have a much easier time
recognizing a word from a single user, while a system that accepts input from multiple
users can end up with a higher number of errors.
Isolated, discontinuous, or continuous speech
Isolated speech means a single spoken word is inputted for recognition. Discontinuous
speech is a full sentence that has words seperated by silence, while continuous speech
would be naturally spoken sentences. A speech recognition system would have a much
better chance at recognizing an isolated word over a word that is part of continuous
speech.
3
Read vs spontaneous speech
When preparing speech to be recognized by a system, read speech will be prepared ahead
of time and likely consist of no errors. Spontaneous speech could possibly contain
mispronounced words or incomplete sentences, resulting in a much more difficult attempt
at recognition.
1.2 Neural Networks
A human brain works differently than a typical computer. While a computer can use a processor
to execute instructions and access local memory at incredible speeds, a human brain is a slower
collection of nodes called neurons. These neurons are connected by synapses which pass
electronic or chemical signals back and forth to each other. These connections are capable of
changing each time information is sent or received which represents the brain adapting and
learning information. A neural network is an artificial computer generated system that attempts
to replicate the neural system of the human brain. Nodes are used to represent neurons, and they
are connected with different weights to represent the synapses. These networks can be trained to
process different information and learn from it. Naturally these systems are used to mimic certain
functions that the human brain is capable of, such as pattern recognition and decision making.
1.3 Thesis in Brief
The purpose of this thesis is to implement a speech recognition system using an artificial neural
network. Due to all of the different characteristics that speech recognition systems depend on, I
decided to simplify the implementation of my system. I will be implementing a speech
recognition system that focuses on a set of isolated words. The words “Yes,” “No,” “Hello,”
4
“Stop,” and “Go” will be recorded to eliminate the problem of spontaneous speech. My speech
recognition system will be trained using only my voice, which will eliminate the need to
compensate for users with different voices. With this system I hope to gain an understanding of
how speech recognition can be implemented, and how neural networks can be used to implement
advanced artificial intelligence problems.
1.4 Thesis Outline
The first chapters of this thesis will show a background and fundamentals of neural networks and
speech recognition
Chapter 2 will review neural networks
Chapter 3 will review speech recognition
The remainder of the chapters will discuss the details of my own neural network implementation
as well as the testing and results. It will conclude with an evaluation of the system and a
discussion on possible future work.
Chapter 4 will review my neural network
Chapter 5 will review my implementation of speech recognition
Chapter 6 will discuss the results of testing the system
Chapter 7 will discuss the conclusion and future work
5
Chapter 2
Neural Networks
This chapter will begin with an analysis of a biological neural network. After presenting this
concept I will discuss how it is translated into artificial neural networks, and the different
structures and training methods of specific neural networks.
2.1 Biological Neural Networks
Artificial neural networks have been modelled after the structure of the human brain. The
average adult human brain weighs approximately 1.5 kg and has a volume of around 1260 cubic
centimeters. The brain is composed of approximately 200 billion neurons connected by 125
trillion synapses, glial cells, and blood vessels. The neuron can be seperated into three major
parts: the cell body (soma), the dendrites, and the axon [2].
Figure 2.1 – Simplified Biological Neurons
6
The dendrites contain thin lines that receive signals from surrounding neurons. Each branch of a
dendrite is connected to a single neuron through the small connection of a synapse. A signal of
either chemical diffusion or electrical impulses is transmitted through the thin cylinder of the
axon, through the synapse connection, and into the connected dendrite of specific neuron [3].
When the neurons are performing a specific task, a certain number of neurons must be used in
order to complete it. Input signals are distributed across the required neurons of different
strengths or frequencies. All of these input signals are then added together and processed by a
threshold function. The function then produces an output signal processed by all of the inputs. It
is this entire function that makes up how humans are able to learn new information and process
problems. While the processing time of 1ms per cycle, or a transmission speed of 0.6ms to
120ms is slower than a modern computer, the capability of learning information to recognize and
solve various problems is the reason why the neural network of a human brain is an excellent
structure for artificial intelligence [4].
2.2 Artificial Neural Networks
In the late 19th century a neuroscientist named Santiago Ramón y Cajal conducted studies on the
human nervous system. Cajal discovered that the human nervous system was actually comprised
of the discrete neurons that communicated with signals passed through axons, dendrites and
synapses [5]. This discovery led to further research that identified different types of neurons, as
well as the types of signals that were being passed between them. Neurobiologists were finding it
very difficult however, to understand how the neurons were working together to achieve such a
high level of functionality.
7
It was not until the advancement of modern computing allowed researchers to be able to build
working models of neural systems. These systems gave a better understanding of how the neural
system of the brain functioned. Warren McCulloch and Walter Pitts created an early model of an
artificial neuron in 1943. This model became known as a linear binary threshold gate. With a
series of given inputs, a sum would be calculated against weight values normalized in the range
of either (0, 1) or (-1, 1) and then associated with a specific input. Given a certain threshold, the
output would be one of two classes in binary based on whether the sum exceeded the threshold
or not [6].
Figure 2.2 – Symbolic Illustration of Linear Threshold Gate
8
With this model of an artificial neuron it was now proven that systems of neurons that were
assembled into a finite state automaton could compute any arbitrary function, when given
suitable values for the weights between the neurons. Researchers went even further by
implementing learning procedures that were capable of automatically finding appropriate weight
values, which enabled a network to compute any specific function. As the years passed and more
research was performed. Basic attributes of neural networks were defined, different training
methods were created, and new network architectures were implemented.
2.2.1 Fundamentals of Neural Networks
While there are many different architectures of neural networks, each one of these systems
contain similar basic attributes. These attributes are processing units, weights, a computation
method and a training method.
Processing Units
A neural network system is made up of a specific number of processing units that
represent neurons in a human brain. These units are typically divided into different
groups. Some units represent the input units that receive the data to process, for example
numerical values representing an image. Units that are hidden inside the neural network
work to manipulate and transform the inputted information. The final group of units that
can be known as output units represent the decision of the neural network.
Each neuron that receives data performs a defined function. The results of this function
are then shared with connected neurons. The sharing of this value is called the activation
9
value. This value determines whether the receiving neuron will be activated or if it will
remain inactive. There are several different activation functions that approximate the
activation value to either 1 to 0, or 1 to -1 [7].
Figure 2.3 – Activation Functions
Weights
The processing units in a neural network must be connected in some way. These
connections between neurons are displayed as lines in neural network diagrams. Each
connection is assigned a value of a weight. These weights represent how influential a
neuron is to another connected neuron. Typically the input from a neuron is multiplied by
the weight of the connection that is sending the data. This value is then entered into the
activation function of the network. If the value exceeds the given threshold of the
activation function then the neuron will be activated, otherwise the neuron will remain
inactive.
10
Computation Method
A neural network uses what is called a threshold logic unit in order to perform the
required computations stated above. This unit is an object that takes all of the different
inputs from connected neurons, sums them together, and then uses the activation function
to determine a correct output. This output is then either sent to the output layer of the
network, or onto the next connected neuron in another hidden layer.
Figure 2.4 – Threshold Logic Unit
The figure above displays the inputs from either the input layer of the network, or the
connected neurons, being modified by the given weight values of the connection. These
values are then summed together and entered into the activation function with an output
of y [8].
11
Training Method
In order for a neural network to perform efficiently and correctly it must be able to adapt
its weights in order to achieve desired outputs for any input. The training process of a
neural network is extremely important in order for the network to replicate a human brain
being able to learn and respond to new information. This process involves iterating inputs
of training data into the neural network. With some neural networks there will be a given
desired output along with this training data. Each time the data is inputted into the neural
network, the weights of the connections between the neurons may be updated in order to
achieve the desired output [9]. There are several different techniques to training neural
networks, each one being appropriate for different problems. We will explore these
training algorithms in more detail further on.
2.2.2 Neural Network Structures
Artificial neural networks can be classified in two significant groups, feedforward networks and
recurrent networks.
Feedforward Neural Networks
A feedforward network is a directed and weighted graph that contains no loops. The input
neurons contain no connection leading to them, and the output neurons contain no
connections leading away from them. As the values of the input nodes are set, using the
threshold logic unit, all other nodes in the network can set their values as well. A
feedforward neural network can either be a single layer network or a multilayer network
[10].
12
Figure 2.5 – Single Layer Feedforward Network [11]
Figure 2.6 – Multi Layer Feedforward Network [11]
13
As shown in Figure 2.5, a single layer feedforward network consists of a single layer of
weights connected from the inputs to the outputs. In Figure 2.6 there is a layer in between
the input and output layers. The layers that are in this position are referred to as hidden
layers that hold hidden neurons [11]. These extra layers can provide extra computation on
the data being processed by the neural network, allowing for more advanced functions to
be performed. With both of these networks the data being processed can only travel in the
direction of input to output, hence the term feedforward network. The more popular
option for feedforward networks is the multi-layered structure. These networks have been
found to generalize data very well, and often provide the correct output for an input that
was not in the initial training set of data [10].
Recurrent Neural Network
A recurrent neural network is a directed and weighted graph that contains at least one
feedback loop. This technique allows data to travel backwards through the neural
network and adjust weights based on new outputs. With output values now being able to
manipulate connection weights in the network, they can now exhibit a more dynamic
temporal behavior.
14
Figure 2.7 – Recurrent Neural Network
In Figure 2.7, the neural network displayed has weight connections going from the input
layer, to a hidden layer, and then to an output layer. The difference in this network
compared to feedforward networks is that there is also a connection from the output layer
leading back to the input layer. With this network structure, each time data is processed
through the network it learns and adapts to it, unlike feedforward networks that only learn
from specific training data. This network structure is very useful in applications that
involve processing arbitrary sequences of inputs, like connected handwriting recognition
[12].
15
2.2.3 Training Algorithms
In order to ensure that a neural network is performing at its peak efficiency, the weights in the
network must be adjusted in order to properly obtain the correct output desired. There are three
main methods of allowing a neural network to learn required information. There is supervised
learning, reinforcement learning and unsupervised learning.
Supervised Learning
Neural networks that implement supervised learning require a set of training data, which
are arranged as input vectors. Next there needs to be a target vector that determines how
well the inputs are learned, and acts as a guide to adjust weight values in order to reduce
errors.
The output of a feedforward neural network for a given pattern zp is calculated with a
single pass through the network. For each output unit ok, we have,
where fok and fyj are respectively the activation function for output unit ok and hidden unit
yj; wkj is the weight between output unit ok and hidden unit yj; zi, p is the value of input
unit zi of input pattern zp; the (I + 1)-th input unit and the (J + 1)-th hidden unit are bias
units representing the threshold values of neurons in the next layer [13].
16
Reinforcement Learning
In reinforcement learning, also referred to as semi-supervised learning, there is no
explicitly desired output for the neural network. Unlike supervised learning, the learning
of an input-output mapping is performed through signals that represent good or bad
behavior. Reinforcement learning can be difficult to implement because there is no target
to provide a desired response from the neural network [14].
Figure 2.8 – Block Diagram of Reinforcement Learning [14]
Unsupervised Learning
In a neural network that implements unsupervised learning there is no target output
provided or any signals to indicate whether an action was good or bad. It is left up to the
network to observe patterns and regularities in the input data. Once a network has learned
the patterns of the input data, it will be able to attempt to identify the same patterns in
17
new inputs, or classify them as a new class of output. One method to perform
unsupervised learning is the competitive learning rule. With this rule a network can
consist of an input layer that will receive data and a competitive layer or neurons that will
be competing with each other. These neurons will attempt to respond to certain features
in the input data, however only one neuron will be activated while the others remain
inactive [15]. This type of learning is much more difficult to implement and is suited for
much more complicated problems than speech recognition.
Figure 2.9 – Competitive Neural Network
18
Chapter 3
Speech Recognition
This chapter will begin with an analysis of the fundamentals of speech recognition. Following
the basic concepts there will be a review of the popular Dynamic Time Warping algorithm. Once
these topics are presented there will be a review of the use of Hidden Markov Models within
speech recognition, as well as the alternate approach of using Neural Networks.
3.1 Fundamentals of Speech Recognition
Speech recognition is a multileveled pattern recognition task, in which acoustical signals are
examined and structured into a hierarchy of sub-word units, words, phrases, and sentences. Each
level may provide additional temporal constraints such as known pronunciations or legal word
sequences [15]. Figure 3.1 displays the fundamental elements of a standard speech recognition
system.
Figure 3.1 – Structure of a Standard Speech Recognition System [15]
19
Raw Speech
Speech is sampled at a standard frequency of 16 KHz over a microphone. This sampling
yields a sequence of amplitude values over time [16].
Signal Analysis
Sampled raw speech must be transformed and compressed in order to simplify the
recognition process. There are several popular techniques that extract features from raw
speech and compress the data without loss of data. Fourier analysis, Perceptual Linear
Prediction, Linear Predictive Coding and Cepstral analysis all properly process the raw
speech into a more usable state [17].
Speech Frames
Once raw speech is processed and analyzed, the audio is broken up into speech frames.
These frames are typically 10ms intervals of the processed audio and provide unique
information relative to the speech recognition process [18].
Figure 3.2 – Image Frame Extraction [18]
20
3.2 Dynamic Time Warping Algorithm
Speech recognition can be complicated for a number of different reasons. When comparing a
sample for recognition against training data, the test sample may be of a different duration than
the training sample. One way to solve this problem is to normalize the speech samples so that
they all have the same duration. Another problem with recognizing speech is that the rate at
which the words are spoken may not always be constant, which would mean the optimal
alignment between a test sample and the training sample may be nonlinear.
An instance of dynamic programming known as Dynamic Time Warping is an efficient method
to solve the time alignment problem in speech recognition. The Dynamic Time Warping
algorithm attempts to align the test vector and training vector by warping the time axis
repetitively until an optimal match between the two vectors is found. The algorithm performs a
piece wise linear mapping of the time axis to align both the vectors. For example:
Given two sequences of a feature vector in an n-dimensional space.
x = [x1, x2… xn] and y = [y1, y2… yn]
These two sequences are aligned on the sides of a grid, with one on the top and the other on the
left hand side. Both sequences start on the bottom left of the grid.
21
Figure 3.3 – Global Distance Grid
In each cell of the grid in Figure 3.3, a distance measure is placed, comparing elements of the
two sequences. The distance between the two points is calculated by the Euclidean distance.
Dist (x, y) = |x - y| = [(x1 - y1)2 + (x2 – y2)
2 +… + (xn – yn) 2]1/2
The best alignment for the two sequences is the path through the grid that minimizes the total
distance between them. This total distance is calculated by finding and going through all the
possible routes through the grid. The distance is equal to the minimum of the distances between
the individual elements on the path divided by the sum of the weighting function [19].
22
3.3 Hidden Markov Model Approach
One of the most widely used and successful approaches to speech recognition is using a Hidden
Markov Model. This model is essentially a collection of different states that are connected by
transitions. The model begins in a designated initial state, and at each step in the process, a
transition is used to reach a new state where an output is generated. This system is referred to as
hidden due to the fact that while outputs are observed over the course of running the system, the
sequence of states visited is hidden from the user.
Figure 3.4 – A Simple Hidden Markov Model Topology
A typical Hidden Markov Model consists of:
{s} = a set of states.
{aij} = a set of transition probabilities, where aij is the probability of the transition from
state i to state j.
{bi(u)} = a set of emission probabilities, where bi is the probability describing the
likelihood of emitting a sound u while in state i.
In order to simplify the process, probabilities a and b only depend on the current state of the
model. This limits the parameters of training, and makes the training and testing of speech
recognition very efficient [20].
23
3.4 Neural Network Approaches
Speech recognition can be defined as a pattern recognition problem. Because neural networks
perform well when attempting pattern recognition, naturally many researchers applied the neural
networks to speech recognition. The first attempts were nothing more than simple problems like
classifying speech segments as voiced/ unvoiced, or nasal/fricative/plosive. When researchers
succeeded in these experiments they decided to move on to the problem of phoneme
classification.
Figure 3.5 – Static and Dynamic Classification
As shown in Figure 3.4, there are two main approaches that can be used to recognize speech with
neural networks. These methods are static and dynamic classification. Static classification allows
the neural network to see all of the input speech at the same time, and the neural network makes
a single decision based off this input. On the other hand, dynamic classification allows the neural
network to only see a small window of the speech input. The small view window leads to a series
of decisions that need to be combined over the entire speech input. Static classification works
well when a neural network is used to recognize isolated words or phonemes. However, dynamic
classification proves to be much better at classifying words or sentences that are spoken together
[15].
24
Static Classification Approach
In 1988 Dr. Richard Lippmann and Dr. William Huang demonstrated that neural
networks were able to form complex decisions from speech inputs. Using a multilayer
perceptron with 2 inputs, 50 hidden neurons, and 10 outputs, the network was able to
process spoken vowels and form decision regions. After 50,000 iterations of training the
neural network, the decision regions became optimal and yielded very promising
classification results [21].
Figure 3.6 – The Multilayer Perceptron and Decision Regions
Dynamic Approach
One of the most difficult speech recognition tasks is implementing a system that can
process the E-set of English letters, “B, C, D, E, G, P, T, V, and Z.” With all of these
letters containing the same characteristic, an 8% error rate for a neural network is actually
considered a good result.
25
In 1989 Dr. Alexander Waibel had excellent results for phoneme recognition using a
Time Delay Neural Network. His structure consisted of an input layer of 3 delays, and a
hidden layer of 5 delays. The final output was computed by integrating over 9 frames of
phoneme activations in the second hidden layer.
Figure 3.7 – Time Delay Neural Network Structure
This network was trained and tested on 2000 samples of phonemes /b, d, g/ that were
manually excised from a database containing 5260 Japanese words. In the end Waibel
achieved an error rate of only 1.5%, which can be compared to 6.5% achieved by a
simple Hidden Markov Model recognition system [22].
26
Chapter 4
Design of Neural Network
This chapter will discuss the limitations my neural network will abide by, in order to reduce the
scope of my implementation. The structure of the neural network will be described, as well as the
algorithms used to train the network and process the audio samples.
4.1 Limitations of Neural Network
With the complexity of building a speech recognition system, I decided to implement some
limitations on my system in order to reduce the scope of the project
Vocabulary Size and Confusability
As the size of the vocabulary for a speech recognition system increases, so does the
amount of confusability. Many words and letters in the English language sound very
similar which can cause a neural network to arrive at the incorrect output. In order to
reduce the vocabulary and in turn the confusability of my system, it will be used to
recognize 5 simple words: “Go,” “Hello,” “No,” “Stop” and “Yes.”
Speaker Dependence vs. Independence
When creating a speech recognition neural network, naturally the network begins to adapt
to the training data. If the audio samples used to train the network originate from different
users, the network will have a more difficult time distinguishing the words spoken vs. a
network that is trained by samples from one user. In order to reduce the recognition
difficulty as well as the training length, I will train the neural network using audio
27
samples spoken only by myself. The test word samples will only be my voice as to not
confuse the network further.
Isolated, Discontinuous, or Continuous Speech
Continuous speech such as naturally spoken sentences can increase the difficulty of a
speech recognition program by a large factor. With natural language there are often
pauses, mispronounced words, or words that are not in the particular vocabulary.
Discontinuous speech must involve separating sentences into individual words or
phonemes. In order to reduce the amount of processing that the audio samples must
undergo, I will record the 5 simple words as isolated words in order to easily detect the
boundaries of each word.
Read vs. Spontaneous Speech
When speech is spoken spontaneously there can be many useless parts involved. There
can be pauses in the sentences, coughing or sneezing, and common filler words like “um”
or “uh.” With my speech recognition system the audio samples will be recorded as if they
are being read from a script. With no elements of spontaneous speech involved, and the
words being isolated, the neural network and audio processing will not need to be as
complex as other implementations.
Real-time Recognition vs. Recorded Samples
In most speech recognition systems, the training data samples must be recorded and
processed in order to train the system. The testing of the system can occur in one of two
28
ways, either in real-time or with recorded test samples. With real-time recognition, audio
samples can contain various background noises that distort the speech being recorded.
Algorithms and further audio processing must be implemented in order to ensure that
only the spoken speech will be used to train the system. In my speech recognition system
I will be pre-recording both the training data samples as well as the test data samples.
This will minimize the amount of background noise that can occur during the audio
samples.
4.2 Structure of Neural Network
A common form of a neural network is a 3-layer, fully connected, feedforward multilayer
perceptron.
Figure 4.1 – Simple Fully Connected Feedforward Multilayer Perceptron [23]
This type of network is arranged in 3 layers, an input layer, a hidden layer and an output layer.
The training of this network involves using a feature vector, or training matrix, and a class
vector, or target matrix. The input training matrix is applied to the network and an output is
calculated. This output is then compared to the target matrix and assigns values to the different
classes. As the training algorithm continues, the weights in the network are adjusted in order to
29
minimize any errors in recognition. The further the training iterations continue, there is a better
chance that the neural network will assign a higher value to the correct class of the specific input.
With my speech recognition system being limited to 5 specific words, essentially 5 classes, this
structure of neural network suits my needs perfectly. The next step in designing the neural
network is deciding on the amount of input, hidden, and output nodes. The output nodes are
simply the number of classes that the network will have to work with, which in this case will be
5, a class for each word to be recognized. The input node quantity is equal to the number features
that are pulled from each individual word. Using the Auditory Toolbox [24] algorithm Mel-
frequency cepstrum coefficient analysis from the Matlab function mfcc(), which will be
discussed further on, the individual word samples are analyzed and processed into 130 features.
We now have 130 input nodes and 5 output nodes for the neural network. In order to determine
an optimal number of hidden neurons there exists 3 common rule-of-thumb approaches:
The number of hidden neurons should be between the size of the input layer and the size
of the output layer.
The number of hidden neurons should be 2/3 the size of the input layer, plus the size of
the output layer.
The number of hidden neurons should be less than twice the size of the input layer [25].
For my neural network I decided to follow these guidelines and take 2/3 of the total of input and
output neurons, resulting in 90 hidden neurons.
30
Figure 4.2 – Structure of Neural Network
The combination of all the parameters resulted in my neural network with 130 input nodes, a
layer of 90 hidden nodes, and an output layer of 5 nodes representing 5 classifications of words.
4.3 Training Algorithm
For a neural network that is implementing some sort of pattern recognition it is quite common,
and beneficial to use a method called backpropagation. This method is a form of supervised
learning that starts by inputting the training data through the network. Once this data makes it
through the network, it generates propagation output activations. These activations are then
propagated backwards through the neural network, and along the way generating a delta value
for all of the hidden and output neurons. The weights of the network are then updated by these
calculated delta values, which increases the speed and quality of the learning process [26].
Another algorithm called the Levenberg-Marquardt algorithm modifies the original idea of
backpropagation and outperforms a variety of other training algorithms. Due to its efficiency and
31
speed it is the most widely used optimization algorithm, which is why I chose to use it to train
my neural network.
Backpropagation Algorithm
Consider a multilayer feedforward network, such as a three-layer network.
The net input to unit i in layer k + 1 is
The output of unit i will be
For an M layer network the system equations in matrix form are given by
The task of the network is to learn associations between a specified set of input-output
pairs {(p1, t1), (p2, t2) … (pQ, tQ)}.
The performance index for the network is
Where aqM is the output of the network when the qth input, pq is presented,
and eq = tq - aqM is the error for the qth input. For the standard backpropagation algorithm
we use an approximate steepest descent rule. The performance index is approximated by
32
where the total sum of squares is replaced by the squared errors for a single input/output
pair. The approximate steepest (gradient) descent algorithm is then
Where α is the learning rate. Define
As the sensitivity of the performance index to changes in the net input of unit i in layer k.
Now it can be shown, using (1), (6), and (9), that
It can also be shown that the sensitivities satisfy the following recurrence relation
33
This recurrence relation is initialized at the final layer
The overall learning algorithm now proceeds as follows; first, propagate the input
forward using (3)-(4); next, propagate the sensitivities back using (15) and (12); and
finally, update the weights and offsets using (7), (8), (l0), and (11) [27].
Levenberg-Marquardt Modification
While backpropogation is a steepest descent algorithm, the Levenberg-Marquardt
algorithm is an approximation to Newton’s method. Suppose that we have a function
V(x) which we want to minimize with respect to the parameter vector x, then Newton’s
method would be
Where is the Hessian matrix and is the gradient. If we assume that
V(x) is a sum of squares function
34
then it can be shown that
Where J(x) is the Jacobian matrix
For the Gauss-Newton method it is assumed that S(x) ≈ 0, and the update (16) becomes
The Levenberg-Marquardt modification to the Gauss-Newton method is
The parameter µ is multiplied by some factor (β) whenever a step would result in an
increased V(x). When a step reduces V(x), µ is divided by β. Notice that when µ is large
the algorithm becomes steepest descent (with step 1/µ), while for small µ the algorithm
becomes Gauss-Newton. The Levenberg-Marquardt algorithm can be considered a trust-
region modification to Gauss-Newton.
35
The key step in this algorithm is the computation of the Jacobian matrix. For the neural
network mapping problem the terms in the Jacobian matrix can be computed by a simple
modification to the backpropagation algorithm. The performance index for the mapping
problem is given by (5). It is easy to see that this is equivalent in form to (1 7), where x =
[w1(1, 1) w1(1, 2)… w1 (S1, R)b1(1)…b1(S1)w2(1, 1)… bM(SM)]T, and N = Q x SM.
Standard backpropagation calculates terms like
For the elements of the Jacobian matrix that are needed for the Levenberg-Marquardt
algorithm we need to calculate terms like
These terms can be calculated using the standard backpropagation algorithm with one
modification at the final layer
Note that each column of the matrix in (26) is a sensitivity vector which must be
backpropagated through the network to produce one row of the Jacobian.
36
Therefore the Levenberg-Marquardt algorithm proceeds as follows:
1) Present all inputs to the network and compute the corresponding network outputs
(using (3) and (4)), and errors (eq = tq – aqM). Compute the sum of squares of errors
over all inputs (V(x)).
2) Compute the Jacobian matrix (using (26), (12), (10), (11), and (20)).
3) Solve (23) to obtain Δx.
4) Recalculate the sum of squares of errors using x + Δx. If this new sum of squares is
smaller than that computed in step 1, then reduce µ by β, let x = x + Δx, and go back
to step 1. If the sum of squares is not reduced, then increase µ by β and go back to
step 3.
5) The algorithm is assumed to have converged when the norm of the gradient (18) is
less than some predetermined value, or when the sum of squares has been reduced to
some error goal [27].
4.4 Audio Processing Algorithm
In order to properly recognize speech patterns with my neural network I need to process the
audio samples and retrieve features that can define them. Commonly used features from audio
samples are mel-frequency cepstral coefficients, which together make up a mel-frequency
cepstrum. In order to derive the coefficients from an audio clip represented in a matrix, there are
steps that need to be followed:
1) Windowing: In the first stage, the signal is multiplied with a tapered window (usually
Hamming or Hanning window). The windowed speech frames are given by,
37
where s is a matrix containing framed speech, w is another matrix whose T rows contain
the same window function w of size N and ◦ denotes entry wise matrix multiplication.
2) Zero-padding: Zero-padding is required to compute the power spectrum using fast
Fourier transform FFT. Sufficient numbers of zeroes are padded using the following
matrix operation:
where I is an identity matrix of size N x N and O is a null matrix of size N x (M – N).
Here M is power of two and is greater than N.
3) DFT computation: The windowed speech frames are multiplied with twiddle factor
matrix (W) to formulate discrete Fourier transform DFT coefficients (Ω). Half of the
twiddle factor matrix is sufficient due to the conjugate symmetric property of Fourier
transform. This operation can be expressed as,
4) Power spectrum computation: Power spectrum (Θ) is computed by entry wise
multiplying the DFT coefficients with its conjugate. This can be written as,
38
5) Filter bank log energy computation: The speech signal is passed through a triangular
filter bank of frequency response (Λ) which contains p filters, linearly spaced in Mel
scale. The log energy output (Ψ) of the filter bank is given by,
6) DCT computation: In the finishing stage of MFCC computation, Ψ is multiplied with the
discrete cosine transform DCT matrix D to create the final coefficient (x). Therefore,
where, each column of D are p-dimensional orthogonal basis vector of DCT. However,
since the first coefficient is discarded as it is de-coefficient, multiplication with a
p x (p – 1) matrix is adequate in DCT computation [28].
These 6 steps result in a new matrix of extracted features from audio samples, which can then be
used as training data for a neural network.
39
Chapter 5
Neural Network Implementation
This chapter will discuss the details of my neural network implementation. The training of audio
files and test audio files through Matlab will be reviewed, as well as the programming behind the
neural network in Matlab.
5.1 Materials
In order to design and implement a speech recognition system using a neural network I needed to
decide on a programming language that I would use. Originally I decided to use the Fast
Artificial Neural Network library through the Java programming language. However, this library
did not link into Java very well and was proving difficult to use. After researching other methods
to program neural networks, I decided to attempt to use the program Matlab. Using Matlab to
implement a neural network proved to be much more user friendly than I had anticipated. Linked
within Matlab was a Neural Network Toolbox that provided many of the functions needed to
implement any type of neural network.
For the audio processing I used Matlab to record audio through my computer soundcard. Once
the wav files were recorded I used a library called Auditory Toolbox [24] that could be linked
with Matlab. This library contained the Mel-frequency Cepstrum Coefficient algorithm that I had
decided to use in order to retrieve unique features from my audio samples.
40
5.2 Training Data Preparation
In order to implement speech recognition with my neural network, I needed to record, analyze
and manipulate audio files.
Figure 5.1 – Matlab Audio Recording
Using Matlab I defined a sampling frequency called Fs and a duration called Duration. Then I
assign a new wav file to a variable using the wavrecord function. I needed to manually trim the
audio files in order to get rid of empty space where no speech was spoken, in order to improve
my network error rates.
Using this recording method I recorded 10 different samples of the words “Go, Hello, No, Stop
and Yes.” Five samples of each word would be set aside to test the neural network after it was
built, the other five samples of each word were to be used as training data. Using the Auditory
Toolbox function called mfcc I was able to retrieve the proper coefficients that could be used for
my speech recognition. I then needed to transform the matrix produced from the function into a
vector that could be entered into the neural network.
Figure 5.2 – MFCC function called and vector created
41
Figure 5.3 – Wav file and MFCC of “Go”
Figure 5.4 – Wav file and MFCC of “Hello”
Figure 5.5 – Wav file and MFCC of “No”
42
Figure 5.6 – Wav file and MFCC of “Stop”
Figure 5.7 – Wav file and MFCC of “Yes”
The last part of the preparation before building the neural network was to create the training
matrix and target matrix. I formed a matrix of size 130 x 25, which is 130 features for 25 words,
by taking the MFCC vector of each word and combing them into one matrix. Then I made a
target matrix of size 5 x 25, which is five classes with 25 words.
43
5.3 Building the Neural Network
With all of the details of my neural network decided, I entered them into Matlab in order to build
the network. I defined the network inputs as the training matrix, and the network targets as the
target Matrix. I created a variable called hiddenLayerSize and assigned the 90 hidden neurons I
decided to use. Using the function patternnet, a standard feedforward neural network that could
classify inputs according to target classes was created.
Figure 5.8 – Creating the Neural Network
Next I defined the division of my input data for training, validation, and testing. I used a standard
division of 70% of the data for training, 15% of the data for validation, and 15% of the data for
testing.
Figure 5.9 – Setting Up Division of Data
44
I then defined the training algorithm that I wanted my neural network to implement, which was
the Levenberg-Marquardt algorithm. I used the standard mean squared error function to evaluate
the performance of my network, and began the training.
Figure 5.10 – Matlab Training Code
The training of the network took an average of three minutes each time I attempted it. The
network needed between 10 and 20 iterations before it was optimized by the Levenberg-
Marquardt algorithm and stopped improving.
45
Figure 5.11 – Training the Neural Network
In the end Matlab was very easy to use when specifying the characteristics of the neural network,
and the network was created without any problems.
46
5.4 Testing the Neural Network
Once the network was created and trained, I wrote a function in Matlab that would test its speech
recognition capabilities.
Figure 5.12 – testNetwork Function
This function used the parameters testSoundFile and myNetwork in order to read in a recorded
wav file and test it in the network. After reading in the file I extracted the features using the mel-
frequency cepstral coefficients algorithm, then simulated the created neural network with the
new input vector. Using if-statements I outputted which class the neural network thought the
inputted word was in.
47
Chapter 6
Results
In this chapter I will discuss the details testing my neural network and the results it produced. I
will also discuss how the limitations may have altered these results.
6.1 Testing Results
Using the testing function that I wrote in Matlab, I tested all 5 test samples of each word in the
neural network. These test words were not involved in the training of the network, and therefore
could reveal whether the network correctly classified the words.
Figure 6.1 – Test Results for “Go”
Figure 6.2 – Test Results for “Hello”
48
Figure 6.3 – Test Results for “No”
Figure 6.4 – Test Results for “Stop”
Figure 6.5 – Test Results for “Yes”
49
6.2 Discussion of Results
When using a feedforward multilayer perceptron network with the Levenberg-Marquardt training
algorithm to classify the words “Go,” “Hello,” “No,” “Stop” and “Yes,” based on their mel-
frequency cepstrum coefficients, the test words were recognized with no errors. These results
were very positive and demonstrate the capabilities of neural networks when it comes to pattern
recognition.
Given that the network correctly identified all of the test words that were inputted, the results
were what I was hoping to achieve. However, given the limitations that I set on this project, they
may have altered the end results in a positive way.
When designing the speech recognition system I limited the vocabulary to 5 simple commands
which were recorded and spoken only by my voice. I needed to manually isolate the words so
that there was no confusion of when one word was ending and another began. With all of these
limitations it reduces the difficulty that the speech recognition system needs to undergo and
therefore reduces the chance that the network will make a mistake.
After the initial testing was complete, I decided to try a real-time recognition test. Using the
recording function in Matlab to receive a spoken word I automatically processed the mel-
frequency coefficients and simulated the neural network with them. As I expected the neural
network could not make a complete decision as there was no classification that could round up to
one. I believe that this was due to the fact that even though the word was recorded in isolation
with little to no background noise, there was too much room around the word that needed to be
50
cut out. Having this much empty space around the word altered the coefficients and ultimately
caused the word to go unrecognized.
6.3 Difficulties
Through the implementation of my speech recognition system I encountered some difficulties:
1) When I first started to design the speech recognition system it appeared to be a daunting
task due to all of the fundamentals that I detailed. In order to reduce the complexity of the
system and ensure that I was able to complete it, I reduced the system to recognize five
isolated words and focus on pre-recorded audio instead of real-time recognition.
2) Originally attempting to use the Fast Artificial Neural Network library and link it the Java
programming language proved to be very complicated and overall not very user friendly.
This problem was easy to overcome once I decided to use Matlab to implement the
system.
3) When recording the audio samples to train and test the neural network, originally I used a
sample rate of 8000. I tried to use a high rate in order to retrieve enough coefficients to
perform efficient recognition. However, when training the neural network the number of
features per word was much higher and caused Matlab to run into the “Out of Memory”
error. I reduced the sampling rate of my wav files in the end to 2000, which allowed the
network to be trained in an acceptable amount of time and led to successful recognition
4) When recording the words through Matlab, I often needed to re-record the word as there
was too much background noise in the wav file. My soundcard would pick up slight
noises which could alter the file and cause confusion within the network.
51
Chapter 7
Conclusions and Future Work
In this chapter I will discuss how the purpose of my thesis was fulfilled. I will also discuss future
work that this project can lead to.
7.1 Summary
I started this thesis set on implementing a speech recognition system with an artificial neural
network. I wanted to research different techniques behind speech recognition, as well as
techniques for using neural networks efficiently. While the scope of my system was reduced to
an isolated word recognition network, the results were still very positive. Despite limiting the
speech recognition side of the project, I gained an understanding of how neural networks can
tackle a problem like pattern recognition, as well as the benefits of certain structures and training
algorithms. In the end, I accomplished what I set out to and successfully implemented speech
recognition with a neural network.
7.2 Future Work
Now that I have gained an understanding of how speech recognition works with artificial neural
networks, I hope that I can expand upon my implementation. In the future I would like to
implement an algorithm that can filter out useless noise in audio files, this way I can attempt to
update my recognition system into a real-time process. With further research into audio analysis
I hope to improve upon the feature sets that I attempt to recognize, that way I can attempt more
difficult problems like recognizing phonemes from the E-set of the English language.
52
Bibliography
[1] I. R. Titze, Principles of Voice Production. Prentice Hall, 1994
[2] B. Busse et al, "Single-Synapse Analysis of a Diverse Synapse Population: Proteomic
Imaging Methods and Markers,” Neuron, vol. 68, pp. 639-653. November, 2010.
[3] Kiyoshi Kawaguchi. (2000, June, 17). Biological Neural Networks [Online]. Available:
http://www.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-html/node10.html
[4] Albrecht Schmidt. (2000). Biological Neural Networks [Online]. Available:
http://www.teco.edu/~albrecht/neuro/html/node7.html
[5] Rodolfo R. Llinas, “The contribution of Santiago Ramon y Cajal to functional neuroscience,”
Nature Reviews Neuroscience, vol. 4. January 2003.
[6] Kiyoshi Kawaguchi. (2000, June, 17). The McCulloch-Pitts Model of Neuron [Online].
Available: http://wwwold.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-
html/node12.html
[7] Informatics V – Scientific Computing. Pendulum Project [Online]. Available:
http://www5.in.tum.de/wiki/index.php/Pendulum_Project
[8] Andrew Blais, “An introduction to neural networks – Pattern learning with the back-
propagation algorithm,” IBM Developer Works. July 2001.
[9] Anil K, Jain, “Artificial Neural Networks: A Tutorial,” Michigan State University, MI.
0018-9162/96/$5.00, March 1996.
[10] Lawrence Davis, David J. Montana, “Training Feedforward Neural Networks Using Genetic
Algorithms,” BBN Systems and Technologies Corp.
[11] Fundamentals of Neural Networks. [Online] Available:
http://www.myreaders.info/08_Neural_Networks.pdf
53
[12] Roman Bertolami et al, “A Novel Connectionist System for Unconstrained Handwriting
Recognition,” CH-6928 Manno-Lugano, Switzerland. May 9, 2008.
[13] Andries P. Engelbrecht, Computation Intelligence: An Introduction, 2nd ed. England: John
Wiley & Sons Ltd, 2007.
[14] Simon Haykin, Neural Networks A Comprehensive Foundation, 2nd ed. Pearson Prentice
Hall, 1999.
[15] Speech Recognition. [Online] Available:
http://www.learnartificialneuralnetworks.com/speechrecognition.html#neuralnetwork
[16] Engebretson et al, “Identifying Language from Raw Speech – An Application of
Recurrent Neural Networks,” Department of Computer Science, Washington University.
[17] Yuliang Feng; Xiange Sun, "Analysis and processing speech signal based on MATLAB,"
Electrical and Control Engineering (ICECE), 2011 International Conference on, vol., no.,
pp.555, 556, 16-18 Sept. 2011
[18] Yasunari Yoshitomi, Taro Asada and Masayoshi Tabuse (2011). Vowel Judgment for Facial
Expression Recognition of a Speaker, Speech Technologies, Prof. Ivo Ipsic (Ed.), ISBN:
978-953-307-996-7
[19] Shivanker Dev Dhingra, Geeta Nijhawan, Poonam Pandit3, “Isolated Speech Recognition
Using MFCC and DTW,” International Journal of Advanced Research in Electrical,
Electronics and Instrumentation Engineering. Vol. 2, Issue 8, August 2013.
[20] Mark Gales, Steve Young, “The Application of Hidden Markov Models in Speech
Recognition,” Foundations and Trends in Signal Processing. Vol. 1, No. 3, 2007.
[21] William Y. Huang, Richard P. Lippmann, “Neural Net and Traditional Classifiers,” Neural
Information Processing Systems. Colo. 1987.
54
[22] Alexander Waibel et al, "Phoneme recognition using time-delay neural
networks," Acoustics, Speech and Signal Processing, IEEE Transactions on, vol.37,
no.3, pp.328,339, Mar 1989
[23] – Artificial Neural Networks/Feed-Forward Networks. [Online] Available:
http://en.wikibooks.org/wiki/Artificial_Neural_Networks/Feed-Forward_Networks
[24] – Malcolm Slaney, Auditory Toolbox. [Online] Available:
https://engineering.purdue.edu/~malcolm/interval/1998-010/
[25] Jeff Heaton, Introduction to Neural Networks with Java, 2nd ed. Heaton Research, Inc. 2008.
[26] Hecht-Nielsen, R., "Theory of the backpropagation neural network," Neural Networks,
1989. IJCNN. International Joint Conference on, vol., no., pp.593, 605 vol.1, 0-0 1989
[27] Hagan, M.T.; Menhaj, M.-B., "Training feedforward networks with the Marquardt
algorithm," Neural Networks, IEEE Transactions on, vol.5, no.6, pp.989, 993, Nov 1994
[28] Md. Sahidullah and Goutam Saha. 2012. “Design, analysis and experimental evaluation of
block based transformation in MFCC computation for speaker recognition.” Speech
Communication 54, 4 (May 2012), 543-565.
55
Appendix A
Source Code for trainingPrep.m
go1 = wavread('go1', 2000); go2 = wavread('go2', 2000); go3 = wavread('go3', 2000); go4 = wavread('go4', 2000); go5 = wavread('go5', 2000); Go1 = mfcc(go1); Go1 = Go1'; Go1 = Go1(:)'; Go1 = Go1'; Go2 = mfcc(go2); Go2 = Go2'; Go2 = Go2(:)'; Go2 = Go2'; Go3 = mfcc(go3); Go3 = Go3'; Go3 = Go3(:)'; Go3 = Go3'; Go4 = mfcc(go4); Go4 = Go4'; Go4 = Go4(:)'; Go4 = Go4'; Go5 = mfcc(go5); Go5 = Go5'; Go5 = Go5(:)'; Go5 = Go5'; hello1 = wavread('hello1', 2000); hello2 = wavread('hello2', 2000); hello3 = wavread('hello3', 2000); hello4 = wavread('hello4', 2000); hello5 = wavread('hello5', 2000); Hello1 = mfcc(hello1); Hello1 = Hello1'; Hello1 = Hello1(:)'; Hello1 = Hello1'; Hello2 = mfcc(hello2); Hello2 = Hello2'; Hello2 = Hello2(:)'; Hello2 = Hello2'; Hello3 = mfcc(hello3); Hello3 = Hello3'; Hello3 = Hello3(:)'; Hello3 = Hello3'; Hello4 = mfcc(hello4); Hello4 = Hello4'; Hello4 = Hello4(:)'; Hello4 = Hello4'; Hello5 = mfcc(hello5); Hello5 = Hello5'; Hello5 = Hello5(:)'; Hello5 = Hello5'; no1 = wavread('no1', 2000);
56
no2 = wavread('no2', 2000); no3 = wavread('no3', 2000); no4 = wavread('no4', 2000); no5 = wavread('no5', 2000); No1 = mfcc(no1); No1 = No1'; No1 = No1(:)'; No1 = No1'; No2 = mfcc(no2); No2 = No2'; No2 = No2(:)'; No2 = No2'; No3 = mfcc(no3); No3 = No3'; No3 = No3(:)'; No3 = No3'; No4 = mfcc(no4); No4 = No4'; No4 = No4(:)'; No4 = No4'; No5 = mfcc(no5); No5 = No5'; No5 = No5(:)'; No5 = No5'; stop1 = wavread('stop1', 2000); stop2 = wavread('stop2', 2000); stop3 = wavread('stop3', 2000); stop4 = wavread('stop4', 2000); stop5 = wavread('stop5', 2000); Stop1 = mfcc(stop1); Stop1 = Stop1'; Stop1 = Stop1(:)'; Stop1 = Stop1'; Stop2 = mfcc(stop2); Stop2 = Stop2'; Stop2 = Stop2(:)'; Stop2 = Stop2'; Stop3 = mfcc(stop3); Stop3 = Stop3'; Stop3 = Stop3(:)'; Stop3 = Stop3'; Stop4 = mfcc(stop4); Stop4 = Stop4'; Stop4 = Stop4(:)'; Stop4 = Stop4'; Stop5 = mfcc(stop5); Stop5 = Stop5'; Stop5 = Stop5(:)'; Stop5 = Stop5'; yes1 = wavread('yes1', 2000); yes2 = wavread('yes2', 2000); yes3 = wavread('yes3', 2000); yes4 = wavread('yes4', 2000); yes5 = wavread('yes5', 2000); Yes1 = mfcc(yes1); Yes1 = Yes1'; Yes1 = Yes1(:)';
57
Yes1 = Yes1'; Yes2 = mfcc(yes2); Yes2 = Yes2'; Yes2 = Yes2(:)'; Yes2 = Yes2'; Yes3 = mfcc(yes3); Yes3 = Yes3'; Yes3 = Yes3(:)'; Yes3 = Yes3'; Yes4 = mfcc(yes4); Yes4 = Yes4'; Yes4 = Yes4(:)'; Yes4 = Yes4'; Yes5 = mfcc(yes5); Yes5 = Yes5'; Yes5 = Yes5(:)'; Yes5 = Yes5'; trainMatrix(:, 1) = Go1; trainMatrix(:, 2) = Go2; trainMatrix(:, 3) = Go3; trainMatrix(:, 4) = Go4; trainMatrix(:, 5) = Go5; trainMatrix(:, 6) = Hello1; trainMatrix(:, 7) = Hello2; trainMatrix(:, 8) = Hello3; trainMatrix(:, 9) = Hello4; trainMatrix(:, 10) = Hello5; trainMatrix(:, 11) = No1; trainMatrix(:, 12) = No2; trainMatrix(:, 13) = No3; trainMatrix(:, 14) = No4; trainMatrix(:, 15) = No5; trainMatrix(:, 16) = Stop1; trainMatrix(:, 17) = Stop2; trainMatrix(:, 18) = Stop3; trainMatrix(:, 19) = Stop4; trainMatrix(:, 20) = Stop5; trainMatrix(:, 21) = Yes1; trainMatrix(:, 22) = Yes2; trainMatrix(:, 23) = Yes3; trainMatrix(:, 24) = Yes4; trainMatrix(:, 25) = Yes5; targetMatrix = [1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0; 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0; 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1];
58
Appendix B
Source Code for createNetwork.m
% Solve a Pattern Recognition Problem with a Neural Network % Script generated by NPRTOOL % Created Fri Mar 14 11:36:58 EDT 2014 % % This script assumes these variables are defined: % % trainMatrix - input data. % targetMatrix - target data.
inputs = trainMatrix; targets = targetMatrix;
% Create a Pattern Recognition Network hiddenLayerSize = 90; myNetwork = patternnet(hiddenLayerSize);
% Choose Input and Output Pre/Post-Processing Functions % For a list of all processing functions type: help nnprocess myNetwork.inputs{1}.processFcns = {'removeconstantrows','mapminmax'}; myNetwork.outputs{2}.processFcns = {'removeconstantrows','mapminmax'};
% Setup Division of Data for Training, Validation, Testing % For a list of all data division functions type: help nndivide myNetwork.divideFcn = 'dividerand'; % Divide data randomly myNetwork.divideMode = 'sample'; % Divide up every sample myNetwork.divideParam.trainRatio = 70/100; myNetwork.divideParam.valRatio = 15/100; myNetwork.divideParam.testRatio = 15/100;
% For help on training function 'trainlm' type: help trainlm % For a list of all training functions type: help nntrain myNetwork.trainFcn = 'trainlm'; % Levenberg-Marquardt
% Choose a Performance Function % For a list of all performance functions type: help nnperformance myNetwork.performFcn = 'mse'; % Mean squared error
% Choose Plot Functions % For a list of all plot functions type: help nnplot myNetwork.plotFcns = {'plotperform','plottrainstate','ploterrhist', ... 'plotregression', 'plotfit'};
% Train the Network [myNetwork,tr] = train(myNetwork,inputs,targets);
% Test the Network outputs = myNetwork(inputs);
59
errors = gsubtract(targets,outputs); performance = perform(myNetwork,targets,outputs)
% Recalculate Training, Validation and Test Performance trainTargets = targets .* tr.trainMask{1}; valTargets = targets .* tr.valMask{1}; testTargets = targets .* tr.testMask{1}; trainPerformance = perform(myNetwork,trainTargets,outputs) valPerformance = perform(myNetwork,valTargets,outputs) testPerformance = perform(myNetwork,testTargets,outputs)
% View the Network view(myNetwork)
% Plots % Uncomment these lines to enable various plots. %figure, plotperform(tr) %figure, plottrainstate(tr) %figure, plotconfusion(targets,outputs) %figure, ploterrhist(errors)
60
Appendix C
Source Code for testNetwork.m
function [] = testNetwork( testSoundFile , myNetwork) %UNTITLED2 Summary of this function goes here % Detailed explanation goes here fileName = testSoundFile; myNetwork1 = myNetwork; testSound = wavread(fileName, 2000); TestSound = mfcc(testSound); TestSound = TestSound'; TestSound = TestSound(:)'; TestSound = TestSound'; ResultMatrix = sim(myNetwork1, TestSound); Class1 = round(ResultMatrix(1)); Class2 = round(ResultMatrix(2)); Class3 = round(ResultMatrix(3)); Class4 = round(ResultMatrix(4)); Class5 = round(ResultMatrix(5));
if Class1 == 1 display('The word is Go'); elseif Class2 == 1 display('The word is Hello'); elseif Class3 == 1 display('The word is No'); elseif Class4 == 1 display('The word is Stop'); elseif Class5 == 1 display('The word is Yes'); end end
61
Appendix D
Source Code for mfcc.m
% mfcc - Mel frequency cepstrum coefficient analysis. % [ceps,freqresp,fb,fbrecon,freqrecon] = ... % mfcc(input, samplingRate, [frameRate]) % Find the cepstral coefficients (ceps) corresponding to the % input. Four other quantities are optionally returned that % represent: % the detailed fft magnitude (freqresp) used in MFCC calculation, % the mel-scale filter bank output (fb) % the filter bank output by inverting the cepstrals with a cosine % transform (fbrecon), % the smooth frequency response by interpolating the fb reconstruction % (freqrecon) % -- Malcolm Slaney, August 1993 % Modified a bit to make testing an algorithm easier... 4/15/94 % Fixed Cosine Transform (indices of cos() were swapped) - 5/26/95 % Added optional frameRate argument - 6/8/95 % Added proper filterbank reconstruction using inverse DCT - 10/27/95 % Added filterbank inversion to reconstruct spectrum - 11/1/95
% (c) 1998 Interval Research Corporation
function [ceps,freqresp,fb,fbrecon,freqrecon] = ... mfcc(input, samplingRate, frameRate) global mfccDCTMatrix mfccFilterWeights
[r c] = size(input); if (r > c) input=input'; end
% Filter bank parameters lowestFrequency = 133.3333; linearFilters = 13; linearSpacing = 66.66666666; logFilters = 27; logSpacing = 1.0711703; fftSize = 512; cepstralCoefficients = 13; windowSize = 400; windowSize = 256; % Standard says 400, but 256 makes more sense % Really should be a function of the sample % rate (and the lowestFrequency) and the % frame rate. if (nargin < 2) samplingRate = 16000; end; if (nargin < 3) frameRate = 100; end;
% Keep this around for later.... totalFilters = linearFilters + logFilters;
% Now figure the band edges. Interesting frequencies are spaced
62
% by linearSpacing for a while, then go logarithmic. First figure % all the interesting frequencies. Lower, center, and upper band % edges are all consequtive interesting frequencies.
freqs = lowestFrequency + (0:linearFilters-1)*linearSpacing; freqs(linearFilters+1:totalFilters+2) = ... freqs(linearFilters) * logSpacing.^(1:logFilters+2);
lower = freqs(1:totalFilters); center = freqs(2:totalFilters+1); upper = freqs(3:totalFilters+2);
% We now want to combine FFT bins so that each filter has unit % weight, assuming a triangular weighting function. First figure % out the height of the triangle, then we can figure out each % frequencies contribution mfccFilterWeights = zeros(totalFilters,fftSize); triangleHeight = 2./(upper-lower); fftFreqs = (0:fftSize-1)/fftSize*samplingRate;
for chan=1:totalFilters mfccFilterWeights(chan,:) = ... (fftFreqs > lower(chan) & fftFreqs <= center(chan)).* ... triangleHeight(chan).*(fftFreqs-lower(chan))/(center(chan)-lower(chan)) +
... (fftFreqs > center(chan) & fftFreqs < upper(chan)).* ... triangleHeight(chan).*(upper(chan)-fftFreqs)/(upper(chan)-center(chan)); end %semilogx(fftFreqs,mfccFilterWeights') %axis([lower(1) upper(totalFilters) 0 max(max(mfccFilterWeights))])
hamWindow = 0.54 - 0.46*cos(2*pi*(0:windowSize-1)/windowSize);
if 0 % Window it like ComplexSpectrum windowStep = samplingRate/frameRate; a = .54; b = -.46; wr = sqrt(windowStep/windowSize); phi = pi/windowSize; hamWindow = 2*wr/sqrt(4*a*a+2*b*b)* ... (a + b*cos(2*pi*(0:windowSize-1)/windowSize + phi)); end
% Figure out Discrete Cosine Transform. We want a matrix % dct(i,j) which is totalFilters x cepstralCoefficients in size. % The i,j component is given by % cos( i * (j+0.5)/totalFilters pi ) % where we have assumed that i and j start at 0. mfccDCTMatrix = 1/sqrt(totalFilters/2)*cos((0:(cepstralCoefficients-1))' *
... (2*(0:(totalFilters-1))+1) * pi/2/totalFilters); mfccDCTMatrix(1,:) = mfccDCTMatrix(1,:) * sqrt(2)/2;
%imagesc(mfccDCTMatrix);
63
% Filter the input with the preemphasis filter. Also figure how % many columns of data we will end up with. if 1 preEmphasized = filter([1 -.97], 1, input); else preEmphasized = input; end windowStep = samplingRate/frameRate; cols = fix((length(input)-windowSize)/windowStep);
% Allocate all the space we need for the output arrays. ceps = zeros(cepstralCoefficients, cols); if (nargout > 1) freqresp = zeros(fftSize/2, cols); end; if (nargout > 2) fb = zeros(totalFilters, cols); end;
% Invert the filter bank center frequencies. For each FFT bin % we want to know the exact position in the filter bank to find % the original frequency response. The next block of code finds the % integer and fractional sampling positions. if (nargout > 4) fr = (0:(fftSize/2-1))'/(fftSize/2)*samplingRate/2; j = 1; for i=1:(fftSize/2) if fr(i) > center(j+1) j = j + 1; end if j > totalFilters-1 j = totalFilters-1; end fr(i) = min(totalFilters-.0001, ... max(1,j + (fr(i)-center(j))/(center(j+1)-center(j)))); end fri = fix(fr); frac = fr - fri;
freqrecon = zeros(fftSize/2, cols); end
% Ok, now let's do the processing. For each chunk of data: % * Window the data with a hamming window, % * Shift it into FFT order, % * Find the magnitude of the fft, % * Convert the fft data into filter bank outputs, % * Find the log base 10, % * Find the cosine transform to reduce dimensionality. for start=0:cols-1 first = start*windowStep + 1; last = first + windowSize-1; fftData = zeros(1,fftSize); fftData(1:windowSize) = preEmphasized(first:last).*hamWindow; fftMag = abs(fft(fftData)); earMag = log10(mfccFilterWeights * fftMag');
ceps(:,start+1) = mfccDCTMatrix * earMag; if (nargout > 1) freqresp(:,start+1) = fftMag(1:fftSize/2)'; end; if (nargout > 2) fb(:,start+1) = earMag; end
64
if (nargout > 3) fbrecon(:,start+1) = ... mfccDCTMatrix(1:cepstralCoefficients,:)' * ... ceps(:,start+1); end if (nargout > 4) f10 = 10.^fbrecon(:,start+1); freqrecon(:,start+1) = samplingRate/fftSize * ... (f10(fri).*(1-frac) + f10(fri+1).*frac); end end
% OK, just to check things, let's also reconstruct the original FB % output. We do this by multiplying the cepstral data by the transpose % of the original DCT matrix. This all works because we were careful to % scale the DCT matrix so it was orthonormal. if 1 & (nargout > 3) fbrecon = mfccDCTMatrix(1:cepstralCoefficients,:)' * ceps; % imagesc(mt(:,1:cepstralCoefficients)*mfccDCTMatrix); end;