Implementing Speech Recognition with Artificial Neural...

i

Implementing Speech Recognition with Artificial Neural Networks

by

Alexander Murphy

Department of Computer Science

Thesis Advisor: Dr. Yi Feng

Submitted in partial fulfillment

of the requirements for the degree of

Bachelor of Computer Science

Algoma University

Sault Ste. Marie, Ontario

April 11, 2014

ii

Abstract

Speech recognition can be a complicated problem. But in today’s society when technology is

consistently striving for a hands-free or voice driven implementation, speech recognition can be

a very useful tool. With speech being such a fascinating phenomenon of the human body, many

different properties of speech end up being unique for each person. This leads to many different

options when considering implementing a speech recognition system. Which techniques of

speech recognition can yield better results? What is an efficient way of implementing such a

system? I will be answering these questions in this thesis by implementing an isolated word

recognition system with an artificial neural network. With this network being used to implement

the recognition system I will attempt to gain an understanding of how neural networks are used

for pattern recognition, and the techniques behind them.

iii

Acknowledgements

I would like to thank my thesis supervisor, Dr. Yi Feng. Her feedback, guidance, and advice was

a great help in the completion of my thesis. Also, I would like to thank Dr. George Townsend for

agreeing to be my second reader for this project, as well as teaching me how to use the massive

program that is Matlab. All of your help was greatly appreciated.

I would also like to thank the entire Department of Computer Science for helping me through the

past four years. It was a wonderful experience from which I benefitted greatly.

iv

Table of Contents

Abstract ii

Acknowledgements iii

Table of Contents iv

List of Figures vi

Chapter 1 Introduction 1

1.1 Speech Recognition……………………………………………………………………….2

1.2 Neural Networks…………………………………………………………………………..3

1.3 Thesis in Brief……………………………………………………………………………..3

1.4 Thesis Outline……………………………………………………………………………..4

Chapter 2 Neural Networks 5

2.1 Biological Neural Networks………………………………………………………………5

2.2 Artificial Neural Networks………………………………………………………………..6

2.2.1 Fundamentals of Neural Networks………………………………………………..8

2.2.2 Neural Network Structures………………………………………………………11

2.2.3 Training Algorithms……………………………………………………………..15

Chapter 3 Speech Recognition 18

3.1 Fundamentals of Speech Recognition……………………………………………………18

3.2 Dynamic Time Warping Algorithm……………………………………………………...20

3.3 Hidden Markov Model Approaches……………………………………………………...22

3.4 Neural Network Approaches…………………………………………………………..…23

Chapter 4 Design of Neural Network 26

4.1 Limitations of Neural Network…………..………………………………………………26

4.2 Structure of Neural Network……………………………………………..………………28

v

4.3 Training Algorithm………………………………………………………………………30

4.4 Audio Processing Algorithm……………………………………………….…………….36

Chapter 5 Neural Network Implementation 39

5.1 Materials………………..…………….………………………………………………….39

5.2 Training Data Preparation………………………………………………………………..40

5.3 Building the Neural Network…………………………………………………………….43

5.4 Testing the Neural Network…………………………………………………...…………46

Chapter 6 Results 47

6.1 Testing Results…………………………………………………………………………...47

6.2 Discussion of Results….....................................................................................................49

6.2 Difficulties……………………………………………………………………………….50

Chapter 7 Conclusions and Future Work 51

7.1 Summary…………………………………………………………………………………51

7.2 Future Work……………………………………………………………………………...51

Bibliography 52

Appendices 55

A: trainingPrep.m 55

B: createNetwork.m 58

C: testNetwork.m 60

D: mfcc.m 61

vi

List of Figures

Figure 2.1 – Simplified Biological Neurons……………………………………………………....5

Figure 2.2 – Symbolic Illustration of Linear Threshold Gate……………………………………..7

Figure 2.3 – Activation Functions………………………………………………………………...9

Figure 2.4 – Threshold Logic Unit………………………………………………………………10

Figure 2.5 – Single Layer Feedforward Network [11]…………………………………………..12

Figure 2.6 – Multi Layer Feedforward Network [11]……………………………………………12

Figure 2.7 – Recurrent Neural Network…………………………………………………...…….14

Figure 2.8 – Block Diagram of Reinforcement Learning [14]…………………………………..16

Figure 2.9 – Competitive Neural Network………………………………………………………17

Figure 3.1 – Structure of a Standard Speech Recognition System [15]………………………….18

Figure 3.2 – Image Frame Extraction [18]……………………………………………………….19

Figure 3.3 – Global Distance Grid……………………………………………………………….21

Figure 3.4 – A Simple Hidden Markov Model Topology……………………………………….22

Figure 3.5 – Static and Dynamic Classification………………………………………………….23

Figure 3.6 – The Multilayer Perceptron and Decision Regions………………………………….24

Figure 3.7 – Time Delay Neural Network Structure……………………………………………..25

Figure 4.1 – Simple Fully Connected Feedforward Multilayer Perceptron [23]………………...28

Figure 4.2 – Structure of Neural Network……………………………………………………….30

Figure 5.1 – Matlab Audio Recording…………………………………………………………...40

Figure 5.2 – MFCC function called and vector created………………………………………….40

Figure 5.3 – Wav file and MFCC of “Go”……………………………………………………….41

Figure 5.4 – Wav file and MFCC of “Hello”…………………………………………………….41

Figure 5.5 – Wav file and MFCC of “No”……………………………………………………….41

vii

Figure 5.6 – Wav file and MFCC of “Stop”………………………………………….………….42

Figure 5.7 – Wav file and MFCC of “Yes”……………………………………………..……….42

Figure 5.8 – Creating the Neural Network……………………………………………………….43

Figure 5.9 – Setting Up Division of Data………………………………………………………..43

Figure 5.10 – Matlab Training Code……………………………………………………………..44

Figure 5.11 – Training the Neural Network……………………………………………………...45

Figure 5.12 – testNetwork Function……………………………………………………………..46

Figure 6.1 – Test Results for “Go”………………………………………………………………47

Figure 6.2 – Test Results for “Hello”……………………………………………………………47

Figure 6.3 – Test Results for “No”………………………………………………………………47

Figure 6.4 – Test Results for “Stop”……………………………………………..………………48

Figure 6.5 – Test Results for “Yes”……………………………………………………………...48

1

Chapter 1

Introduction

Speech is a natural phenomenon that occurs every single day. From a very early point in our

lives we learn the necessary skills to use speech as a primary mode of communication. Because

speech comes so naturally to us, it is easy to forget how complex it really is. It begins with lungs

producing airflow and air pressure. This pressure then vibrates the vocal chords which separate

the air flow into audible pulses. The muscles of the larynx then adjust the length and tension of

the vocal cords to change the pitch and tone of the sound produced. The vocal tract, consisting of

the tongue, palate, cheek and lips then articulate and filter the sound [1]. Each instance of speech

that occurs after this process is very unique. Voices can vary in volume, speed, pitch, roughness,

tone and other different aspects. Due to different cultural backgrounds voices can also differ in

terms of accent, articulation, and pronunciation. All of these differences make the

implementation of speech recognition a very challenging problem.

In today’s society people prefer options in technology that simplify activities. Using speech

recognition, users could type text or issue device commands through speech. Systems like

language translation and dictation could become simple hands-free devices. But with these ideas

comes the complicated task of implementing the system. Speech recognition in computers is

nowhere near the speech recognition capabilities of the human brain. But perhaps using systems

that mimic human brain functions could lead to further advancements in the field.

2

1.1 Speech Recognition

Speech recognition can be a very complex problem with a large number of characteristics that

need to be considered.

Vocabulary size and confusability

Speech recognition can generally function more efficiently with a small vocabulary size.

As the number of words that need to be recognized increases, the amount of confusion

and percentage of error can also increase.

Speaker dependence vs. independence

As mentioned above, different speakers can have completely different voices due to

several different factors. A speech recognition system will have a much easier time

recognizing a word from a single user, while a system that accepts input from multiple

users can end up with a higher number of errors.

Isolated, discontinuous, or continuous speech

Isolated speech means a single spoken word is inputted for recognition. Discontinuous

speech is a full sentence that has words seperated by silence, while continuous speech

would be naturally spoken sentences. A speech recognition system would have a much

better chance at recognizing an isolated word over a word that is part of continuous

speech.

3

Read vs spontaneous speech

When preparing speech to be recognized by a system, read speech will be prepared ahead

of time and likely consist of no errors. Spontaneous speech could possibly contain

mispronounced words or incomplete sentences, resulting in a much more difficult attempt

at recognition.

1.2 Neural Networks

A human brain works differently than a typical computer. While a computer can use a processor

to execute instructions and access local memory at incredible speeds, a human brain is a slower

collection of nodes called neurons. These neurons are connected by synapses which pass

electronic or chemical signals back and forth to each other. These connections are capable of

changing each time information is sent or received which represents the brain adapting and

learning information. A neural network is an artificial computer generated system that attempts

to replicate the neural system of the human brain. Nodes are used to represent neurons, and they

are connected with different weights to represent the synapses. These networks can be trained to

process different information and learn from it. Naturally these systems are used to mimic certain

functions that the human brain is capable of, such as pattern recognition and decision making.

1.3 Thesis in Brief

The purpose of this thesis is to implement a speech recognition system using an artificial neural

network. Due to all of the different characteristics that speech recognition systems depend on, I

decided to simplify the implementation of my system. I will be implementing a speech

recognition system that focuses on a set of isolated words. The words “Yes,” “No,” “Hello,”

4

“Stop,” and “Go” will be recorded to eliminate the problem of spontaneous speech. My speech

recognition system will be trained using only my voice, which will eliminate the need to

compensate for users with different voices. With this system I hope to gain an understanding of

how speech recognition can be implemented, and how neural networks can be used to implement

advanced artificial intelligence problems.

1.4 Thesis Outline

The first chapters of this thesis will show a background and fundamentals of neural networks and

speech recognition

Chapter 2 will review neural networks

Chapter 3 will review speech recognition

The remainder of the chapters will discuss the details of my own neural network implementation

as well as the testing and results. It will conclude with an evaluation of the system and a

discussion on possible future work.

Chapter 4 will review my neural network

Chapter 5 will review my implementation of speech recognition

Chapter 6 will discuss the results of testing the system

Chapter 7 will discuss the conclusion and future work

5

Chapter 2

Neural Networks

This chapter will begin with an analysis of a biological neural network. After presenting this

concept I will discuss how it is translated into artificial neural networks, and the different

structures and training methods of specific neural networks.

2.1 Biological Neural Networks

Artificial neural networks have been modelled after the structure of the human brain. The

average adult human brain weighs approximately 1.5 kg and has a volume of around 1260 cubic

centimeters. The brain is composed of approximately 200 billion neurons connected by 125

trillion synapses, glial cells, and blood vessels. The neuron can be seperated into three major

parts: the cell body (soma), the dendrites, and the axon [2].

Figure 2.1 – Simplified Biological Neurons

6

The dendrites contain thin lines that receive signals from surrounding neurons. Each branch of a

dendrite is connected to a single neuron through the small connection of a synapse. A signal of

either chemical diffusion or electrical impulses is transmitted through the thin cylinder of the

axon, through the synapse connection, and into the connected dendrite of specific neuron [3].

When the neurons are performing a specific task, a certain number of neurons must be used in

order to complete it. Input signals are distributed across the required neurons of different

strengths or frequencies. All of these input signals are then added together and processed by a

threshold function. The function then produces an output signal processed by all of the inputs. It

is this entire function that makes up how humans are able to learn new information and process

problems. While the processing time of 1ms per cycle, or a transmission speed of 0.6ms to

120ms is slower than a modern computer, the capability of learning information to recognize and

solve various problems is the reason why the neural network of a human brain is an excellent

structure for artificial intelligence [4].

2.2 Artificial Neural Networks

In the late 19th century a neuroscientist named Santiago Ramón y Cajal conducted studies on the

human nervous system. Cajal discovered that the human nervous system was actually comprised

of the discrete neurons that communicated with signals passed through axons, dendrites and

synapses [5]. This discovery led to further research that identified different types of neurons, as

well as the types of signals that were being passed between them. Neurobiologists were finding it

very difficult however, to understand how the neurons were working together to achieve such a

high level of functionality.

7

It was not until the advancement of modern computing allowed researchers to be able to build

working models of neural systems. These systems gave a better understanding of how the neural

system of the brain functioned. Warren McCulloch and Walter Pitts created an early model of an

artificial neuron in 1943. This model became known as a linear binary threshold gate. With a

series of given inputs, a sum would be calculated against weight values normalized in the range

of either (0, 1) or (-1, 1) and then associated with a specific input. Given a certain threshold, the

output would be one of two classes in binary based on whether the sum exceeded the threshold

or not [6].

Figure 2.2 – Symbolic Illustration of Linear Threshold Gate

8

With this model of an artificial neuron it was now proven that systems of neurons that were

assembled into a finite state automaton could compute any arbitrary function, when given

suitable values for the weights between the neurons. Researchers went even further by

implementing learning procedures that were capable of automatically finding appropriate weight

values, which enabled a network to compute any specific function. As the years passed and more

research was performed. Basic attributes of neural networks were defined, different training

methods were created, and new network architectures were implemented.

2.2.1 Fundamentals of Neural Networks

While there are many different architectures of neural networks, each one of these systems

contain similar basic attributes. These attributes are processing units, weights, a computation

method and a training method.

Processing Units

A neural network system is made up of a specific number of processing units that

represent neurons in a human brain. These units are typically divided into different

groups. Some units represent the input units that receive the data to process, for example

numerical values representing an image. Units that are hidden inside the neural network

work to manipulate and transform the inputted information. The final group of units that

can be known as output units represent the decision of the neural network.

Each neuron that receives data performs a defined function. The results of this function

are then shared with connected neurons. The sharing of this value is called the activation

9

value. This value determines whether the receiving neuron will be activated or if it will

remain inactive. There are several different activation functions that approximate the

activation value to either 1 to 0, or 1 to -1 [7].

Figure 2.3 – Activation Functions

Weights

The processing units in a neural network must be connected in some way. These

connections between neurons are displayed as lines in neural network diagrams. Each

connection is assigned a value of a weight. These weights represent how influential a

neuron is to another connected neuron. Typically the input from a neuron is multiplied by

the weight of the connection that is sending the data. This value is then entered into the

activation function of the network. If the value exceeds the given threshold of the

activation function then the neuron will be activated, otherwise the neuron will remain

inactive.

10

Computation Method

A neural network uses what is called a threshold logic unit in order to perform the

required computations stated above. This unit is an object that takes all of the different

inputs from connected neurons, sums them together, and then uses the activation function

to determine a correct output. This output is then either sent to the output layer of the

network, or onto the next connected neuron in another hidden layer.

Figure 2.4 – Threshold Logic Unit

The figure above displays the inputs from either the input layer of the network, or the

connected neurons, being modified by the given weight values of the connection. These

values are then summed together and entered into the activation function with an output

of y [8].

11

Training Method

In order for a neural network to perform efficiently and correctly it must be able to adapt

its weights in order to achieve desired outputs for any input. The training process of a

neural network is extremely important in order for the network to replicate a human brain

being able to learn and respond to new information. This process involves iterating inputs

of training data into the neural network. With some neural networks there will be a given

desired output along with this training data. Each time the data is inputted into the neural

network, the weights of the connections between the neurons may be updated in order to

achieve the desired output [9]. There are several different techniques to training neural

networks, each one being appropriate for different problems. We will explore these

training algorithms in more detail further on.

2.2.2 Neural Network Structures

Artificial neural networks can be classified in two significant groups, feedforward networks and

recurrent networks.

Feedforward Neural Networks

A feedforward network is a directed and weighted graph that contains no loops. The input

neurons contain no connection leading to them, and the output neurons contain no

connections leading away from them. As the values of the input nodes are set, using the

threshold logic unit, all other nodes in the network can set their values as well. A

feedforward neural network can either be a single layer network or a multilayer network

[10].

12

Figure 2.5 – Single Layer Feedforward Network [11]

Figure 2.6 – Multi Layer Feedforward Network [11]

13

As shown in Figure 2.5, a single layer feedforward network consists of a single layer of

weights connected from the inputs to the outputs. In Figure 2.6 there is a layer in between

the input and output layers. The layers that are in this position are referred to as hidden

layers that hold hidden neurons [11]. These extra layers can provide extra computation on

the data being processed by the neural network, allowing for more advanced functions to

be performed. With both of these networks the data being processed can only travel in the

direction of input to output, hence the term feedforward network. The more popular

option for feedforward networks is the multi-layered structure. These networks have been

found to generalize data very well, and often provide the correct output for an input that

was not in the initial training set of data [10].

Recurrent Neural Network

A recurrent neural network is a directed and weighted graph that contains at least one

feedback loop. This technique allows data to travel backwards through the neural

network and adjust weights based on new outputs. With output values now being able to

manipulate connection weights in the network, they can now exhibit a more dynamic

temporal behavior.

14

Figure 2.7 – Recurrent Neural Network

In Figure 2.7, the neural network displayed has weight connections going from the input

layer, to a hidden layer, and then to an output layer. The difference in this network

compared to feedforward networks is that there is also a connection from the output layer

leading back to the input layer. With this network structure, each time data is processed

through the network it learns and adapts to it, unlike feedforward networks that only learn

from specific training data. This network structure is very useful in applications that

involve processing arbitrary sequences of inputs, like connected handwriting recognition

[12].

15

2.2.3 Training Algorithms

In order to ensure that a neural network is performing at its peak efficiency, the weights in the

network must be adjusted in order to properly obtain the correct output desired. There are three

main methods of allowing a neural network to learn required information. There is supervised

learning, reinforcement learning and unsupervised learning.

Supervised Learning

Neural networks that implement supervised learning require a set of training data, which

are arranged as input vectors. Next there needs to be a target vector that determines how

well the inputs are learned, and acts as a guide to adjust weight values in order to reduce

errors.

The output of a feedforward neural network for a given pattern zp is calculated with a

single pass through the network. For each output unit ok, we have,

where fok and fyj are respectively the activation function for output unit ok and hidden unit

yj; wkj is the weight between output unit ok and hidden unit yj; zi, p is the value of input

unit zi of input pattern zp; the (I + 1)-th input unit and the (J + 1)-th hidden unit are bias

units representing the threshold values of neurons in the next layer [13].

16

Reinforcement Learning

In reinforcement learning, also referred to as semi-supervised learning, there is no

explicitly desired output for the neural network. Unlike supervised learning, the learning

of an input-output mapping is performed through signals that represent good or bad

behavior. Reinforcement learning can be difficult to implement because there is no target

to provide a desired response from the neural network [14].

Figure 2.8 – Block Diagram of Reinforcement Learning [14]

Unsupervised Learning

In a neural network that implements unsupervised learning there is no target output

provided or any signals to indicate whether an action was good or bad. It is left up to the

network to observe patterns and regularities in the input data. Once a network has learned

the patterns of the input data, it will be able to attempt to identify the same patterns in

17

new inputs, or classify them as a new class of output. One method to perform

unsupervised learning is the competitive learning rule. With this rule a network can

consist of an input layer that will receive data and a competitive layer or neurons that will

be competing with each other. These neurons will attempt to respond to certain features

in the input data, however only one neuron will be activated while the others remain

inactive [15]. This type of learning is much more difficult to implement and is suited for

much more complicated problems than speech recognition.

Figure 2.9 – Competitive Neural Network

18

Chapter 3

Speech Recognition

This chapter will begin with an analysis of the fundamentals of speech recognition. Following

the basic concepts there will be a review of the popular Dynamic Time Warping algorithm. Once

these topics are presented there will be a review of the use of Hidden Markov Models within

speech recognition, as well as the alternate approach of using Neural Networks.

3.1 Fundamentals of Speech Recognition

Speech recognition is a multileveled pattern recognition task, in which acoustical signals are

examined and structured into a hierarchy of sub-word units, words, phrases, and sentences. Each

level may provide additional temporal constraints such as known pronunciations or legal word

sequences [15]. Figure 3.1 displays the fundamental elements of a standard speech recognition

system.

Figure 3.1 – Structure of a Standard Speech Recognition System [15]

19

Raw Speech

Speech is sampled at a standard frequency of 16 KHz over a microphone. This sampling

yields a sequence of amplitude values over time [16].

Signal Analysis

Sampled raw speech must be transformed and compressed in order to simplify the

recognition process. There are several popular techniques that extract features from raw

speech and compress the data without loss of data. Fourier analysis, Perceptual Linear

Prediction, Linear Predictive Coding and Cepstral analysis all properly process the raw

speech into a more usable state [17].

Speech Frames

Once raw speech is processed and analyzed, the audio is broken up into speech frames.

These frames are typically 10ms intervals of the processed audio and provide unique

information relative to the speech recognition process [18].

Figure 3.2 – Image Frame Extraction [18]

20

3.2 Dynamic Time Warping Algorithm

Speech recognition can be complicated for a number of different reasons. When comparing a

sample for recognition against training data, the test sample may be of a different duration than

the training sample. One way to solve this problem is to normalize the speech samples so that

they all have the same duration. Another problem with recognizing speech is that the rate at

which the words are spoken may not always be constant, which would mean the optimal

alignment between a test sample and the training sample may be nonlinear.

An instance of dynamic programming known as Dynamic Time Warping is an efficient method

to solve the time alignment problem in speech recognition. The Dynamic Time Warping

algorithm attempts to align the test vector and training vector by warping the time axis

repetitively until an optimal match between the two vectors is found. The algorithm performs a

piece wise linear mapping of the time axis to align both the vectors. For example:

Given two sequences of a feature vector in an n-dimensional space.

x = [x1, x2… xn] and y = [y1, y2… yn]

These two sequences are aligned on the sides of a grid, with one on the top and the other on the

left hand side. Both sequences start on the bottom left of the grid.

21

Figure 3.3 – Global Distance Grid

In each cell of the grid in Figure 3.3, a distance measure is placed, comparing elements of the

two sequences. The distance between the two points is calculated by the Euclidean distance.

Dist (x, y) = |x - y| = [(x1 - y1)2 + (x2 – y2)

2 +… + (xn – yn) 2]1/2

The best alignment for the two sequences is the path through the grid that minimizes the total

distance between them. This total distance is calculated by finding and going through all the

possible routes through the grid. The distance is equal to the minimum of the distances between

the individual elements on the path divided by the sum of the weighting function [19].

22

3.3 Hidden Markov Model Approach

One of the most widely used and successful approaches to speech recognition is using a Hidden

Markov Model. This model is essentially a collection of different states that are connected by

transitions. The model begins in a designated initial state, and at each step in the process, a

transition is used to reach a new state where an output is generated. This system is referred to as

hidden due to the fact that while outputs are observed over the course of running the system, the

sequence of states visited is hidden from the user.

Figure 3.4 – A Simple Hidden Markov Model Topology

A typical Hidden Markov Model consists of:

{s} = a set of states.

{aij} = a set of transition probabilities, where aij is the probability of the transition from

state i to state j.

{bi(u)} = a set of emission probabilities, where bi is the probability describing the

likelihood of emitting a sound u while in state i.

In order to simplify the process, probabilities a and b only depend on the current state of the

model. This limits the parameters of training, and makes the training and testing of speech

recognition very efficient [20].

23

3.4 Neural Network Approaches

Speech recognition can be defined as a pattern recognition problem. Because neural networks

perform well when attempting pattern recognition, naturally many researchers applied the neural

networks to speech recognition. The first attempts were nothing more than simple problems like

classifying speech segments as voiced/ unvoiced, or nasal/fricative/plosive. When researchers

succeeded in these experiments they decided to move on to the problem of phoneme

classification.

Figure 3.5 – Static and Dynamic Classification

As shown in Figure 3.4, there are two main approaches that can be used to recognize speech with

neural networks. These methods are static and dynamic classification. Static classification allows

the neural network to see all of the input speech at the same time, and the neural network makes

a single decision based off this input. On the other hand, dynamic classification allows the neural

network to only see a small window of the speech input. The small view window leads to a series

of decisions that need to be combined over the entire speech input. Static classification works

well when a neural network is used to recognize isolated words or phonemes. However, dynamic

classification proves to be much better at classifying words or sentences that are spoken together

[15].

24

Static Classification Approach

In 1988 Dr. Richard Lippmann and Dr. William Huang demonstrated that neural

networks were able to form complex decisions from speech inputs. Using a multilayer

perceptron with 2 inputs, 50 hidden neurons, and 10 outputs, the network was able to

process spoken vowels and form decision regions. After 50,000 iterations of training the

neural network, the decision regions became optimal and yielded very promising

classification results [21].

Figure 3.6 – The Multilayer Perceptron and Decision Regions

Dynamic Approach

One of the most difficult speech recognition tasks is implementing a system that can

process the E-set of English letters, “B, C, D, E, G, P, T, V, and Z.” With all of these

letters containing the same characteristic, an 8% error rate for a neural network is actually

considered a good result.

25

In 1989 Dr. Alexander Waibel had excellent results for phoneme recognition using a

Time Delay Neural Network. His structure consisted of an input layer of 3 delays, and a

hidden layer of 5 delays. The final output was computed by integrating over 9 frames of

phoneme activations in the second hidden layer.

Figure 3.7 – Time Delay Neural Network Structure

This network was trained and tested on 2000 samples of phonemes /b, d, g/ that were

manually excised from a database containing 5260 Japanese words. In the end Waibel

achieved an error rate of only 1.5%, which can be compared to 6.5% achieved by a

simple Hidden Markov Model recognition system [22].

26

Chapter 4

Design of Neural Network

This chapter will discuss the limitations my neural network will abide by, in order to reduce the

scope of my implementation. The structure of the neural network will be described, as well as the

algorithms used to train the network and process the audio samples.

4.1 Limitations of Neural Network

With the complexity of building a speech recognition system, I decided to implement some

limitations on my system in order to reduce the scope of the project

Vocabulary Size and Confusability

As the size of the vocabulary for a speech recognition system increases, so does the

amount of confusability. Many words and letters in the English language sound very

similar which can cause a neural network to arrive at the incorrect output. In order to

reduce the vocabulary and in turn the confusability of my system, it will be used to

recognize 5 simple words: “Go,” “Hello,” “No,” “Stop” and “Yes.”

Speaker Dependence vs. Independence

When creating a speech recognition neural network, naturally the network begins to adapt

to the training data. If the audio samples used to train the network originate from different

users, the network will have a more difficult time distinguishing the words spoken vs. a

network that is trained by samples from one user. In order to reduce the recognition

difficulty as well as the training length, I will train the neural network using audio

27

samples spoken only by myself. The test word samples will only be my voice as to not

confuse the network further.

Isolated, Discontinuous, or Continuous Speech

Continuous speech such as naturally spoken sentences can increase the difficulty of a

speech recognition program by a large factor. With natural language there are often

pauses, mispronounced words, or words that are not in the particular vocabulary.

Discontinuous speech must involve separating sentences into individual words or

phonemes. In order to reduce the amount of processing that the audio samples must

undergo, I will record the 5 simple words as isolated words in order to easily detect the

boundaries of each word.

Read vs. Spontaneous Speech

When speech is spoken spontaneously there can be many useless parts involved. There

can be pauses in the sentences, coughing or sneezing, and common filler words like “um”

or “uh.” With my speech recognition system the audio samples will be recorded as if they

are being read from a script. With no elements of spontaneous speech involved, and the

words being isolated, the neural network and audio processing will not need to be as

complex as other implementations.

Real-time Recognition vs. Recorded Samples

In most speech recognition systems, the training data samples must be recorded and

processed in order to train the system. The testing of the system can occur in one of two

28

ways, either in real-time or with recorded test samples. With real-time recognition, audio

samples can contain various background noises that distort the speech being recorded.

Algorithms and further audio processing must be implemented in order to ensure that

only the spoken speech will be used to train the system. In my speech recognition system

I will be pre-recording both the training data samples as well as the test data samples.

This will minimize the amount of background noise that can occur during the audio

samples.

4.2 Structure of Neural Network

A common form of a neural network is a 3-layer, fully connected, feedforward multilayer

perceptron.

Figure 4.1 – Simple Fully Connected Feedforward Multilayer Perceptron [23]

This type of network is arranged in 3 layers, an input layer, a hidden layer and an output layer.

The training of this network involves using a feature vector, or training matrix, and a class

vector, or target matrix. The input training matrix is applied to the network and an output is

calculated. This output is then compared to the target matrix and assigns values to the different

classes. As the training algorithm continues, the weights in the network are adjusted in order to

29

minimize any errors in recognition. The further the training iterations continue, there is a better

chance that the neural network will assign a higher value to the correct class of the specific input.

With my speech recognition system being limited to 5 specific words, essentially 5 classes, this

structure of neural network suits my needs perfectly. The next step in designing the neural

network is deciding on the amount of input, hidden, and output nodes. The output nodes are

simply the number of classes that the network will have to work with, which in this case will be

5, a class for each word to be recognized. The input node quantity is equal to the number features

that are pulled from each individual word. Using the Auditory Toolbox [24] algorithm Mel-

frequency cepstrum coefficient analysis from the Matlab function mfcc(), which will be

discussed further on, the individual word samples are analyzed and processed into 130 features.

We now have 130 input nodes and 5 output nodes for the neural network. In order to determine

an optimal number of hidden neurons there exists 3 common rule-of-thumb approaches:

The number of hidden neurons should be between the size of the input layer and the size

of the output layer.

The number of hidden neurons should be 2/3 the size of the input layer, plus the size of

the output layer.

The number of hidden neurons should be less than twice the size of the input layer [25].

For my neural network I decided to follow these guidelines and take 2/3 of the total of input and

output neurons, resulting in 90 hidden neurons.

30

Figure 4.2 – Structure of Neural Network

The combination of all the parameters resulted in my neural network with 130 input nodes, a

layer of 90 hidden nodes, and an output layer of 5 nodes representing 5 classifications of words.

4.3 Training Algorithm

For a neural network that is implementing some sort of pattern recognition it is quite common,

and beneficial to use a method called backpropagation. This method is a form of supervised

learning that starts by inputting the training data through the network. Once this data makes it

through the network, it generates propagation output activations. These activations are then

propagated backwards through the neural network, and along the way generating a delta value

for all of the hidden and output neurons. The weights of the network are then updated by these

calculated delta values, which increases the speed and quality of the learning process [26].

Another algorithm called the Levenberg-Marquardt algorithm modifies the original idea of

backpropagation and outperforms a variety of other training algorithms. Due to its efficiency and

31

speed it is the most widely used optimization algorithm, which is why I chose to use it to train

my neural network.

Backpropagation Algorithm

Consider a multilayer feedforward network, such as a three-layer network.

The net input to unit i in layer k + 1 is

The output of unit i will be

For an M layer network the system equations in matrix form are given by

The task of the network is to learn associations between a specified set of input-output

pairs {(p1, t1), (p2, t2) … (pQ, tQ)}.

The performance index for the network is

Where aqM is the output of the network when the qth input, pq is presented,

and eq = tq - aqM is the error for the qth input. For the standard backpropagation algorithm

we use an approximate steepest descent rule. The performance index is approximated by

32

where the total sum of squares is replaced by the squared errors for a single input/output

pair. The approximate steepest (gradient) descent algorithm is then

Where α is the learning rate. Define

As the sensitivity of the performance index to changes in the net input of unit i in layer k.

Now it can be shown, using (1), (6), and (9), that

It can also be shown that the sensitivities satisfy the following recurrence relation

33

This recurrence relation is initialized at the final layer

The overall learning algorithm now proceeds as follows; first, propagate the input

forward using (3)-(4); next, propagate the sensitivities back using (15) and (12); and

finally, update the weights and offsets using (7), (8), (l0), and (11) [27].

Levenberg-Marquardt Modification

While backpropogation is a steepest descent algorithm, the Levenberg-Marquardt

algorithm is an approximation to Newton’s method. Suppose that we have a function

V(x) which we want to minimize with respect to the parameter vector x, then Newton’s

method would be

Where is the Hessian matrix and is the gradient. If we assume that

V(x) is a sum of squares function

34

then it can be shown that

Where J(x) is the Jacobian matrix

For the Gauss-Newton method it is assumed that S(x) ≈ 0, and the update (16) becomes

The Levenberg-Marquardt modification to the Gauss-Newton method is

The parameter µ is multiplied by some factor (β) whenever a step would result in an

increased V(x). When a step reduces V(x), µ is divided by β. Notice that when µ is large

the algorithm becomes steepest descent (with step 1/µ), while for small µ the algorithm

becomes Gauss-Newton. The Levenberg-Marquardt algorithm can be considered a trust-

region modification to Gauss-Newton.

35

The key step in this algorithm is the computation of the Jacobian matrix. For the neural

network mapping problem the terms in the Jacobian matrix can be computed by a simple

modification to the backpropagation algorithm. The performance index for the mapping

problem is given by (5). It is easy to see that this is equivalent in form to (1 7), where x =

[w1(1, 1) w1(1, 2)… w1 (S1, R)b1(1)…b1(S1)w2(1, 1)… bM(SM)]T, and N = Q x SM.

Standard backpropagation calculates terms like

For the elements of the Jacobian matrix that are needed for the Levenberg-Marquardt

algorithm we need to calculate terms like

These terms can be calculated using the standard backpropagation algorithm with one

modification at the final layer

Note that each column of the matrix in (26) is a sensitivity vector which must be

backpropagated through the network to produce one row of the Jacobian.

36

Therefore the Levenberg-Marquardt algorithm proceeds as follows:

1) Present all inputs to the network and compute the corresponding network outputs

(using (3) and (4)), and errors (eq = tq – aqM). Compute the sum of squares of errors

over all inputs (V(x)).

2) Compute the Jacobian matrix (using (26), (12), (10), (11), and (20)).

3) Solve (23) to obtain Δx.

4) Recalculate the sum of squares of errors using x + Δx. If this new sum of squares is

smaller than that computed in step 1, then reduce µ by β, let x = x + Δx, and go back

to step 1. If the sum of squares is not reduced, then increase µ by β and go back to

step 3.

5) The algorithm is assumed to have converged when the norm of the gradient (18) is

less than some predetermined value, or when the sum of squares has been reduced to

some error goal [27].

4.4 Audio Processing Algorithm

In order to properly recognize speech patterns with my neural network I need to process the

audio samples and retrieve features that can define them. Commonly used features from audio

samples are mel-frequency cepstral coefficients, which together make up a mel-frequency

cepstrum. In order to derive the coefficients from an audio clip represented in a matrix, there are

steps that need to be followed:

1) Windowing: In the first stage, the signal is multiplied with a tapered window (usually

Hamming or Hanning window). The windowed speech frames are given by,

37

where s is a matrix containing framed speech, w is another matrix whose T rows contain

the same window function w of size N and ◦ denotes entry wise matrix multiplication.

2) Zero-padding: Zero-padding is required to compute the power spectrum using fast

Fourier transform FFT. Sufficient numbers of zeroes are padded using the following

matrix operation:

where I is an identity matrix of size N x N and O is a null matrix of size N x (M – N).

Here M is power of two and is greater than N.

3) DFT computation: The windowed speech frames are multiplied with twiddle factor

matrix (W) to formulate discrete Fourier transform DFT coefficients (Ω). Half of the

twiddle factor matrix is sufficient due to the conjugate symmetric property of Fourier

transform. This operation can be expressed as,

4) Power spectrum computation: Power spectrum (Θ) is computed by entry wise

multiplying the DFT coefficients with its conjugate. This can be written as,

38

5) Filter bank log energy computation: The speech signal is passed through a triangular

filter bank of frequency response (Λ) which contains p filters, linearly spaced in Mel

scale. The log energy output (Ψ) of the filter bank is given by,

6) DCT computation: In the finishing stage of MFCC computation, Ψ is multiplied with the

discrete cosine transform DCT matrix D to create the final coefficient (x). Therefore,

where, each column of D are p-dimensional orthogonal basis vector of DCT. However,

since the first coefficient is discarded as it is de-coefficient, multiplication with a

p x (p – 1) matrix is adequate in DCT computation [28].

These 6 steps result in a new matrix of extracted features from audio samples, which can then be

used as training data for a neural network.

39

Chapter 5

Neural Network Implementation

This chapter will discuss the details of my neural network implementation. The training of audio

files and test audio files through Matlab will be reviewed, as well as the programming behind the

neural network in Matlab.

5.1 Materials

In order to design and implement a speech recognition system using a neural network I needed to

decide on a programming language that I would use. Originally I decided to use the Fast

Artificial Neural Network library through the Java programming language. However, this library

did not link into Java very well and was proving difficult to use. After researching other methods

to program neural networks, I decided to attempt to use the program Matlab. Using Matlab to

implement a neural network proved to be much more user friendly than I had anticipated. Linked

within Matlab was a Neural Network Toolbox that provided many of the functions needed to

implement any type of neural network.

For the audio processing I used Matlab to record audio through my computer soundcard. Once

the wav files were recorded I used a library called Auditory Toolbox [24] that could be linked

with Matlab. This library contained the Mel-frequency Cepstrum Coefficient algorithm that I had

decided to use in order to retrieve unique features from my audio samples.

40

5.2 Training Data Preparation

In order to implement speech recognition with my neural network, I needed to record, analyze

and manipulate audio files.

Figure 5.1 – Matlab Audio Recording

Using Matlab I defined a sampling frequency called Fs and a duration called Duration. Then I

assign a new wav file to a variable using the wavrecord function. I needed to manually trim the

audio files in order to get rid of empty space where no speech was spoken, in order to improve

my network error rates.

Using this recording method I recorded 10 different samples of the words “Go, Hello, No, Stop

and Yes.” Five samples of each word would be set aside to test the neural network after it was

built, the other five samples of each word were to be used as training data. Using the Auditory

Toolbox function called mfcc I was able to retrieve the proper coefficients that could be used for

my speech recognition. I then needed to transform the matrix produced from the function into a

vector that could be entered into the neural network.

Figure 5.2 – MFCC function called and vector created

41

Figure 5.3 – Wav file and MFCC of “Go”

Figure 5.4 – Wav file and MFCC of “Hello”

Figure 5.5 – Wav file and MFCC of “No”

42

Figure 5.6 – Wav file and MFCC of “Stop”

Figure 5.7 – Wav file and MFCC of “Yes”

The last part of the preparation before building the neural network was to create the training

matrix and target matrix. I formed a matrix of size 130 x 25, which is 130 features for 25 words,

by taking the MFCC vector of each word and combing them into one matrix. Then I made a

target matrix of size 5 x 25, which is five classes with 25 words.

43

5.3 Building the Neural Network

With all of the details of my neural network decided, I entered them into Matlab in order to build

the network. I defined the network inputs as the training matrix, and the network targets as the

target Matrix. I created a variable called hiddenLayerSize and assigned the 90 hidden neurons I

decided to use. Using the function patternnet, a standard feedforward neural network that could

classify inputs according to target classes was created.

Figure 5.8 – Creating the Neural Network

Next I defined the division of my input data for training, validation, and testing. I used a standard

division of 70% of the data for training, 15% of the data for validation, and 15% of the data for

testing.

Figure 5.9 – Setting Up Division of Data

44

I then defined the training algorithm that I wanted my neural network to implement, which was

the Levenberg-Marquardt algorithm. I used the standard mean squared error function to evaluate

the performance of my network, and began the training.

Figure 5.10 – Matlab Training Code

The training of the network took an average of three minutes each time I attempted it. The

network needed between 10 and 20 iterations before it was optimized by the Levenberg-

Marquardt algorithm and stopped improving.

45

Figure 5.11 – Training the Neural Network

In the end Matlab was very easy to use when specifying the characteristics of the neural network,

and the network was created without any problems.

46

5.4 Testing the Neural Network

Once the network was created and trained, I wrote a function in Matlab that would test its speech

recognition capabilities.

Figure 5.12 – testNetwork Function

This function used the parameters testSoundFile and myNetwork in order to read in a recorded

wav file and test it in the network. After reading in the file I extracted the features using the mel-

frequency cepstral coefficients algorithm, then simulated the created neural network with the

new input vector. Using if-statements I outputted which class the neural network thought the

inputted word was in.

47

Chapter 6

Results

In this chapter I will discuss the details testing my neural network and the results it produced. I

will also discuss how the limitations may have altered these results.

6.1 Testing Results

Using the testing function that I wrote in Matlab, I tested all 5 test samples of each word in the

neural network. These test words were not involved in the training of the network, and therefore

could reveal whether the network correctly classified the words.

Figure 6.1 – Test Results for “Go”

Figure 6.2 – Test Results for “Hello”

48

Figure 6.3 – Test Results for “No”

Figure 6.4 – Test Results for “Stop”

Figure 6.5 – Test Results for “Yes”

49

6.2 Discussion of Results

When using a feedforward multilayer perceptron network with the Levenberg-Marquardt training

algorithm to classify the words “Go,” “Hello,” “No,” “Stop” and “Yes,” based on their mel-

frequency cepstrum coefficients, the test words were recognized with no errors. These results

were very positive and demonstrate the capabilities of neural networks when it comes to pattern

recognition.

Given that the network correctly identified all of the test words that were inputted, the results

were what I was hoping to achieve. However, given the limitations that I set on this project, they

may have altered the end results in a positive way.

When designing the speech recognition system I limited the vocabulary to 5 simple commands

which were recorded and spoken only by my voice. I needed to manually isolate the words so

that there was no confusion of when one word was ending and another began. With all of these

limitations it reduces the difficulty that the speech recognition system needs to undergo and

therefore reduces the chance that the network will make a mistake.

After the initial testing was complete, I decided to try a real-time recognition test. Using the

recording function in Matlab to receive a spoken word I automatically processed the mel-

frequency coefficients and simulated the neural network with them. As I expected the neural

network could not make a complete decision as there was no classification that could round up to

one. I believe that this was due to the fact that even though the word was recorded in isolation

with little to no background noise, there was too much room around the word that needed to be

50

cut out. Having this much empty space around the word altered the coefficients and ultimately

caused the word to go unrecognized.

6.3 Difficulties

Through the implementation of my speech recognition system I encountered some difficulties:

1) When I first started to design the speech recognition system it appeared to be a daunting

task due to all of the fundamentals that I detailed. In order to reduce the complexity of the

system and ensure that I was able to complete it, I reduced the system to recognize five

isolated words and focus on pre-recorded audio instead of real-time recognition.

2) Originally attempting to use the Fast Artificial Neural Network library and link it the Java

programming language proved to be very complicated and overall not very user friendly.

This problem was easy to overcome once I decided to use Matlab to implement the

system.

3) When recording the audio samples to train and test the neural network, originally I used a

sample rate of 8000. I tried to use a high rate in order to retrieve enough coefficients to

perform efficient recognition. However, when training the neural network the number of

features per word was much higher and caused Matlab to run into the “Out of Memory”

error. I reduced the sampling rate of my wav files in the end to 2000, which allowed the

network to be trained in an acceptable amount of time and led to successful recognition

4) When recording the words through Matlab, I often needed to re-record the word as there

was too much background noise in the wav file. My soundcard would pick up slight

noises which could alter the file and cause confusion within the network.

51

Chapter 7

Conclusions and Future Work

In this chapter I will discuss how the purpose of my thesis was fulfilled. I will also discuss future

work that this project can lead to.

7.1 Summary

I started this thesis set on implementing a speech recognition system with an artificial neural

network. I wanted to research different techniques behind speech recognition, as well as

techniques for using neural networks efficiently. While the scope of my system was reduced to

an isolated word recognition network, the results were still very positive. Despite limiting the

speech recognition side of the project, I gained an understanding of how neural networks can

tackle a problem like pattern recognition, as well as the benefits of certain structures and training

algorithms. In the end, I accomplished what I set out to and successfully implemented speech

recognition with a neural network.

7.2 Future Work

Now that I have gained an understanding of how speech recognition works with artificial neural

networks, I hope that I can expand upon my implementation. In the future I would like to

implement an algorithm that can filter out useless noise in audio files, this way I can attempt to

update my recognition system into a real-time process. With further research into audio analysis

I hope to improve upon the feature sets that I attempt to recognize, that way I can attempt more

difficult problems like recognizing phonemes from the E-set of the English language.

52

Bibliography

[1] I. R. Titze, Principles of Voice Production. Prentice Hall, 1994

[2] B. Busse et al, "Single-Synapse Analysis of a Diverse Synapse Population: Proteomic

Imaging Methods and Markers,” Neuron, vol. 68, pp. 639-653. November, 2010.

[3] Kiyoshi Kawaguchi. (2000, June, 17). Biological Neural Networks [Online]. Available:

http://www.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-html/node10.html

[4] Albrecht Schmidt. (2000). Biological Neural Networks [Online]. Available:

http://www.teco.edu/~albrecht/neuro/html/node7.html

[5] Rodolfo R. Llinas, “The contribution of Santiago Ramon y Cajal to functional neuroscience,”

Nature Reviews Neuroscience, vol. 4. January 2003.

[6] Kiyoshi Kawaguchi. (2000, June, 17). The McCulloch-Pitts Model of Neuron [Online].

Available: http://wwwold.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-

html/node12.html

[7] Informatics V – Scientific Computing. Pendulum Project [Online]. Available:

http://www5.in.tum.de/wiki/index.php/Pendulum_Project

[8] Andrew Blais, “An introduction to neural networks – Pattern learning with the back-

propagation algorithm,” IBM Developer Works. July 2001.

[9] Anil K, Jain, “Artificial Neural Networks: A Tutorial,” Michigan State University, MI.

0018-9162/96/$5.00, March 1996.

[10] Lawrence Davis, David J. Montana, “Training Feedforward Neural Networks Using Genetic

Algorithms,” BBN Systems and Technologies Corp.

[11] Fundamentals of Neural Networks. [Online] Available:

http://www.myreaders.info/08_Neural_Networks.pdf

http://www.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-html/node10.html

http://www.teco.edu/~albrecht/neuro/html/node7.html

http://wwwold.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-%09html/node12.html

http://wwwold.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-%09html/node12.html

http://www5.in.tum.de/wiki/index.php/Pendulum_Project

http://www.myreaders.info/08_Neural_Networks.pdf

53

[12] Roman Bertolami et al, “A Novel Connectionist System for Unconstrained Handwriting

Recognition,” CH-6928 Manno-Lugano, Switzerland. May 9, 2008.

[13] Andries P. Engelbrecht, Computation Intelligence: An Introduction, 2nd ed. England: John

Wiley & Sons Ltd, 2007.

[14] Simon Haykin, Neural Networks A Comprehensive Foundation, 2nd ed. Pearson Prentice

Hall, 1999.

[15] Speech Recognition. [Online] Available:

http://www.learnartificialneuralnetworks.com/speechrecognition.html#neuralnetwork

[16] Engebretson et al, “Identifying Language from Raw Speech – An Application of

Recurrent Neural Networks,” Department of Computer Science, Washington University.

[17] Yuliang Feng; Xiange Sun, "Analysis and processing speech signal based on MATLAB,"

Electrical and Control Engineering (ICECE), 2011 International Conference on, vol., no.,

pp.555, 556, 16-18 Sept. 2011

[18] Yasunari Yoshitomi, Taro Asada and Masayoshi Tabuse (2011). Vowel Judgment for Facial

Expression Recognition of a Speaker, Speech Technologies, Prof. Ivo Ipsic (Ed.), ISBN:

978-953-307-996-7

[19] Shivanker Dev Dhingra, Geeta Nijhawan, Poonam Pandit3, “Isolated Speech Recognition

Using MFCC and DTW,” International Journal of Advanced Research in Electrical,

Electronics and Instrumentation Engineering. Vol. 2, Issue 8, August 2013.

[20] Mark Gales, Steve Young, “The Application of Hidden Markov Models in Speech

Recognition,” Foundations and Trends in Signal Processing. Vol. 1, No. 3, 2007.

[21] William Y. Huang, Richard P. Lippmann, “Neural Net and Traditional Classifiers,” Neural

Information Processing Systems. Colo. 1987.

http://www.learnartificialneuralnetworks.com/speechrecognition.html#neuralnetwork

54

[22] Alexander Waibel et al, "Phoneme recognition using time-delay neural

networks," Acoustics, Speech and Signal Processing, IEEE Transactions on, vol.37,

no.3, pp.328,339, Mar 1989

[23] – Artificial Neural Networks/Feed-Forward Networks. [Online] Available:

http://en.wikibooks.org/wiki/Artificial_Neural_Networks/Feed-Forward_Networks

[24] – Malcolm Slaney, Auditory Toolbox. [Online] Available:

https://engineering.purdue.edu/~malcolm/interval/1998-010/

[25] Jeff Heaton, Introduction to Neural Networks with Java, 2nd ed. Heaton Research, Inc. 2008.

[26] Hecht-Nielsen, R., "Theory of the backpropagation neural network," Neural Networks,

1989. IJCNN. International Joint Conference on, vol., no., pp.593, 605 vol.1, 0-0 1989

[27] Hagan, M.T.; Menhaj, M.-B., "Training feedforward networks with the Marquardt

algorithm," Neural Networks, IEEE Transactions on, vol.5, no.6, pp.989, 993, Nov 1994

[28] Md. Sahidullah and Goutam Saha. 2012. “Design, analysis and experimental evaluation of

block based transformation in MFCC computation for speaker recognition.” Speech

Communication 54, 4 (May 2012), 543-565.

http://en.wikibooks.org/wiki/Artificial_Neural_Networks/Feed-Forward_Networks



55

Appendix A

Source Code for trainingPrep.m

go1 = wavread('go1', 2000); go2 = wavread('go2', 2000); go3 = wavread('go3', 2000); go4 = wavread('go4', 2000); go5 = wavread('go5', 2000); Go1 = mfcc(go1); Go1 = Go1'; Go1 = Go1(:)'; Go1 = Go1'; Go2 = mfcc(go2); Go2 = Go2'; Go2 = Go2(:)'; Go2 = Go2'; Go3 = mfcc(go3); Go3 = Go3'; Go3 = Go3(:)'; Go3 = Go3'; Go4 = mfcc(go4); Go4 = Go4'; Go4 = Go4(:)'; Go4 = Go4'; Go5 = mfcc(go5); Go5 = Go5'; Go5 = Go5(:)'; Go5 = Go5'; hello1 = wavread('hello1', 2000); hello2 = wavread('hello2', 2000); hello3 = wavread('hello3', 2000); hello4 = wavread('hello4', 2000); hello5 = wavread('hello5', 2000); Hello1 = mfcc(hello1); Hello1 = Hello1'; Hello1 = Hello1(:)'; Hello1 = Hello1'; Hello2 = mfcc(hello2); Hello2 = Hello2'; Hello2 = Hello2(:)'; Hello2 = Hello2'; Hello3 = mfcc(hello3); Hello3 = Hello3'; Hello3 = Hello3(:)'; Hello3 = Hello3'; Hello4 = mfcc(hello4); Hello4 = Hello4'; Hello4 = Hello4(:)'; Hello4 = Hello4'; Hello5 = mfcc(hello5); Hello5 = Hello5'; Hello5 = Hello5(:)'; Hello5 = Hello5'; no1 = wavread('no1', 2000);

56

no2 = wavread('no2', 2000); no3 = wavread('no3', 2000); no4 = wavread('no4', 2000); no5 = wavread('no5', 2000); No1 = mfcc(no1); No1 = No1'; No1 = No1(:)'; No1 = No1'; No2 = mfcc(no2); No2 = No2'; No2 = No2(:)'; No2 = No2'; No3 = mfcc(no3); No3 = No3'; No3 = No3(:)'; No3 = No3'; No4 = mfcc(no4); No4 = No4'; No4 = No4(:)'; No4 = No4'; No5 = mfcc(no5); No5 = No5'; No5 = No5(:)'; No5 = No5'; stop1 = wavread('stop1', 2000); stop2 = wavread('stop2', 2000); stop3 = wavread('stop3', 2000); stop4 = wavread('stop4', 2000); stop5 = wavread('stop5', 2000); Stop1 = mfcc(stop1); Stop1 = Stop1'; Stop1 = Stop1(:)'; Stop1 = Stop1'; Stop2 = mfcc(stop2); Stop2 = Stop2'; Stop2 = Stop2(:)'; Stop2 = Stop2'; Stop3 = mfcc(stop3); Stop3 = Stop3'; Stop3 = Stop3(:)'; Stop3 = Stop3'; Stop4 = mfcc(stop4); Stop4 = Stop4'; Stop4 = Stop4(:)'; Stop4 = Stop4'; Stop5 = mfcc(stop5); Stop5 = Stop5'; Stop5 = Stop5(:)'; Stop5 = Stop5'; yes1 = wavread('yes1', 2000); yes2 = wavread('yes2', 2000); yes3 = wavread('yes3', 2000); yes4 = wavread('yes4', 2000); yes5 = wavread('yes5', 2000); Yes1 = mfcc(yes1); Yes1 = Yes1'; Yes1 = Yes1(:)';

57

Yes1 = Yes1'; Yes2 = mfcc(yes2); Yes2 = Yes2'; Yes2 = Yes2(:)'; Yes2 = Yes2'; Yes3 = mfcc(yes3); Yes3 = Yes3'; Yes3 = Yes3(:)'; Yes3 = Yes3'; Yes4 = mfcc(yes4); Yes4 = Yes4'; Yes4 = Yes4(:)'; Yes4 = Yes4'; Yes5 = mfcc(yes5); Yes5 = Yes5'; Yes5 = Yes5(:)'; Yes5 = Yes5'; trainMatrix(:, 1) = Go1; trainMatrix(:, 2) = Go2; trainMatrix(:, 3) = Go3; trainMatrix(:, 4) = Go4; trainMatrix(:, 5) = Go5; trainMatrix(:, 6) = Hello1; trainMatrix(:, 7) = Hello2; trainMatrix(:, 8) = Hello3; trainMatrix(:, 9) = Hello4; trainMatrix(:, 10) = Hello5; trainMatrix(:, 11) = No1; trainMatrix(:, 12) = No2; trainMatrix(:, 13) = No3; trainMatrix(:, 14) = No4; trainMatrix(:, 15) = No5; trainMatrix(:, 16) = Stop1; trainMatrix(:, 17) = Stop2; trainMatrix(:, 18) = Stop3; trainMatrix(:, 19) = Stop4; trainMatrix(:, 20) = Stop5; trainMatrix(:, 21) = Yes1; trainMatrix(:, 22) = Yes2; trainMatrix(:, 23) = Yes3; trainMatrix(:, 24) = Yes4; trainMatrix(:, 25) = Yes5; targetMatrix = [1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0; 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0; 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0; 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1];

58

Appendix B

Source Code for createNetwork.m

% Solve a Pattern Recognition Problem with a Neural Network % Script generated by NPRTOOL % Created Fri Mar 14 11:36:58 EDT 2014 % % This script assumes these variables are defined: % % trainMatrix - input data. % targetMatrix - target data.

inputs = trainMatrix; targets = targetMatrix;

% Create a Pattern Recognition Network hiddenLayerSize = 90; myNetwork = patternnet(hiddenLayerSize);

% Choose Input and Output Pre/Post-Processing Functions % For a list of all processing functions type: help nnprocess myNetwork.inputs{1}.processFcns = {'removeconstantrows','mapminmax'}; myNetwork.outputs{2}.processFcns = {'removeconstantrows','mapminmax'};

% Setup Division of Data for Training, Validation, Testing % For a list of all data division functions type: help nndivide myNetwork.divideFcn = 'dividerand'; % Divide data randomly myNetwork.divideMode = 'sample'; % Divide up every sample myNetwork.divideParam.trainRatio = 70/100; myNetwork.divideParam.valRatio = 15/100; myNetwork.divideParam.testRatio = 15/100;

% For help on training function 'trainlm' type: help trainlm % For a list of all training functions type: help nntrain myNetwork.trainFcn = 'trainlm'; % Levenberg-Marquardt

% Choose a Performance Function % For a list of all performance functions type: help nnperformance myNetwork.performFcn = 'mse'; % Mean squared error

% Choose Plot Functions % For a list of all plot functions type: help nnplot myNetwork.plotFcns = {'plotperform','plottrainstate','ploterrhist', ... 'plotregression', 'plotfit'};

% Train the Network [myNetwork,tr] = train(myNetwork,inputs,targets);

% Test the Network outputs = myNetwork(inputs);

59

errors = gsubtract(targets,outputs); performance = perform(myNetwork,targets,outputs)

% Recalculate Training, Validation and Test Performance trainTargets = targets .* tr.trainMask{1}; valTargets = targets .* tr.valMask{1}; testTargets = targets .* tr.testMask{1}; trainPerformance = perform(myNetwork,trainTargets,outputs) valPerformance = perform(myNetwork,valTargets,outputs) testPerformance = perform(myNetwork,testTargets,outputs)

% View the Network view(myNetwork)

% Plots % Uncomment these lines to enable various plots. %figure, plotperform(tr) %figure, plottrainstate(tr) %figure, plotconfusion(targets,outputs) %figure, ploterrhist(errors)

60

Appendix C

Source Code for testNetwork.m

function [] = testNetwork( testSoundFile , myNetwork) %UNTITLED2 Summary of this function goes here % Detailed explanation goes here fileName = testSoundFile; myNetwork1 = myNetwork; testSound = wavread(fileName, 2000); TestSound = mfcc(testSound); TestSound = TestSound'; TestSound = TestSound(:)'; TestSound = TestSound'; ResultMatrix = sim(myNetwork1, TestSound); Class1 = round(ResultMatrix(1)); Class2 = round(ResultMatrix(2)); Class3 = round(ResultMatrix(3)); Class4 = round(ResultMatrix(4)); Class5 = round(ResultMatrix(5));

if Class1 == 1 display('The word is Go'); elseif Class2 == 1 display('The word is Hello'); elseif Class3 == 1 display('The word is No'); elseif Class4 == 1 display('The word is Stop'); elseif Class5 == 1 display('The word is Yes'); end end

61

Appendix D

Source Code for mfcc.m

% mfcc - Mel frequency cepstrum coefficient analysis. % [ceps,freqresp,fb,fbrecon,freqrecon] = ... % mfcc(input, samplingRate, [frameRate]) % Find the cepstral coefficients (ceps) corresponding to the % input. Four other quantities are optionally returned that % represent: % the detailed fft magnitude (freqresp) used in MFCC calculation, % the mel-scale filter bank output (fb) % the filter bank output by inverting the cepstrals with a cosine % transform (fbrecon), % the smooth frequency response by interpolating the fb reconstruction % (freqrecon) % -- Malcolm Slaney, August 1993 % Modified a bit to make testing an algorithm easier... 4/15/94 % Fixed Cosine Transform (indices of cos() were swapped) - 5/26/95 % Added optional frameRate argument - 6/8/95 % Added proper filterbank reconstruction using inverse DCT - 10/27/95 % Added filterbank inversion to reconstruct spectrum - 11/1/95

% (c) 1998 Interval Research Corporation

function [ceps,freqresp,fb,fbrecon,freqrecon] = ... mfcc(input, samplingRate, frameRate) global mfccDCTMatrix mfccFilterWeights

[r c] = size(input); if (r > c) input=input'; end

% Filter bank parameters lowestFrequency = 133.3333; linearFilters = 13; linearSpacing = 66.66666666; logFilters = 27; logSpacing = 1.0711703; fftSize = 512; cepstralCoefficients = 13; windowSize = 400; windowSize = 256; % Standard says 400, but 256 makes more sense % Really should be a function of the sample % rate (and the lowestFrequency) and the % frame rate. if (nargin < 2) samplingRate = 16000; end; if (nargin < 3) frameRate = 100; end;

% Keep this around for later.... totalFilters = linearFilters + logFilters;

% Now figure the band edges. Interesting frequencies are spaced

62

% by linearSpacing for a while, then go logarithmic. First figure % all the interesting frequencies. Lower, center, and upper band % edges are all consequtive interesting frequencies.

freqs = lowestFrequency + (0:linearFilters-1)*linearSpacing; freqs(linearFilters+1:totalFilters+2) = ... freqs(linearFilters) * logSpacing.^(1:logFilters+2);

lower = freqs(1:totalFilters); center = freqs(2:totalFilters+1); upper = freqs(3:totalFilters+2);

% We now want to combine FFT bins so that each filter has unit % weight, assuming a triangular weighting function. First figure % out the height of the triangle, then we can figure out each % frequencies contribution mfccFilterWeights = zeros(totalFilters,fftSize); triangleHeight = 2./(upper-lower); fftFreqs = (0:fftSize-1)/fftSize*samplingRate;

for chan=1:totalFilters mfccFilterWeights(chan,:) = ... (fftFreqs > lower(chan) & fftFreqs <= center(chan)).* ... triangleHeight(chan).*(fftFreqs-lower(chan))/(center(chan)-lower(chan)) +

... (fftFreqs > center(chan) & fftFreqs < upper(chan)).* ... triangleHeight(chan).*(upper(chan)-fftFreqs)/(upper(chan)-center(chan)); end %semilogx(fftFreqs,mfccFilterWeights') %axis([lower(1) upper(totalFilters) 0 max(max(mfccFilterWeights))])

hamWindow = 0.54 - 0.46*cos(2*pi*(0:windowSize-1)/windowSize);

if 0 % Window it like ComplexSpectrum windowStep = samplingRate/frameRate; a = .54; b = -.46; wr = sqrt(windowStep/windowSize); phi = pi/windowSize; hamWindow = 2*wr/sqrt(4*a*a+2*b*b)* ... (a + b*cos(2*pi*(0:windowSize-1)/windowSize + phi)); end

% Figure out Discrete Cosine Transform. We want a matrix % dct(i,j) which is totalFilters x cepstralCoefficients in size. % The i,j component is given by % cos( i * (j+0.5)/totalFilters pi ) % where we have assumed that i and j start at 0. mfccDCTMatrix = 1/sqrt(totalFilters/2)*cos((0:(cepstralCoefficients-1))' *

... (2*(0:(totalFilters-1))+1) * pi/2/totalFilters); mfccDCTMatrix(1,:) = mfccDCTMatrix(1,:) * sqrt(2)/2;

%imagesc(mfccDCTMatrix);

63

% Filter the input with the preemphasis filter. Also figure how % many columns of data we will end up with. if 1 preEmphasized = filter([1 -.97], 1, input); else preEmphasized = input; end windowStep = samplingRate/frameRate; cols = fix((length(input)-windowSize)/windowStep);

% Allocate all the space we need for the output arrays. ceps = zeros(cepstralCoefficients, cols); if (nargout > 1) freqresp = zeros(fftSize/2, cols); end; if (nargout > 2) fb = zeros(totalFilters, cols); end;

% Invert the filter bank center frequencies. For each FFT bin % we want to know the exact position in the filter bank to find % the original frequency response. The next block of code finds the % integer and fractional sampling positions. if (nargout > 4) fr = (0:(fftSize/2-1))'/(fftSize/2)*samplingRate/2; j = 1; for i=1:(fftSize/2) if fr(i) > center(j+1) j = j + 1; end if j > totalFilters-1 j = totalFilters-1; end fr(i) = min(totalFilters-.0001, ... max(1,j + (fr(i)-center(j))/(center(j+1)-center(j)))); end fri = fix(fr); frac = fr - fri;

freqrecon = zeros(fftSize/2, cols); end

% Ok, now let's do the processing. For each chunk of data: % * Window the data with a hamming window, % * Shift it into FFT order, % * Find the magnitude of the fft, % * Convert the fft data into filter bank outputs, % * Find the log base 10, % * Find the cosine transform to reduce dimensionality. for start=0:cols-1 first = start*windowStep + 1; last = first + windowSize-1; fftData = zeros(1,fftSize); fftData(1:windowSize) = preEmphasized(first:last).*hamWindow; fftMag = abs(fft(fftData)); earMag = log10(mfccFilterWeights * fftMag');

ceps(:,start+1) = mfccDCTMatrix * earMag; if (nargout > 1) freqresp(:,start+1) = fftMag(1:fftSize/2)'; end; if (nargout > 2) fb(:,start+1) = earMag; end

64

if (nargout > 3) fbrecon(:,start+1) = ... mfccDCTMatrix(1:cepstralCoefficients,:)' * ... ceps(:,start+1); end if (nargout > 4) f10 = 10.^fbrecon(:,start+1); freqrecon(:,start+1) = samplingRate/fftSize * ... (f10(fri).*(1-frac) + f10(fri+1).*frac); end end

% OK, just to check things, let's also reconstruct the original FB % output. We do this by multiplying the cepstral data by the transpose % of the original DCT matrix. This all works because we were careful to % scale the DCT matrix so it was orthonormal. if 1 & (nargout > 3) fbrecon = mfccDCTMatrix(1:cepstralCoefficients,:)' * ceps; % imagesc(mt(:,1:cepstralCoefficients)*mfccDCTMatrix); end;

Implementing Speech Recognition with Artificial Neural...

Documents

Transcript of Implementing Speech Recognition with Artificial Neural...