Post on 03-Apr-2018
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 1/21
Word Recognition Device
By
C.K. Liang & Oliver Tsai
ECE 345 Final Project
TA: Inseop Lee
Project Number: 22
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 2/21
ii
ABSTRACT
The word recognition device utilizes the linear predictive coding (LPC) to
recognize certain words. The LPC relies on frequency resonances of human voice to
detect a pattern match. The LPC algorithm calculates specific set of coefficients for a
waveform. If the two sets of coefficients are numerically similar, the two waveforms may
contain the same vowel sound. Since certain words contain certain vowels, LPC can be
used for word recognition. The speaker-dependent word recognition device is
implemented using the Motorola DSP 56303.
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 3/21
iii
TABLE OF CONTENTS
PAGE
1. INTRODUCTION (OT)………………………...…………………………………….1
1.1 Device Functionality………………………………………………………………1
1.2 Block Diagrams…………………………………………………………………...1
1.3 Performance Specification………………………………………………………...2
1.4 Subprojects Overview……………………………………………………………..2
2. DESIGN PROCEDURE (CK)………………………………………………………....4
2.1 General Diagram Block Diagram………………………………………….…..….4
2.2 Input……………………………………………………………………………….4
2.3 Filter……………………………………………………………………………….4
2.4 Spectral Analysis………………………………………………………………….5
2.5 Output………..……………………………………………………………………5
3. DESIGN DETAILS (CK)……...………………………...……………………………..6
3.1 Detail Block Diagram…………………………………………………………….6
3.2 A/D Converter…………………………………………………………………….6
3.3 End Point Detection………………………………………………………………6
3.4 Pre-emphasis Filter……………………………………………………………….7
3.5 Frame Blocking…………………………………………………………………...7
3.6 Hamming Window………………………………………………………………..8
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 4/21
iv
3.7 Auto-Correlation………………………………………………………………….8
3.8 Levinson-Durbin Algorithm……………………………………………………...9
3.9 Summation of Squared of the Difference (SSD)………………………………..10
3.10 Output…………………………………………………………………………..10
4. DESIGN VERIFICATIONS (OT)…………………………………………………..11
4.1 Design Verification Theory……………………………………………………...11
5. COSTS (OT)…………...…………………………………………………………….13
6. CONCLUSIONS (CK)..…………………………………………………………......14
6.1 Difficulties Encountered…………………………………………………………14
6.2 Summary…………………………………………………………………………15
APPENDIX 1. VOWEL SOUND WAVEFORM…………………………………...16
APPENDIX 2. CONSONANT SOUND WAVEFORM…….………………………17
APPENDIX 3. LPC COEFFICIENTS SUMMARY………………………………...18
APPENDIX 4. ASSEMBLY CODE………………………………………………...19
APPENDIX 5. DSP 56303 COMPONENT LAYOUT…….………………………..22
REFERENCES…………………….…………………………………………………….23
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 5/21
1
1. INTRODUCTION
1.1 Device Functionality
The speaker-dependent word recognition device is implemented using the
Motorola DSP56303. First the speaker will train the device by storing 10 different vowel
sounds into memory. Then the same speaker can repeat one of the ten words associated
with the vowel sound and the device can detect which word was repeated and flag an
appropriate output.
1.2 Block Diagram
a. Training the device
Fig. 1.1 Training the Device
b. Word Recognition
Fig 1.2 Word Recognition
VowelSound
MicrophoneInput
A/D Converter
Calculate LPCcoefficients
Store coefficientsin memory
Vowel Sound Microphone Input A/D Converter
Calculate LPC
coefficients
Compare coefficients
with the one inOutput
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 6/21
2
1.3 Performance Specification
First, 10 sets of LPC coefficients are trained without background noise and stored
into the vowel template. Then the speech recognition begins without noise, with –20 dB
noise, -10 dB noise, and 0 dB consecutively. The noise is generated by MATLAB
program using the function ‘rand’. Refer to Fig. 1.3 for the noise diagrams, the
performance results are shown in Table 1.
TABLE 1.1 SUMMARY OF PERFORMANCE WITH NOISE
No Noise -20 dB Noise -10 dB Noise 0 dB Noise
10/10 9/10 9/10 7/10
If noise is greater than 0 dB, it will exceed the current threshold and trigger the
device before the speech is spoken.
1.4 Subprojects Overview
The project may be divided into four broad but major components. The first
component deals with input of speech samples into the Motorola DSP56303. The second
component is filtering the speech samples. The third component is the spectral analysis
performed on the speech samples. The final part utilizes logic circuits and LED to output
the results.
4
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 7/21
2. DESIGN PROCEDURE
2.1 General Design Block Diagram
Fig. 2.1 General Design Block Diagram
2.2 Input
With the microphone connected to Motorola DSP56303 chip, raw analog speech can
be taken through the analog to digital (A/D) converter for processing. The A/D converter
takes digital samples at 8 kHz.
2.3 Filtering
The Finite Impulse Response (FIR) filters increase the magnitudes of certain
frequencies that are characteristic of the vowel sound. Since vowel sounds usually have
very distinct magnitudes in very low frequencies, the FIR filter is similar to a high pass
filter. The FIR filter boosts up high frequency components and reduces the low frequency
components of the waveform.
Input Filtering
Sound Matching (SSD)Output
Spectral Analysis
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 8/21
5
2.3 Spectral Analysis
In spectral analysis, the primary goal of this component is to calculate 10 LPC
coefficients for the waveform. A 30 millisecond frame is created, which the window
consists of the most recent samples. As the DSP56303 processes the current samples,
more samples are coming in. After 10 ms, the DSP56303 creates another 30 ms window
frame and takes this as the most current samples. Therefore, there will be a 20 ms overlap
of samples between the previous window and the most recent window frame. This
process continues throughout the speech sample.
2.4 Sound Matching
In order to compare word template to the spoken word, the Motorola DSP56303
calculates Summation of Square of Difference (SSD) for each of the 10 LPC coefficients
between the vowel template in memory and the spoken word. If there are 10 different
vowel template, the one with the minimum summation value may be similar to the
spoken word. However, the maximum threshold should be set such that the minimum
summation must not exceed it.
2.4 Output
After a match is found, the device will indicate an appropriate output by lighting up
one of ten LEDs. Each LED corresponds to one of the ten words.
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 9/21
6
3. DESIGN DETAILS
3.1 Block Diagram
Fig. 3.1 General Block Diagram
3.2 A/D Converter
For Motorola DSP56303, the device converts the analog signals to digital
samples by an ASM file called ‘core302.asm’. The samples are input from CODEC A/D
input port as shown in Fig. 5. The assembly file initializes the necessary peripheral
settings for general I/O purposes. Moreover, the file also contains a macro called
waitdata. The macro waits for a sample and takes the sample in. The sampling rate is set
to 8000 samples/second.
3.3 End Point Detection.
In the end point detection, each sample taken from A/D converter is compared to
a volume threshold. If the sample is lower than the threshold, it is considered as
background noise and therefore disregarded. Otherwise, the DSP board, as shown in Fig.
A/D Converter Pre-emphasis filter
Hamming WindowFrame BlockingAuto-Correlation
OutputSSD ComparisonLevinson-Durbin Algorithm
End Point Detection
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 10/21
5, will output 4 bits high to Port B to indicate readiness to process speech samples, and
the next 2000 samples will be stored into a buffer before processing.
7
3.4 Pre-emphasis filter
The pre-emphasis is a low-order digital filter. The filter has transfer function
show in Equation (3.1).
H(z) = 1 – 0.9375 z-1 (3.1)
The digitized speech signal goes through the filter to average transmission
conditions, noise backgrounds, and signal spectrums. The filter boosts up the high
frequency components of human voice and attenuates the low frequency component of
human voice. Because human voice typically has higher power at low frequencies, the
filter renders the speech sample easy for LPC calculation.
3.5 Frame blocking
The pre-emphasized speech samples are divided into 30-ms window frames. Each
30 ms window frame consists of 240 samples as illustrated in Equation (3.2) and
Equation (3.3).
(Sampling Rate)(Frame Length) = Number of Samples in a Frame (3.2)
(8000 samples/second)(0.030 second) = 240 samples (3.3)
In addition, adjacent window frames are separated by 80 samples (240 x 1/3),
with 160 overlapping samples. The amount of separation and overlapping depends on
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 11/21
frame length. The frame length is chosen according to the sampling rate. The higher the
sampling rate, the larger the frame length to be accurate.
8
3.6 Hamming window
The equation to form a Hamming window is shown below in Equation (3.4).
W(n) = 0.54 – 0.46 cos(2πn/239), 0 ≤ n ≤ 239 (3.4)
Equation (3.4) generates 240 discrete points for the Hamming window. For each
30ms window frame, 240 samples are multiplied point by point with the 240 discrete
Hamming window points. The hamming window is used to gradually tamper the window
frame to zero at its beginning and end boundaries. Therefore, the signal discontinuities at
the beginning and end of each frame are minimized. For example, if a signal of finite
length is passed through a filter of finite length, the beginning and end of the filtered
signal depends on the samples before and after the signal. Since they are not of infinite
lengths, the output should be weighted heavily in the middle.
3.7 Auto-Correlation
After the Hamming window, each window frame of 240 samples can be used to
calculate the auto-correlation coefficients. The auto-correlation formula is shown below
in Equation (3.5).
239
Ri = ∑ x(n) x(n-i) for i = 0 to 10 (3.5) N=1
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 12/21
The largest variable i corresponds to the order of the LPC analysis. For the
sampling rate 8 kHz, the typical order of the LPC should be 10. Note that the auto-
correlation coefficient, R 0, is the energy of the window frame. After this first stage of
9
LPC computation, the 11 auto-correlation coefficients are next combined with Levinson-
Durbin algorithm to find the 10 LPC coefficients.
3.8 Levinson-Durbin algorithm
Each window frame is modeled as an IIR filter with the following transfer
function shown below in Equation (3.6).
H(z) =__ G_____________ (1 + A1 z-1 + A2 z-2 + …. + A10 z-10) (3.6)
The 10 LPC coefficients (A1, A2, …., A10) is the solution to the auto-correlation
Equation (3.7) [V&SP 141].
R 0 R 1 R 2 …. R 9 A1 R 1R 1 R 0 R 1 …. R 8 A2 R 2 (3.7)
R 2 R 1 R 0 …. R 7 A3 = - R 3
.................. .................... ..... .......R 9 R 8 R 7 …. R 0 A10 R 10
The solution is calculated using recursive method:Given: An(0) = 1 and E0 = R 0
For n = 1 to 10, {n-1
K n = (-1/En-1) ∑ An-1(i) R n-i
i=0
An(n) = K n
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 13/21
For i = 1 to n-1,An(i) = An-1(i) + K n An-1(n-i)
End For
En = En-1 (1 - K n2 )} End For where A10(1), A10(2), A10(3), …., A10(10) are the 10 LPC coefficients.
10
3.9 Sum of Square of Difference comparison
The Sum of Square of Difference comparison is quantitative method to compare
two sets of LPC coefficients. Suppose one set of LPC coefficients in the template are
A’1, A’2, A’3, …., A’10, and another set of LPC coefficients obtained from a window
frame are A1, A2, A3, …., A’10.
SSD = (A’1 – A1)2 + (A’2 – A2)
2 + (A’3 – A3)2 + …. + (A’10 – A10)
2 (5)
Each time the window frame is shifted, SSD is calculated between LPC
coefficients from the window frame and every set of LPC coefficients in template. A
minimum SSD exists between LPC coefficients from a window frame and one set of LPC
coefficients in template. The one with the minimum SSD value is the closest match to the
input vowel.
3.10 Output 4 bits
Motorola DSP56303 outputs 4 bits to its Port B. Four bits are mapped to 10 LEDs
corresponding to the 10 vowel sounds. An additional LED is also used for the threshold
volume. The mapping is done using NAND gates as shown in Fig. 5. A specific LED will
turn off while other LEDs are on. This indicates the LED corresponding to the vowel is
detected. The threshold LED is used as an indicator for the speaker. While recording,
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 14/21
when the LED turns off, it indicates to the speaker the speech samples has been taken by
the device and the next 2000 speech samples will be stored into a buffer and processed.
11
4. DESIGN VERIFICATION
4.1 Design Theory Verification
The major underlying theme in this project is vowel sounds are used for storing
LPC coefficients and the key to making word recognition possible. Vowel sounds, as
opposed to consonant sounds, are used because LPC coefficients fluctuate minimally in
vowel sounds throughout the waveform. This is possible because the vowel sound
waveform is periodic as shown in Fig. 4.1 Appendix 2. The first plot shows the entire
waveform of the vowel sound ‘a’ , and the second plot shows a 60 ms (480 sample)
window frame.
Since the waveform is consistently periodic, the LPC coefficient for each window
frame does not fluctuate greatly. Using MATLAB simulations, when calculating LPC
coefficients for each window frame throughout the entire sample, the mean of the LPC
coefficients for all the window frames in the sample can be calculated. With the mean,
the variance for each coefficient can also be calculated. Table 1, in the Appendix 1
summarizes the variances of the vowel sound ‘a’ for each coefficient. These coefficients
are significantly smaller than the variances for the consonant sound ‘t’ . This is also
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 15/21
shown on Table 1 in Appendix 2. Moreover, in Fig. 4.2 Appendix 3, the plot of the
waveform for ‘t’ ’ immediately reveals a lack of periodicity. The lack periodicity explains
the high variance for consonant sounds. Thus vowel sounds are used to store LPC
templates because of their periodicity
12
Moreover, each vowel sound also has a distinct set of LPC coefficients. This is
illustrated in Table 2 in the Appendix 1 This characteristic is used to distinguish one
vowel sound from another. Since words are composed of vowels, words can also be
distinguished from each other. As a result, exploiting these characteristics makes word
recognition possible.
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 16/21
13
5. COSTS
• Parts:
TABLE 5.1 PARTS COSTS
Motorola DSP56303 Telex Microphone
$300.00 $15.00
Labor:
TABLE 5.2 LABOR COST
Name Total
Hours
Salary Total
C.K. Liang 200 $20/hour $4000.00
Oliver Tsai 200 $20/hour $4000.00
• Grand Total:
TABLE 5.3 TOTAL COST
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 17/21
Labor Parts Grand Total
$8,000 $315.00 $8,315
14
6. CONCLUSIONS
6.1 Difficulties Encountered
6.1.1 Indirect connection between microphone and the DSP board
The device relied on an inexpensive low quality microphone for input.
Since the computer must power the microphone, the microphone is connected to
the microphone port of the computer and the computer output of the speaker is
connected to the CODEC A/D port of the DSP board. Since the microphone is
indirectly connected to the board, the computer may induce some undesired noise
in addition to the speech samples. An alternative solution to this problem is to
purchase a high quality microphone that can be directly connected to the DSP
board.
6.1.2 Incompatible I/O core302 assembly file
Since the core302 assembly file is written for DSP56302 boards, the file
does not work with the DSP56303 board. Since the DSP56303 is deficient in
memory, the file is modified slightly. Specifically memory locations need to be
initialized correctly. As a result, more memory spaces are freed up for use.
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 18/21
6.1.3 Insufficient data memory
Due to insufficient memory in the DSP56303 chip, only 2000 digitized
speech samples are stored and processed at a time. Since 2000 samples are not
15
enough to capture a whole word, word recognition is not feasible using
DSP56303 chip. However, 2000 samples (equivalent to ¼ second for sampling
rate 8 kHz) are able to capture the whole vowel sound.
6.2 Summary
The calculation and comparisons of LPC coefficients on vowel sounds can
be used to recognize different vowel sounds. However, due to buffer restrictions
of storing only 2000 samples, vowel recognition was only achievable. Increasing
the buffer size to 4000 samples should be adequate for word recognition. A major
improvement to consider is continuous speech recognition. Recent theories in
using LPC coefficients with Hidden Markov Model, state diagrams and decision
rules can be applied to make continuous speech recognition possible.
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 19/21
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 20/21
Please download more appendices from the addresses below:
https://www-s.ece.uiuc.edu/ece345/projects/spring99/project22_file1doc
https://www-s.ece.uiuc.edu/ece345/projects/spring99/project22_file2doc
https://www-s.ece.uiuc.edu/ece345/projects/spring99/project22_file3.doc
7/29/2019 project22_final_paper.doc
http://slidepdf.com/reader/full/project22finalpaperdoc 21/21
23
REFERENCES
T. Parson. Voice and Speech Processing. New York: McGraw, 1987.
L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. New Jersey:Prentice Hall, 1978.