project22_final_paper.doc

7/29/2019 project22_final_paper.doc

http://slidepdf.com/reader/full/project22finalpaperdoc 1/21

Word Recognition Device

C.K. Liang & Oliver Tsai

ECE 345 Final Project

TA: Inseop Lee

Project Number: 22

ABSTRACT

The word recognition device utilizes the linear predictive coding (LPC) to

recognize certain words. The LPC relies on frequency resonances of human voice to

detect a pattern match. The LPC algorithm calculates specific set of coefficients for a

waveform. If the two sets of coefficients are numerically similar, the two waveforms may

contain the same vowel sound. Since certain words contain certain vowels, LPC can be

used for word recognition. The speaker-dependent word recognition device is

implemented using the Motorola DSP 56303.

TABLE OF CONTENTS

1. INTRODUCTION (OT)………………………...…………………………………….1

1.1 Device Functionality………………………………………………………………1

1.2 Block Diagrams…………………………………………………………………...1

1.3 Performance Specification………………………………………………………...2

1.4 Subprojects Overview……………………………………………………………..2

2. DESIGN PROCEDURE (CK)………………………………………………………....4

2.1 General Diagram Block Diagram………………………………………….…..….4

2.2 Input……………………………………………………………………………….4

2.3 Filter……………………………………………………………………………….4

2.4 Spectral Analysis………………………………………………………………….5

2.5 Output………..……………………………………………………………………5

3. DESIGN DETAILS (CK)……...………………………...……………………………..6

3.1 Detail Block Diagram…………………………………………………………….6

3.2 A/D Converter…………………………………………………………………….6

3.3 End Point Detection………………………………………………………………6

3.4 Pre-emphasis Filter……………………………………………………………….7

3.5 Frame Blocking…………………………………………………………………...7

3.6 Hamming Window………………………………………………………………..8

3.7 Auto-Correlation………………………………………………………………….8

3.8 Levinson-Durbin Algorithm……………………………………………………...9

3.9 Summation of Squared of the Difference (SSD)………………………………..10

3.10 Output…………………………………………………………………………..10

4. DESIGN VERIFICATIONS (OT)…………………………………………………..11

4.1 Design Verification Theory……………………………………………………...11

5. COSTS (OT)…………...…………………………………………………………….13

6. CONCLUSIONS (CK)..…………………………………………………………......14

6.1 Difficulties Encountered…………………………………………………………14

6.2 Summary…………………………………………………………………………15

APPENDIX 1. VOWEL SOUND WAVEFORM…………………………………...16

APPENDIX 2. CONSONANT SOUND WAVEFORM…….………………………17

APPENDIX 3. LPC COEFFICIENTS SUMMARY………………………………...18

APPENDIX 4. ASSEMBLY CODE………………………………………………...19

APPENDIX 5. DSP 56303 COMPONENT LAYOUT…….………………………..22

REFERENCES…………………….…………………………………………………….23

1. INTRODUCTION

1.1 Device Functionality

The speaker-dependent word recognition device is implemented using the

Motorola DSP56303. First the speaker will train the device by storing 10 different vowel

sounds into memory. Then the same speaker can repeat one of the ten words associated

with the vowel sound and the device can detect which word was repeated and flag an

appropriate output.

1.2 Block Diagram

a. Training the device

Fig. 1.1 Training the Device

b. Word Recognition

Fig 1.2 Word Recognition

VowelSound

MicrophoneInput

A/D Converter

Calculate LPCcoefficients

Store coefficientsin memory

Vowel Sound Microphone Input A/D Converter

Calculate LPC

coefficients

Compare coefficients

with the one inOutput

1.3 Performance Specification

First, 10 sets of LPC coefficients are trained without background noise and stored

into the vowel template. Then the speech recognition begins without noise, with –20 dB

noise, -10 dB noise, and 0 dB consecutively. The noise is generated by MATLAB

program using the function ‘rand’. Refer to Fig. 1.3 for the noise diagrams, the

performance results are shown in Table 1.

TABLE 1.1 SUMMARY OF PERFORMANCE WITH NOISE

No Noise -20 dB Noise -10 dB Noise 0 dB Noise

10/10 9/10 9/10 7/10

If noise is greater than 0 dB, it will exceed the current threshold and trigger the

device before the speech is spoken.

1.4 Subprojects Overview

The project may be divided into four broad but major components. The first

component deals with input of speech samples into the Motorola DSP56303. The second

component is filtering the speech samples. The third component is the spectral analysis

performed on the speech samples. The final part utilizes logic circuits and LED to output

the results.

2. DESIGN PROCEDURE

2.1 General Design Block Diagram

Fig. 2.1 General Design Block Diagram

2.2 Input

With the microphone connected to Motorola DSP56303 chip, raw analog speech can

be taken through the analog to digital (A/D) converter for processing. The A/D converter

takes digital samples at 8 kHz.

2.3 Filtering

The Finite Impulse Response (FIR) filters increase the magnitudes of certain

frequencies that are characteristic of the vowel sound. Since vowel sounds usually have

very distinct magnitudes in very low frequencies, the FIR filter is similar to a high pass

filter. The FIR filter boosts up high frequency components and reduces the low frequency

components of the waveform.

Input Filtering

Sound Matching (SSD)Output

Spectral Analysis

2.3 Spectral Analysis

In spectral analysis, the primary goal of this component is to calculate 10 LPC

coefficients for the waveform. A 30 millisecond frame is created, which the window

consists of the most recent samples. As the DSP56303 processes the current samples,

more samples are coming in. After 10 ms, the DSP56303 creates another 30 ms window

frame and takes this as the most current samples. Therefore, there will be a 20 ms overlap

of samples between the previous window and the most recent window frame. This

process continues throughout the speech sample.

2.4 Sound Matching

In order to compare word template to the spoken word, the Motorola DSP56303

calculates Summation of Square of Difference (SSD) for each of the 10 LPC coefficients

between the vowel template in memory and the spoken word. If there are 10 different

vowel template, the one with the minimum summation value may be similar to the

spoken word. However, the maximum threshold should be set such that the minimum

summation must not exceed it.

2.4 Output

After a match is found, the device will indicate an appropriate output by lighting up

one of ten LEDs. Each LED corresponds to one of the ten words.

3. DESIGN DETAILS

3.1 Block Diagram

Fig. 3.1 General Block Diagram

3.2 A/D Converter

For Motorola DSP56303, the device converts the analog signals to digital

samples by an ASM file called ‘core302.asm’. The samples are input from CODEC A/D

input port as shown in Fig. 5. The assembly file initializes the necessary peripheral

settings for general I/O purposes. Moreover, the file also contains a macro called

waitdata. The macro waits for a sample and takes the sample in. The sampling rate is set

to 8000 samples/second.

3.3 End Point Detection.

In the end point detection, each sample taken from A/D converter is compared to

a volume threshold. If the sample is lower than the threshold, it is considered as

background noise and therefore disregarded. Otherwise, the DSP board, as shown in Fig.

A/D Converter Pre-emphasis filter

Hamming WindowFrame BlockingAuto-Correlation

OutputSSD ComparisonLevinson-Durbin Algorithm

End Point Detection

5, will output 4 bits high to Port B to indicate readiness to process speech samples, and

the next 2000 samples will be stored into a buffer before processing.

3.4 Pre-emphasis filter

The pre-emphasis is a low-order digital filter. The filter has transfer function

show in Equation (3.1).

H(z) = 1 – 0.9375 z-1 (3.1)

The digitized speech signal goes through the filter to average transmission

conditions, noise backgrounds, and signal spectrums. The filter boosts up the high

frequency components of human voice and attenuates the low frequency component of

human voice. Because human voice typically has higher power at low frequencies, the

filter renders the speech sample easy for LPC calculation.

3.5 Frame blocking

The pre-emphasized speech samples are divided into 30-ms window frames. Each

30 ms window frame consists of 240 samples as illustrated in Equation (3.2) and

Equation (3.3).

(Sampling Rate)(Frame Length) = Number of Samples in a Frame (3.2)

(8000 samples/second)(0.030 second) = 240 samples (3.3)

In addition, adjacent window frames are separated by 80 samples (240 x 1/3),

with 160 overlapping samples. The amount of separation and overlapping depends on

frame length. The frame length is chosen according to the sampling rate. The higher the

sampling rate, the larger the frame length to be accurate.

3.6 Hamming window

The equation to form a Hamming window is shown below in Equation (3.4).

W(n) = 0.54 – 0.46 cos(2πn/239), 0 ≤ n ≤ 239 (3.4)

Equation (3.4) generates 240 discrete points for the Hamming window. For each

30ms window frame, 240 samples are multiplied point by point with the 240 discrete

Hamming window points. The hamming window is used to gradually tamper the window

frame to zero at its beginning and end boundaries. Therefore, the signal discontinuities at

the beginning and end of each frame are minimized. For example, if a signal of finite

length is passed through a filter of finite length, the beginning and end of the filtered

signal depends on the samples before and after the signal. Since they are not of infinite

lengths, the output should be weighted heavily in the middle.

3.7 Auto-Correlation

After the Hamming window, each window frame of 240 samples can be used to

calculate the auto-correlation coefficients. The auto-correlation formula is shown below

in Equation (3.5).

Ri = ∑ x(n) x(n-i) for i = 0 to 10 (3.5) N=1

The largest variable i corresponds to the order of the LPC analysis. For the

sampling rate 8 kHz, the typical order of the LPC should be 10. Note that the auto-

correlation coefficient, R 0, is the energy of the window frame. After this first stage of

LPC computation, the 11 auto-correlation coefficients are next combined with Levinson-

Durbin algorithm to find the 10 LPC coefficients.

3.8 Levinson-Durbin algorithm

Each window frame is modeled as an IIR filter with the following transfer

function shown below in Equation (3.6).

H(z) =__ G_____________ (1 + A1 z-1 + A2 z-2 + …. + A10 z-10) (3.6)

The 10 LPC coefficients (A1, A2, …., A10) is the solution to the auto-correlation

Equation (3.7) [V&SP 141].

R 0 R 1 R 2 …. R 9 A1 R 1R 1 R 0 R 1 …. R 8 A2 R 2 (3.7)

R 2 R 1 R 0 …. R 7 A3 = - R 3

.................. .................... ..... .......R 9 R 8 R 7 …. R 0 A10 R 10

The solution is calculated using recursive method:Given: An(0) = 1 and E0 = R 0

For n = 1 to 10, {n-1

K n = (-1/En-1) ∑ An-1(i) R n-i

An(n) = K n

For i = 1 to n-1,An(i) = An-1(i) + K n An-1(n-i)

End For

En = En-1 (1 - K n2 )} End For where A10(1), A10(2), A10(3), …., A10(10) are the 10 LPC coefficients.

3.9 Sum of Square of Difference comparison

The Sum of Square of Difference comparison is quantitative method to compare

two sets of LPC coefficients. Suppose one set of LPC coefficients in the template are

A’1, A’2, A’3, …., A’10, and another set of LPC coefficients obtained from a window

frame are A1, A2, A3, …., A’10.

SSD = (A’1 – A1)2 + (A’2 – A2)

2 + (A’3 – A3)2 + …. + (A’10 – A10)

Each time the window frame is shifted, SSD is calculated between LPC

coefficients from the window frame and every set of LPC coefficients in template. A

minimum SSD exists between LPC coefficients from a window frame and one set of LPC

coefficients in template. The one with the minimum SSD value is the closest match to the

input vowel.

3.10 Output 4 bits

Motorola DSP56303 outputs 4 bits to its Port B. Four bits are mapped to 10 LEDs

corresponding to the 10 vowel sounds. An additional LED is also used for the threshold

volume. The mapping is done using NAND gates as shown in Fig. 5. A specific LED will

turn off while other LEDs are on. This indicates the LED corresponding to the vowel is

detected. The threshold LED is used as an indicator for the speaker. While recording,

when the LED turns off, it indicates to the speaker the speech samples has been taken by

the device and the next 2000 speech samples will be stored into a buffer and processed.

4. DESIGN VERIFICATION

4.1 Design Theory Verification

The major underlying theme in this project is vowel sounds are used for storing

LPC coefficients and the key to making word recognition possible. Vowel sounds, as

opposed to consonant sounds, are used because LPC coefficients fluctuate minimally in

vowel sounds throughout the waveform. This is possible because the vowel sound

waveform is periodic as shown in Fig. 4.1 Appendix 2. The first plot shows the entire

waveform of the vowel sound ‘a’ , and the second plot shows a 60 ms (480 sample)

window frame.

Since the waveform is consistently periodic, the LPC coefficient for each window

frame does not fluctuate greatly. Using MATLAB simulations, when calculating LPC

coefficients for each window frame throughout the entire sample, the mean of the LPC

coefficients for all the window frames in the sample can be calculated. With the mean,

the variance for each coefficient can also be calculated. Table 1, in the Appendix 1

summarizes the variances of the vowel sound ‘a’ for each coefficient. These coefficients

are significantly smaller than the variances for the consonant sound ‘t’ . This is also

shown on Table 1 in Appendix 2. Moreover, in Fig. 4.2 Appendix 3, the plot of the

waveform for ‘t’ ’ immediately reveals a lack of periodicity. The lack periodicity explains

the high variance for consonant sounds. Thus vowel sounds are used to store LPC

templates because of their periodicity

Moreover, each vowel sound also has a distinct set of LPC coefficients. This is

illustrated in Table 2 in the Appendix 1 This characteristic is used to distinguish one

vowel sound from another. Since words are composed of vowels, words can also be

distinguished from each other. As a result, exploiting these characteristics makes word

recognition possible.

5. COSTS

• Parts:

TABLE 5.1 PARTS COSTS

Motorola DSP56303 Telex Microphone

$300.00 $15.00

Labor:

TABLE 5.2 LABOR COST

Name Total

Salary Total

C.K. Liang 200 $20/hour $4000.00

Oliver Tsai 200 $20/hour $4000.00

• Grand Total:

TABLE 5.3 TOTAL COST

Labor Parts Grand Total

$8,000 $315.00 $8,315

6. CONCLUSIONS

6.1 Difficulties Encountered

6.1.1 Indirect connection between microphone and the DSP board

The device relied on an inexpensive low quality microphone for input.

Since the computer must power the microphone, the microphone is connected to

the microphone port of the computer and the computer output of the speaker is

connected to the CODEC A/D port of the DSP board. Since the microphone is

indirectly connected to the board, the computer may induce some undesired noise

in addition to the speech samples. An alternative solution to this problem is to

purchase a high quality microphone that can be directly connected to the DSP

board.

6.1.2 Incompatible I/O core302 assembly file

Since the core302 assembly file is written for DSP56302 boards, the file

does not work with the DSP56303 board. Since the DSP56303 is deficient in

memory, the file is modified slightly. Specifically memory locations need to be

initialized correctly. As a result, more memory spaces are freed up for use.

6.1.3 Insufficient data memory

Due to insufficient memory in the DSP56303 chip, only 2000 digitized

speech samples are stored and processed at a time. Since 2000 samples are not

enough to capture a whole word, word recognition is not feasible using

DSP56303 chip. However, 2000 samples (equivalent to ¼ second for sampling

rate 8 kHz) are able to capture the whole vowel sound.

6.2 Summary

The calculation and comparisons of LPC coefficients on vowel sounds can

be used to recognize different vowel sounds. However, due to buffer restrictions

of storing only 2000 samples, vowel recognition was only achievable. Increasing

the buffer size to 4000 samples should be adequate for word recognition. A major

improvement to consider is continuous speech recognition. Recent theories in

using LPC coefficients with Hidden Markov Model, state diagrams and decision

rules can be applied to make continuous speech recognition possible.

Please download more appendices from the addresses below:

https://www-s.ece.uiuc.edu/ece345/projects/spring99/project22_file1doc

https://www-s.ece.uiuc.edu/ece345/projects/spring99/project22_file2doc

https://www-s.ece.uiuc.edu/ece345/projects/spring99/project22_file3.doc

REFERENCES

T. Parson. Voice and Speech Processing. New York: McGraw, 1987.

L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. New Jersey:Prentice Hall, 1978.

project22_final_paper.doc

Documents

Transcript of project22_final_paper.doc

Explore The Levels of Creation

Compressing And Decompressing Folders

Tragic Heroes

Daniel Zanella and Alexander Weygers

Introduction to Six Sigma

18 Tricks to Teach Your Body

Steve Jobs' Commencement Speech at Stanford

One Flew Over the Cuckoo's Nest

Barclays1

(Tesla) - The Tesla Magnetic Car Engine

Improve the Color Quality Of Your Monitor

United States Constitution

Simple Functions in Haskell

Acetone Peroxide

Bhagavad Gita

Plato - Symposium

Star Wars Trivia!

Do you admire Leonardo da Vinci?

How Computer Keyboards Work

Fortran