DYNAMIC TIME WARPING IN KEY WORD SPOTTING

DYNAMIC TIME WARPINGIN KEY WORD SPOTTING

OUTLINE

• KWS and role of DTW in it.

• Brief outline of DTW

• What is training and why is it needed?

• DTW training algorithm

• Raw data preprocessing

• Results

• Suggestions

SYSTEM BLOCK-DIAGRAM

Front - EndAudio

InterfaceBack-End

Training/Testing/Analysis

12/18/2003

Key-Word Recognizer

Monitor

FUNCTIONS OF SUBSYSTEMS• Audio interface samples sound and provides it

to other subsystems. Also indicates detection of KEY word.

• Front-end detects intervals of data with voice present and converts them into collections of feature vectors

• Back-end compares input of feature vectors from the front-end with template(or set of templates) provided by training block and sends the score to the analysis unit.

• Training/Testing/Analysis creates templates , analyze the matching score and makes a decision what KEY-word the input corresponds to

DUAL ROLE OF DTW

• DTW can be used as a comparison method in the back-end

• DTW can be used as an averaging tool in training to create template.

• In the first case, it is on-line use ,in the second, it is off-line.

PRINCILES OF DTW

• While comparing 2 inputs of feature vectors the Global difference must be the sum of local differences between frames the inputs consist of.

• Because the phonetic content distribution of an input is not known in advance, it is possible to compare frames corresponding to different phones. Compare “apple and orange”, so to speak.

• DTW minimizes the effect of such a comparison by finding the correspondence between the frames such that the Global distance is minimal.

GLOBAL AND LOCAL DISTANCE

D(i, j+1)

d(i, j+1)

D(i+1,j+1)=min( D(i, j+1),

D(i, j), D(i+1, j) )+

+d(i+1,j+1)

d(i+1,j+1)

D(i, j)

d(i, j)

D(i+1,j)

d(i+1,j)

D[j, j]

D[0 0]

DTW continued

• Iteratively populated array, as described above, will lead us to the Global distance D[N,M]

• At the same time the path is not known. If application requires knowledge of the path the array can be populated with data structures containing not only the global distance at a cell but also the indexes of the preceding cell in the possible trajectory.

What is training and why is it needed?

• There are a huge number of realizations or tokens of the same word we wish to recognize. They differ in length and acoustic content so that the distance between some realizations might be bigger than the distance between realization of a given word and a realization of a different word. That could lead to a wrong decision in recognition.

Training continued• To reduce the probability of the scenario

described above, the averaging procedure is designed. Its effect is a realization of the word ,which is overall well matched with all the tokens of a training data. This constructed realization of the word is called template. Some tokens within the data might be perfectly matched some –not so good, but template matches fairly well with any token of the training data.

Possible length distribution in the data

35 40 45 50 55 60 65 700

1

2

3

4

5

6

7

8

9

10

NUBBER OF FRAMES

NU

MB

ER

OF

TO

KE

NS

TRAINING USING DTW

Find the distribution of length of the training tokens and the token/s representing average length.

• Use this average token as one input to DTW program.• Use the rest of the training data iteratively as the other

input to the DTW.• Every token of data will be warped with average length

input and frames of different tokens corresponding to the same frame of the average input will be added together and averaged.

Obtained in such a manner template will be used as an input to DTW program to repeat the cycle.

Repeat until convergence.

AVERAGING WITH DTW

AmCm

AnCn

• After one cycle Xn=An/Cn;

• In the next cycle An and Cn should be

set to 0

RAW DATA

• All training data were given as a directory of raw data files, each resulted from sampling of analog signal with 8kHz.

• A description file with the information about date, time, filename, beginning and end of speech and, finally, the word spoken was also given.

Raw data preprocessing

• Training program required the knowledge of the beginning of the segment of speech as well as its length, both expressed in frames. For this reason, the description file was read and necessary information was extracted and written into the file ‘filelist’. A PEARL program was used.

• Training program also required tokens represented by feature vectors ,that is why raw data files corresponding to key-word were converted to feature files . Again, the program was written in PERL with Front-end called from inside. As a result, a new directory ‘Operator’ with feature files was created.

• Obtained in such a manner file ‘filelist’ and file corresponding to the token with the average length were inputs to the training program using DTW averaging algorithm described above.

• Average token was the input X and the input Y was read sequentially from the directory “Operator”, pointed to by paths of files in the input file “filelist”.

RESULTS

• DTW program was written tested on the data obtained in Dr. Silaghi Speech Recognition class through HTK Front-end. Although a few deviations were observed there was clear correlation between phonetic similarity of words and the DTW score. At the same time it was a mistake not to test it on the data used in the training before writing a training program.

RESULTS CONTINUED

• When the training program was finished it turned out that DTW does not discriminate training data good enough so that it was pointless to test the template produced by the training program.

• Because DTW program is relatively straightforward it is more likely that there is a problem in the Front-end. In any case, it remains to be seen.

• Overall, it can be concluded that DTW program and based on it Training program was developed but not tested due to possible problems in the Front-end

SUGGESTIONS

• To gain a better insight into DTW it would be interesting to incorporate MATLAB in studying it.

• The DTW code is simple enough to be executed as a M-script. It only needs 2 arrays X and Y as inputs. They are imported from outside after the reading of correspondent feature files. The advantage is ,that MATLAB provides the extensive tools for the examining the DFT path.

• The wave-files of the inputs should be imported also and the spectrograms should be studied, to see the correlation between the phonetic similarities pointed by the spectrograms and the behavior of the DTW path.

DYNAMIC TIME WARPING IN KEY WORD SPOTTING

Documents

Transcript of DYNAMIC TIME WARPING IN KEY WORD SPOTTING