Mathew 1

7:46 PM me: Hello Sir !

52 minutes

8:38 PM mathew.magimaidoss: hello Shweta

are you online?

me: Hello

yes

8:39 PM mathew.magimaidoss: sorry i did nt see your last message

me: That's ok

mathew.magimaidoss: is it ok to start the call

me: ya

12 minutes

8:52 PM mathew.magimaidoss: Speech ----> DTW -----> Recognition score

Video -------> Matching (correlation) ----> Recognition score

8:53 PM combine speech score and video/visual score

8:54 PM speech score denote it as A

visual score denote it as B

A + B

w1 . A + w2. B

w1 + w2 = 1

8:57 PM http://publications.idiap.ch/downloads/reports/2000/rr00-35.pdf

9:01 PM cepstral coefficients,

mel frequence cepstral coefficients

9:03 PM 20-30 ms - frame size

frame-shift - 10 ms

10 ms shift

25 ms frame

9:04 PM HTK

5 minutes

9:10 PM mathew.magimaidoss: Dynamic Time Warping

9:13 PM Sakoe, H. and Chiba, S., Dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1) pp. 43- 49

9:14 PM http://publications.idiap.ch/downloads/papers/2007/aradilla-mlmi-2007.pdf

9:15 PM http://publications.idiap.ch/downloads/papers/2011/Soldo_ICASSP_2011.pdf

9:17 PM http://www.cstr.ed.ac.uk/research/projects/featureMLPs/

9:20 PM speech -> extract cepstral coefficients

9:21 PM close talking microphone

9:22 PM six

9:23 PM end point detection

9:25 PM for each frame compute the energy

9:26 PM N frames

N energies

take frames 1 to 10 in the begining

and take last 10 frames

9:28 PM 00000010000011111111111111110001000000

median smoothing

9:30 PM two microphones

one close talking and one table top

9:35 PM speech -> cepstral features -> DTW -> score

speech -> cepstral features -> posterior features -> DTW

-> score

9:36 PM MFCC

why not implement PLP cepstral coefficients

9:38 PM HTK

code

5 minutes

9:44 PM mathew.magimaidoss: for each word

10 different participants

5 male 5 female

20 different participants

10 male 10 female

9:45 PM 10 words

10 x 20

200 utterances

9:46 PM 10 x 20 x 5

9:47 PM word_m01_t01

word_m01_t02

word_m02_t01

word_f01_t01

9:48 PM 20 X 20 x 5

9:49 PM age

native language

which region

9:51 PM English words

9:54 PM data base

number of words

number of speakers

number of trials

9:55 PM type of microphone

gender balance

equal number of male and female

9:56 PM metadata: age, native language, languages they can speak, which region they are from

me: ya ok

mathew.magimaidoss: get a good microphone

for data collection

9:57 PM like senheiser microphone

me: Ok . The call got disconnected.

9:58 PM I'll get a good microphone

9:59 PM mathew.magimaidoss: sampling frequency is based on the bandwidth

20 - 20 kHz

44.1 kHz

Mathew 1

Documents

Transcript of Mathew 1