IIS for Speech Processing Michael J. Watts mike.watts.nz
-
Upload
baxter-mcgee -
Category
Documents
-
view
37 -
download
1
description
Transcript of IIS for Speech Processing Michael J. Watts mike.watts.nz
![Page 1: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/1.jpg)
IIS for Speech Processing
Michael J. Watts
http://mike.watts.net.nz
![Page 2: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/2.jpg)
Lecture Outline
• Speech Production• Speech Segments• Speech Data Capture• Representing Speech Data• Speech Processing• Speech Recognition
![Page 3: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/3.jpg)
Speech Production
• Speech is a sound waveo pressure wave
• Created by air passing over the organs of speech
• Sounds modified by the tongue, teeth and lips
![Page 4: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/4.jpg)
Speech Production
• Http://www.umanitoba.ca/faculties/arts/linguistics/russell/138/sec1/anatomy.htm
![Page 5: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/5.jpg)
Speech Segments
• There are many levels of speech that linguists deal witho Sentenceo Clauseo Phraseo Wordo Phoneme
• Not a strict hierarchy
![Page 6: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/6.jpg)
Speech Segments
• A sentence consists of one or more clauses• A clause has at least a predicate, a subject
and expresses a propositiono predicate expresses something about the subjecto proposition is the part of the clause that is
constant
![Page 7: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/7.jpg)
Speech Segments
• A phrase is more than one word, but less than a clauseo less organised
• A word is a part of a phraseo minimal possible meaningful unito consists of one or more phonemeso there are 250,000+ words in the English language
![Page 8: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/8.jpg)
Speech Segments
• A phoneme is the smallest unique segment in a spoken languageo there are 44 phonemes in standard New Zealand
Englisho standard (English) English has 43
• A phoneme consists of one or more phones
![Page 9: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/9.jpg)
Speech Data Capture
• With a microphone o well, Duh!
• Digitising soundo two parameterso sample rateo Resolution
![Page 10: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/10.jpg)
Speech Data Capture
• A sound wave is a continuous wave• Digital capture (sampling) is discrete• How often a sample is taken is the sample
rateo 44kHz = 44,000 samples / second
• The more samples (higher sample rate) the better the quality of the recording
![Page 11: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/11.jpg)
Speech Data Capture
• Resolution is the number of bits used to “describe” each sample
• Higher sample resolution means a greater number of sounds can be represented
• Most CDs are 16-bit, 44.1kHz
![Page 12: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/12.jpg)
Representing Speech Data
• Two main methods of visualisation• Waveform
o plot of signal amplitude against time• Spectrogram
o energy-frequency-time plots
![Page 13: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/13.jpg)
Representing Speech Data
![Page 14: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/14.jpg)
Representing Speech Data
![Page 15: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/15.jpg)
Representing Speech Data
• Speech signal has a lot of variation• Hard to recognise speech from the raw signal• Need to reduce the signal somehow• Various transforms are available• Many based on FFT
![Page 16: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/16.jpg)
Representing Speech Data
• Common transform is the mel-scale• The mel-scale is a log scale• Models human perception• Divides the signal into frequency bands• Returns the log-energy for each frequency
band
![Page 17: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/17.jpg)
Speech Processing
• Speech encodingo Compression or de-noising of a speech signalo ANN are good at compression
• Speaker identificationo Who is speaking at any one time?
• Language identificationo what language is being spoken?
• Speech recognition
![Page 18: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/18.jpg)
Speech Recognition
• Two general types• Continuous
o recognises words without pauses between themo recognise individual words, but uses surrounding
words for increased accuracy• Discrete
o recognises single words
![Page 19: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/19.jpg)
Speech Recognition
• Systems can be word or phoneme based• Word based systems are easier to build• Require a restricted vocabulary
o require a recogniser for each word how many words in English?
• The most common used, however
![Page 20: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/20.jpg)
Speech Recognition
• Phoneme based system require fewer recogniserso ~40 phonemes in most English accents
• Require a means of assembling the phonemes into wordso harder than it sounds
![Page 21: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/21.jpg)
Speech Recognition
• ANN can be trained for both word and phoneme based recognition
• Common models used are MLP, SOM and recurrent networks
• Monolithic vs. Multi-modularo speed vs. simplicity
![Page 22: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/22.jpg)
Speech Recognition
• Phoneme based speech recognition with MLP• Create a MLP for each phoneme• Train the MLP to activate for the target
phoneme and reject all others• Need to combine the outputs in a meaningful
way
![Page 23: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/23.jpg)
Speech Recognition
From Kasabov, 1996, pg380
![Page 24: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/24.jpg)
Speech Recognition
• Phoneme base speech recognition with SOM• Train a SOM on the phonemes
o phonemes will cluster together• Node that is active identifies the current
phonemeo one of the earliest applications of SOMs
![Page 25: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/25.jpg)
Speech Recognition
• Word based systems work in a similar manner• Obviates the need for assembling phonemes
into words• Large number of words restricts the
application areas
![Page 26: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/26.jpg)
Speech Recognition
• Fuzzy systems can also be used• Fuzzy rules assemble the phoneme streams
into words• Hybrid systems
![Page 27: IIS for Speech Processing Michael J. Watts mike.watts.nz](https://reader030.fdocuments.us/reader030/viewer/2022032805/56813164550346895d97da6a/html5/thumbnails/27.jpg)
Summary
• Speech processing is a complex problem• Large variability in speech signals• Main application is speech recognition• Many types of speech recognition methods• Intelligent systems can be useful for this
problem