Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.
-
Upload
jasper-ramsey -
Category
Documents
-
view
218 -
download
1
Transcript of Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.
![Page 1: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/1.jpg)
Speech & Language Modeling
Cindy Burklow & Jay Hatcher
CS521 – March 30, 2006
![Page 2: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/2.jpg)
Agenda
What is Speech Recognition? What is Speech Recognition?
Challenges of Speech Recognition Challenges of Speech Recognition
Expresso III Case StudyExpresso III Case Study
IBM Superhuman Speech TechIBM Superhuman Speech Tech
Speech Synthesis Speech Synthesis
![Page 3: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/3.jpg)
What is Speech Recognition
How does it work?
Two approaches
Phonemes
One longRule book
DeductiveFramework
Search Algorithms& Math Models
![Page 4: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/4.jpg)
Hunting Speech
![Page 5: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/5.jpg)
Phoneme Sequence
![Page 6: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/6.jpg)
Phonemes Energy
![Page 7: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/7.jpg)
Challenges of Speech Recognition
Noise
Users own preferences
Limit Speech Range
People
Infinite Combinations
Software
![Page 8: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/8.jpg)
Expresso III
Project
Who? Why?What? How?
![Page 9: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/9.jpg)
Expresso III
How is it different?
Why try a new method?Co-ArticulationIndependenciesDuration
Linear Dynamic Model (LDM)
![Page 10: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/10.jpg)
Expresso III
Why Linear Dynamic Model (LDM)?
Expresso III ‘s Hypothesis
Testing Methods
Includes error modelsOnly linear models allowed
Series of tests (5 total)Increase “phones” & training data
Switching & Iteration & Data classificationGenerated histograms of log likelihood
Divide & Conquer TechniqueResults
![Page 11: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/11.jpg)
IBM Superhuman Speech Tech
ViaVoice 4.4ViaVoice 4.4 ProductsProducts GoalGoal
“Get performance comparable to humans in the next five years.”
-IBM Jan. 2006
Comprehend languages
Translate dynamically
Create “on-the-fly” subtitles on TV
Speak commands
Free-Form Command
MASTOR
TALES
PDAS, IPODS, & DVRs
![Page 12: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/12.jpg)
“Free-Form Command”
• Commands associated with objects
• Simplified Language
• Partnering with Specialized Hardware Manufacturing
• Finding Cliché markets
• Well-chosen Algorithms
![Page 13: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/13.jpg)
IBM’s MASTOR
Multilingual Automatic Speech-to-Speech Translator
![Page 14: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/14.jpg)
IBM’s Tales
• Server-based system
• Dynamically Transcribe & translates any words spoken into English subtitles
• Requires long processing time
• Real-time translations are impossible
• 60%-70% accuracy rate
• High subscription fee for users
![Page 15: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/15.jpg)
Expanding Speech Recognition Applications
PDAs to collect data
iPod: Email & RSS Read Aloud
![Page 16: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/16.jpg)
Navigate Your DVR with Speech
Voice commands
Requires microphone*TV remote* Headset
![Page 17: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/17.jpg)
Text to Speech Systems
Two major steps:1. Convert the text into a pronounceable format
– Look for domain specific sections like time, dates, numbers, addresses, and abbreviations
– Try to identify homographs and the contexts in which they occur
– Use some combination of dictionary and rule-based approaches as a guide to pronunciation
2. Synthesize speech from the phonetic representation using one of many possible approaches
![Page 18: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/18.jpg)
Speech Synthesis
Formant Synthesis Recordings
Concatenative synthesis
Unit Selection
Waveform Synthesis
Diphone Synthesis
Hybrid ApproachesArticulatory Synthesis
HMM-based synthesis
Continuum of Speech Synthesis methods
![Page 19: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/19.jpg)
Speech Synthesis at CMU
• Carnegie Mellon University has been doing extensive research in both speech recognition and speech synthesis
• Research primarily uses the Festival Speech Synthesis System, an open-source framework developed by Edinburgh University
![Page 20: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/20.jpg)
Speech Synthesis at CMU
• Research has primarily focused on Diphone Synthesis, with some additional exploration into Unit Selection.
![Page 21: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/21.jpg)
Speech Synthesis at CMU
• Diphone synthesis allows greater control of pitch and voice inflection, but often has a more robotic sound to it.
• Example: This is a short introduction to the Festival Speech Synthesis System. Festival was developed by Alan Black and Paul Taylor, at the Centre for Speech Technology Research, University of Edinburgh.
![Page 22: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/22.jpg)
Speech Synthesis at CMU
• Improvements can be made by performing statistical analysis of the text as a preprocessing step before synthesis.
• This helps with pacing, homographs, and other situations where pronunciation differs depending on context.
• He wanted to go for a drive in.• He wanted to go for a drive in the country.
• My cat who lives dangerously has nine lives.• Henry V: Part I Act II Scene XI: Mr X is I believe, V I
Lenin, and not Charles I.
![Page 23: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/23.jpg)
Speech Synthesis at CMU
• Unit selection can be used instead of diphones to improve how natural the voice sounds by using whole phones (e.g. syllables) and not just diphones (sound transitions)
• The following examples are based on the same speaker:
• Diphones
• Unit Selection
![Page 24: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/24.jpg)
Speech Synthesis at CMU
• With care, unit selection can produce very convincing natural sound.– Original Sound– Synthesis from natural phones, pitch, and
duration data
• However, it is difficult to generalize Unit Selection for a variety of situations, and if it does poorly it sounds much worse than diphones.– Example
![Page 25: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/25.jpg)
Speech Synthesis at CMU
• Most commercial TTS packages use Unit Selection with medium to large databases of samples.– Example: Neospeech VoiceText
• These produce higher quality sound at the expense of memory and processor power.
• CMU’s Festival implementation has focused more on Diphone Synthesis to reduce memory footprint and allow greater control of the synthesizer.
![Page 26: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/26.jpg)
Speech Synthesis at CMU
• Diphone Synthesis can control inflection, pitch, and other factors dynamically.– A short example with no prosody.– A short example with declination.– A short example with accents on stressed
syllables and end tones.– A short example with statistically trained
intonation and duration models.
![Page 27: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/27.jpg)
Conclusion
• CMU’s research using Festival has lead to useful technology for embedded systems and servers. The Diphone Synthesis model they have developed can produce generally intelligible speech with minimal memory and processing costs. The model is still being worked on and may one day reach a natural level of quality.
![Page 28: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/28.jpg)
What is speech recognition & Challenges?• http://www.extremetech.com/article2/0,1697,1826664,00.asp• http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/
dl/mags/co/&toc=comp/mags/co/2002/04/r4toc.xml&DOI=10.1109/MC.2002.993770
• http://en.wikipedia.org/wiki/Speech_recognition• http://cslu.cse.ogi.edu/HLTsurvey/ch1node7.html
Expresso III Case Study• http://www.cstr.ed.ac.uk/publications/users/
s0129866_abstracts.html#Couper-02• http://www.cstr.ed.ac.uk/publications/users/s0129866.html
IBM Superhuman Speech Tech• http://www.ibm.com• http://www.pcmag.com/article2/0,1895,1915071,00.asp
References and Useful Links
![Page 29: Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649e165503460f94b0177d/html5/thumbnails/29.jpg)
References and Useful Links
• The Festival Speech Synthesis System
• NeoSpeech VoiceText Demo
• AT&T’s TTS FAQ
• Reviews of Popular Speech Synthesizers
• Speech Engine Listings with Samples
• BrightSpeech.com
• Festival at CMU
• FestVox