Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.

Speech & Language Modeling

Cindy Burklow & Jay Hatcher

CS521 – March 30, 2006

Agenda

What is Speech Recognition? What is Speech Recognition?

Challenges of Speech Recognition Challenges of Speech Recognition

Expresso III Case StudyExpresso III Case Study

IBM Superhuman Speech TechIBM Superhuman Speech Tech

Speech Synthesis Speech Synthesis

What is Speech Recognition

How does it work?

Two approaches

Phonemes

One longRule book

DeductiveFramework

Search Algorithms& Math Models

Hunting Speech

Phoneme Sequence

Phonemes Energy

Challenges of Speech Recognition

Noise

Users own preferences

Limit Speech Range

People

Infinite Combinations

Software

Expresso III

Project

Who? Why?What? How?

Expresso III

How is it different?

Why try a new method?Co-ArticulationIndependenciesDuration

Linear Dynamic Model (LDM)

Expresso III

Why Linear Dynamic Model (LDM)?

Expresso III ‘s Hypothesis

Testing Methods

Includes error modelsOnly linear models allowed

Series of tests (5 total)Increase “phones” & training data

Switching & Iteration & Data classificationGenerated histograms of log likelihood

Divide & Conquer TechniqueResults

IBM Superhuman Speech Tech

ViaVoice 4.4ViaVoice 4.4 ProductsProducts GoalGoal

“Get performance comparable to humans in the next five years.”

-IBM Jan. 2006

Comprehend languages

Translate dynamically

Create “on-the-fly” subtitles on TV

Speak commands

Free-Form Command

MASTOR

TALES

PDAS, IPODS, & DVRs

“Free-Form Command”

• Commands associated with objects

• Simplified Language

• Partnering with Specialized Hardware Manufacturing

• Finding Cliché markets

• Well-chosen Algorithms

IBM’s MASTOR

Multilingual Automatic Speech-to-Speech Translator

IBM’s Tales

• Server-based system

• Dynamically Transcribe & translates any words spoken into English subtitles

• Requires long processing time

• Real-time translations are impossible

• 60%-70% accuracy rate

• High subscription fee for users

Expanding Speech Recognition Applications

PDAs to collect data

iPod: Email & RSS Read Aloud

Navigate Your DVR with Speech

Voice commands

Requires microphone*TV remote* Headset

Text to Speech Systems

Two major steps:1. Convert the text into a pronounceable format

– Look for domain specific sections like time, dates, numbers, addresses, and abbreviations

– Try to identify homographs and the contexts in which they occur

– Use some combination of dictionary and rule-based approaches as a guide to pronunciation

2. Synthesize speech from the phonetic representation using one of many possible approaches

Speech Synthesis

Formant Synthesis Recordings

Concatenative synthesis

Unit Selection

Waveform Synthesis

Diphone Synthesis

Hybrid ApproachesArticulatory Synthesis

HMM-based synthesis

Continuum of Speech Synthesis methods

Speech Synthesis at CMU

• Carnegie Mellon University has been doing extensive research in both speech recognition and speech synthesis

• Research primarily uses the Festival Speech Synthesis System, an open-source framework developed by Edinburgh University


• Research has primarily focused on Diphone Synthesis, with some additional exploration into Unit Selection.


• Diphone synthesis allows greater control of pitch and voice inflection, but often has a more robotic sound to it.

• Example: This is a short introduction to the Festival Speech Synthesis System. Festival was developed by Alan Black and Paul Taylor, at the Centre for Speech Technology Research, University of Edinburgh.


• Improvements can be made by performing statistical analysis of the text as a preprocessing step before synthesis.

• This helps with pacing, homographs, and other situations where pronunciation differs depending on context.

• He wanted to go for a drive in.• He wanted to go for a drive in the country.

• My cat who lives dangerously has nine lives.• Henry V: Part I Act II Scene XI: Mr X is I believe, V I

Lenin, and not Charles I.


• Unit selection can be used instead of diphones to improve how natural the voice sounds by using whole phones (e.g. syllables) and not just diphones (sound transitions)

• The following examples are based on the same speaker:

• Diphones

• Unit Selection


• With care, unit selection can produce very convincing natural sound.– Original Sound– Synthesis from natural phones, pitch, and

duration data

• However, it is difficult to generalize Unit Selection for a variety of situations, and if it does poorly it sounds much worse than diphones.– Example


• Most commercial TTS packages use Unit Selection with medium to large databases of samples.– Example: Neospeech VoiceText

• These produce higher quality sound at the expense of memory and processor power.

• CMU’s Festival implementation has focused more on Diphone Synthesis to reduce memory footprint and allow greater control of the synthesizer.


• Diphone Synthesis can control inflection, pitch, and other factors dynamically.– A short example with no prosody.– A short example with declination.– A short example with accents on stressed

syllables and end tones.– A short example with statistically trained

intonation and duration models.

Conclusion

• CMU’s research using Festival has lead to useful technology for embedded systems and servers. The Diphone Synthesis model they have developed can produce generally intelligible speech with minimal memory and processing costs. The model is still being worked on and may one day reach a natural level of quality.

What is speech recognition & Challenges?• http://www.extremetech.com/article2/0,1697,1826664,00.asp• http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/

dl/mags/co/&toc=comp/mags/co/2002/04/r4toc.xml&DOI=10.1109/MC.2002.993770

• http://en.wikipedia.org/wiki/Speech_recognition• http://cslu.cse.ogi.edu/HLTsurvey/ch1node7.html

Expresso III Case Study• http://www.cstr.ed.ac.uk/publications/users/

s0129866_abstracts.html#Couper-02• http://www.cstr.ed.ac.uk/publications/users/s0129866.html

IBM Superhuman Speech Tech• http://www.ibm.com• http://www.pcmag.com/article2/0,1895,1915071,00.asp

References and Useful Links

References and Useful Links

• The Festival Speech Synthesis System

• NeoSpeech VoiceText Demo

• AT&T’s TTS FAQ

• Reviews of Popular Speech Synthesizers

• Speech Engine Listings with Samples

• BrightSpeech.com

• Festival at CMU

• FestVox

Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.

Documents

Transcript of Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.