CS 551/651: Structure of Spoken Language Lecture 13: Text-to-Speech (TTS) Technology and Automatic...

CS 551/651:Structure of Spoken Language

Lecture 13: Text-to-Speech (TTS) Technology and Automatic Speech

Recognition (ASR)

John-Paul HosomFall 2008

Text-to-Speech (TTS) Synthesis

• Having looked at theories of human speech production and speech perception, now we’ll look at structures and algorithms currently used to implement these technologies.

• Text-to-Speech (TTS) has three main approaches:(1) formant-based(2) concatenative(3) articulatory

• All TTS approaches must address:

(a) text analysis: from text, predicting phonemes, stress, and phrase boundaries

(b) prosody: from text-analysis output, predictingpitch contour, energy contour, duration of each

phoneme

(c) signal processing: given phoneme symbols and timing,generate speech waveform


• From a linguistic perspective, there may be many more things to consider…

(from Klatt 1987)


Generating a Waveform: Articulatory Synthesis

• The vocal tract is divided into a large number of short tubes, as in the electrical transmission line analog (Lecture 11), which are then combined and resonant frequencies calculated.

from Sinder, 1999 (thesis work with Flanagan, Rutgers)


Generating a Waveform: Articulatory Synthesis

• Vocal-tract sources include noise and a “buzz” source for voiced sounds

• Articulatory synthesis important for validating the Motor Theory of Speech Perception

• Demos from 1976 and circa 1992 (Haskins Labs)


Generating a Waveform: Formant Synthesis

• Instead of specifying mouth shapes, formant synthesis specifies frequencies and bandwidths of resonators, which are used to filter a source waveform.

• Formant frequency analysis is difficult; bandwidth estimation is even more difficult. But the biggest perceptual problem in formant synthesis is not in the resonances, but in a “buzzy” quality most likely due to the glottal source model.

• Formant synthesis can sound identical to natural utterance if details of the glottal source and formants are well modeled.

NATURAL SPEECH SYNTHETIC SPEECH

(John Holmes, 1973)


Formant TTS Synthesis: Architecture

• Formant-synthesis systems contain a number of sound sources, which are passed to filters in either parallel or cascade series. Each filter corresponds to one formant (resonance) or anti-resonance.

(From Yamaguchi, 1993)


Formant systems: Rule-Based Synthesis

• For synthesis of arbitrary text, formants and bandwidths for each phoneme are determined by analyzing speech of a single person.

• The models of each phoneme may be a single set of formant frequencies and bandwidths for a canonical phoneme at a single point in time, or a trajectory of frequencies, bandwidths, and source models over time.

• The formant frequencies for each phoneme are combined over time using a model of coarticulation, such as Klatt’s modified locus theory.

• Duration, pitch, and energy rules are applied

• Result: something like this:


• Despite great success in copy synthesis, synthesis by rule using formants has severely degraded quality. It’s not clear why… Problem with glottal source? Problem with coarticulation and formant transitions? Problem with prosody?

• Formant synthesis was main TTS technique until the early or mid 1990’s, when increasing memory size and CPU speed allowed concatenative synthesis to be viable approach.

• Concatenative synthesis uses recordings of small units of speech (typically the region from the middle of one phoneme to the middle of another phoneme, or a diphone unit), and glues these units together to forms words and sentences.

• Concatenative synthesis means that you don’t have to worry about glottal source models or coarticulation, since the synthesis is just a concatenation of different waveforms containing “natural” glottal source and coarticulation.


Concatenative Synthesis: Units

• The basic unit for concatenative synthesis is the diphone:

• More recent TTS research is on using larger units. Issues include: (a) how to decide what units will be used?

(b) how to select best unit from very large database?

• With increasing size and variety of units, there is an exponential growth in the database size. Yet, despite massive databases that may take months to record, coverage is nowhere near complete. There is a very large number of infrequent events in speech.

sil-jh jh-aa aa-n n-sil

Concatenative Synthesis: Signal Processing

• Waveform-based Pitch-Synchronous Overlap Add (PSOLA)

• Perform pitch modification by spacing of pitch-synchronous units

• Or, use Line Spectral Frequencies (LSFs), which arecomputed from Linear Predictive Coefficients (LPC)



DEMOS

• Klatt’s DEC Talk (formant synthesis) (early 1990’s)sample 1

• AT&T (large-unit selection)sample 1a (2003) sample 2a (2003)sample 1b (2005) sample 2b (2005)

• Bell Labs (large-unit selection)sample 1a (2003) sample 2a (2003)sample 1b (2005) sample 2b (2005)

• OGI (diphone units)sample 1a (2003) sample 2a (2003)sample 1b (2005) sample 2b (2005)

ASR Technology: Frame-Based Approaches

• Stochastic Approach

includes HMMs and HMM/ANN hybrids

ASR Technology: Frame-Based Approaches• HMM-Based System Characteristics

System is in only one state at each time t; at time t+1, the system transfers to one of the states indicated by the arcs.

At each time t, the likelihood of each phoneme is estimated using Gaussian mixture model or ANN. The classifier uses a fixed time window usually extending no more than 60 msec. Each frame is typically classified into each phoneme in a particular left and right context, e.g. /y−eh+s/, and as the left, middle, or right region of that context-dependent phoneme (3 states per phoneme).

The probability of transferring from one state to the next is independent of the observed (test) speech utterance, being computed over the entire training corpus.

The Viterbi search determines the most likely word sequence given the phoneme and state-transition probabilities and the list of possible vocabulary words.


• Issues with HMMs:

Independence is assumed between frames

Implicit duration model for phonemes is Geometric, whereas phonemes actually have Gamma distributions

Independence is required between features within one frame for GMM classification (not so for ANN classification)

All frames of speech contribute equally to final result

Duration is not used in phoneme classification

Duration is modeled using a priori averages over the entire training set

Language model uses probability of word N given words N−1, N−2, etc. (bigram, trigram, etc. language model); infrequently occurring word combinations poorly recognized (e.g. “black Monday”, a stock-market ‘crash’ in 1987)


• Why is HMM Dominant Technique for ASR?

well-defined mathematical structure

does not require expert knowledge about speech signal (more people study statistics than study speech)

errors in analysis don’t propagate and accumulate

does not require prior segmentation

does not require a large number of templates

results are usually the best or among the best

Issues in Developing ASR Systems

• Type of Channel

Microphone signal different from telephone signal, “land-line” telephone signal different from cellular signal.

Channel characteristics:pick-up pattern (omni-directional, unidirectional, etc.)frequency response, sensitivity, noise, etc.

Typical channels:desktop boom mic: unidirectional, 100 to 16000 Hz hand-held mic: super-cardioid, 40 to 20000 Hz telephone: unidirectional, 300 to 8000 Hz

Training on data from one type of channel automatically “learns” that channel’s characteristics; switching channels degrades performance.


• Speaking Rate

Even the same speaker may vary the rate of speech.

Most ASR systems require a fixed window of input speech.

Formant dynamics change with different speaking rates and speaking styles (e.g. “frustrated speech”).

ASR performance is best when tested on same rate of speech as training data.

Training on a wide variation in speaking rate results in lower overall performance.


• Noise

two types of noise:additive, convolutional

additive: white noise (random values added to waveform)

convolutional: filter (additive values in log spectrum)

techniques for removing noise: RASTA, Cepstral Mean Subtraction (CMS)

(nearly) impossible to remove all noise while preserving all speech

stochastic training “learns” noise as well as speech; if noise changes, performance degrades.


• Vocabulary

Vocabulary must be specified in advance (can’t recognize new words)

Pronunciation of each word must be specified exactly (phonetic substitutions may degrade performance)

Grammar: either very simple or very structured

Reasons: • phonetic recognition so poor that confidence in each recognized phoneme usually very low.• humans often speak ungrammatically or disfluently.


• How Well Does ASR Do?

1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003

Read Speech

VariedMicrophones

BroadcastSpeech

ConversationalSpeech

5k

20kSpontaneousSpeech (2-3k)

1k Noisy

1%

100%W

ord

Err

or

Ra

te

10%

Error Rates on Increasingly Difficult Problems

NoisySpeech

Structured Speech

human speech recognitionof Broadcast Speech (0.9%WER)2.5%

19%

• Current best performance on conversational telephone speech is around 10% word error rate

ASR Technology vs. Spectrogram Reading

HMM-Based ASR:• frame based − no identification of landmarks in speech signal• duration of phonemes not identified until end of processing• all frames are equally important• “cues” are completely unspecified, learned by training• coarticulation model = context-dependent phoneme models

Spectrogram Reading:• first identify landmarks in the signal Where’s the vowel? Is that change in energy a plosive?• identify change over duration of a phoneme, relative durations

Is that formant movement a diphthong or coarticulation?• identify activity at phoneme boundaries

F2 goes to 1800 Hz at onset of voicing, voicing continues into frication, so it’s a voiced fric.

• specific cues to phoneme identity 1800 Hz implies alveolar, F3 2000 Hz implies retroflex

• coarticulation model = tends toward locus theory

ASR Technology vs. Spectrogram Reading

HMM-Based ASR:• frame based − no identification of landmarks in speech signal• duration of phonemes not identified until end of processing• all frames are equally important• “cues” are completely unspecified, learned by training• coarticulation model = context-dependent phoneme models

Spectrogram Reading and Human Speech Recognition• first identify landmarks in the signal Humans thought to have landmark (e.g. plosive) detectors• identify change over duration of a phoneme, relative durations

Humans very sensitive to small changes, especially at vowel/consonant boundaries

• identify activity at phoneme boundaries Transition into the vowel most important region for human speech perception

• specific cues to phoneme identity Humans use (large) set of specific cues, e.g. VOT

The Structure of Spoken Language

Final Points:

• Speech is complex! Not as simple as “sequence of phonemes”

• There is structure in speech, related to broad phonetic categories

• Identifying formant locations and movement is important

• Duration is important even for phoneme identity

• Phoneme boundaries are important

• There are numerous cues to phoneme identity

• Little is understood about how humans process speech

• Current ASR technology is incapable of accounting for all information that humans use in reading spectrograms, and what is known about human speech processing often not used… this implies (but does not prove) that current technology may be incapable of reaching human levels of performance.

• Speech is complex!

CS 551/651: Structure of Spoken Language Lecture 13: Text-to-Speech (TTS) Technology and Automatic...

Documents

Transcript of CS 551/651: Structure of Spoken Language Lecture 13: Text-to-Speech (TTS) Technology and Automatic...