Human Speech Communication

message

linguistic code (~ 50 b/s)

motor controlspeech production

SPEECH SIGNAL (~50 kb/s)speech perception

cognitive processeslinguistic code (~ 50 b/s)

message

Human Speech Communication

PCM (Pulse Code Modulation)

• Transmit value of each speech sample– dynamic range of speech is about 50-60 dB

• 11 bits/sample– maximum frequency in telephone speech is 3.4 kHz

• sampling frequency 8 kHz

8000 x 11 = 88 kb/sSimple and universal but not very efficient

Better quantization ?

• Less quantization noise for weaker signals IN

OUT

- law

A - law

Logarithmic PCM (-law, A-law)

• Finer quantization for each individual small amplitude sample– how about small signal samples surrounded by large ones?– it is the instantaneous signal energy which should determine the step

Differential coding

• For many natural signals, the difference between successive samples quantizes better than samples themselves

• Even better, predict the current sample from the past ones and transmit the error of the prediction

current sample

time

Differential predictive coding• DPCM

– a single predictor reflecting global predictability of speech

– predictor order up to 4-5– delta modulation - gross

quantization of prediction error into 1 bit (typically requires up-sampling well over the Nyquist rate)

• adaptive DPCM– new predictor for every new

speech block– predictor needs to be

transmitted together with the prediction error

Speech Coders

Linear model of speech production

A.G. Bell got it almost right

linear model of speech

source filter speech

changes slowly

current sample

time

short-term prediction

long-term prediction

short-term - resonance of vocal tractlong-term - periodicity of voiced speech (vocal cord vibration)

LPC vocoder

• The same principle as in H. Dudley’s Vocoder• Used by US Government (LPC-10) - 2.4 kbs

Residual Excited LPC (RELP)

• Transmitter:– Simplify prediction

error (low-pass filter and down-sample

• Receiver– re-introduce high

frequencies in the simplified residual (nonlinear distortion)

Analysis-by-synthesis

• Identical synthesizer in coder and in decoder– change parameters in coder– use for synthesizing speech– compare synthesized speech with real speech– when “close enough”, send parameters to the receiver

Future in speech coding?

• No need to transmit what we do not hear– study human hearing, especially masking

• No need to transmit what is predictable– speech production mechanism– speaker characteristics– linguistic code (recognition-synthesis)– thought-to-speech

Automatic recognition of speech

reduce information = decrease entropy

linguistic messagephoneme string(below 50 b/s)

knowledge

electric signal(more than 50 kb/s)prior knowledge

( textbook )

acquired knowledge( data )

• Automatic speech recognition (ASR)– derive proper response from speech

stimulus

• Auditory perception– how do biological systems respond

to acoustic stimuli

• Knowledge of auditory perception ?

Principle of stochastic ASR• Using a model of speech production process, generate all possible

acoustic sequences wi for all legal linguistic messages

• Compare all generated sequences with the unknown acoustic input x to find which one is the most similar

))|)(M((maxarg xww iPi

=

1. What is the model M ( wi ) ?

2. Form of the data x ?

One (simple) model

hello world

uh e l o w r dlo

• Two dominant sources of variability in speech1. people say the same thing with different speeds ( temporal

variability )2. different people sound different, communication environment

different, ( feature variability)

• “Doubly stochastic” process (Hidden Markov Model)– Speech as a sequence of hidden states - recover the state sequence

1. never know for sure in which state we are2. never know for sure which data can be generated from a given

state

Hidden Markov Model

hi hi hi hi hi hi hi hi hi hi

sequence of male and female groups?

f0=160 Hz 170 Hz 160 Hz 170 Hz 200 Hz 110 Hz 140 Hz 240 Hz170 Hz 190 Hz

m f m m m m m f m m

m f

pm

pm-f mf

The model P(sound|gender)pf

pf-m

f0

p1m

m f m f m

160 170 160 170 200 110 140 240 170 190

units of speech(phonemes)

x

What the x should be ?

Speech signal ?

• always also carry some irrelevant information – additional processing is

necessary to alleviate it

• Reflects changes in acoustic pressure– its original purpose is

reconstruction of speech– does carry relevant information

histogram

speech signal

correlations

/u/ /o/ /a/ // /iy/

beer

Isaac Newton

• it is in the spectrum !!

Where Is The Message ?

/uw//ao//ah//eh//ih//iy/

averaged fft spectra of some vowels from

3 hours of fluent speech

Steam Engine (1769)Internal Combustion Engine (2003)

Inertia in engineering

time

frequ

ency

j/ /u/ /ar/ /j/ /o/ /j/ /o/

10-20 ms

get spectral components

Short-term Spectrum

time

histogram

short-term speech spectral envelope

correlations

histogram

logarithmic short-term speech spectral envelope

correlations

histogram

cosine transform of logarithmic short-term speech spectral envelope

(cepstrum)

correlations

What Is Wrong With the Short-term Spectrum ?

1) inconsistent (same message, different representation)

frequency

short-term spectrum

“auditory-like”spectrum

auditory-likemodifications

Pitch of the tone (Mel scale)

• Frequency resolution of human ear decreases with frequency

FFT

“critical-band energy”

f

t

Emulating frequency resolution of human ear with FFT

Equal Loudness Curves

Perceptual Linear Prediction (PLP)[Hermansky 1990]

• Auditory-like modifications of short-term speech spectrum prior to its approximation by all-pole autoregressive model– critical-band spectral

resolution– equal-loundness

sensitivity– intensity-loudness

nonlinearity• Today applied in virtually all

state-of-the-art experimental ASR systems

/j/ /u/ /ar/ /j/ /o/ /j/ /o/

LDA gives basis for projection of spectral space

time

freq

uen

cy

Spectral Basis from LDA

LDA vectors from Fourier Spectrum

Spectral resolution of LDA-derived spectral basis is higher at low frequencies

Critical bands of human hearing are narrower at lower frequencies

63 % 16 %

12 % 2 %

Sensitivity to Spectral Change

(Malayath 1999)Cosine basis LDA-derived bases Critical-band filterbank

if the receiver could be controlled– put more resources (introduce less

noise) where there is more signal – biological system optimized for

information extraction from sensory signals

Combination of channel and signal spectrum should be as flat (as random-like) as possible.– Shannon, Communication in presence of noise (1949)

resource space

energy of the signal

resource space

level of noise in the channel

energy of the signal

level of noise in the channel

if signal could be controlled (e.g. in communication)

– put more signal where there is less noise– sensory signal optimized for a given

communication channel

What Is Wrong With the Short-term Spectral Envelope?

2) Fragile (easily corrupted by minor disturbances)

f

spectrum

f

additive band-limited noise

ignore the noisy parts of the spectrum

f

linear (high-pass) filtering

remove means from parts of the spectrum

tone at f

threshold ofperception of the tone

noise bandwidth

• Nonlinear frequency resolution of hearing– Critical bands

• up to ~600 Hz constant bandwidth

• above 1 kHz constant Q

band-pass filterednoise centered at f

Simultaneous Masking

critical bandwidth

More Important Outcome of Masking Experiments

• What happens outside the critical band does not affect detection of events within the band !!!

• Independent processing of parts of the spectrum ?

S ( frequency )

pf2 pf3 pf4 pf5 pf6pf1

( Hermansky, Sharma and Pavel 1996, Bourlard and Dupont 1996 )frequency

{p(f)}

Replace spectral vector by a matrix of posterior probabilities of acoustic events

uh e l o w r dlocoarticulation

What Is Wrong With the Short-term Spectral Envelope?

3) Coarticulation (inertia of organs of speech production)

human auditory perception

Masking in Time

• suggests ~200 ms buffer in auditory system– also seen in perception of loudness, detection of short stimuli, gaps in tones,

auditory afterimages, binaural release from masking, …..– what happens outside this buffer, does no affect detection of signal within the

buffer

signal

masker

time

stronger masker

increasein threshold

0 200 ms

Short-term Features?

time~10 ms

data x

processing

time

longer time span ? (~250 ms?)

Cortical Receptive FieldsAverage of the first two principal components ( 83% of variance ) along temporal axis from about 180 cortical receptive fields ( from D. Klein 2004, unpublished )

• time-frequency distribution of the linear component of the most efficient stimulus that excites the given auditory neuron

Data for Deriving Posterior Probabilities of Speech Events

1-3 critical bands

250-1000 ms

TIME [s]

FREQUENCY

How to Get Estimates of Temporal Evolution of Spectral Energy ?- with M. Athineos, D. Ellis (Columbia Univ), and P. Fousek (CTU Prague)

data x

time

10-20 ms

200-1000 ms

1-3 Bark

time200-1000 ms

200-1000 ms

1-3 Barkall-pole model of part oftime-frequency plane

All-pole Model of Temporal Trajectoryof Spectral Energy

the signal

signal power spectrum

all-polemodel of

the powerspectrum

DCTof

the signal

Hilbertenvelope

of the signal

all-polemodel of

the Hilbertenvelope

conventional LPspectral domain LP

signaldiscretecosine

transform

low frequency

high frequency

prediction

prediction

all-pole modelof low-

frequencyHilbert

envelope

all-pole modelof high-frequencyHilbert envelope

All-pole Models of Sub-band Energy Contours

Critical-band Spectrum From FFT

time

tona

lity

Critical-band Spectrum From All-pole Models Of Hilbert Envelopes in Critical Bands

time

tona

lity

Putting It All Together

• TRAP-TANDEM– data-guided features based on frequency-independent

processing of relatively long spans of signal• with S. Sharma, P. Jain, S. Sivadas, ICSI Berkeley and TU Brno

time

frequ

ency

data processing( trained NN )

processing( trained NN )

some functionof phoneme posteriors

data processing( trained NN )

class posteriors

Human Speech Communication

Documents

Transcript of Human Speech Communication