Human Speech Communication
description
Transcript of Human Speech Communication
message
linguistic code (~ 50 b/s)
motor controlspeech production
SPEECH SIGNAL (~50 kb/s)speech perception
cognitive processeslinguistic code (~ 50 b/s)
message
Human Speech Communication
PCM (Pulse Code Modulation)
• Transmit value of each speech sample– dynamic range of speech is about 50-60 dB
• 11 bits/sample– maximum frequency in telephone speech is 3.4 kHz
• sampling frequency 8 kHz
8000 x 11 = 88 kb/sSimple and universal but not very efficient
Better quantization ?
• Less quantization noise for weaker signals IN
OUT
- law
A - law
Logarithmic PCM (-law, A-law)
• Finer quantization for each individual small amplitude sample– how about small signal samples surrounded by large ones?– it is the instantaneous signal energy which should determine the step
?
Differential coding
• For many natural signals, the difference between successive samples quantizes better than samples themselves
• Even better, predict the current sample from the past ones and transmit the error of the prediction
current sample
time
Differential predictive coding• DPCM
– a single predictor reflecting global predictability of speech
– predictor order up to 4-5– delta modulation - gross
quantization of prediction error into 1 bit (typically requires up-sampling well over the Nyquist rate)
• adaptive DPCM– new predictor for every new
speech block– predictor needs to be
transmitted together with the prediction error
Speech Coders
Linear model of speech production
A.G. Bell got it almost right
linear model of speech
source filter speech
changes slowly
current sample
time
short-term prediction
long-term prediction
short-term - resonance of vocal tractlong-term - periodicity of voiced speech (vocal cord vibration)
LPC vocoder
• The same principle as in H. Dudley’s Vocoder• Used by US Government (LPC-10) - 2.4 kbs
Residual Excited LPC (RELP)
• Transmitter:– Simplify prediction
error (low-pass filter and down-sample
• Receiver– re-introduce high
frequencies in the simplified residual (nonlinear distortion)
Analysis-by-synthesis
• Identical synthesizer in coder and in decoder– change parameters in coder– use for synthesizing speech– compare synthesized speech with real speech– when “close enough”, send parameters to the receiver
Future in speech coding?
• No need to transmit what we do not hear– study human hearing, especially masking
• No need to transmit what is predictable– speech production mechanism– speaker characteristics– linguistic code (recognition-synthesis)– thought-to-speech
Automatic recognition of speech
reduce information = decrease entropy
linguistic messagephoneme string(below 50 b/s)
knowledge
electric signal(more than 50 kb/s)prior knowledge
( textbook )
acquired knowledge( data )
• Automatic speech recognition (ASR)– derive proper response from speech
stimulus
• Auditory perception– how do biological systems respond
to acoustic stimuli
• Knowledge of auditory perception ?
Principle of stochastic ASR• Using a model of speech production process, generate all possible
acoustic sequences wi for all legal linguistic messages
• Compare all generated sequences with the unknown acoustic input x to find which one is the most similar
))|)(M((maxarg xww iPi
=
1. What is the model M ( wi ) ?
2. Form of the data x ?
One (simple) model
hello world
uh e l o w r dlo
• Two dominant sources of variability in speech1. people say the same thing with different speeds ( temporal
variability )2. different people sound different, communication environment
different, ( feature variability)
• “Doubly stochastic” process (Hidden Markov Model)– Speech as a sequence of hidden states - recover the state sequence
1. never know for sure in which state we are2. never know for sure which data can be generated from a given
state
Hidden Markov Model
hi hi hi hi hi hi hi hi hi hi
sequence of male and female groups?
f0=160 Hz 170 Hz 160 Hz 170 Hz 200 Hz 110 Hz 140 Hz 240 Hz170 Hz 190 Hz
m f m m m m m f m m
m f
pm
pm-f mf
The model P(sound|gender)pf
pf-m
f0
p1m
m f m f m
160 170 160 170 200 110 140 240 170 190
units of speech(phonemes)
x
What the x should be ?
Speech signal ?
• always also carry some irrelevant information – additional processing is
necessary to alleviate it
• Reflects changes in acoustic pressure– its original purpose is
reconstruction of speech– does carry relevant information
histogram
speech signal
correlations
/u/ /o/ /a/ // /iy/
beer
Isaac Newton
• it is in the spectrum !!
Where Is The Message ?
/uw//ao//ah//eh//ih//iy/
averaged fft spectra of some vowels from
3 hours of fluent speech
Steam Engine (1769)Internal Combustion Engine (2003)
Inertia in engineering
time
frequ
ency
j/ /u/ /ar/ /j/ /o/ /j/ /o/
10-20 ms
get spectral components
Short-term Spectrum
time
histogram
short-term speech spectral envelope
correlations
histogram
logarithmic short-term speech spectral envelope
correlations
histogram
cosine transform of logarithmic short-term speech spectral envelope
(cepstrum)
correlations
What Is Wrong With the Short-term Spectrum ?
1) inconsistent (same message, different representation)
frequency
short-term spectrum
“auditory-like”spectrum
auditory-likemodifications
Pitch of the tone (Mel scale)
• Frequency resolution of human ear decreases with frequency
FFT
“critical-band energy”
f
t
Emulating frequency resolution of human ear with FFT
Equal Loudness Curves
Perceptual Linear Prediction (PLP)[Hermansky 1990]
• Auditory-like modifications of short-term speech spectrum prior to its approximation by all-pole autoregressive model– critical-band spectral
resolution– equal-loundness
sensitivity– intensity-loudness
nonlinearity• Today applied in virtually all
state-of-the-art experimental ASR systems
/j/ /u/ /ar/ /j/ /o/ /j/ /o/
LDA gives basis for projection of spectral space
time
freq
uen
cy
Spectral Basis from LDA
LDA vectors from Fourier Spectrum
Spectral resolution of LDA-derived spectral basis is higher at low frequencies
Critical bands of human hearing are narrower at lower frequencies
63 % 16 %
12 % 2 %
Sensitivity to Spectral Change
(Malayath 1999)Cosine basis LDA-derived bases Critical-band filterbank
if the receiver could be controlled– put more resources (introduce less
noise) where there is more signal – biological system optimized for
information extraction from sensory signals
Combination of channel and signal spectrum should be as flat (as random-like) as possible.– Shannon, Communication in presence of noise (1949)
resource space
energy of the signal
resource space
level of noise in the channel
energy of the signal
level of noise in the channel
if signal could be controlled (e.g. in communication)
– put more signal where there is less noise– sensory signal optimized for a given
communication channel
What Is Wrong With the Short-term Spectral Envelope?
2) Fragile (easily corrupted by minor disturbances)
f
spectrum
f
additive band-limited noise
ignore the noisy parts of the spectrum
f
linear (high-pass) filtering
remove means from parts of the spectrum
tone at f
threshold ofperception of the tone
noise bandwidth
• Nonlinear frequency resolution of hearing– Critical bands
• up to ~600 Hz constant bandwidth
• above 1 kHz constant Q
band-pass filterednoise centered at f
Simultaneous Masking
critical bandwidth
More Important Outcome of Masking Experiments
• What happens outside the critical band does not affect detection of events within the band !!!
• Independent processing of parts of the spectrum ?
S ( frequency )
pf2 pf3 pf4 pf5 pf6pf1
( Hermansky, Sharma and Pavel 1996, Bourlard and Dupont 1996 )frequency
{p(f)}
Replace spectral vector by a matrix of posterior probabilities of acoustic events
uh e l o w r dlocoarticulation
What Is Wrong With the Short-term Spectral Envelope?
3) Coarticulation (inertia of organs of speech production)
human auditory perception
Masking in Time
• suggests ~200 ms buffer in auditory system– also seen in perception of loudness, detection of short stimuli, gaps in tones,
auditory afterimages, binaural release from masking, …..– what happens outside this buffer, does no affect detection of signal within the
buffer
signal
masker
time
stronger masker
increasein threshold
0 200 ms
Short-term Features?
time~10 ms
data x
processing
time
longer time span ? (~250 ms?)
Cortical Receptive FieldsAverage of the first two principal components ( 83% of variance ) along temporal axis from about 180 cortical receptive fields ( from D. Klein 2004, unpublished )
• time-frequency distribution of the linear component of the most efficient stimulus that excites the given auditory neuron
Data for Deriving Posterior Probabilities of Speech Events
1-3 critical bands
250-1000 ms
TIME [s]
FREQUENCY
How to Get Estimates of Temporal Evolution of Spectral Energy ?- with M. Athineos, D. Ellis (Columbia Univ), and P. Fousek (CTU Prague)
data x
time
10-20 ms
200-1000 ms
1-3 Bark
time200-1000 ms
200-1000 ms
1-3 Barkall-pole model of part oftime-frequency plane
All-pole Model of Temporal Trajectoryof Spectral Energy
the signal
signal power spectrum
all-polemodel of
the powerspectrum
DCTof
the signal
Hilbertenvelope
of the signal
all-polemodel of
the Hilbertenvelope
conventional LPspectral domain LP
signaldiscretecosine
transform
low frequency
high frequency
prediction
prediction
all-pole modelof low-
frequencyHilbert
envelope
all-pole modelof high-frequencyHilbert envelope
All-pole Models of Sub-band Energy Contours
Critical-band Spectrum From FFT
time
tona
lity
Critical-band Spectrum From All-pole Models Of Hilbert Envelopes in Critical Bands
time
tona
lity
Putting It All Together
• TRAP-TANDEM– data-guided features based on frequency-independent
processing of relatively long spans of signal• with S. Sharma, P. Jain, S. Sivadas, ICSI Berkeley and TU Brno
time
frequ
ency
data processing( trained NN )
processing( trained NN )
some functionof phoneme posteriors
data processing( trained NN )
class posteriors