Post on 07-Jan-2016
description
Temporal masking of spectrally reduced speech:psychoacoustical experiments
and links with ASR
Frédéric Berthommier and Angélique Grosgeorges
ICP
46 av. Félix viallet, Grenoble, France
email: (bertho,ggeorges)@icp.inpg.fr
Introduction and motivations
We used the experimental paradigm proposed by [Shannon et al., 95], from which we developed a series of experiments. As proposed by (Horii et al., 1971) they varied the spectro-temporal resolution of speech utterances. The stimuli were composed of white noise modulated by the filtered envelopes extracted in 4 subbands. The task was consonant identification for VCVCV within 16 French consonants. Then, we evaluated the transmission of their phonetic features: voicing, mode and place of articulation.
We extent this paradigm by masking this residual signal with stationary [Lorenzi et al., 99], or non stationary noises [Grosgeorges et al., 00]. In this framework, we substitute to the couple (local SNR/acoustic representation) and to the analysis in terms of identification rate another couple (global SNR/phonetic representation) with an analysis in terms of feature transmission.
Then, we focus on the problem of acoustic phonetic decoding in noise, and on the impact of the noise on the features grounding the classification process. In other words, we postulate the existence of an intermediate level preceding the phonetic categorisation, and we study its properties.
Introduction and motivations (2)
So, we expect a set of complementary results from this approach, at the same time informative about the study of the link between auditory and speech processes, useful for CASA, and informative for developing ASR for noisy and distorted speech.
For RESPITE, the goal of this project is to set-up a plausible multi-stream model in which the phonetic identification of consonants is grounded by the extraction of these three phonetic characteristics, voicing, place and mode, this in specialised modules having different spectro-temporal resolution. A pre-classification according this appropriate phonetic representation could be more robust than the direct classification, the streams easier to weight according their information content, and the fusion process easier to control. Remark: vowel identification is considered as well modelled in current implementations.
Moreover, the visual modality can be integrated in this model easily for the same reason: the audio-visual complementarity is optimally represented.
The Shannon et al. ’ experimentSpectral degradation: signal was divided into one, two, three or four frequency bands. Temporal degradation: the amplitude envelope extracted from each band was low-pass filtered with cutoff frequencies Fc:16, 50, 160 or 500Hz. The identification of 3 features (voicing, manner and place) for 16 French consonants « a/C/a » was evaluated by the classical information transmission analysis (Miller and Nicely, 1955).
0
20
40
60
80
100
1 2 3 4 1 2 3 4 1 2 3 4
Voicing Manner Place
Info
rmat
ion
rec
eive
d (%
)
Number of bands
Fc = 16 Hz or 50, 160, 500 Hz
The main conclusion is: despite the great spectro-temporal reduction, voicing and manner are remarkably well transmited by the residual envelope, i.e. by the temporal components of the speech.
Some questions arise: how this residue is processed ? how to use it for increasing robustness ?
…. one way is to mask it and to analyse what occurs.
Factorial design of the masking experiment
Factor n°1: The spectral resolution was constant at 4 frequency bands, and the envelope was filtered with cutoff frequency Fc at 10 or 500Hz.
Factor n°2: We added different temporal maskers in order to selectively degrade the different components of the residual signal:
(1) in order to mask the coarse component of temporal information, we used a low frequency AM (amplitude modulation < 8Hz) white noise applied in each subband, for all maskers.
(2) to degrade the residual spectral information, we decorrelated the low frequency AM across the 4 frequency bands.
(3) to mask the fine temporal information, we re-modulated the low frequency AM of the masker at 100Hz.
Factorial design (2)
Level 1 Level 2 Level 3Factor n°2White noise: - decorrelated
White noise: - decorrelated - 100 Hz
Fc=500 Hz
Fc=10 Hz (1) (1) + (2) (1) + (2) + (3)
White noise: - correlated Factor n°1
(1) (1) + (2) (1) + (2) + (3)
Task: Consonant identification task in a quiet room, with forced choice and no feedback
Subjects: 6 normal hearing listeners not trained. However all subjects had experience in psychoacoustical experiments
Stimuli: 384 stimuli composed of 6 different conditions were presented in random order
Speech and signal processing
16 utterances aCaCa : - with C = {b, d, g,v, Z, z, m, n, r, l,p,t,k,f,s,S}consonant features: voicing: voiced={b,d,g,v,Z,z,m,n,r,l} / voiceless={p,t,k,f,s,S}manner: fricative + liquid ={f,s,S,v,Z,z,r,l} / occlusive + nasal={p,t,k,b,d,g,m,n}place: dental={p,b,f,v,m} / labial={t,d,s,z,n,l} / palatal={k,g,Z,S,r}
Stimulus
FS = 11025 Hz and Frame analysis92.8ms
Nonsense Speech:
SNR=+6dBTemporal
masker
FFT
2
3
4
1
4 spectral bands decomposition
Low-pass filteringat 500 Hz or 10 Hz
Signal rectificationiFFT
Bandpassfiltering
Signal reconstruction
+
Whitenoise
White noise
(1)
(1) + (2)
(1) + (2) + (3)
Exemple of stimulus
aCaCa speech envelope Fc = 10 Hz
Envelope of stimulus:(1)
Envelope of stimulus:(1) + (2)
Envelope of stimulus:(1) + (2) + (3)
t
Am
pli
tude
1
2
3
4
a B a B a
t
Am
pli
tude
a B a B a
t
Am
plit
ude
a B a B a
t
Am
pli
tude
a B a B a
Results of the experiment
For all conditions, chance was set at 6.25% (1/16) for consonant recognition.
Overall mean correct identification for the 6 subjects was 28%.
A confusion matrix was generated for each listener and summed across listeners. Then, the mean transmission information (Miller and Nicely, J. Acoust. Soc. Am., 1955) for voicing, manner and place of articulation was evaluated.
The average information received for each consonant feature is plotted as a function of the level number, as compared with the average information received when there was no temporal masker (dashed lines).
Results : transmission of voicing
Voicing is not transmitted by the fine temporal modulation (as in Shannon et al.) and it decreases slightly with the degradation of residual spectral information allowed by decorrelation.
So we conclude that voicing features are acoustically “distributed”, and then, the degradation according the different maskers’ characteristics (low frequency AM, decorrelation and 100Hz re-modulation) is cumulative.
Info
rmat
ion
rec
eive
d (
%)
Level number1 2 3
0
20
40
60
70 Voicing recognition
Fc=10Hz
Fc=500Hz
Results : transmission of the manner
Manner of consonant articulation is completely suppressed for all temporal maskers, having in common a low AM characteristic.There is no significant difference with 0% information received.
Manner recognition
Info
rmat
ion
rec
eive
d (
%)
0
10
40
20
30
Level number1 2 3
Manner recognition
Info
rmat
ion
rec
eive
d (
%)
1 2 30
20
40
60
70
Level number
Fc=10Hz
Fc=500Hz
When spectral information is reduced, manner is conveyed by the coarse envelope component, and it strongly interferes with a low AM masker: the differentiation between fricatives and occlusivesis encoded temporally and it is well masked by noise having close temporal characteristics.
Nullification of manner transmission
aCaCa speech envelope Fc = 10 Hz
Envelope of stimulus:(1)
Envelope of stimulus:(1) + (2)
Envelope of stimulus:(1) + (2) + (3)
t
Am
pli
tude
1
2
3
4
a B a B a
t
Am
pli
tude
a B a B a
t
Am
plit
ude
a B a B a
t
Am
pli
tude
a B a B a
Results : place transmission
Place of articulation is significantly less transmitted (P<0.05; t-test) for Level 2 and 3 comparatively to Level 1, for Fc=10 Hz (*).
Decorrelation degrades the residual spectral information (for Fc at 10Hz).In
form
atio
n r
ecei
ved
(%
)
Level number1 2 3
0
20
40
60
70
**
Place recognition
Fc=10HzFc=500Hz
Conclusion of the masking experiment
We retrieve the main Shannon et al.’s results.
Our experiment suggests that: -voicing is a redundant consonant feature which depends on both categories of information: coarse temporal envelope and spectral information, -but manner is mainly carried by the coarse temporal envelope.
This experiment supports the hypothesis that consonant identification is a complex process which can compensate for the reduction or the masking of both temporal or spectral information by the use of residual information for voicing and place, but not for the manner.
0 1000 2000 3000 4000 50000
0.2
0.4
0.6
0.8
1
Hz
Gain
0 1000 2000 3000 4000 50000
0.2
0.4
0.6
0.8
1
Hz
Gain
0 1000 2000 3000 4000 50000
0.2
0.4
0.6
0.8
1
Hz
Gain
0 1000 2000 3000 4000 50000
0.2
0.4
0.6
0.8
1
Hz
Gain
0 1000 2000 3000 4000 50000
0.2
0.4
0.6
0.8
1
Hz
Gain
Clean signal
10 Hz 500 Hz
The intelligibility is weak for 1 and 2 subbands, with a poor transmission of the place of articulation. The difference between Fc at 10 and 500 Hz is weak.
Perspective (1) : variation of the spectro-temporal resolution
Perspective (2): interaction between spectral reductionand masking
Info
rmat
ion
rec
eive
d (
%)
4sbSNR=+6dB
4sbclean
16sbclean
16sbSNR=+6dB
Voicing
Place of articulation
Manner of articulation
0
20
40
60
80
100
This preliminary experiment (Fc=10Hz) shows that for the mode, thereis a rather independent effect of spectral reduction and of temporalmasking, the later having the stronger impact. This confirms that the mode is mainly encoded temporally. So one proposal for multistream ASR is to decode this feature temporally in a separate 4 subbands stream.
Perspective (3): audio-visual complementarity
As shown by Erber (1972), intelligibility is high even for 1 and 2 subbands: the place of articulation is the best transmitted by the visual modality, whereas this is the worse transmitted for the audio reduced speech, so the global intelligibility is restored thanks to the direct complementarity of transmission in the two modalities.