Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR

Temporal masking of spectrally reduced speech:psychoacoustical experiments

and links with ASR

Frédéric Berthommier and Angélique Grosgeorges

46 av. Félix viallet, Grenoble, France

email: (bertho,ggeorges)@icp.inpg.fr

Introduction and motivations

We used the experimental paradigm proposed by [Shannon et al., 95], from which we developed a series of experiments. As proposed by (Horii et al., 1971) they varied the spectro-temporal resolution of speech utterances. The stimuli were composed of white noise modulated by the filtered envelopes extracted in 4 subbands. The task was consonant identification for VCVCV within 16 French consonants. Then, we evaluated the transmission of their phonetic features: voicing, mode and place of articulation.

We extent this paradigm by masking this residual signal with stationary [Lorenzi et al., 99], or non stationary noises [Grosgeorges et al., 00]. In this framework, we substitute to the couple (local SNR/acoustic representation) and to the analysis in terms of identification rate another couple (global SNR/phonetic representation) with an analysis in terms of feature transmission.

Then, we focus on the problem of acoustic phonetic decoding in noise, and on the impact of the noise on the features grounding the classification process. In other words, we postulate the existence of an intermediate level preceding the phonetic categorisation, and we study its properties.

Introduction and motivations (2)

So, we expect a set of complementary results from this approach, at the same time informative about the study of the link between auditory and speech processes, useful for CASA, and informative for developing ASR for noisy and distorted speech.

For RESPITE, the goal of this project is to set-up a plausible multi-stream model in which the phonetic identification of consonants is grounded by the extraction of these three phonetic characteristics, voicing, place and mode, this in specialised modules having different spectro-temporal resolution. A pre-classification according this appropriate phonetic representation could be more robust than the direct classification, the streams easier to weight according their information content, and the fusion process easier to control. Remark: vowel identification is considered as well modelled in current implementations.

Moreover, the visual modality can be integrated in this model easily for the same reason: the audio-visual complementarity is optimally represented.

The Shannon et al. ’ experimentSpectral degradation: signal was divided into one, two, three or four frequency bands. Temporal degradation: the amplitude envelope extracted from each band was low-pass filtered with cutoff frequencies Fc:16, 50, 160 or 500Hz. The identification of 3 features (voicing, manner and place) for 16 French consonants « a/C/a » was evaluated by the classical information transmission analysis (Miller and Nicely, 1955).

1 2 3 4 1 2 3 4 1 2 3 4

Voicing Manner Place

Number of bands

Fc = 16 Hz or 50, 160, 500 Hz

The main conclusion is: despite the great spectro-temporal reduction, voicing and manner are remarkably well transmited by the residual envelope, i.e. by the temporal components of the speech.

Some questions arise: how this residue is processed ? how to use it for increasing robustness ?

…. one way is to mask it and to analyse what occurs.

Factorial design of the masking experiment

Factor n°1: The spectral resolution was constant at 4 frequency bands, and the envelope was filtered with cutoff frequency Fc at 10 or 500Hz.

Factor n°2: We added different temporal maskers in order to selectively degrade the different components of the residual signal:

(1) in order to mask the coarse component of temporal information, we used a low frequency AM (amplitude modulation < 8Hz) white noise applied in each subband, for all maskers.

(2) to degrade the residual spectral information, we decorrelated the low frequency AM across the 4 frequency bands.

(3) to mask the fine temporal information, we re-modulated the low frequency AM of the masker at 100Hz.

Factorial design (2)

Level 1 Level 2 Level 3Factor n°2White noise: - decorrelated

White noise: - decorrelated - 100 Hz

Fc=500 Hz

Fc=10 Hz (1) (1) + (2) (1) + (2) + (3)

White noise: - correlated Factor n°1

(1) (1) + (2) (1) + (2) + (3)

Task: Consonant identification task in a quiet room, with forced choice and no feedback

Subjects: 6 normal hearing listeners not trained. However all subjects had experience in psychoacoustical experiments

Stimuli: 384 stimuli composed of 6 different conditions were presented in random order

Speech and signal processing

16 utterances aCaCa : - with C = {b, d, g,v, Z, z, m, n, r, l,p,t,k,f,s,S}consonant features: voicing: voiced={b,d,g,v,Z,z,m,n,r,l} / voiceless={p,t,k,f,s,S}manner: fricative + liquid ={f,s,S,v,Z,z,r,l} / occlusive + nasal={p,t,k,b,d,g,m,n}place: dental={p,b,f,v,m} / labial={t,d,s,z,n,l} / palatal={k,g,Z,S,r}

Stimulus

FS = 11025 Hz and Frame analysis92.8ms

Nonsense Speech:

SNR=+6dBTemporal

masker

4 spectral bands decomposition

Low-pass filteringat 500 Hz or 10 Hz

Signal rectificationiFFT

Bandpassfiltering

Signal reconstruction

Whitenoise

White noise

(1) + (2)

(1) + (2) + (3)

Exemple of stimulus

aCaCa speech envelope Fc = 10 Hz

Envelope of stimulus:(1)

Envelope of stimulus:(1) + (2)

Envelope of stimulus:(1) + (2) + (3)

a B a B a

Results of the experiment

For all conditions, chance was set at 6.25% (1/16) for consonant recognition.

Overall mean correct identification for the 6 subjects was 28%.

A confusion matrix was generated for each listener and summed across listeners. Then, the mean transmission information (Miller and Nicely, J. Acoust. Soc. Am., 1955) for voicing, manner and place of articulation was evaluated.

The average information received for each consonant feature is plotted as a function of the level number, as compared with the average information received when there was no temporal masker (dashed lines).

Results : transmission of voicing

Voicing is not transmitted by the fine temporal modulation (as in Shannon et al.) and it decreases slightly with the degradation of residual spectral information allowed by decorrelation.

So we conclude that voicing features are acoustically “distributed”, and then, the degradation according the different maskers’ characteristics (low frequency AM, decorrelation and 100Hz re-modulation) is cumulative.

Level number1 2 3

70 Voicing recognition

Fc=10Hz

Fc=500Hz

Results : transmission of the manner

Manner of consonant articulation is completely suppressed for all temporal maskers, having in common a low AM characteristic.There is no significant difference with 0% information received.

Manner recognition

Level number1 2 3

Manner recognition

1 2 30

Level number

Fc=10Hz

Fc=500Hz

When spectral information is reduced, manner is conveyed by the coarse envelope component, and it strongly interferes with a low AM masker: the differentiation between fricatives and occlusivesis encoded temporally and it is well masked by noise having close temporal characteristics.

Nullification of manner transmission

aCaCa speech envelope Fc = 10 Hz

Envelope of stimulus:(1)

Envelope of stimulus:(1) + (2)

Envelope of stimulus:(1) + (2) + (3)

a B a B a

Results : place transmission

Place of articulation is significantly less transmitted (P<0.05; t-test) for Level 2 and 3 comparatively to Level 1, for Fc=10 Hz (*).

Decorrelation degrades the residual spectral information (for Fc at 10Hz).In

Level number1 2 3

Place recognition

Fc=10HzFc=500Hz

Conclusion of the masking experiment

We retrieve the main Shannon et al.’s results.

Our experiment suggests that: -voicing is a redundant consonant feature which depends on both categories of information: coarse temporal envelope and spectral information, -but manner is mainly carried by the coarse temporal envelope.

This experiment supports the hypothesis that consonant identification is a complex process which can compensate for the reduction or the masking of both temporal or spectral information by the use of residual information for voicing and place, but not for the manner.

0 1000 2000 3000 4000 50000

Clean signal

10 Hz 500 Hz

The intelligibility is weak for 1 and 2 subbands, with a poor transmission of the place of articulation. The difference between Fc at 10 and 500 Hz is weak.

Perspective (1) : variation of the spectro-temporal resolution

Perspective (2): interaction between spectral reductionand masking

4sbSNR=+6dB

4sbclean

16sbclean

16sbSNR=+6dB

Voicing

Place of articulation

Manner of articulation

This preliminary experiment (Fc=10Hz) shows that for the mode, thereis a rather independent effect of spectral reduction and of temporalmasking, the later having the stronger impact. This confirms that the mode is mainly encoded temporally. So one proposal for multistream ASR is to decode this feature temporally in a separate 4 subbands stream.

Perspective (3): audio-visual complementarity

As shown by Erber (1972), intelligibility is high even for 1 and 2 subbands: the place of articulation is the best transmitted by the visual modality, whereas this is the worse transmitted for the audio reduced speech, so the global intelligibility is restored thanks to the direct complementarity of transmission in the two modalities.

Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR

Documents

Transcript of Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR

Spectrally Selective Glazings · Spectrally Selective Glazings A well proven window technology to reduce energy costs while enhancing daylight and view Abstract Spectrally selective

Spectrally and Spatially Resolved Smith-Purcell Radiation ...

Spectrally Thin Trees

Spectrally Adaptive Nanoscale Quantum Dot Sensors

Masking What’s

Chapter 6: Masking. Masking Masking: a process in which the threshold of one sound (signal) is raised by the presentation of another sound (masker). Masking.

Microcavity-Mediated Spectrally Tunable Amplification of … · 2019-04-30 · Microcavity-Mediated Spectrally Tunable Ampliﬁcation of Absorption in Plasmonic Nanoantennas Qinglan

Source Reconstruction for Spectrally-resolved Bioluminescence ... · Source Reconstruction for Spectrally-resolved Bioluminescence Tomography with Sparse A priori Information Yujie

Layer - WordPress.comKlik masking icon . Masking 1. Masking layer is active 2. Use paint tools to masking picture . Masking 1. On these picture make gradient tool . Masking 1. Pull

Data Masking with DbDefence - Database Encryption · Transparent Data Masking with DbDefence What is Data Masking? Data masking is a special way of encrypting and displaying sensitive

Spectrally efficient multicarrier modulation system for ...

Theoretical analysis of spectrally encoded endoscopyyelin.net.technion.ac.il/files/2015/10/SEE-theoretical.pdf · Theoretical analysis of spectrally encoded endoscopy Michal Merman,

Environmental Sensing Using Land-Based Spectrally-Selective ...

Noncoherent iterative decoding of spectrally efficient ...

SPECTRALLY-CONSISTENT RELATIVE RADIOMETRIC NORMALIZATION …

Spectrally Tunable Sources for Advanced Radiometric Applications

Spectrally-Encoded Single-Pixel Machine Vision Using ...

Spectrally selective pencil-beam navigator for motion ...

Interlaced spectrally encoded confocal scanning laser ophthalmoscopy …people.duke.edu/~sf59/Tao_10.pdf · 2012. 12. 19. · Interlaced spectrally encoded confocal scanning laser

Interference Avoidance in Spectrally Encoded Multiple …cooper/SPOT/Reference/SPE/Interference... · Interference Avoidance in Spectrally Encoded Multiple Access Communications Using