Improved ASR in noise using harmonic decomposition

Improved ASR in noise using harmonic decomposition

• Introduction

• Pitch-Scaled Harmonic Filter

• Recognition Experiments

• Results

• Conclusion aperiodic contribution

periodic contribution

Production of /z/:

Motivation & Aims

• Most speech sounds are predominantly voiced or unvoiced.

What happens when the two components are “mixed”?

• Voiced and unvoiced components have different natures:

unvoiced: aperiodic signal from turbulence-noise sources

voiced: quasi-periodic signal from vocal-fold vibration

Why not extract their features separately?

Do the two contributions contain complementary information?

• Human speech recognition still performs well in noise.

How? Does it take advantage of harmonic properties?

Introduction

Voiced and unvoiced parts of a speech signal

aperiodic contribution

periodic contribution

Production of /z/:

Introduction

Automatic Speech Recognition

Front EndPattern

Recognitionspeech signal

speech labels

Feature Extraction:

conversion of speech signals to a sequence of parameter vectors

Dynamic Programming:

matching of observation sequences to models of known utterances

Introduction

u(n) v(n)

Harmonic Decomposition

Pitch optimisation

PSHF block diagram

raw pitch

wave-form

+ _

optimised pitch

f0raw f0

opt

aperiodic waveform

s(n)

periodic waveform

Nopt

sw(n)

vw(n)^

window

w(n) w(n)

window

uw(n)^

PSHF

Decomposition example (waveforms)

Ori

gina

lP

erio

dic

part

Ape

riod

ic

part

PSHF

Decomposition example (spectrograms)

Ori

gina

lP

erio

dic

part

Ape

riod

ic

part

PSHF

Decomposition example (MFCC specs.)

Ori

gina

lP

erio

dic

part

Ape

riod

ic

part

PSHF

Parameterisations

SPLIT: MFCC +Δ, +Δ2 catPSHF

PCA26:

PCA78:

PCA13:

PCA39:

MFCC +Δ, +Δ2catPSHF PCA

MFCC +Δ, +Δ2 catPSHF PCA



BASE: MFCCwaveform features

+Δ, +Δ2

Method

Speech Database: Aurora 2.0

• TIdigits database at 8 kHz, filtered with G.712 channel

• Connected English digit strings (male & female speakers)

GroupSignal-to-Noise Ratio

(dB)

clean condition Train

multi-condition 20 15 10 5

set A(same noises)

20 15 10 5 0 -5

set B(different noises)

20 15 10 5 0 -5Test

set C(different channel)

20 15 10 5 0 -5

Method

Description of the experiments

• Baseline experiment: [base]

standard parameterisation of the original waveforms (i.e., MFCC+D+A)

• Split experiments: [split]

adjustment of stream weights (voiced vs. unvoiced)

• PCA experiments: [pca26, pca78, pca13 and pca39]

decorrelation of the feature vectors, and reduction of the number of coefficients

Method

Split experiments resultsResults

Word Accuracy (%)clean multi overall

base 52.6 78.3 65.4split 77.9 89.1 83.0pca26 71.2 88.8 78.8pca78 61.9 88.1 74.7pca13 72.6 87.6 79.7pca39 70.9 87.5 78.8

Word Accuracy (%) WER (%)clean multi overall abs. rel.

base 52.6 78.3 65.4 -- --split 77.9 89.1 83.0 17.6 50.9pca26 71.2 88.8 78.8 13.4 38.7pca78 61.9 88.1 74.7 9.3 26.9pca13 72.6 87.6 79.7 14.3 41.3pca39 70.9 87.5 78.8 13.4 38.7

Summary of resultsResults

Conclusions

• PSHF module split Aurora’s speech waveforms into two synchronous streams (periodic and aperiodic).

• Used separately, accuracy was slighty degraded, however together, it was substantially increased in noisy conditions.

• Periodic speech segments provide robustness to noise.

• Apply Linear Discriminant Analysis (LDA) to the two-stream feature vector.

• Evaluate the performance of this front end in a more general task, such as phoneme recognition.

• Test the technique for speaker recognition.

Further Work

COLUMBO PROJECT: Harmonic Decomposition applied to ASR

David M. Moreno 1 <[email protected]>

Philip J.B. Jackson 2 <[email protected]>

Javier Hernando 1 <[email protected]>

Martin J. Russell 3 <[email protected]>

http://www.ee.surrey.ac.uk/

Personal/P.Jackson/Columbo/

1 2 3

http://web.bham.ac.uk/p.jackson/columbo/

Pitch Optimisation: vowel /u/

Cost function

Spectrum derived from a 268-point DFT

Harmonic Decomposition: vowel /u/

Word accuracy results (%)

Observation probability, with stream weights

Improved ASR in noise using harmonic decomposition

Documents

Transcript of Improved ASR in noise using harmonic decomposition