APPROVED BY SUPERVISORY COMMITTEE: Dr. Robert Hunt …Sunil Devdas Kamath, M.S.E.E. The University...
Transcript of APPROVED BY SUPERVISORY COMMITTEE: Dr. Robert Hunt …Sunil Devdas Kamath, M.S.E.E. The University...
A MULTI-BAND SPECTRAL SUBTRACTION METHOD FOR SPEECH ENHANCEMENT
APPROVED BY SUPERVISORY COMMITTEE:
Dr. Philipos Loizou, Chair.
Dr. Robert Hunt
Dr. Mohammad Saquib
Copyright 2001
Sunil Devdas Kamath
All Rights Reserved
To my parents
A MULTI-BAND SPECTRAL SUBTRACTION METHOD FOR SPEECH ENHANCEMENT
by
SUNIL DEVDAS KAMATH, B.E.
THESIS
Presented to the faculty of
The University of Texas at Dallas
in Partial Fulfillment
of the Requirements
for the Degree of
MASTER OF SCIENCE IN ELECTRICAL ENGINEERING
THE UNIVERSITY OF TEXAS AT DALLAS
December 2001
v
ACKNOWLEDGEMENTS I would like to thank my adviser, Dr. Philipos Loizou, for his guidance in my studies and my
work. He has offered me many helpful suggestions on conducting research and writing
technical documents.
I would also like to thank Dr. Robert Hunt and Dr. Mohammad Saquib, for their valuable
feedback on this manuscript.
Thanks are also in order to Dr. Emily Tobey for providing me with the opportunity to work
with her wonderful team at the Callier Institute of Communication Disorders / UTD. I would
like to thank Paul Dybala and Amanda Labue for conducting the subject test.
I would like to take this opportunity to express my deepest gratitude to Dr. Neeraj Magotra of
Texas Instruments – Dallas for the invaluable support and guidance he has given me in every
aspect of my student and personal life. I am especially thankful for the wholehearted
confidence that he has shown in my abilities.
And finally to my wife, Sanmati, I would like to acknowledge my deepest appreciation for
her love and caring, for her timely encouragements, for being my driving force and standing
by me through thick and thin.
vi
A MULTI-BAND SPECTRAL SUBTRACTION METHOD FOR SPEECH ENHANCEMENT
Sunil Devdas Kamath, M.S.E.E. The University of Texas at Dallas, 2001
Supervising Professor: Dr. Philipos C. Loizou The corruption of speech due to presence of additive background noise causes severe
difficulties in various communication environments. This thesis addresses the problem of
reduction of additive background noise in speech. The proposed approach is a frequency-
dependent speech enhancement method based on the proven spectral subtraction method.
Most implementations and variations of the basic spectral subtraction technique advocate
subtraction of the noise spectrum estimate over the entire speech spectrum. However, real
world noise is mostly colored and does not affect the speech signal uniformly over the entire
spectrum. This thesis explores a Multi-Band Spectral Subtraction (MBSS) approach with
suitable pre-processing of the speech data. Speech is processed into )81( ≤≤ NN
frequency bands and spectral subtraction is performed independently on each band using
band-specific over-subtraction factors. This method provides a greater degree of flexibility
and control on the noise subtraction levels that reduces artifacts in the enhanced speech,
resulting in improved speech quality. The effect of the number of frequency band and the
vii
type of filter spacing (linear, logarithmic or mel) was investigated. Results showed that the
proposed MBSS method with four linear-spaced frequency bands outperformed the
conventional spectral subtraction method with respect to speech quality and reduced musical
noise.
viii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ..................................................................................................v
ABSTRACT ........................................................................................................................vi
LIST OF FIGURES ..............................................................................................................x
LIST OF TABLES..............................................................................................................xii
1. INTRODUCTION............................................................................................................1
2. LITERATURE REVIEW.................................................................................................3
2.1 Fundamentals of speech production ...........................................................................4
2.2 Classification of speech enhancement techniques ......................................................6
2.3 Short-term spectral amplitude techniques...................................................................9
2.4 Principle of the basic spectral subtraction method....................................................11
2.5 Drawbacks of the spectral subtraction method .........................................................13
2.6 Modifications to spectral subtraction .......................................................................16
2.7 Frequency – dependent spectral subtraction methods ...............................................23
3. MULTI-BAND SPECTRAL SUBTRACTION ..............................................................26
3.1 Motivation...............................................................................................................26
3.2 Multi-band spectral subtraction................................................................................30
4. IMPLEMENTATION AND PERFORMANCE EVALUATION....................................37
4.1 Implementation .......................................................................................................37
4.2 Objective measures for performance evaluation.......................................................43
4.3 Effect of pre-processing strategies ...........................................................................46
4.4 Effect of frequency spacing .....................................................................................50
4.5 Performance with speech-silence detector................................................................55
4.6 Subjective evaluation of speech intelligibility ..........................................................57
4.7 Optimal configuration..............................................................................................59
5. SUMMARY AND CONCLUSIONS..............................................................................63
ix
BIBLIOGRAPHY...............................................................................................................66
VITA
x
LIST OF FIGURES Figure 2.1: Diagramtic representation of the short-time spectral magnitude enhancement
system. ........................................................................................................................10 Figure 2.2: Spectrograms of the sentence “The shop closes for lunch”, clean speech (top),
with speech shaped noise at 5 dB SNR (middle), and speech enhanced using spectral subtraction (bottom) ....................................................................................................14
Figure 2.3: Over-subtraction factor α as a function of SNR with 40 =α .............................19 Figure 3.1: (a) PSD of WGN, (b) Segmental SNR of four (linearly-spaced) frequency bands
of speech corrupted by WGN at 5dB SNR. ..................................................................28 Figure 3.2: (a) PSD of speech-shaped noise, (b) Segmental SNR of four (linearly-spaced)
frequency bands of speech corrupted by speech-shaped noise at 5dB SNR. .................29 Figure 3.3: (a) PSD of multi-talker babble, (b) Segmental SNR of four (linearly spaced)
frequency bands of speech corrupted multi-talker babble at 5dB SNR. ........................29 Figure 3.4: (a) PSD of aircraft noise, (b) Segmental SNR of four (linearly spaced) frequency
bands of speech corrupted aircraft noise at 5dB SNR...................................................30 Figure 3.5: Diagrammatic representation of the multi-band spectral subtraction method......31 Figure 3.6: (a) Original magnitude spectrum speech frame, (b) Magnitude spectrum of the
smoothed and averaged version of 3.5(a). ....................................................................33 Figure 4.1: (a) Long-term magnitude spectrum of a speech file from the HINT database , (b)
Magnitude spectrum of the speech-shaped noise. .........................................................39 Figure 4.1: Sentence “The shop closes for lunch,” sampled at 8kHz, (above) time plot and
(below) the corresponding spectrogram. ......................................................................41 Figure 4.2: Speech shaped noise sampled at 8kHz, (above) time plot and (below) the
corresponding spectrogram..........................................................................................41 Figure 4.3: Sentence “The shop closes for lunch,” at 5 dB SNR, (above) time plot and
(below) the corresponding spectrogram. ......................................................................42 Figure 4.4: Sentence “The shop closes for lunch,” at 0 dB SNR, (above) time plot and
(below) the corresponding spectrogram. ......................................................................42 Figure 4.5: Sentence “The shop closes for lunch,” after spectral smoothing and magnitude
averaging, (above) time plot and (below) the corresponding spectrogram. ...................46 Figure 4.6: Mean IS distance measure of the MBSS approach with linear frequency spacing
and without pre-processing, as a function of the number of bands for 10 sentences embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR. .........................47
Figure 4.7: Mean IS distance measure of the MBSS approach with linear frequency spacing and with pre-processing, as a function of the number of bands for 10 sentences embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR. .........................48
Figure 4.8: Spectrograms of processed speech of the sentence “The shop closes for lunch,” at 5 dB SNR, using MBSS using four linearly spaced frequency bands, (above) without pre-processing and (below) with smoothing and weighted magnitude averaging. .........49
xi
Figure 4.9: Spectrograms of processed speech of the sentence “The shop closes for lunch,” at 0 dB SNR, using MBSS using four linearly spaced frequency bands, (above) without pre-processing and (below) with smoothing and weighted magnitude averaging. ........49
Figure 4.10: Mean IS distance measure of the MBSS approach with logarithmic frequency spacing as a function of the number of bands for 10 sentences embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR...........................................................53
Figure 4.11: Mean IS distance measure of the MBSS approach with mel frequency spacing as a function of the number of bands for 10 sentences embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR. ..................................................................................53
Figure 4.12: Comparison of spectrograms of enhanced speech at 5 dB SNR processed with the MBSS algorithm using four bands with (top) linear spacing, (middle) logarithmic spacing and (bottom) mel spacing................................................................................54
Figure 4.13: Comparison of spectrograms of enhanced speech at 0 dB SNR processed with the MBSS algorithm using four bands with (top) linear spacing, (middle) logarithmic spacing and (bottom) mel spacing................................................................................54
Figure 4.14: Mean IS distance measure of the MBSS approach with linear frequency spacing and speech-silence detector, as a function of the number of bands for 10 sentences embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR. .........................56
Figure 4.13: Spectrograms of speech enhanced with the MBSS algorithm using four linearly spaced frequency bands with a speech-silence detector, at (top) 5 dB SNR and (bottom) 0 dB SNR. ...................................................................................................................57
Figure 4.14: Intelligibility test results for seven subjects scored on percentage words correct.....................................................................................................................................59
Figure 4.15: Comparison of the performance, in terms of mean IS distance measure, of the with power spectral subtraction (indicated with 'PSS') with the multi-band spectral subtraction approach as a function of the number of bands for 10 sentences embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0 dB SNR..............................................60
Figure 4.16: Spectrogram of the sentence ''The shop closes for lunch.'' at 5 dB SNR. The top spectrogram is the corrupted signal, the middle spectrogram is the enhanced signal obtained by the multi-band spectral subtraction method using 4 linearly spaced frequency bands, and the bottom spectrogram is the enhanced signal obtained by the power spectral subtraction method...............................................................................61
Figure 4.17: Spectrogram of the sentence ''The shop closes for lunch.'' at 5 dB SNR. The top spectrogram is the corrupted signal, the middle spectrogram is the enhanced signal obtained by the multi-band spectral subtraction method using 4 linearly spaced frequency bands, and the bottom spectrogram is the enhanced signal obtained by the power spectral subtraction method...............................................................................62
xii
LIST OF TABLES
Table 2.1: Phonemes in American English.. ........................................................................5
Table 2.2: Speech enhancement processing strategies, (Adapted from [30]) ..........................6
Table 4.1: List of sentences used from the HINT database for objective performance
evaluation......................................................................................................................40
Table 4.2: Center frequency values for linear, logarithmic and mel spacing of frequency
bands.............................................................................................................................52
Table 4.3: Mean global and segmental SNR calculated over ten sentences at 5 dB SNR......58
1
CHAPTER ONE
INTRODUCTION A major part of the interaction between humans takes place via speech communication.
Hence, research in speech and hearing sciences has been going on for centuries to understand
the dynamics and processes involved in the production and perception of speech. The field of
speech processing is essentially an application of signal processing techniques to acoustic
signals using the knowledge offered by researchers in the field of hearing sciences. The
explosive advances in recent years in the field of digital computing have provided a
tremendous boost to the field of speech processing. Digital signal processing techniques are
more sophisticated and advanced as compared to their analog counterparts. Ease and speed of
representing, storing, retrieving and processing speech data has contributed to the
development of efficient and effective speech processing techniques to address the issues
related to speech.
The presence of background noise in speech significantly reduces the intelligibility of
speech. Degradation of speech severely affects the ability of person, whether impaired or
normal hearing, to understand what the speaker is saying. Noise reduction or speech
enhancement algorithms are used to suppress such background noise and improve the
perceptual quality and intelligibility of speech. Even though speech is perceptible in a
moderately noisy environment, many applications like mobile communications, speech
recognition and aids for the hearing handicapped, to name a few, drive the effort to build
2
more effective noise reduction algorithms for better performance. Over the years engineers
have developed a variety of theoretical and relatively effective techniques to combat this
issue. However, the problem of cleaning noisy speech still poses a challenge to the area of
signal processing. Removing various types of noise is difficult due to the random nature of
the noise and the inherent complexities of speech. Noise reduction techniques usually have a
trade off between the amount of noise removal and speech distortions introduced due the
processing of the speech signal. Complexity and ease of implementation of the noise
reduction algorithms is also of concern in applications especially those related to portable
devices such as mobile communications and digital hearing aids.
The spectral subtraction method is a well-known noise reduction technique [2] [3]
[18]. Most implementations and variations of the basic technique advocate subtraction of the
noise spectrum estimate over the entire speech spectrum. However, real world noise is
mostly colored and does not affect the speech signal uniformly over the entire spectrum. In
this thesis, we propose a multi-band spectral subtraction approach that takes into account the
fact that colored noise affects the speech spectrum differently at various frequencies. This
method outperforms the standard power spectral subtraction method resulting in superior
speech quality and largely reduced musical noise.
This thesis is organized as follows; Chapter 2 gives a review of the different noise
reduction strategies that have been developed. Chapter 3 discusses the Multi-Band Spectral
Subtraction (MBSS) method. Results and quantative performance comparison is discussed in
Chapter 4. Chapter 5 gives the conclusions and presents a summary of the work done and
future work.
3
CHAPTER TWO
LITERATURE REVIEW In the past decades, research in the field of speech enhancement has focused on the
suppression of additive background noise [5] [16] [17]. From the point of view of signal
processing, additive noise is easier to deal with than convolutive noise or nonlinear
disturbances. The ultimate goal of speech enhancement is to eliminate the additive noise
present in speech signal and restore the speech signal to its original form. Several methods
have been developed as a result of these research efforts. Most of these methods have been
developed with some or the other auditory, perceptual or statistical constraints placed on the
speech and noise signals. However, in real world situations, it is very difficult to reliably
predict the characteristics of the interfering noise signal or the exact characteristics of the
speech waveform. Hence, in effect, the speech enhancement methods are sub-optimal and
can only reduce the amount of noise in the signal to some extent. Due to the sub-optimal
nature of these methods, some of the speech signal can be distorted during the process.
Hence, there is a trade-off between distortions in the processed speech and the amount of
noise suppressed. The effectiveness of the speech enhancement system can therefore be
measured based on how well it performs in light of this trade-off.
This chapter presents review on the production of speech in humans and a literature
review of the different speech enhancement methods used to date. The family of subtractive-
type enhancement methods is discussed in more detail.
4
2.1 Fundamentals of speech production
Speech, a dynamic, information-bearing signal, is also called the acoustic waveform. These
waves are produced due the sound pressure generated in the mouth of the speaker as a result
of some sequence of coordinated movements of a series of structures in the human vocal
system. The branch of science that deals with the dynamics and production of the human
sound is called phonetics. The process of speech communication involves the production of
the acoustic wave by the speaker and the perception of the signal by the listener. Though the
process of speech perception still largely remains a mystery to the scientific world, the
process of speech production has been well researched and understood. A sound knowledge
of the processes involved in the production and perception of speech is necessary for
engineers to develop suitable methods to represent and transform the acoustic signals to
achieve the desired results.
The human speech production mechanism consists of the lungs, trachea (windpipe),
larynx, pharyngeal cavity (throat), buccal cavity (mouth), nasal cavity, velum (soft palate),
tongue, jaw, teeth and lips. The lungs and trachea make up the respiratory subsystem of the
mechanism. These provide the source of energy for speech when air is expelled from the
lungs into the trachea. The resulting airflow passes through the larynx, which provides
periodic excitation to the system to produce the voiced sounds. The three cavities of the
system can collectively be termed as the main acoustic filter that shapes the sound that is
generated. The velum, tongue, jaw, teeth and lips are known as the articulators. These
provide the finer adjustments to generate speech. The excitation used to generate speech can
be classified into voiced, unvoiced, mixed, plosive, whisper and silence. Any combination of
one or more can be blended to produce a particular type of sound. A phoneme describes the
5
linguistic meaning conveyed by a particular speech sound. The American English language
consists of about 42 phonemes, which can be classified as vowels, semivowels, diphthongs
and consonants (fricatives, nasals, affricatives and whisper) as shown in Table 2.1.
Table 2.1: Phonemes in American English.
Vowels are produced due to the periodic vibrations of the vocal chords in the larynx.
The frequency at which the vocal chords vibrate is called the fundamental frequency or pitch
of the speech. The fricatives are caused by the turbulence of the air passing through narrow
constrictions in the vocal tract, causing a random noise-like sound. Nasals are caused by
6
acoustically coupling the nasal cavity to the pharyngeal cavity by lowering the velum.
Building up pressure in front of the vocal tract and abruptly releasing it produces plosives.
The resonant frequencies generated by the vocal tract are called the formant frequencies or
the formants. The formants depend on the length and shape of the vocal tract.
2.2 Classification of speech enhancement techniques
Speech enhancement systems can be classified in a number of ways [18] [30] based on the
criteria used or application of the enhancement system. (See Table 2.2).
Domain Possible Strategies
Number of input channels One / Two / Multiple
Domain of processing Time / Frequency
Type of algorithm Non-adaptive / Adaptive
Additional constraints Speech production / Perception
Table 2.2: Speech enhancement processing strategies, (Adapted from [30]).
The speech signal can be acquired from single or multiple channel sensors. As
discussed in Chapter 1, additive noise can make speech enhancement particularly difficult.
Non-stationarity of the noise process can further complicate the enhancement effort. One
microphone input (single channel) could make speech enhancement difficult, as speech and
noise are present in the same channel. Separation of the two signals would require relatively
good knowledge of the speech and noise models or require that the interfering signal be
present exclusively in a different frequency band than that of the speech signal. A costly
solution to this problem is to use a dual microphone approach. Spatial analysis can however
7
help immensely in speech enhancement as this gives useful information regarding the signal.
In such analysis, the noise source is assumed to be statistically independent and additive.
This assumption is based on the fact that most environmental noise is typically additive in
nature. The discussion in this chapter will be limited to single channel enhancement
techniques, as these are the most common types of enhancement systems found in many
applications.
• Suppression of noise using periodicity of speech
These methods exploit the quasi-periodic nature of voiced speech. As discussed in
Chapter 1, voiced speech is periodic in nature characterized by a fundamental frequency,
which varies from person to person. This technique however, depends heavily on the
accurate estimation of the pitch period (inverse of the pitch) of the speaker’s voice.
A simple method based on this criterion is the adaptive comb filter [18]. In this
method, a series of notch filters are used so as to filter out any spectral content between
the fundamental frequency and its harmonics. Another method is the single channel
adaptive noise cancellation technique [25]. In this method, a delayed version of the
speech signal is used as the input to an adaptive LMS filter while the input in used as the
reference signal. The delay decorrelates the noise in the input signal with that present in
the reference. And when the delay is equal to an estimate of the pitch period, there is a
correlation in the speech content of the two signals.
A major disadvantage of these methods is that there is no improvement in quality of
the unvoiced speech portions. Moreover, an accurate pitch extraction algorithm is crucial to
achieving good performance.
8
• Model-based speech enhancement
Enhancement systems in this category are also called statistical - model based methods
[6]. These methods are usually used when there is no knowledge of the statistical
properties of the speech or noise signal. Speech production models like autoregressive –
moving average (ARMA), autoregressive (AR) or moving average (MA) are used
instead. This involves the estimation of the speech model parameters and then the
estimation of the enhanced signal by re-synthesis using speech model parameters or by
using a Wiener or Kalman filter.
The Wiener filter is a popular adaptive technique that has been used in many
enhancement methods. The basic principle of the Wiener filter is to estimate an optimal
filter from the noisy input speech by minimizing the Mean Square Error (MSE) between
the desired signal )(ks and the estimated signal )(^
ks . The Wiener filter can be given in
the frequency domain by:
)()(
)()(
ωωωω
NS
S
PP
PH
+= (2.1)
where )(ωSP is the power spectral density (PSD) of the speech and )(ωNP is the PSD of
the noise spectrum calculated during periods of non-speech activity. From Eq. 2.1 it is
obvious that a priori knowledge of the speech and noise power spectra is necessary. The
speech power spectrum is estimated using the estimated speech model parameters [17].
9
2.3 Short-term spectral amplitude techniques
The short-term spectral amplitude (STSA) of speech has been exploited successfully in the
development of various speech enhancement algorithms. The basic idea is to use the STSA
of the noisy speech input and recover an estimate of the clean STSA by removing the part
contributed by the additive noise. A general representation of the technique is given in Figure
2.1. The input to the system is the noise-corrupted signal )(ny . While there are many
methods for the analysis-synthesis processing, the Short-term Fourier Transform (STFT) of
the signal with OverLap and Add (OLA) [5] is the most commonly used method. The
spectral amplitude |)(| kY of the noisy input signal )(ny is modified using a correction
factor. Usually this correction factor could be the spectral amplitude of the estimated noise
signal )(nd , measured during periods of silence/non-activity in the speech signal or obtained
from a reference channel (dual-channel method). The correction is obtained by subtracting
the spectral amplitude of the noise signal from that of the noisy speech input. Hence, these
methods are also referred to as subtractive-type algorithms. If the noise is assumed to be
uncorrelated with the speech signal, then the corrected amplitude can be considered as an
estimate |)(ˆ| kS of the original clean speech signal )(ns . The unprocessed phase of the noisy
input signal is used to synthesize the enhanced speech signal under the assumption that the
human ear is not able to perceive the distortions in the phase of the speech signal.
10
Figure 2.1: Diagramtic representation of the short-time spectral magnitude enhancement system.
Spectral subtraction is a well-known noise reduction method based on the STSA
estimation technique. The basic power spectral subtraction technique, as proposed by Boll
[3], is popular due to its simple underlying concept and its effectiveness in enhancing speech
degraded by additive noise. The basic principle of the spectral subtraction method is to
subtract the magnitude spectrum of noise from that of the noisy speech. The noise is assumed
to be uncorrelated and additive to the speech signal. An estimate of the noise signal is
measured during silence or non-speech activity in the signal.
Since Boll [3] first proposed this method, several variations and enhancements have
been made to the techniques to overcome some inherent drawbacks in the method. Section
2.3 presents the basic principle of the technique, Section 2.4 discusses the drawbacks in the
method and Section 2.5 describes the improvements that have been proposed over the years.
Windowing
+ DFT
Synthesis
Magnitude Modification
Phase
Correction Factor
|)(| kY |)(|^
kS
)()()( ndnsny += )(^
ns
Analysis
IDFT +
OverLap and Add
11
2.4 Principle of the basic spectral subtraction method
If we assume that )(ny , the discrete noisy input signal, is composed of the clean speech
signal )(ns and the uncorrelated additive noise signal )(nd , then we can represent it as:
)()()( ndnsny += (2.2)
Processing is done on a frames-by-frame basis. Analysis of overlapping frames of the
noisy signal is implemented by using the Discrete Fourier Transform (DFT) preceded by a
Hamming window. The power spectrum of the noisy signal can be written as:
222 |)(||)(||)(| kDkSkY +≈ (2.3)
where the DFT of )(kY is given by:
)(21
0
|)(|)()( kjN
nkjN
n
ekYenykY ϕπ
==−−
=∑ (2.4)
where )(kϕ is the phase of the noise-corrupted signal, i.e. the phase of )(kY .
Since the noise spectrum )(kD cannot be directly obtained, a time-average of the
power spectrum )(ˆ kD is calculated during a period of silence. Assuming that noise is
uncorrelated with the speech signal, an estimate of the modified speech spectrum can be
given as:
222 |)(ˆ||)(||)(ˆ| kDkYkS −= (2.5)
12
From Eq. (2.5) it can be seen that the subtraction process involves the subtraction of an
averaged estimate of the noise from the instantaneous speech spectrum. Due to the error in
computing the noise spectrum, we may have some negative values in the modified spectrum.
These values are set to zero. This process is called half-wave rectification. With half-wave
rectification the modified spectrum can be written as:
>=
else
kSifkSkS
0
0|)(ˆ||)(ˆ||)(ˆ|
222 (2.6)
The modified spectrum of Eq. 2.6 is combined with the phase information from the noise-
corrupted signal to reconstruct the time signal by using the Inverse Discrete Fourier
Transform (IDFT) in conjunction with the OLA method.
( ))(|)(ˆ|)(ˆ kjekSIDFTns ϕ= (2.7)
The noise suppression can also be implemented as a time-varying filtering process by
rewriting the spectral subtraction method as:
)()()(ˆ kYkHkS = (2.8)
where )(kH is a gain function represented by:
2
2
|)(|
|)(ˆ|1)(
kY
kDkH −= (2.9)
or
13
2
22
|)(|
|)(ˆ||)(|)(
kY
kDkYkH
−= (2.10)
In this case, the modified spectrum is obtained by applying a time varying weight )(kH to
each frequency component. From Eq. 2.9, it can be deduced that the frequency dependent
gain is a function of the noisy signal-to-noise ratio (NSNR) of each of the frequency
components. The enhanced time signal is synthesized as given in Eq. 2.7, using the original
noisy phase portion.
The enhanced signal has largely reduced noise levels compared to the original noise-
corrupted signal resulting in a better SNR and improved speech quality. However, the
subtraction process also introduces an annoying artifact called musical noise. This artifact is
due to the residual noise in the enhanced speech. This and other drawbacks of the method
neutralize the improvement in speech quality achieved due to the reduction in noise levels
and can be more annoying than the original noise itself.
2.5 Drawbacks of the spectral subtraction method
While the spectral subtraction method is easily implemented and effectively reduces the
noise present in the corrupted signal, there exist some glaring shortcomings, which are given
below:
• Residual noise (musical noise)
It is obvious that the effectiveness of the noise removal process is dependent on obtaining
an accurate spectral estimate of the noise signal. The better the noise estimate, the lesser
the residual noise content in the modified spectrum. However, since the noise spectrum
14
cannot be directly obtained, we are forced to use an averaged estimate of the noise.
Hence there are some significant variations between the estimated noise spectrum and the
actual noise content present in the instantaneous speech spectrum. The subtraction of
these quantities results in the presence of isolated residual noise levels of large variance.
These residual spectral content manifest themselves in the reconstructed time signal as
varying tonal sounds resulting in a musical disturbance of an unnatural quality. This
musical noise can be even more disturbing and annoying to the listener than the
distortions due to the original noise content.
Figure 2.2: Spectrograms of the sentence “The shop closes for lunch”, clean speech (top), with speech shaped noise at 5 dB SNR (middle), and speech enhanced
using spectral subtraction (bottom)
15
The residual noise is quite clearly evident in the bottom plot of Figure 2.2. Figure 2.2
shows the plot of the spectrograms of the sentence “ The shop closes for lunch”
pronounced by a male speaker.
Several residual noise reduction algorithms have been proposed to combat this
problem. However, due to the limitations of the single-channel enhancement methods, it
is not possible to remove this noise completely, without compromising the quality of the
enhanced speech. Hence there is a trade-off between the amount of noise reduction and
speech distortion due to the underlying processing.
• Distortions due to half / full wave rectification
The modified speech spectrum obtained from Eq. 2.5 may contain some negative values
due to the errors in the estimated noise spectrum. These values are rectified using half-
wave rectification (set to zero) or full-wave rectification (set to its absolute value). This
can lead to further distortions in the resulting time signal.
• Roughening of speech due to the noisy phase
The phase of the noise-corrupted signal is not enhanced before being combined with the
modified spectrum to regenerate the enhanced time signal. This is due to the fact that the
presence of noise in the phase information does not contribute immensely to the
degradation of the speech quality. This is especially true at high SNRs (>5 dB). However,
at lower SNRs (<0dB), the noisy phase can lead to a perceivable roughness in the speech
signal contributing to the reduction in speech quality. Experiments conducted by
Schroeder [27] have corroborated this fact. Estimating the phase of the clean speech is a
16
difficult and will greatly increase the complexity of the method. Moreover, the distortion
due to noisy phase information is not very significant compared to that of the magnitude
spectrum, especially for high SNRs. Hence the use of the noisy phase information is
considered to be an acceptable practice in the reconstruction of the enhanced speech
signal.
Most speech enhancement algorithms, including the spectral subtraction methods, try
to optimize noise removal based on mathematical models of the speech and noise signals.
However, speech is a subtle form of communication and is heavily dependent on the
relationship of one frequency with another. Hence, while conventional speech enhancement
algorithms can increase the speech quality of the noisy speech by increasing the SNR, there
is no significant increase in speech intelligibility. Algorithms should take into account the
subtleties of speech and incorporate methods based on the perceptual properties of the speech
signal. The spectral subtraction methods, as well as most other methods, suffer from this
drawback. Studies [5] [30] have shown that there is no improvement in the intelligibility in
the speech signals enhanced by the spectral subtraction method.
2.6 Modifications to spectral subtraction
Several variants of the spectral subtraction method originally proposed by Boll [3] have been
developed to address the problems of the basic technique, especially the presence of musical
noise. Still other methods based on this method have been developed that perform noise
suppression in the autocorrelation, cepstral, logarithmic and sub-space domains. A variety of
pre and post processing methods have also proved to help reduce the presence of musical
17
noise while minimizing speech distortion. This section deals with the different techniques
and enhancements that have been proposed over the years.
• Magnitude averaging
Magnitude averaging of the input spectrum reduces spectral error by averaging across
neighboring frames. This has the effect of lowering the noise variance while reinforcing
the speech spectral content and thus preventing destructive subtraction. The magnitude
averaging is viable only for stationary time waveforms. Due to the short-term stationarity
of speech, the number of neighboring frames over which the averaging is done is limited.
If this constraint is ignored, a certain slurring of speech can be detected due to the
smearing of different speech phonemes into each other. A generalized representation of
the averaging operation can be expressed as:
∑−=
−+=
M
Mjjiji kYW
MkY )(
12
1)( (2.11)
where i is the frame index. The weights jW can be used to weight the frames. When
jW =1 )( j∀ , the equation reduces to the basic magnitude averaging operation. In the
case where the frames are weighted by different values for jW , the operation is referred
to as weighted magnitude averaging. Goh et al. [9] proposed multi-blade median filtering
over several frames of speech to identify spectral content contributing to residual noise
and smoothing them out.
18
• Generalized spectral subtraction
A generalized form of the basic spectral subtraction Eq. 2.5 is given by Berouti et al. [2]
as:
aaa kDkYkS |)(ˆ||)(||)(ˆ| += (2.12)
where the power exponent a can be chosen to obtain optimum performance. In the case
where 2=a , the subtraction is carried out on the Short-term Power Density Spectra
(STPDS) and is referred to as power spectral subtraction. When 1=a , the equation
reduces to the basic spectral subtraction method proposed by Boll [3] where the
subtraction is carried out by subtracting the magnitude spectra.
• Spectral Subtraction using over-subtraction and spectral floor
An important variation of spectral subtraction was proposed by Berouti et al. [2] for
reduction of residual musical noise. This proposed technique could be expressed as:
222 |)(ˆ||)(||)(ˆ| kDkYkS α−= (2.13)
>
=elsekD
kDkSifkSkS
2
2222
|)(ˆ|
|)(ˆ||)(ˆ||)(ˆ||)(ˆ|
ββ
(2.14)
where the over-subtraction factor α is a function of the noisy signal-to-noise ratio and
calculated as:
dBSNRdBSNR 20520
30 ≤≤−−=αα (2.15)
19
where 0α is the desired value of α at 0 dB SNR. Figure 2.3 gives a plot of α at 40 =α
over a range of SNR values. The over-subtraction factor α , subtracts an overestimate of
the noise power spectrum from the speech power spectrum. This operation minimizes the
presence of residual noise by decreasing the spectral excursions in the enhanced
spectrum. The over-subtraction factor can be seen as a time-varying factor, which
provides a degree of control over the noise removal process between periods of noise
update.
Figure 2.3: Over-subtraction factor α as a function of SNR with 40 =α
In Eq. 2.14, the spectral floor β prevents the spectral components of the enhanced
spectrum from falling below the lower value, 2|)(ˆ| kDβ . This operation fills out the
valleys between spectral peaks and the reinsertion of broadband noise into the spectrum
helps to ‘mask’ the neighboring residual noise components. While the proposed
20
technique has proved to be successful in suppressing the residual noise to a large extent,
over-subtraction of the noise estimate also causes heavy speech distortions.
• Spectral subtraction with an MMSE STSA estimator
In 1983, Ephraim and Malah [7] proposed an optimum (in the minimum mean-square
error sense - MMSE) STSA estimator. The method calculated a gain function based on
the a priori and a posteriori SNRs. The following equations describe the method:
)()()(ˆ kYkHkS = (2.16)
+⋅⋅
+⋅=
A
AN
A
A
N
FkHγ
γγγ
γλ
π11
1
2)( (2.17)
where Aγ is the a priori SNR, which is calculated as:
)1()98.01(|)(ˆ|
|)(ˆ|98.0)( ,2
21
, −⋅−+
⋅= −
iN
i
iiA P
kD
kSk γγ (2.18)
here, i is the frame index with xxP =)( if 0≥x and 0)( =xP otherwise. Nγ is the a
posteriori SNR and F is a function representing:
+
+=−
22)1()( 10
2 xI
xIxexF
x
(2.19)
where )(0 yI and )(1 yI are zero and first order modified Bessel functions respectively.
Unlike magnitude averaging where averaging is performed irrespective of whether
the frame contains speech or noise, the proposed MMSE estimator performs non-linear
21
smoothing only when the SNR is low, i.e. when the frame predominantly contains noise.
The residual noise present due to this technique has been observed to be colorless. The
method reduces the distortions in the speech parts due to averaging.
• Spectral subtraction based on perceptual properties
As mentioned earlier, while conventional speech enhancement algorithms improve the
speech quality of the noisy speech by increasing the SNR, there is no significant increase
in speech intelligibility due to the quasi-stationary and other subtle properties of speech.
To tackle this problem, researchers have been trying to incorporate the knowledge of
human perceptual properties in the enhancement processing. Methods based on the
perceptual loudness (Petersen and Boll [22]) and lateral inhibition (Cheng and
O’Shaughnessy [4]) have shown that this approach is somewhat successful in preserving
speech content.
Virag [29] proposed a technique based on the masking properties of the human
auditory system, i.e. the property that weak sounds are masked by simultaneously
occurring stronger sounds. A masking threshold is calculated by modeling the frequency
selectivity of the human ear and the masking property. Using the implementation of
spectral subtraction given in Eq. 2.8, the gain function is calculated as:
⋅
+<
⋅−
=
elseY
D
Y
Dif
Y
D
G2
1
12
1
|)(|
|)(ˆ|
1
|)(|
|)(ˆ|
|)(|
|)(ˆ|1
)(γγ
γγγ
ωωβ
βαωω
ωωα
ω (2.20)
22
where the over subtraction factor α and the spectral floor parameter β is a function of
the masking threshold )(ωT . The exponents 1γ and 2γ determine the sharpness of
transition of )(ωG . The masking threshold )(ωT is calculated by applying a spreading
function across the critical bands of the speech spectrum.
Kim et al. [15] proposed a similar method based on masking properties and phonetic
dependency of speech. The method employs a state-dependent subtraction of speech and
residual noise reduction using the masking threshold. While these methods have proved
to improve speech quality as compared to using purely mathematical models of speech
and noise signals, the increase in complexity in implementation is also substantial.
• Other methods
Other methods based on the spectral subtraction method have been developed that
operate in the autocorrelation, cepstral, logarithmic and signal subspace domains. In the
basic spectral subtraction method and most of its variations, the short-term spectral
estimations are done in the frequency domain. This estimation can also be done in the
logarithmic domain. A major drawback of this method is that the resulting
implementation becomes very complicated and computationally expensive. However, this
drawback can be avoided by using lookup tables.
The signal subspace principles have also been incorporated successfully within the
spectral subtraction framework. The decomposition of the noisy signal into a subspace of
the desired signal and noise is done using the Karhunen-Loeve Transform (KLT) as
described by [8] [24]. However, for non-white noise sources, pre-whitening may be
necessary.
23
In recent years, researchers have proposed a frequency adaptive subtraction factor
based on the segmental noisy SNR. Most implementations and variations of the basic
technique advocate subtraction of the noise spectrum estimate over the entire speech
spectrum. However, real world noise is mostly colored and does not affect the speech
signal uniformly over the entire spectrum. The frequency-dependent spectral subtraction
approach takes into account the fact that colored noise affects the speech spectrum
differently at various frequencies. The next section deals in detail the methods based on
non-linear spectral subtraction methods.
2.7 Frequency – dependent spectral subtraction methods
Lockwood and Boudy [19] proposed the non-linear spectral subtraction (NSS) method,
which is based on the linear spectral subtraction method proposed by Berouti et al. [2]. In this
method, the over-subtraction factor is frequency dependent within every frame of speech
input. Hence, the subtraction is non-linear over the range of frequencies in the spectrum. The
enhanced speech spectrum can be expressed as:
)(1
)(|)(||)(ˆ|
k
kkYkS
i
iii ργ
α+
−= (2.21)
where i is the frame index. )(kYi is the smoothed noisy speech spectrum of the ith frame,
and )(kiα is the frequency-dependent overestimation factor calculated as:
( )|)(ˆ|max)(40
kDk jiji
i ≤≤−=α (2.22)
24
or, is estimated as:
|)(ˆ|5.1)( kDk ji ⋅=α (2.23)
The scaling factor γ is dependent on the variation of the frequency-dependent SNR )(kiρ ,
which is given by:
|)(ˆ|
|)(|)(
kD
kYk
i
ii =ρ (2.24)
The subtracting term in Eq. 2.19 is manually limited by the bounds given in Eq. 2.25 to
reduce large variation in the modified spectrum.
|)(ˆ|3)(1
)(|)(ˆ|
'
kDk
kDkD i
i
ii ≤
+≤
ργ (2.25)
To prevent negative values in the enhanced spectrum, a spectral floor is employed as:
≥=
elsekY
kDkSifkSkSi
iiii
|)(|
|)(|)()()(
^^^^
ββ (2.26)
where a typical value for 1.0=β .
The proposed algorithm computes an optimum over-subtraction value for each
frequency in the frame depending on the SNR. Though the algorithm is successful in
suppressing the musical noise to a large extent, there may exist large variations between
neighboring frequency components due to errors in the noise estimate. However, the
25
algorithm demonstrates that frequency-dependent processing can be used to suppress musical
noise and achieve better speech quality.
Other approaches based on frequency-dependent subtraction have also been proposed.
He and Zweig [11] have proposed a two-band spectral subtraction method using the Berouti
et al. [2] method for the lower frequency band and weighted magnitude averaging for the
higher frequency band, which is considered to be stochastic in nature. The cut-off frequency
between the two bands was determined adaptively for each frame as the highest frequency
below which the separation between adjacent peaks was approximately equal to the
fundamental frequency. Another method (similar to the work presented in this thesis)
proposed by Wu and Chen [31] uses the Berouti et al. [2] spectral subtraction method on
each critical band over the speech spectrum.
The success met by these methods has shown that frequency-dependent processing of
the subtraction procedure is indeed a valid line of research. Chapter 3 presents the proposed
multi-band method for frequency-dependent subtraction.
26
CHAPTER THREE
MULTI-BAND SPECTRAL SUBTRACTION In Chapter 2, we have seen that while the conventional power spectral subtraction method
substantially reduces the noise levels in the noisy speech, it also introduces an annoying
distortion in the speech signal called musical noise. This distortion is caused due to the
inaccuracies in the short-time noise spectrum estimate resulting in large spectral variations in
the enhanced spectrum. This chapter describes a frequency-dependent spectral subtraction
method that offers better speech quality of the resulting enhanced speech with reduced
residual noise. Section 3.1 discusses the motivation behind the development of the proposed
method. Section 3.2 explains the different processing techniques used in the proposed multi-
band spectral subtraction algorithm.
3.1 Motivation
Most speech enhancement algorithms have been observed to work well under some
conditions. As such the problem of enhancing speech corrupted by a noise source has not yet
been fully resolved. While methods based on mathematical and statistical models of
speech/noise signals have shown to be effective, they have a key drawback due to the fact
that they incorporate some very crucial assumptions about the speech and noise
characteristics. However, real-world noise is highly random in nature. Moreover, the spectral
content of speech can vary significantly from speaker to speaker and with the emotional state
27
of the speaker. Hence it becomes imperative to exploit as much of the palpable properties of
the speech and noise signals as possible. For example, it is possible to observe the noise by
itself during speech pauses due to the bursty nature of speech.
Recent research in spectral subtraction methods has focused on a non-linear approach
to the subtraction procedure [11] [19] [28] [31]. This approach has been justified due to the
variation of signal-to-noise ratio across the speech spectrum. Unlike white gaussian noise
(WGN), which has a flat spectrum, the spectrum of real-world noise is not flat. Thus, the
noise signal does not affect the speech signal uniformly over the whole spectrum. Some
frequencies are affected more adversely than the others. In multi-talker babble, for instance,
the low frequencies, where most of the speech energy resides, are affected more than the high
frequencies. Hence it becomes imperative to estimate a suitable factor that will subtract just
the necessary amount of the noise spectrum from each frequency bin (ideally), to prevent
destructive subtraction of the speech while removing most of the residual noise.
Another factor that leads to variation in SNR in different frequency bands of speech
corrupted with noise is the fact that noise has non-uniform effect on different vowels and
consonants. Past research [20] has shown this effect to be present in consonants. Preliminary
results of continuing research at our lab at the University of Texas at Dallas have shown that
various types of noise also affect vowels non-uniformly.
These effects are best illustrated in the plots of the power spectral density (PSD) of
different noise signals and the corresponding variation of segmental SNR of a portion of
speech corrupted with the particular noise. Calculation of the segmental SNR is given by Eq.
3.10 in section 3.2. Figure 3.1(a) depicts the PSD of computer-generated WGN, which is flat
over the whole spectrum. Figure 3.1(b) shows the segmental SNR estimated for 4 (linearly
28
spaced) frequency bands of speech corrupted by the noise. The segmental SNR was plotted
for a portion of the sentence "The shop closes for lunch." produced by a male speaker.
Figures 3.2 – 3.4 illustrate similar plots for three real-world noise signals, i.e. speech-shaped
noise, babble noise and aircraft noise. The SNR plots indicate that the segmental SNRs of the
high frequency bands (e.g. band 4) are significantly lower than the SNR of the low frequency
bands (e.g. band 2), by as much as 15 dB in some cases.
The non-linear spectral subtraction [19] is a frequency-dependent spectral subtraction
method, which exploits the non-uniformity of the effects of noise on speech. Here, a
frequency-dependent subtraction factor is calculated for each frequency component of the
spectra.
(a) (b)
Figure 3.1: (a) PSD of WGN, (b) Segmental SNR of four (linearly-spaced) frequency bands of speech corrupted by WGN at 5dB SNR.
29
(a) (b)
Figure 3.2: (a) PSD of speech-shaped noise, (b) Segmental SNR of four (linearly-spaced) frequency bands of speech corrupted by speech-shaped noise at 5dB SNR.
(a) (b)
Figure 3.3: (a) PSD of multi-talker babble, (b) Segmental SNR of four (linearly spaced) frequency bands of speech corrupted multi-talker babble at 5dB SNR.
30
(a) (b)
Figure 3.4: (a) PSD of aircraft noise, (b) Segmental SNR of four (linearly spaced) frequency bands of speech corrupted aircraft noise at 5dB SNR.
3.2 Multi-band spectral subtraction
This section describes the proposed method for speech enhancement with reduced residual
noise. A block diagram of the proposed method is shown in Figure 3.6. It consists of 4
stages. In the first stage, the signal is windowed and the magnitude spectrum is estimated
using the FFT. In the second stage, we split the noise and speech spectra into different
frequency bands and calculate the over-subtraction factor for each band. The third stage
includes processing the individual frequency bands by subtracting the corresponding noise
spectrum from the noisy speech spectrum. Lastly, the modified frequency bands are
recombined and the time signal is obtained by using the noisy phase information and taking
the IFFT in the fourth stage. The effect of pre-processing operations is to neutralize the
31
Fig
ure
3.5:
Dia
gram
mat
ic r
epre
sent
atio
n of
the
mul
ti-ba
nd s
pect
ral s
ubtr
actio
n m
etho
d
32
distortion in the spectral content of the input data due to the analysis window and to pre-
condition the input data to surmount the distortion due to errors in the subtraction process.
• Pre-processing techniques
Along with the actual noise suppression operation, some pre-processing methods are also
crucial to achieving good speech quality. It had been mentioned in Chapter 2 (Eqs. 2.8 -
2.10) that the spectral subtraction process could be viewed as a time varying filter. To
reduce the perception of residual noise in the enhanced speech, it is necessary to reduce
the variance of the frequency content of the signal. Hence, instead of directly using the
power spectra of the signal, a smoothed version of the power spectra can be used. A
smoothing window of size 10 ms was found to work well. However, it was seen that
smoothing of the estimated noise spectrum was not helpful in reducing residual noise.
Local or magnitude averaging has also proved to help improve speech quality of the
processed speech [3][4] [5]. The operation is described as:
∑−=
−=M
Mlljlj kYWkY )()( (3.1)
where j is the frame index 10 << lW . The averaging is done over M preceding and
succeeding frames of speech. Since the residual noise is the difference between the
estimated noise spectrum and its mean, local averaging of the magnitude spectrum
essentially means that noise content in the averaged frame will approach the mean noise
spectrum, i.e., the estimated noise spectrum, )(ˆ kN . If the error can be written as:
33
)()(ˆ)( kSkSkE −= (3.2)
Then, from Eqs. 2.3 and 2.5,
)(ˆ)()( kNkNkE −= (3.3)
substituting )(kN with )(kN :
)(ˆ)()( kNkNkE −= (3.4)
where )(kN is obtained by the averaging operation as described in Eq. 2.11. The error,
i.e. the residual noise, is minimized if )(ˆ)( kNkN ≈ . Figure 3.5 (b) shows the effect of
smoothing and local averaging on the original spectrum shown in Figure 3.5 (a).
(a) (b)
Figure 3.6: (a) Original magnitude spectrum speech frame, (b) Magnitude spectrum of the smoothed and averaged version of 3.5(a).
34
• Proposed subtraction method
A block diagram of the proposed method was given in Figure 3.6. Assuming the additive
noise to be stationary and uncorrelated with the clean speech signal, the resulting input
corrupted speech can be expressed as:
)()()( ndnsny += (3.5)
where )(ny , )(ns and )(nd are the corrupted speech signal, clean speech signal and the
noise respectively. For a zero-mean uncorrelated noise signal, the power spectrum of the
corrupted speech can be approximately estimated as:
222 |)(||)(||)(| kDkSkY +≈ (3.6)
where )(kS and )(kD are the magnitude spectra of the clean speech and the noise
respectively. Since the noise spectrum cannot be directly obtained, an estimate )(ˆ kD is
calculated during periods of silence or non-speech activity. In the proposed
implementation by Berouti et al. [2], the estimate of the clean speech spectrum is
obtained as:
222^
|)(ˆ||)(||)(| kDkYkS α−= (3.7)
>
=elsekD
kDkSifkSkS
2
2222
|)(ˆ|
|)(ˆ||)(ˆ||)(ˆ||)(ˆ|
ββ
(3.8)
where α is the over-subtraction factor [2], which is a function of the segmental SNR.
This implementation assumes that the noise affects the speech spectrum uniformly and the
35
over-subtraction factor α subtracts an over-estimate of the noise over the whole
spectrum. That is not the case, however, with real-world noise (e.g., car noise, cafeteria
noise, etc.).
To take into account the fact that colored noise affects the speech spectrum
differently at various frequencies, we propose a multi-band approach to spectral
subtraction. The speech spectrum is divided into N non-overlapping bands, and spectral
subtraction is performed independently in each band. The estimate of the clean speech
spectrum in the i th band is obtained by:
iiiiiii ekbkDkYkS ≤≤−= 222 |)(ˆ||)(||)(ˆ| δα (3.9)
where ib and ie are the beginning and ending frequency bins of the i th frequency band,
iα is the over-subtraction factor of the i th band and iδ is a band-subtraction factor that
can be individually set for each frequency band to customize the noise removal
properties. )(kY i is the i -th frequency band of smoothed and averaged version of the
noisy speech spectrum as given by Eq. 3.1. The band-specific over-subtraction factor iα
is a function of the segmental iSNR of the i th frequency band, which is calculated as:
=
∑
∑
=
=
i
i
i
i
e
bki
e
bki
i
kD
kY
dBSNR2
2
10
|)(ˆ|
|)(|
log10)( (3.10)
36
Using the iSNR value calculated in Eq. 3.8, iα can be determined as:
>
≤≤−−
−<
=
201
205)(20
34
575.4
i
ii
i
i
SNR
SNRSNR
SNR
α (3.11)
While the use of the over-subtraction factor iα provides a degree of control over the
noise subtraction level in each band, the use of multiple frequency bands and the use of
the iδ weights provide an additional degree of control within each band.
The negative values in the modified spectrum in Eq. 3.7 are floored to the noisy
spectrum as:
>
=elsekY
kSifkSkS
i
iii 2
222
|)(|
0|)(ˆ||)(ˆ||)(ˆ|
β (3.12)
where the spectral floor parameter was set to 002.0=β .
The modified spectra of each frequency band are recombined to obtain the enhanced
speech spectrum, |)(ˆ| kS . The IFFT of the enhanced speech spectrum is computed with
the original noisy phase information. Since overlapping frames of speech were used in
the analysis stage, the enhanced time signal )(ns is obtained by adding the overlapping
portions of the temporal speech frames, i.e. by the OLA method.
In the next chapter, we present the results obtained with the proposed multi-band
spectral subtraction approach.
37
CHAPTER FOUR
IMPLEMENTATION AND PERFORMANCE EVALUATION This chapter describes the implementation details and performance evaluation of the
proposed algorithm. Evaluation of a speech enhancement algorithm is not simple. While
objective quality assessment methods can indicate an improvement or degradation in speech
quality based on mathematical measures, the human listener does not believe in a simple
mathematical error criterion. Therefore, subjective measurements of intelligibility and quality
are also required. Section 4.1 describes the implementation of the proposed algorithm and the
speech material used to test the algorithm. Section 4.2 explains the objective measures that
were used to evaluate the algorithm. Later sections deal with the results obtained by off-line
simulations of different versions of the proposed algorithm. In section 4.3, the effects of pre-
processing techniques are discussed. In section 4.4, objective results obtained by using
different methods for frequency spacing is given. Section 4.5, evaluates the proposed
algorithm incorporating a speech activity detector. Subjective results are given in section 4.6.
Section 4.7 summarizes the best configuration for the proposed algorithm that is indicated by
the objective measures.
4.1 Implementation
It is necessary to conduct off-line simulations to check the validity and feasibility of an
algorithm before it can be implemented on a real-time system. Implementation on a
38
workstation permits modifications and changes to the algorithm without constraints of time,
memory or computational power. The simulations were carried out on an IBM PC using
Matlab, a technical computing software.
The speech signal is first Hamming windowed using a 20-ms window and a 10-ms
overlap between frames. The windowed speech frame is then analyzed using the Fast Fourier
Transform (FFT). Smoothing of the magnitude spectrum as per [1] was found to reduce the
variance of the speech spectrum and contribute to the enhancement in speech quality. A
weighted spectral average is taken over preceding and succeeding frames of data as given by
Eq. 3.1 in section 3.2. The number of frames M is limited to 2 to prevent smearing of the
speech spectral content. The weights lW were empirically determined and set to
[ ] 2209.0,25.0,32.0,25.0,09.0 ≤≤−= lforWl . The resulting smoothed and averaged
spectrum and the estimated noise spectrum is divided into ( )81 ≤≤ NN frequency bands
using either linear, logarithmic or Mel spacing as described in section 4.4. The over-
subtraction factor iα is calculated for each band as described by Eq. 3.11. The values for
iδ in Eq. 3.7 were empirically determined and set to:
−>
−≤<
≤
=
kHzFs
f
kHzFs
fkHz
kHzf
i
i
i
i
22
5.1
22
15.2
11
δ (4.1)
where if is the upper frequency of the thi − band, and Fs is the sampling frequency in Hz.
The motivation for using smaller iδ values for the low frequency bands is to minimize
39
speech distortion, since most of the speech energy is present in the lower frequencies.
Relaxed subtraction is also used for the high frequency bands. Subtraction is performed over
each band as indicated in Eq. 3.9 and the negative values are rectified using the spectral floor
as given in Eq. 3.12. A small amount of the original noisy spectrum can be introduced back
into the enhanced spectrum to mask any remaining musical noise. In this implementation, 5%
of the original noisy spectrum was added to the enhanced spectrum. The enhanced spectrum
within each band is combined, and the enhanced signal is obtained by taking the IFFT of the
enhanced spectrum using the phase of the original noisy spectrum. Finally, the standard
overlap-and-add method is used to obtain the enhanced signal.
(a) (b)
Figure 4.1: (a) Long-term magnitude spectrum of a speech file from the HINT database , (b) Magnitude spectrum of the speech-shaped noise.
Ten sentences (see Table 4.1) of list number 2 (L02) from the HINT (Hearing In
Noise Test) database [21] uttered by a male speaker were used to evaluate the proposed
40
multi-band spectral subtraction approach. Speech-shaped noise at 5 dB and 0 dB SNR was
added to the sentences after downsampling them to 8 kHz. This noise was generated from the
long-term spectrum of all the sentences in the database and resembles the spectral
characteristics of the male speaker, as illustrated in Figure 4.1.
File Sentence Utterance
L02_1 “A boy ran down the path.”
L02_2 “Flowers grow in the garden.”
L02_3 “Strawberry jam is sweet.”
L02_4 “The shop closes for lunch.”
L02_5 “The police helped the driver.”
L02_6 “She looked in her mirror.”
L02_7 “The match fell on the floor.”
L02_8 “The fruit came in a box.”
L02_9 “He really scared his sister.”
L02_10 “The tub faucet is leaking.”
Table 4.1: List of sentences used from the HINT database for
objective performance evaluation
41
Figure 4.1: Sentence “The shop closes for lunch,” sampled at 8kHz, (above) time plot and (below) the corresponding spectrogram.
Figure 4.2: Speech shaped noise sampled at 8kHz, (above) time plot and (below) the corresponding spectrogram.
42
Figure 4.3: Sentence “The shop closes for lunch,” at 5 dB SNR, (above) time plot and (below) the corresponding spectrogram.
Figure 4.4: Sentence “The shop closes for lunch,” at 0 dB SNR, (above) time plot and (below) the corresponding spectrogram.
43
4.2 Objective measures for performance evaluation
In the evaluation of speech enhancement algorithms, it is necessary to identify the similarities
and differences in perceived quality and subjectively measured intelligibility. Speech quality
is an indicator of the “naturalness” of the processed speech signal. Intelligibility of speech
signals is a measure of the amount of speech information present in the signal that is
responsible for conveying what the speaker is saying. The interrelationship between
perceived speech and intelligibility is not clearly understood. While unintelligible speech
may not be considered to be of high quality, the converse may not be true. For human
listeners, it is important for the speech signal to be intelligible, even at the expense of some
degradation in speech quality. For example, human end-users could actually prefer a less
aggressive enhancement method that may not completely remove all of the interfering noise,
to a more aggressive algorithm that may completely remove the noise component but also
reduce the speech intelligibility. Performance evaluation tests can be done by subjective
quality measures or objective quality measures. Subjective measures are discussed in section
4.6. While subjective measures provide a broad measure of performance since a large
difference different in quality is necessary to be distinguishable to the listener. Hence, it
becomes difficult to get a reliable measure of changes due to algorithm parameters. Objective
measures, on the other hand, provide a measure that can be easily implemented and reliably
reproduced.
Objective measures are based on a mathematical comparison of the original and
processed speech signals. The majority of objective quality measures quantify speech quality
in terms of a numerical distance measure or a model of the perception of speech quality by
the human auditory system. It is desired that the objective measures be consistent with the
44
judgment of the human perception of speech. However, it has been seen that the correlation
between the results obtained by objective measures are not highly correlated with those
obtained by subjective measures. The signal-to-noise ratio (SNR) and the Itakura-Saito (IS)
measure are two of the most widely used objective measures.
• Signal-to-noise ratio (SNR)
The SNR is a popular method to measure speech quality. As the name suggests, it is a
calculated as the ratio of the signal to noise power in decibels:
[ ]
−⋅=
∑
∑
n
ndB
nsns
ns
SNR2
2
10)(ˆ)(
)(log10 (4.2)
where )(ns is the clean speech signal, )(nd is the noise signal and )(ˆ ns is the processed
speech signal. If the summation is performed over the whole signal length, the operation
is called global SNR. However, this measure suffers from a very low correlation to
subjective results [23]. A better measure can be achieved by performing the summation
over smaller periods or frames of the speech signal. This method is referred to as
segmental SNR. An average of the segmental SNRs over the whole speech length can be
performed. This method has proved to have a higher correlation to subjective results as
compared to the global SNR method [23].
45
• Itakura-Saito (IS) distance measure
The IS measure is based on the similarity or difference between the all-pole model of the
clean signal and the corrupted or processed speech signal. This measure penalizes any
mismatch in formant locations while errors in the locations of spectral valleys do not
contribute heavily to the distance. It is computed as shown in the following equation:
1log),)((2
2
102
2
−
+
=
φφφ
φφφ
σσ
σσ
d
Td
Tdd
d
dISaRa
aRaaamd (4.3)
where 2dσ and 2
φσ are the all-pole gains for the enhanced and clean speech segments
respectively. da and φa are the linear prediction coefficient vectors for the enhanced and
clean speech segments respectively. dR and φR are the autocorrelation matrices of the
enhanced and clean speech segments respectively. This method has a correlation of 0.59
with subjective measures [23]. A typical range of results for the IS measure is 1 to 10,
with lower values indicating lesser distance and better speech quality [23].
The IS distance method was used as the objective measure to evaluate the
performance of the proposed algorithm. The highest %5 of the IS distance values were
discarded, as suggested in [10], to exclude unrealistically high spectral distance values. This
method ensured a reasonable overall measure of performance.
To determine the optimal (in terms of speech quality) number of bands, we varied the
number of bands from 1 to 8 and examined speech performance using objective measures.
46
4.3 Effect of pre-processing strategies
Development of the proposed multi-band spectral subtraction (MBSS) algorithm was carried
out in different stages. The performance of the multi-band subtraction process as defined in
Eqs. 3.7 – 3.10 was evaluated with different pre - and post-processing techniques. Smoothing
the magnitude spectrum and taking a weighted spectral average has shown to help preserve
speech information and improve speech quality by reducing the variance of the noisy input
spectrum. Averaging also strengthens the speech components of the transitional regions. This
results in reduced amounts of destructive subtraction of the speech signal components by the
imperfect noise estimate.
Figure 4.5: Sentence “The shop closes for lunch,” after spectral smoothing and magnitude averaging, (above) time plot and (below) the corresponding spectrogram.
The spectrogram in Figure 4.5 shows the effects of smoothing and weighted
magnitude averaging on speech. When compared to the spectrogram of the original signal in
47
Figure 4.1, it can be observed that the speech spectral components are darker, indicating
higher values of speech magnitude and hence higher speech spectral concentration in those
regions. However, smoothing the estimated magnitude spectrum of noise and the over-
subtraction factors iα did not result in any significant improvement in signal quality.
Figures 4.6 and 4.7 plot the mean IS distance values for 10 sentences at 5 dB and 0
dB SNR for the MBSS algorithm without and with pre-processing respectively, as a function
of the number of bands used. The IS distance shows marked improvement when the number
of bands increased from 1 to 4. The error bars indicate standard deviations. The improvement
in speech quality is also marked.
(a) (b)
Figure 4.6: Mean IS distance measure of the MBSS approach with linear frequency spacing and without pre-processing, as a function of the number of bands for 10 sentences
embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR.
48
(a) (b)
Figure 4.7: Mean IS distance measure of the MBSS approach with linear frequency spacing and with pre-processing, as a function of the number of bands for 10 sentences embedded
in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR.
However, the enhanced speech pre-processed using the above mentioned techniques
have significantly reduced musical noise. This is also evident in Figures 4.8 and 4.9. which
illustrate the effect of using pre-processing strategies. The spectrogram at the top in Figures
4.6 and 4.7 represent the enhanced speech for the sentence “The shop closes for lunch.” at 5
dB and 0dB SNRs respectively after being processed through the MBSS algorithm using four
linearly spaced frequency bands without spectral smoothing and weighted magnitude
averaging. The bottom spectrograms are obtained after processing the noisy speech through
the same configuration of MBSS, but with spectral smoothing and weighted magnitude
averaging as a pre-processing strategy. The top spectrograms exhibit high levels of residual
noise in the enhanced spectrum, whereas the bottom spectrograms have relatively lesser
residual noise levels.
49
Figure 4.8: Spectrograms of processed speech of the sentence “The shop closes for lunch,” at 5 dB SNR, using MBSS using four linearly spaced frequency bands, (above) without pre-
processing and (below) with smoothing and weighted magnitude averaging.
Figure 4.9: Spectrograms of processed speech of the sentence “The shop closes for lunch,” at 0 dB SNR, using MBSS using four linearly spaced frequency bands, (above) without pre-
processing and (below) with smoothing and weighted magnitude averaging.
50
In addition to reduced residual noise, informal listening tests showed that the
enhanced speech obtained by the use of the above-mentioned pre-processing techniques has a
significant improvement in speech quality. Hence, all later versions of the MBSS algorithm
referred to in section 4.4 and 4.5 incorporate the spectral smoothing and the weighted
spectral average as the pre-processing strategy.
4.4 Effect of frequency spacing
The central idea behind the development of the proposed algorithm is that the enhancement
process is more effective and accurate when carried out over different frequency bands rather
than over the whole spectrum taken as a single band. The process of splitting the speech
signal into different bands can be performed in the time domain by using band-pass filters or
in the frequency domain by using appropriate windows. The latter method was adopted
because it is computationally more economical and technically more reasonable to
implement considering the frequency domain implementation of the subtraction stage in the
proposed method.
Three frequency spacing techniques, viz. linear, logarithmic and mel spacing were
evaluated for the MBSS method. In the linear spacing of frequency bands, the speech
bandwidth is divided into N linearly spaced frequency bands. The logarithmic and mel
spacing are non-linear frequency scales that approximate the sensitivity of the human ear. In
logarithmic frequency spacing, the center frequencies are distributed logarithmically over the
speech bandwidth. The upper and lower frequencies are non-overlapping. The mel is a
psychoacoustic unit of meaure for the perceived pitch by the human ear. The mapping
between the mel scale and the real frequencies is non-linear vis-à-vis the non-linearity of the
51
human ear. The center frequencies for the corresponding frequency spacing methods are
given in Table 4.2.
The performance of the MBSS algorithm was evaluated, as described in section 4.2,
for each frequency spacing allocation. The plots for the IS distance values obtained for
linear, logarithmic and mel spacing are given in Figures 4.7, 4.10 and 4.11 respectively. The
IS distance shows a consistent improvement as the number of bands is increased. However,
the improvement in speech quality is not very obvious when the more than four frequency
bands are employed. All three spacing methods exhibited comparable performance results
and speech quality. However, logarithmic spacing caused some distortion in the lower
frequency ranges. This can be established from Figures 4.12 and 4.13, which show the
spectrograms for the sentence “The shop closes for lunch,” at 5 dB and 0 dB respectively,
processed by the MBSS algorithm using the three spacing methods with four frequency
bands. The top spectrograms show the processed speech using linear spacing. The middle
spectrograms illustrate the processed speech for logarithmic spacing. It can be observed that
there is some removal of speech in the lower frequencies. This is can be explained by
observing the center frequencies listed for logarithmic spacing in Table 4.2. There is a higher
concentration of bands in the lower frequency region as the number of bands is increased,
resulting in disproportionate subtraction of lower frequencies. The bottom spectrograms
represent speech processed by mel spacing. This is comparable to the spectrograms on the
top parts of the figures.
52
Center Frequencies (kHz)
Number of bands
Linear Spacing Logarithmic Spacing Mel Spacing
1 2 2.0005 2.5798
2 1, 3 0.0321, 2.0316 1.2476, 2.9208
3 0.6667, 2.0, 3.3334 0.0084, 0.1339, 2.1260 0.8058, 1.7133, 3.1335
4 0.5, 1.5, 2.5, 3.5 0.0045, 0.0356, 0.2831,
2.2515 0.5915, 1.1911, 2.0492,
3.2772
5 0.4, 1.2, 2.0, 2.8, 3.6 0.0031, 0.0164, 0.0863,
0.4532, 2.3807 0.4661, 0.9066, 1.5006,
2.3012, 3.3804
6 0.3333, 1.0, 1.6667, 2.3333, 3.0, 3.6667
0.0025, 0.0099, 0.0396, 0.1576, 0.6280, 2.5020
0.3841, 0.7295, 1.1757, 1.7520, 2.4964, 3.4580
7 0.2857, 0.8571, 1.4286,
2.0, 2.5714, 3.1429, 3.7143
0.0021, 0.0070, 0.0228, 0.0747, 0.2442, 0.7986,
2.6116
0.3264, 0.6092, 0.9630, 1.4056, 1.9592, 2.6519,
3.5184
8 0.25, 0.75, 1.25, 1.75, 2.25, 2.75, 3.25, 3.75
0.0019, 0.0054, 0.0152, 0.0428, 0.1208, 0.3407,
0.9607, 2.7092
0.2838, 0.5225, 0.8138, 1.1693, 1.6031, 2.1325,
2.7785, 3.5668
Table 4.2: Center frequency values for linear, logarithmic and mel spacing of frequency bands.
53
(a) (b)
Figure 4.10: Mean IS distance measure of the MBSS approach with logarithmic frequency spacing as a function of the number of bands for 10 sentences embedded in speech-
shaped noise at (a) 5 dB SNR and (b) 0dB SNR.
(a) (b)
Figure 4.11: Mean IS distance measure of the MBSS approach with mel frequency spacing as a function of the number of bands for 10 sentences embedded in speech-shaped noise at
(a) 5 dB SNR and (b) 0dB SNR.
54
Figure 4.12: Comparison of spectrograms of enhanced speech at 5 dB SNR processed with the MBSS algorithm using four bands with (top) linear spacing, (middle) logarithmic
spacing and (bottom) mel spacing.
Figure 4.13: Comparison of spectrograms of enhanced speech at 0 dB SNR processed with the MBSS algorithm using four bands with (top) linear spacing, (middle) logarithmic
spacing and (bottom) mel spacing.
55
4.5 Performance with speech-silence detector
The error between the processed signal and the clean speech signal is minimized if the
estimate of the noise spectrum is accurate. Hence, it is desirable to estimate the noise signal
at every available instant to get a more accurate estimate of the noise spectrum. This is not a
problem in dual channel implementations since the noise signal is exclusively made available
in a second channel. However, in single channel methods, the noise signal has to be
estimated from the noisy speech signal itself due to the non-availability of a second noise
channel. Hence a voice activity detector (VAD) is required that will identify those frames of
the input signal in which the speaker is silent. These input signals are considered to only
contain the interfering noise signals and the noise spectrum is updated. The VAD has to
accurately identify such periods of silence to prevent calculating an erroneous update of the
noise spectrum with parts of the speech signal, since subsequent subtraction will remove the
speech signal from the succeeding input frames. Therefore, the VAD is only required to
detect pauses between words or sentences and not transitions between phonemes or words.
Hence, the method is also called the speech-silence detector.
The MBSS algorithm with linear frequency spacing was evaluated with a speech-
silence detector as per [12]. This detector is based on a statistical model-based voice activity
detection method to detect non-speech frames. The method computes the likelihood ratio of
speech being present or absent in the input frame as:
η
absentspeech
presentspeech
kD
kY
kD
kY
N
N
k <>
−−∑−
=
1
02
2
102
2
1)(ˆ
)(log
)(ˆ
)(1 (4.4)
56
where η is a preset threshold. When speech was absent in the jth frame, the noise spectrum
was updated as:
22
1
2)()1()(ˆ)(ˆ kYkDkD jdjdj ⋅−+⋅= − λλ (4.5)
where 9.0=dλ .
The IS distance values obtained for the above configuration of the MBSS method is
shown in Figure 4.14 for 5 dB and 0 dB SNRs. As seen earlier, the IS distance values
consistently decrease as the number of bands is increased.
(a) (b)
Figure 4.14: Mean IS distance measure of the MBSS approach with linear frequency spacing and speech-silence detector, as a function of the number of bands for 10 sentences
embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR.
57
Figure 4.15: Spectrograms of speech enhanced with the MBSS algorithm using four linearly spaced frequency bands with a speech-silence detector, at (top) 5 dB SNR and (bottom) 0
dB SNR.
Global and segmental SNR values were calculated for both noise conditions for all
the four versions of the MBSS algorithm for four frequency bands. The gain in the mean
SNR values calculated over the ten sentences was seen to be consistent and comparable for
all the four versions. (See Table 4.3).
4.6 Subjective evaluation of speech intelligibility
Subjective tests are conducted by having some human subjects listen to the prepared test
speech files and evaluate based on some criteria. Intelligibility test were carried out at the
Callier Center for Communication Disorders / UTD on seven hearing-impaired subjects with
severe to profound hearing loss.
58
Table. 4.3: Mean global and segmental SNR calculated over 10 sentences at 5 and 0 dB SNR.
Speech enhanced by the MBSS algorithm with four linearly spaced frequency bands
was evaluated against the noisy speech. Twenty different sentences at 20 kHz were used for
each condition. The sentences were corrupted using speech-shaped noise at 0 dB SNR. The
noise-corrupted sentences were played in a random order through speakers in a sound-
insulated booth. The sound pressure level was maintained at an average level of 67 dB SPL
with a variance of 2 dB. The subjects were asked to repeat the sentence they heard.
Intelligibility was measured in terms of percentage of words correct.
Figure 4.16 gives the bar plots of the intelligibility scores achieved by each subject
and the average score for both the test conditions. The tests showed no increase in
intelligibility of the corrupted speech after processing by the speech enhancement algorithm.
This result is in accordance with those obtained by other spectral subtraction methods [5]. On
an average, the subjects’ score showed a decrease of 22 % for the enhanced speech.
However, one subject (S6) actually scored better on the test with the speech enhanced by the
MBSS algorithm.
59
Figure 4.16: Intelligibility test results for seven subjects scored on percentage words correct.
4.7 Optimal configuration
The results obtained by objective evaluation are an indicator of the best speech quality that
can be obtained by the different configurations of the algorithm. From the performance plots
(Figures 4.6, 4.7, 4.10, 4.11 and 4.12) of the mean IS distance of the five configurations
mentioned above, it is evident that the MBSS algorithm provides the best performance when
the subtraction process is preceded by spectral smoothing and magnitude spectral averaging
with linear spacing of the frequency bands. A speech-silence detector is necessary if a
practical implementation is considered. The speech-silence detector provides a better
estimate of the noise when the noise spectrum is updated between speech pauses.
For comparative purposes, performance of the traditional power spectral subtraction
(PSS) method as implemented by Berouti et al. [2] is given in Figure 4.15 along with IS
0 10 20 30 40 50 60 70 80 90
S1 S2 S3 S4 S5 S6 S7 Average SUBJECT
% CORR E C T
Noisy speech at 0 dB Processed speech
60
measures for the proposed method. The proposed multi-band spectral subtraction approach
(with number of bands > 3) performed better than the PSS approach for both SNRs.
(a) (b)
Figure 4.17: Comparison of the performance, in terms of mean IS distance measure, of the with power spectral subtraction (indicated with 'PSS') with the multi-band spectral
subtraction approach as a function of the number of bands for 10 sentences embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0 dB SNR.
The performance obtained with power spectral subtraction is indicated with 'PSS'.
While the IS distance does show a slight increase in performance for higher number of bands,
there was no perceivable improvement in speech quality. Informal listening tests indicated
that the multi-band approach yielded very good speech quality with very little trace of
musical noise and with minimal, if any, speech distortion.
The lack of musical noise can also be seen in Figures 4.16 and 4.17, which shows the
spectrograms of enhanced speech obtained with multi-band spectral subtraction (4 bands)
and enhanced speech obtained with power spectrum subtraction.
61
Figure 4.18: Spectrogram of the sentence ''The shop closes for lunch.'' at 5 dB SNR. The top spectrogram is the corrupted signal, the middle spectrogram is the enhanced signal
obtained by the multi-band spectral subtraction method using 4 linearly spaced frequency bands, and the bottom spectrogram is the enhanced signal obtained by the power spectral
subtraction method.
62
Figure 4.19: Spectrogram of the sentence ''The shop closes for lunch.'' at 5 dB SNR. The top spectrogram is the corrupted signal, the middle spectrogram is the enhanced signal
obtained by the multi-band spectral subtraction method using 4 linearly spaced frequency bands, and the bottom spectrogram is the enhanced signal obtained by the power spectral
subtraction method.
63
CHAPTER FIVE
SUMMARY AND CONCLUSIONS The work in this thesis addressed the problem of enhancing speech in noisy conditions. A
multi-band spectral subtraction method, based on the direct estimation of the short-term
spectral amplitude of speech and the non-uniform effect of noise on speech, was proposed.
The results establish the superiority of the proposed method over the conventional spectral
subtraction method with respect to speech quality of the enhanced signal and reduced
residual noise.
The major contributions of this thesis are:
(a) Development of a multi-band speech enhancement strategy based on the spectral
subtraction method. Speech processed by the new algorithm shows reduced levels of
residual noise and good speech quality.
(b) Proposed a band subtraction factor iδ that provides greater control over the subtraction
process in each band and can be tweaked to minimize speech distortion.
(c) Evaluation of various pre-processing strategies for improving the output speech quality.
It was shown that spectral smoothing and weighted spectral averaging of the input
speech spectrum helped preserve the speech content and improved speech quality.
(d) Assessment of linear and non-linear frequency spacing techniques. Linear and mel
frequency spacing methods provide consistently good results.
64
(e) Determining the optimal number of frequency bands for the MBSS algorithm. Results
showed that the speech quality obtained by using four frequency bands is comparable to
that obtained for higher number of bands. This conclusion results in the reduction of a
significant amount computation as compared to using critical bands (23 bands) or for
every frequency component in the FFT, e.g., 256, 512 or 1024 bands.
The multi-band spectral subtraction method provides a definite improvement over the
conventional power spectral subtraction method and does not suffer from musical noise. The
improvement can be attributed to the fact that the multi-band approach takes into account the
non-uniform effect of colored noise on the spectrum of speech. The added computational
complexity of the algorithm is minimal. Four linearly spaced frequency bands were found to
be adequate in obtaining good speech quality.
Further research can be conducted to adaptively calculate the value of the band
subtraction factor iδ , in place of the empirically derived value that is proposed in this thesis.
The algorithm can be implemented in real-time on a fixed point Digital Signal Processor
(DSP) (e.g., the Texas Instruments TMS320C54x/55x) platform for evaluation in real-world
conditions. This would require a detailed quantization analysis of the algorithm. Fixed-point
DSPs are becoming increasingly popular in applications such as cellular-phones, personal
entertainment devices, digital hearing aids and headsets due to their low-power consumption
and high processing rates. Speech enhancement algorithms are a major component of these
applications for operation in adverse environments. The proposed method can eventually be
incorporated into such systems. However, these applications also demand low MIPS (Million
Instructions Per Second), i.e., low number of operations, to conserve battery life. Hence a
study can be made to optimize the processes involved. For instance, an alternative method to
65
calculate the over-subtraction factor iα can be researched because the use of the log function
is computationally expensive in real-time systems. Also, methods can be developed to
preserve the transitional regions and unvoiced regions, which contain low speech levels.
66
BIBLIOGRAPHY
[1] L. Arslan, A. McCree and V. Viswanathan, “New methods for adaptive noise
suppression,” ICASSP, vol.1, pp. 812-815, May 1995.
[2] M. Berouti, R. Schwartz and J. Makhoul, “Enhancement of speech corrupted by
acoustic noise,” Proc. IEEE Int. Conf. on Acoust., Speech, Signal Procs., pp. 208-
211, Apr. 1979.
[3] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE
Trans. Acoust., Speech, Signal Process., vol.27, pp. 113-120, Apr. 1979.
[4] Y. Cheng and D. O'Shaughnessy, “Speech enhancement based conceptually on
auditory evidence,” ICASSP, vol.2, pp. 961-964, Apr. 1991.
[5] J. Deller Jr., J. Hansen and J. Proakis, “Discrete-Time Processing of Speech Signals”,
NY: IEEE Press, 2000.
[6] Y. Ephraim, “Statistical-model-based speech enhancement systems,” Proc. IEEE, vol.
80, No.10, pp. 1526-1555, Oct.1992.
[7] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square
error short-term spectral amplitude estimator,” IEEE Trans. on Acoust., Speech,
Signal Proc., vol.ASSP-32, No.6, pp. 1109-1121, Dec.1984.
[8] Y. Ephraim and H. Van Trees, “A signal subspace approach for speech
enhancement,” IEEE Trans. Speech Audio Procs., vol. 3, pp. 251-266, Jul. 1995.
67
[9] Z. Goh, K.Tan and T. Tan, “Postprocessing method for suppressing musical noise
generated by spectral subtraction,” IEEE Trans. Speech Audio Procs., vol. 6, pp. 287-
292, May 1998.
[10] J. Hansen and B. Pellom, “An effective quality evaluation protocol for speech
enhancements algorithms,” Inter. Conf. on Spoken Language Processing, vol.7, pp.
2819-2822, Sydney, Australia, Dec.1998.
[11] C. He and G. Zweig, “Adaptive two-band spectral subtraction with multi-window
spectral estimation,” ICASSP, vol.2, pp. 793-796, 1999.
[12] Y. Hu, M. Bhatnagar and P. Loizou, “A cross-correlation technique for enhancing
speech corrupted with correlated noise,” ICASSP, vol. 1, pp. 673-676, 2001.
[13] S. Kamath and P. Loizou, “A multi-band spectral subtraction method for enhancing
speech corrupted by colored noise,” submitted to ICASSP 2002.
[14] P. Kasthuri, “Multichannel speech enhancement for a digital programmable hearing
aid,” Master thesis, University of New Mexico, 1999.
[15] W. Kim, S. Kang and H. Ko, “Spectral subtraction based on phonetic dependency and
masking effects,” Proc. IEEE Vis. Image Signal Procs., vol. 147, No. 5, Oct. 2000.
[16] H. Levitt, “Noise reduction in hearing aids: An overview”, Journal of Rehabilitation
Research and Development, vol. 38, No. 1, January/February 2001.
[17] J. Lim and A. Oppenheim, “All-pole modeling of degraded speech,” IEEE
Transactions on Acoustics, Speech and Signal Processing, vol. 26, No. 3, pp. 197-
210, June 1978.
68
[18] J. Lim and A. Oppenheim, “Enhancement and bandwidth compression of noisy
speech,” Proc. IEEE, vol. 67, No. 12, pp. 221-239, Dec. 1979.
[19] P. Lockwood and J. Boudy, “Experiments with a nonlinear spectral subtractor (NSS),
Hidden Markov Models and the projection, for robust speech recognition in cars,”
Speech Communication, Vol. 11, Nos. 2-3, pp. 215-228, 1992.
[20] G. Miller and P. Nicely, “An analysis of perceptual confusions among some English
consonants,” Jour. Acoust. Soc. America, vol. 27, No. 2, pp. 338-352, March 1955.
[21] M. Nilsson, S. Soli and J. Sullivan, “Development of the hearing in noise test for the
measurement of speech reception thresholds in quiet and in noise,” J. Acoust. Soc.
Am., vol.95, pp. 1085-1099, 1994.
[22] T. Peterson and S. Boll, “Acoustic noise suppression in the context of a perceptual
model”, Proc. IEEE Inter. Conf. Acoust. Speech Signal Procs., pp. 1086-1088, 1981.
[23] S. Quackenbush, T. Barnwell, and M. Clements, “Objective Measures for Speech
Quality Testing,” Prentice-Hall, 1988.
[24] A. Rezayee and S. Gazor, “An adaptive KLT approach for speech enhancement,”
IEEE Trans. Speech Audio Processing, vol. 9, pp. 87-95, Feb. 2001.
[25] M. Sambur, “Adaptive noise canceling for speech signals,” IEEE Transactions on
Acoustics, Speech and Signal Processing, vol. 26, pp. 419-423, 1978.
[26] S. Savadatti, “Real time, fixed-point implementation of multi-channel speech
amplitude compression,” Master thesis, University of New Mexico, 2000.
69
[27] M. Schroeder, “Models of hearing,” Proc. IEEE, vol. 63, No. 9, pp. 1332-1350, Sept.
1975.
[28] I. Soon, S. Koh and C. Yeo, “Selective magnitude subtraction for speech
enhancement,” Proc. The Fourth International Conference/Exhibition on High
Performance Computing in the Asia-Pacific Region, vol.2, pp. 692-695, 2000.
[29] N. Virag, “Single channel speech enhancement based on masking properties of the
human auditory system,” IEEE Trans. Speech and Audio Processing, vol. 7, pp 126-
137, March 1999.
[30] N. Virag, “Speech enhancement based on masking properties of the human auditory
system,” Master thesis, Swiss Federal Institute of Technology, 1996.
[31] K. Wu and P. Chen, “Efficient speech enhancement using spectral subtraction for car
hands-free application,” International Conference on Consumer Electronics, vol. 2,
pp. 220-221, 2001.
VITA
Sunil Kamath was born in Mumbai, India on December 27, 1974, the son of late Shri
Mangalore Devdas Pandurang Kamath and Shrimati Geetha Kamath. After
completing his pre-university education from Canara Pre-University College,
Mangalore, he joined the Karnatak University, Dharwad India, where he received the
Bachelors degree in Electrical and Electronic Engineering in 1996. He worked as a
network engineer at Microland Ltd., India until 1999.
He was admitted to the Masters program at the University of New Mexico,
Albuquerque in the Department of Electrical Engineering in August 1999. He
transferred to the Masters program in the Electrical Engineering Department at the
University of Texas at Dallas in August 2000. He has been working with the Callier
Institute of Communication Disorders on speech enhancement in hearing aids since
August 2000.