Post on 14-Jan-2016
Quantile Based Histogram Equalizationfor Noise Robust Speech Recognition
vonDiplom-Physiker Florian Erich Hilger
ausBonn - Bad Godesberg
Berichter: Univ.-Prof. Dr.-Ing. Hermann Ney
Presenter : Chen Hung_Bin
December 2004
2
outline
Histogram Normalization Quantile Based Histogram Equalization Experimental Conclusion
3
Histogram Normalization
Histogram normalization is a general non-parametric method to make the cumulative distribution function (CDF) of some given data match a reference distribution.
to reduce an eventual mismatch between the distribution of the incoming test data and the training data's distribution which is used as reference
4
Histogram Normalization
between the test and the training data distributions is caused by the dierent acoustic conditions
the two CDFs can be used directly to dene a transformation
))((ˆ 1 YPPY train
data training theof CDF reference inverse the
and datast current te theof CDF theis If1trainP
P
5
Histogram Normalization
Example for the cumulative distribution functions of a clean and noisy signal.
The arrows show how an incoming noisy value is transformed based on these twocumulative distribution functions.
6
Histogram Normalization
two pass method Two separate histograms, one for silence the other for speech, can be
estimated on the training data. Then a first recognition pass can be used to determine the amount of
silence in the recognition utterances. Based on that percentage the appropriate target histogram can be
determined. which requires a sufficiently large amount of data from the same
recording environment or noise condition to get reliable estimates for the high resolution histograms
7
Histogram Normalization
two pass method It can not be used when a real-time response of the recognizer is requir
ed, like in command and control applications or spoken dialog systems.
Quantile equalization is a straight forward solution to this problem would be to reduce the number of histogram bins, in order to get reliable estimates even with little data.
8
Quantile Based Histogram Equalization
Quantiles are very easy to determine by just sorting the sample data set.
Cumulative distributions can be approximated using quantiles. example, two cumulative distribution function with four 25% quant
iles, NQ = 4
9
Quantile Based Histogram Equalization
NQ = 4, like shown in the example, about one second of data (100 time frames) is already sufficient to get a rough estimate of the cumulative distribution
an other advantage of the quantile Even if the data set that shall be considered only consists of very few o
r in an extreme case just one sample, the quantiles can be calculated without any special modication of the algorithm.
10
Quantile Based Histogram Equalization
the corresponding reference quantiles of the training data define a set of points that can be used to determine the parameters of a transformation function that transforms the incoming data to and thus reduces the mismatch between the test and training data quantiles
),(~ YTY
Y~YT
Applying a transformation function to make the four training and recognition quantiles match.
11
Quantile Based Histogram Equalization
Within the context of this work the transformation is applied to the output of the Mel-scaled filter-bank after applying a 10th root to reduce the dynamic range, so in the following will denote the output vector of the filter-bank and will correspondingly denote its component.
To scale the incoming filter output values down to the interval [0; 1] After the power function transformation is applied the values are scaled
back to the original range:
YkY thk
kY
1 , ),(~
k
kNQ
kkNQkkkk Q
YQYTY
12
Quantile Based Histogram Equalization
Small values are scaled down even further towards zero, so little amplitude dierences will be enhanced considerably if a logarithm is applied afterwards, this is in contradiction to the desired compression of the signal to a smaller range.
so the transformation function that will always be used within the context
kNQ
kk
kNQ
kkkNQkkkk Q
Y
Q
YQYTY
k
1),(~
),(~
k
kNQ
kkNQkkkk Q
YQYTY
13
Quantile Based Histogram Equalization
Both transformation parameters are jointly optimized to minimize the squared distance between the current quantiles and the training quantiles
The minimum is determined with a simple grid search: by the way it should be in the range
kkk ,ktQ
trainiQ
1
1
2',minarg'
Q
k
N
i
trainikkikk QQT
kk , max1, , 0,1 kk
The step size for the grid search can be set to a value in the order of 0.01
14
Quantile Based Histogram Equalization
Example: output of the 6th Mel scaled lter over time for a sentence from the Aurora 4 test set
case in this 0.1 and 4.1search grid the Cumulative distributions of the signals
15
Quantile Based Histogram Equalization
Combine neighboring filter channels: a linear combination of a filter with its left and right neighbor can be u
sed to further reduce the remaining difference are the filter output values and the recognition quantiles after the pre
ceding power function transformation factors are denoted for the left neighbors and for the right neigh
bors With the transformation step can be written as:
Y~
~~
1)~
,~
(~ˆ
11 kkkkkkkkkk YYYTY
kkk ,~
k k
16
Quantile Based Histogram Equalization
Comparison of the RWTH baseline feature extraction front-end
17
Experiment
Car Navigation isolated German words recorded in cars vocabulary consists of 2100 equally probable words the training data was recorded in a quiet office environment
Aurora 3 – SpeechDat Car continuous digit strings recorded in cars four languages are available: Danish, Finnish, German, and Spanish
Aurora 4 – noisy WSJ 5k utterances read from the Wall Street Journal with various artificially a
dded noises vocabulary consists of 5000 words
18
Comparison of Logarithm and Root Functions
isolated word Car Navigation database with different root functions on the Car Navigation database
LOG: logarithm, CMN: cepstral mean normalization,2nd - 20th: root instead of logarithm, FMN: filter mean normalization.
19
Comparison of Logarithm and Root Functions
Comparison of logarithm and 10th root on Aurora 3 database
WM: well matched, MM: medium mismatch, HM: high mismatch, FMN: filter mean normalization
20
Comparison of Logarithm and Root Functions
on the Aurora 4 noisy WSJ 16kHz database.
LOG: logarithm, CMN: cepstral mean normalization,2nd - 20th: root instead of logarithm, FMN: filter mean normalization.
21
Experiment - Quantile Equalization
Recognition results on the Car Navigation database with quantile equalization
LOG: logarithm, CMN: cepstral mean normalization, 10th: root instead of logarithm, FMN: filter mean normalization, QE: quantile equalization, QEF(2): quantile equalization with filter combination (2 neighbors).
22
Experiment - Quantile Equalization
Comparison of quantile equalization with histogram normalization on the Car Navigation database.
QE train: applied during training and recognition. HN: speaker session wise histogram normalization, HN sil: histogram normalization dependent on the amount of silence, ROT: feature space rotation.
23
Comparison of QE and HN
Cumulative distribution function of the 6th lter output.
HN: after histogram normalization,QE: after quantile equalization.clean: data from test set 1, noisy: test set 12
24
Experiment - Quantile Equalization
Recognition results on the Car Navigation database for dierent numbers of quantiles.
10th: root instead of logarithm, FMN: filter mean (and variance) normalization, QE: quantile equalization with NQ quantiles, QEF quantile equalization with filter combination.
25
Experiment - Quantile Equalization
Comparison of the logarithm in the feature extraction with dierent root functions on the Car Navigation database.
2nd - 20th: root instead of logarithm, FMN:filter mean normalization, QE: quantile equalization, QEF: quantile equalization with filter combination.
26
Conclusion
Replacing the logarithm in the feature extraction by a root function signficantly increased the recognition performance on noisy data
Using four quantiles NQ = 4 can be recommended as standard setup, it can be used on short windows as well as complete utterances.
rx
Spectral Entropy Feature in Full-Combination Multi-Stream for Robust ASR
Hemant Misra , Herv´e Bourlard∗ ∗IDIAP Research Institute, Martigny, Switzerland
Presenter : Chen Hung_Bin
INTERSPEECH 2005
28
Introduction
computing spectral entropy features from the sub-bands of spectrum in order to locate the spectral peaks of the spectrum
spectral entropy features are used along with PLP features in multi-stream framework
training a separate multi-layered perceptron (MLP) for PLP features
9.2% relative error reduction as compared to the baseline
29
Spectral entropy feature
Entropy measures can be used to capture the “peakiness” sharp peak will have low entropy flat distribution will have high entropy
convert the spectrum into a probability mass function (PMF) like function by normalizing it.
spectrum ofenergy theis , /
log
1
12
thi
N
iiii
N
iii
iXXXx
xxH
iX
1X
NX
30
Spectral entropy feature
observe that entropy computed on full-band spectrum can be used as an estimate for speech/silence detection
Entropy computed from the full-band spectrum. (a) Clean speech wave form, (b) Entropy contour for clean speech,(c) Speech corrupted with factory noise at 6 dB SNR, and (d) Entropy contour for speech corrupted with factory noise at 6 dB SNR.
31
Multi-band/multi-resolution spectral entropy feature
The full-band spectral entropy feature can capture only the gross peakiness of the spectrum.
obtained the best results by dividing the normalized full-band spectrum into 24 overlapping sub-bands defined by Mel-scale and computed entropy from each sub-band
32
Entropy based full-combination multi-stream (FCMS)
Full-combination multi-stream :
All possible combinations of the two features are treated as separate streams.An MLP expert is trained for each stream. The posteriors at the output of experts are weighted and combined. The combined posteriors thus obtained are passed to an HMM decoder.
33
Entropy based full-combination multi-stream
The combined output posterior probability for class and framethk thn
I
i
in
ini
n
nin
in
nini
n
I
i
in
n
K
ki
inki
ink
in
I
ii
ink
innk
h
hw
hhh
hhh
I
hh
xqPxqPh
xqPwXqP
1
1
12
1
~/1
~/1
:
: 10000~
),|(log),|(
),|(),|(ˆ
I
Innn
th
th
xxX
I
n
i
,,
,,
3) of (case stream ofnumber :
setparameter :
number frame :
vectorfeature stream :
1
1
i
34
Spectral entropy feature in Tandem framework
exploiting the advantages of both HMM/ANN and HMM/GMM systems
Multi-stream Tandem: Out puts from different experts are weighted and combined. The combined output undergoes KL transform before being fed as features into HMM/GMM systems.
35
access to the ‘outputs before softmax’
Therefore we cannot use the entropy based weighting directly. To overcome this problem
we converted the ‘outputs before softmax’ into posteriors using the equation.
“softmax” nonlinearity in this position (exponentials normalized to sum to 1)
k nk
nknk xy
xyxqP
)|exp(
)|exp()|(
instant time:
vectorfeature:
class
n
x
k
n
th
36
Experimental
Numbers95 database of US English connected digits telephone speech is used
There are 30 words in the database represented by 27 phonemes
Noisex92 database added at different signal-to-noise-ratios (SNRs)
There were 3,330 utterances for training and 2,250 utterances were used for testing the system
37
Results
Hybrid system under different noise conditions:
WERs for PLP features, 24 Mel-band spectral entropy features and its time derivaties (24-Mel), the two features appended (PLP + 24-Mel), and PLP and spectral entropy features in FCMS with inverse entropy weighting.
38
Results
Tandem system under different noise conditions:
WERs for PLP features, 24 Mel-band spectral entropy features and its time derivaties (24-Mel), the two features appended (PLP + 24-Mel), and PLP and spectral entropy features in FCMS with inverse entropy weighting.
39
Conclusion
We demonstrated that better performance can be achieved by FCMS as compared to appending the multi-resolution entropy feature vector to the PLP feature vector.
40
References
[4] Hemant Misra, Shajith Ikbal, Herv´e Bourlard, and Hynek Hermansky, “Spectral entropy based feature for robust ASR,” in Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, Montreal, Canada, May 2004.
[5] Hemant Misra, Shajith Ikbal, Sunil Sivadas, and Herv´e Bourlard, “Multi-resolution spectral entropy feature for robust ASR,” in Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, Philadelphia, U.S.A., Mar. 2005.
[7] Hynek Hermansky, Daniel P. W. Ellis, and Sangita Sharma, “TANDEM connectionist feature extraction for conventional HMM systems,” in Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, Istanbul, Turkey, 2000.
[11] Astrid Hagen and Andrew Morris, “Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR,” Computer Speech and Language, , no. 19, pp. 3–30, 2005.