Ijecet 06 09_010
-
Upload
iaeme-publication -
Category
Engineering
-
view
98 -
download
3
Transcript of Ijecet 06 09_010
http://www.iaeme.com/IJECET/index.asp 82 [email protected]
International Journal of Electronics and Communication Engineering & Technology
(IJECET)
Volume 6, Issue 9, Sep 2015, pp. 82-96, Article ID: IJECET_06_09_010
Available online at
http://www.iaeme.com/IJECETissues.asp?JType=IJECET&VType=6&IType=9
ISSN Print: 0976-6464 and ISSN Online: 0976-6472
© IAEME Publication
COMPARATIVE STUDY OF LPCC AND
FUSED MEL FEATURE SETS FOR
SPEAKER IDENTIFICATION USING GMM-
UBM
Anagha S. Bawaskar
Dept. of Electronics & Telecommunication,
M. E. S. College of Engineering, Pune, India
Prabhakar N. Kota
Dept. of Electronics & Telecommunication,
M. E. S. College of Engineering, Pune, India
ABSTRACT
Biometrics identifiers are typically measurable characteristics used to
label and describe the individual respectively. Biometric identifiers are the
combination of both physiological and behavioral characteristics. The
physiological characteristics include the characteristics related to the shape
of the body. There are various examples for physiological characteristics but
not limited. Examples include fingerprint, palm, hand geometry, iris
recognition, and retina. Behavioral characteristics are related to the pattern
behavior of a person including but not limited to typing rhythm and voice.
Biometric system technology is now a day’s a well-furnished technology, it
analyzes human body characteristics. It is also known as one of the active
biometric tasks. There is much speech related activities such as language
recognition, speech recognition, and speaker recognition respectively.
Speaker recognition superficially defines as to identify the accurate speaker
from the group of various people. It is a very broad term and is further
classified as speaker identification and speaker verification. The paper is
concentrating on the term speaker identification. The main aim is to identify
the accurate speaker from the given speech samples. These samples are
obtained by extracting features and are used for modeling purpose. Standard
database TIMIT is being used for identification. The paper comprises of
various algorithms for feature extraction, they are Mel Frequency Cepstral
Coefficients (MFCC), Inverse Mel Frequency Cepstral Coefficient (IMFCC)
and linear predictive Cepstral Coefficients (LPCC). The term Fusion came
from the combination of the two algorithms namely MFCC and IMFCC. The
Comparative Study of LPCC and Fused Mel Feature Sets For Speaker Identification Using
GMM-UBM
http://www.iaeme.com/IJECET/index.asp 83 [email protected]
comparison is made among the results of Fusion and LPCC respectively.
From the result, it is seen on an average Fusion is better than LPCC.
Index Terms: Gaussian Mixture Models (GMM), Inverted Mel Frequency
Cepstral Coefficients (IMFCC), Linear Predictive Cepstral Coefficients
(LPCC), Mel Frequency Cepstral Coefficients (MFCC), Universal
Background Model (UBM)
Cite this Article: Anagha S. Bawaskar and Prabhakar N. Kota. Comparative
Study of LPCC and Fused Mel Feature Sets For Speaker Identification Using
GMM-UBM, International Journal of Electronics and Communication
Engineering & Technology, 6(9), 2015, pp. 82-96.
http://www.iaeme.com/IJECET/issues.asp?JType=IJECET&VType=6&IType=
9
1. INTRODUCTION
Nowadays various biometrics systems are there. In last decades, an increasing interest
in security system has risen. For the security purpose, it includes various biometric
schemes. Biometrics refers to technologies that measure and analyzes human body
characteristics. There are many biometric methods existing in the world, they are face
recognition, eye retina and iris recognition, fingerprint, DNA, hand measurements etc.
for authentication purpose. These are the one of the well-known biometric method,
adding to this list one of the well-known method is Speech signal processing. Speech
is one of the natural forms used in communication. Speech recognition has application
in voice identification in ordinary personal computers to biometric and forensic
applications. Recently the development has been seen in a security system. There are
two main techniques in speech processing, one is speaker recognition and the other is
speech recognition, in this paper the main focus is given on speaker recognition. The
speaker recognition is further divided only speaker identification and speaker
verification.
Speaker identification is the technique in which not registered speaker is being
identified and Speaker verification a claimed speaker is being identified. The speaker
identification is in a ratio of 1: N while speaker verification is in 1:1 ratio
respectively. In this paper text- independent, speaker identification system is used. In
speaker identification, the specific characteristics of voice are being extracted from
the given sample of voice of speaker known as feature extraction. After this, the
speaker model is trained and stored into the system database. The extraction of the
voice of speaker yields us the specific information of the speakers’ voice called
feature vectors. The speaker vectors represent the specific information of the speaker
which is based on the single or many things from the following: vocal tract, excitation
source, and behavioral traits. All speaker recognition systems use the set of scores to
enhance the probability and reliability of the recognizer. Before feature extraction, the
system goes through the pre-processing stage. An important role is played by Pre-
processing in speaker identification and helps to reduce the amount of variation in the
database which does not contain the important information about speech; it is
considered to be a good practice. The preprocessing removes the irrelevant
information respectively.
The various algorithms used for feature extraction are Mel-frequency Cepstral
Coefficients (MFCC), Inverse Mel Frequency Cepstral Coefficients (IMFCC), and
Linear Predictive Cepstral Coefficients (LPCC). In this paper, feature extraction is
Anagha S. Bawaskar and Prabhakar N. Kota
http://www.iaeme.com/IJECET/index.asp 84 [email protected]
done by using all these above-mentioned algorithms. Investigations by the researchers
find out speaker specific complementary information relative to MFCC that are called
as Inverse Mel Frequency Cepstral Coefficients (IMFCC) respectively.
Complementary information is used for combining the score models and for
combining the score models along with the MFCC and is named as Fused Mel Feature
Set. Such models are nothing but the mathematical representation of the particular
system [1].The inverse filter bank method is being used for capturing this
complementary information from high-frequency part of the energy spectrum. IMFCC
captures the information which is neglected by MFCC. The respective features are
modeled by using Gaussian Mixture Model and Universal Background Model (GMM-
UBM). All algorithms used in this paper are based on Gaussian filters only. The
results are verified in standard database TIMIT.
The final results are the comparison between LPCC results and Fused Mel Feature
Set results and accurate results are noted down. The next section of this paper is
followed by Fused Mel Feature set using Gaussian Filters and Linear predictive
Cepstral coefficient using Gaussian filters. It is followed by comparative results of
both Fused Mel feature Set and Linear Predictive Cepstral Coefficient.
2. FEATURE EXTRACTION AND FILTER DESIGN
To represent any speech signal in a finite number of measures is the goal of feature
extraction. Features are nothing but the representation of the spectrum of a speech
signal in each window frame. The Cepstral vectors are derived from a filter bank that
has been designed according to some model of the auditory system [2]. Most of the
feature extraction methods use a standard triangular filter. The triangular filters are
used for filtering the spectrum of the speech signal which simulates the characteristics
of a human ear. But this also has some disadvantages. These are, they give sharper or
crisp partition in an energy spectrum, due to this some information is lost. In this
paper, Gaussian filters are used. The crisp and sharp transition in an energy spectrum
is avoided if we use Gaussian filters instead of triangular filters. This gives results in a
smoother adaptation from one sub band to other. Because of this adaptive property,
there is always one type of correlation being maintained. These correlations are
maintained from the mid points of the triangular filters at the base of it as well as from
the end points of triangular filters. Mathematical calculations in Gaussian filters are
simple. Hence because of such advantages over triangular filters we use Gaussian
Filters. The motivation for using Mel-Frequency Cepstrum Coefficients was due to
the fact that the auditory response of the human ear resolves frequencies nonlinearly.
The mapping from linear frequency to Mel Frequency is defined as [3].
)700
1(log2595 10
ffmel (1)
Where;
The subjective pitch in Mel corresponding to f is melf , this frequency in actual
measured in Hz.
MFCCs are one of the more popular parameterization methods used by
researchers in the speech technology field. It has the benefit that it is capable of
capturing the phonetically important characteristics of speech. MFCC are band-
limiting in nature and can easily be employed to make it suitable for applications like
a telephone.
Comparative Study of LPCC and Fused Mel Feature Sets For Speaker Identification Using
GMM-UBM
http://www.iaeme.com/IJECET/index.asp 85 [email protected]
Generally, the feature extraction using MFCC uses a triangular filter. The
triangular filter has some characteristics like, asymmetrically tapering which also do
not provide the correlation between the sub-bands and the nearby spectral
components. Because of all this, information loss occurred there. By using Gaussian
filters, one profit is that it avoids drawbacks and losses seen in a triangular filter.
Gaussian filters are tapering towards both the end and provide correlation between
sub-bands and its nearby spectral components [4].
The IMFCC is one of the feature extraction techniques. It captures the
complementary information present in the high-frequency part of the spectrum. The
figure below shows the steps involved in feature extraction of both Gaussian MFCC
and IMFCC features. Let the input speech signal be y (n), where n=1, M. it represent
the preprocessed frame of the signal. Firstly the signal y (n) is converted to the
frequency domain by a DFT which leads to the energy spectrum. This is followed by
Gaussian filter bank block.
Figure 1 Steps involved in extraction of Gaussian MFCC and IMFCC [5]
Mathematically the equation for Gaussian filter is written as;
(2)
Where, k is coefficient index in the N-point DFT, b ik is a point between the thi
triangular filters boundary located at its base and considered as mean of thi Gaussian
filter while the i is the standard deviation or square root of variance and can be
written as,
ii bb
i
kk 1
(3)
Where; is the parameter where variance is being controlled.
Figure 2 Filterbank design [5]
MFCC
IMFCC
DCT
DCT
()10Log
()10Log
Speech
Signal Pre-
Processing ||2
FFT
MFCC Filter Bank
Gaussian
Gaussian
IMFCC Filter Bank
Gaussian
2
2
2
)(
i
bi
MFCC
kk
g
i e
Anagha S. Bawaskar and Prabhakar N. Kota
http://www.iaeme.com/IJECET/index.asp 86 [email protected]
Two plots in a single figure are shown in above figure 2. One is for triangular
filter and the other is for the Gaussian filter. This plot is made by considering a single
value of sigma. Here in this case by considering different values for sigma plot can be
drawn respectively. Fig 4 and Fig 5 shows the individual response for Gaussian filter
bank of MFCC and IMFCC.
Figure 3 Mel scale Gaussian filter bank [6]
Figure 4 Inverted Mel scale Gaussian filter bank [6]
Mathematically, the Gaussian MFCC can be written as,
1)(1
Q
ffffifff
F
Mk
lowmelhighmel
lowmelmel
s
sbi
(4)
sM is a number of points in DFT, sF is the sampling frequency, lowf and highf are low
and high-frequency boundaries of a filterbank, Q is the number of filters in the bank
and 1
melf is an inverse of the transformation.
)110(700)( 2595/1 fmel
melmel ff (5)
The inverted Mel Scale Filterbank structure can be obtained by just flipping
original filterbank around the midpoint of frequency range that is being considered.
)6.(..............................12
)( 1
'
k
Mk s
iqi
Where,
)(' ki is the original MFCC filter bank response.
0 1000 2000 3000 4000 5000 60000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frequency (Hz)
Wei
ght
0 1000 2000 3000 4000 5000 60000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frequency (Hz)
Wei
ght
Comparative Study of LPCC and Fused Mel Feature Sets For Speaker Identification Using
GMM-UBM
http://www.iaeme.com/IJECET/index.asp 87 [email protected]
These filter banks are being forced on the energy spectrum obtained by taking
Fast Fourier transform of the preprocessed signal as follow:
2
1
2 )(.|)(|)(
Ms
k
g
i
g kkYieMFCC
MFCC (7)
Where, )(ki is respective filter response and 2)(kY is the energy spectrum.
1ibk ibk
1ibk
Figure 5 Response )(ki of a typical Mel scale filter [5]
Finally, DCT is taken on the log filter bank energies })]}({{log[ Q
iie and the final
MFCC coefficients can be written as-
)8.(].........)2
12(cos[)]1([log
2 1
0 Q
lmie
QC MFCCMFCC g
Q
l
g
m
Where; 10 Rm , R is the desired number of Cepstral features. The same procedure for
extracting the IMFCC features as well [4] and are denoted as;
)9]......()2
2(cos[)]1([log
2 1
0 Q
llmie
QC IMFCCIMFCC g
Q
l
gm
3. LINEAR PREDICTIVE CEPSTRAL COEFFICIENTS (LPCC)
The predictive coefficient can be determined by minimizing the squared differences
between actual speech samples and linearly predicted values. This set is a unique set
of parameters. In practice, the actual predictor coefficients are never used as it is
because of their high variance. These predictor coefficients are transformed to a more
robust set of parameters known as Cepstral coefficients. The procedure for extracting
the LPCC is same as that of MFCC and IMFCC respectively. In this also we are going
to use Gaussian filter bank.
Figure 6 Block diagram of LPCC algorithm [7]
1 Amp
l
i
t
u
d
e
Speech
Sequence Pre-emphasis
and hamming
window
Linear
Predictive
Analysis
Cepstral
Analysis
LPCC
DFT coefficient index
Amplitude 1
Anagha S. Bawaskar and Prabhakar N. Kota
http://www.iaeme.com/IJECET/index.asp 88 [email protected]
Pre-emphasis and Hamming Window
The first block is a pre-emphasis block; the input signal is given to, the first step
of the algorithm is pre-emphasis. The idea of pre-emphasis is to spectrally flatten the
speech signal and equalize the inherent spectral tilt in a speech [8]. Pre-emphasis is
implemented by a first order FIR digital filter. The following equation shows the
transfer function of the pre-emphasis digital filter,
11)( zZH p (10)
Where, alpha is constant, which has a typical value of 0.97.
After pre-emphasis, the speech signal is subdivided into frames. This process is
the same as multiplying the entire speech sequence by a windowing function,
][][][ mnwnsnsm (11)
Where s[n] is the entire speech sequence, sm[n] is a windowed speech frame at time m
and w[n] is the windowing function.
The typical length of a frame is about 20-30 milliseconds. In the above equation,
m is the time shift or the step size of the windowing function. A new frame is obtained
by shifting the windowing function to a subsequent time. The amount of shifting is
typically 10 milliseconds. The shape of the windowing function is important.
Rectangular window is not recommended since it causes severe spectral distortion
(leakage) to the speech frames [9]. Other types of windowing function, which
minimize the spectral distortion, should be used. One of the most commonly used
windows is the Hamming window.
1
2cos46.054.0][
Nnw
(12)
In the above equation, N is the length of the windowing function. After Hamming
windowing, the speech frame is passed to the next stage for further processing.
Linear predictive analysis
In human speech production, the shape of the vocal tract governs the nature of the
sound being produced. The main idea is based on basic speech production model; it
says that vocal tract can be modeled by an all-pole filter. These are nothing but the
simple coefficient of all-pole filter. They are same as smooth envelope of log
spectrum of speech. The main idea behind LPC is that a given speech sample can
have approximated as a linear combination of the past speech samples. LPC models
signal s (n) as a linear combination of its past values and present input (vocal cords
excitation). If the signal will be represented only in terms of the linear combination of
the past values then the difference between real and predicted output is called
prediction error. LPC minimizes the prediction error to find out the coefficients.
The cepstrum is the inverse transform of the log of the magnitude of the spectrum.
Useful for separating convolved signals (like the source and filter in the speech
production model). Log operation separates the vocal tract transfer function and the
voice source. Vocal Tract filter has slow spectral variations and excitation signal has
high spectral variations. Generally provides more efficient and robust coding of
speech information than LPC coefficients.
Comparative Study of LPCC and Fused Mel Feature Sets For Speaker Identification Using
GMM-UBM
http://www.iaeme.com/IJECET/index.asp 89 [email protected]
Figure 7 LPCC [10]
The predictor coefficients are rarely used as features, but they are transformed
into the more robust Linear Predictive Cepstral Coefficients (LPCC) features. The
LPC are obtained using Levinson-Durbin recursive algorithm. This is known as LPC
analysis. The difference between the actual and the predicted sample value is termed
as the prediction error or residual [11] and is given by,
)()()( nsnsne
p
k
k knsans1
)()( (13)
)14(..............................1,)()( 0
0
aknsanep
k
k
Optimal predictor coefficients will minimize this mean square error. At minimum
value of E,
)15...(........................................,...2,1,0 pkE
ak
Differentiating and equating to zero we get,
= (16)
Where, )18.......(........................................)]()...2()1([
)17(......................................................]...[ 21
r
r
p
prrrr
aaaa
Where ‘R’ is the Toeplitz symmetric autocorrelation matrix given by,
)0(......)1(
....
....
)2(...)0()1(
)1(...)1()0(
rpr
prrr
prrr
R
Equation can be solved for predictor coefficients by using Levinson’s and Durbin
algorithm as follows:
)20..(....................|][|.][
)19...(........................................].........0[
)1(
1
1
1
)0(
i
L
j
i
j
iE
jirairk
rE
ka
Anagha S. Bawaskar and Prabhakar N. Kota
http://www.iaeme.com/IJECET/index.asp 90 [email protected]
Where,
pi 1
kai
j (21)
1)1( .
i
jii
i
j
i
j akaa (22)
12).1( i
i
i EkE (23)
The above set of equations is solved recursively for i=1, 2…p. the final solution is
given by
)( p
mm aa , pmwhere 1, (24)
Where; sam ' are linear predictive coefficients (LPC)
Cepstral Analysis.
In reality, the actual predictor coefficients are never used in recognition, since
they typical show high variance. The predictor coefficients are more efficiently
transformed to a robust set of parameters known as Cepstral coefficients
Before going to the definition of Cepstral coefficients, let us go through the
definition of the Cepstrum. A cepstrum is nothing but the result of taking the Fourier
transform of the logarithm of the estimated spectrum of a signal. The three different
types of cepstrum are the power cepstrum, complex cepstrum and the other one is real
cepstrum. Among them, the power cepstrum, in particular, finds application in the
analysis of human speech. The name cepstrum was derived from the word spectrum
by reversing the first four letters.
The steps through which the input speech signal goes through are preprocessing
then feature extraction and after that modeling. After preprocessing, the signal
reduces complex complexity while operating on speech signal. In this one particular
reduces the number of samples of operations. It is very difficult to work on huge set
of samples; therefore instead of working on such a large set of samples, we restrict
our operations to a frame of sufficiently reduced length. After the signal conditioning
or after pre-processing the speech signal goes through the feature extraction stage.
Here the features are extracted by using DCT. That is calculating the coefficients
using DCT.
Mathematically;
)))((log( windowyFFTabsdctCeps (25)
The principal advantage of Cepstral coefficients is that they are generally
decorrelated and this allows diagonal covariances. However, one minor problem with
them is that the higher order Cepstral are numerically quite small and this results in a
very wide range of variances when going from the low to high Cepstral coefficients.
Cepstral coefficient can be used to separate the excitation signal (which contains
the words and the pitch) and the transfer function (which contains the voice quality).
The cepstrum can be seen as information about rate of change in the different
spectrum bands. The recursive relation between the predictor coefficients and
Comparative Study of LPCC and Fused Mel Feature Sets For Speaker Identification Using
GMM-UBM
http://www.iaeme.com/IJECET/index.asp 91 [email protected]
Cepstral coefficients is used to convert the LP coefficients (LPC) into LP Cepstral
coefficients kc
)26....(..................................................ln 2
0 c
kmk
m
kmm acm
kac
1
1
pm1 (27)
)28.(........................................1
1 kmk
m
km acm
kc
Where 2 the gain term in the LP analysis and d is is the number of LP Cepstral
coefficients.
4. GAUSSIAN MIXTURE MODEL (GMM) AND UNIVERSAL
BACKGROUND MODEL
The text independent speaker recognition system used in this paper uses GMM-UBM
approach for modeling purpose. Generally; two models are being developed here, one
is target speaker model and other is impostor model (UBM). It has generalization
ability to handle unseen acoustic pattern [12].
In a biometric system, GMM is commonly used as a parametric model of
probability distribution continuous measurements or features. The features used are
generally vocal tract features in any speaker identification system. As we all know
that GMM are more likely used for text-independent speaker identification as the
prior knowledge about what speaker will say. Hence modeling is generally done in
GMM.A Gaussian mixture model is a weighted sum of M component Gaussian
densities as given by the equation [13].
Where; x is a D-dimensional continuous-valued data vector (i.e. measurement
or features),
, i = 1….. M are the mixture weights, and
, i = 1… M is the component Gaussian densities. Each component density is a
D-variate Gaussian function of the form;
With mean vector and covariance matrix the mixture weights satisfy the
constraint that
The complete Gaussian mixture model is parameterized by the mean vectors,
covariance matrices and mixture weights from all component densities. These
parameters are collectively represented by the notation,
i=1 … M ………. (31)
)29......(..........),|()|(1
M
i
iii xgwxp
iw
),|( iixg
)30)}....(()'(2
1exp{
||)2(
1),|( 1
2/12/ iii
i
Dii xxxg
},,{ iii
i i
11 i
M
i
Anagha S. Bawaskar and Prabhakar N. Kota
http://www.iaeme.com/IJECET/index.asp 92 [email protected]
For a sequence of T training vectors .The GMM likelihood, assuming
independence between the vectors, can be written as
(32)
For utterances with T frames, the log-likelihood of speaker models is;
)33(..........)|(log)|(log)(1
T
t
stss xpXpXL
For speaker identification, the value of )(XLs is computed for all speaker models
s enrolled in the system and the owner of the model that generates the highest value
is the returned as the identified speaker. During training phase, Feature vectors are
being trained using Expectation and Maximization (E&M) algorithm. An iterative
update of each of the parameters in , with a consecutive increase in the log
likelihood at each step.
GMM are generally used for text-independent speaker identification. The
drawback of the previous systems is being overcome by using GMM-UBM. It
overcomes on the cost of the mode; it is not as expensive that of the GMM. There is
no need for the vocabulary database or big phoneme. GMM is more advantageous
than HMM.
Capturing the general characteristics of a population and accordingly adapting it
to individual speaker is the basic idea of UBM. In other words more briefly UBM is
defined as the model which is used in many application areas but one of them is
biometric system which is used to compare the person’s independent feature
characteristics against person specific feature model during decision of acceptance or
rejection. UBM is also said as GMM only with large set of speakers.
The UBM is trained with the EM algorithm on its training data. For the speaker
recognition process, it fulfills two main roles:
It is the apriori model for all target speakers when applying Bayesian adaptation to
derive speaker models and it helps to compute log-likelihood ratio much faster by
selecting the best Gaussian for each frame on which likelihood is relevant. This work
proposes to use the UBM as a guide to discriminative training of speakers [14].
5. COMPARATIVE RESULTS OF FUSED MEL FEATURE SETS
AND LINEAR PREDICTIVE CEPSTRAL COEFFICIENTS
The main method focus in this paper is fusion of the both algorithms that are used
both MFCC and IMFCC respectively. The main aim is to compare the fused results to
the results obtained from LPCC. That is here the accuracy obtained from the fused
Mel feature set is compared with the Linear predictive Cepstral coefficient. The better
results among them will give us the accurately identified speaker respectively among
the database used for it. The system performs better if the two or more combination of
them were supplied with information that is complementary in nature. For obtaining
the identification accuracy MFCC and IMFCC features which are complementary to
each other can be fused together. There are many possible ways for combining such
as; product, sum, minimum, maximum, median, average etc, can be used. The sum
rule outperforms as compared to the other combinations and is most resilient to
estimation errors.
Let us go through the block diagram of Fused Mel feature set along with LPCC
with GMM-UBM modeling technique.
T
t
txpXp1
)|()|(
},.....{ 1 TxxX
Comparative Study of LPCC and Fused Mel Feature Sets For Speaker Identification Using
GMM-UBM
http://www.iaeme.com/IJECET/index.asp 93 [email protected]
From figure 7 and 8 we can say that system includes training and testing for fused
Mel feature Set and LPCC feature set. The implementation is done on TIMIT
database. TIMIT corpus is one of the standard databases used by the many researchers
for the purpose of speaker identification. This paper also concentrates on the TIMIT
database. It comprises of the 16 speakers.
Figure 8 Steps involved in speaker identification system (fused Mel features sets) [5]
Figure 9 Speaker identification system (LPCC) [6]
The recordings are from 8 dialect regions. Each speaker has 10 utterances
respectively Total 160 sentences recordings (10 recordings per speaker). The audio
format is .wav format, single channel, 16 kHz sampling, 16-bit sample, PCM
encoding.
The features are being extracted by using Gaussian Mel scale filter bank. The
feature vectors are trained by using Expectation Maximum algorithm. From the
diagram, we can say that separate model is being created for each speaker [5].
Features are extracted from the incoming test signal and then the likelihood of these
features with each of the speaker model is determined. These are included in the
testing step. The likelihood for MFCC and IMFCC as well as for LPCC is determined.
We have drawn two separate block diagrams for fused Mel feature sets and LPCC. In
first diagram a uniform weighted sum rule is adopted to fuse the scores from the two
classifiers.
(34) i
IMFCCMFCCii
com SwwSS )1(
Anagha S. Bawaskar and Prabhakar N. Kota
http://www.iaeme.com/IJECET/index.asp 94 [email protected]
is combined score of MFCC and IMFCC, and , are scores generated by the
MFCC model and scores generated by IMFCC Model and w is fusion coefficient. On similar Line, we
calculated the values for LPCC and are denoted byi
LPCCS .
The accuracy for Fusion and LPCC are calculated and are compared. The usage of
weights and number of mixtures can be changed to different values to test the system
for optimum result.
Table I shows the performance level of proposed system for different weights and
mixtures. As stated we are using standard database TIMIT of 16 speakers, we need to
divide these into two for training and testing purpose. For this purpose, the UBM
consist of 5 speakers and GMM 11 speakers respectively. The background model is
generated by UBM. The value of alpha that is filtering constant is kept as 0.97
respectively. The accuracy is being calculated on the basis of False positive and False
negative. In false positive a false speaker is accepted as true one. While in False
Negative, a true speaker is rejected as an impostor. The formula for accuracy
calculation is:
Accuracy in percentage=100-((FP+FN)*100/ (M*N))
Where, M*N= size of the confusion matrix.
Table I. Comparative Results for different number of Mixtures and weights for given
proposed system
No. of
Mix-
tures
Score threshold=0.6 Score
threshold=0.77
Score
threshold=0.8
Score threshold=0.97
Fusion
(%)
LPCC
(%)
Fusion
(%)
LPCC
(%)
Fusion
(%)
LPCC
(%)
Fusion
(%)
LPCC
(%)
4 92.56 84.29 92.56 85.95 92.56 85.95 91.73 86.77
8 94.21 86.77 92.56 86.77 92.56 87.60 92.56 87.60
16 92.56 71.07 95.04 73.55 95.04 74.38 92.56 76.85
Figure 10 Graphical Representation of table 1
From above table, we can see that the various accuracy percentages we got for the
different values of the mixtures and the different values of score threshold. The value
of threshold increases the accuracy is increasing accordingly. But in all the accuracy
for the Fusion is good as compared to the LPCC. The performance of the fused
system exceeds the performance of LPCC. The percentage of maximum performance
is 95.04% and hence likewise we have found out good identification with limited
errors.
i
coms i
MFCCS i
IMFCCS
Comparative Study of LPCC and Fused Mel Feature Sets For Speaker Identification Using
GMM-UBM
http://www.iaeme.com/IJECET/index.asp 95 [email protected]
6. CONCLUSION
Many methods were used earlier for feature extraction. They include MFCC, IMFCC,
etc. These two algorithms worked individually well and give good accuracy. Though
the IMFCC help MFCC to improve its accuracy further, these two algorithms are
combined together and is called as fused Mel Feature set. In this, the Gaussian
Mixture Model is being evaluated for speaker identification. The performance is
increased by fusing the complementary information. As shown in table, the accuracy
has been calculated for LPCC and Fusion and is seen 95.04% at weight 0.77 and 0.8
respectively for Fusion which is better than 73.55 and 74.38 at weight 0.77 and 0.8 for
LPCC. The more enhancements may be done by changing the modeling technique
and by changing various combinations of weights.
The future scope may include an application of same database approach try to
develop a real-time application and also the system can be developed by using
artificial neural network based approach.
REFERENCES
[1] J. Kittler, M. Hatef, R. Duin, J. Mataz, On Combining Classifier, IEEE
Transaction, Pattern Analysis and Machine Intelligence, 20(3), pp.226-
239,March 1998.
[2] Rana, Mukesh, and Saloni Miglani, Performance analysis of MFCC and LPCC
Techniques in Automatic speech Recognition, International Journal of
Engineering and Computer Science, 3(8), pp.7727-7732, August, 2014
[3] Sridharan, Sridha & Wong, Eddie, Comparison of Linear Prediction Cepstrum
Coefficients and Mel-Frequency Cepstrum Coefficients for language
identification, Proceedings of International Symposium on Intelligent
Multimedia, Video and Speech Processing, pp. 95-98, 2-4 May 2001
[4] Chakroborty Sandipan, and Goutam Saha, Improved text-independent speaker
identification using fused MFCC & IMFCC feature sets based on Gaussian filter,
International Journal of Signal Processing, 5(1), pp. 11-19, 2009
[5] R. Shantha Selva Kumari , S. selva Nidhyananthan , Anand, Fused Mel Feature
sets based Text-Independent Speaker Identification using Gaussian Mixture
Model, International Conference on Communication Technology and System
Design , Procedia Engineering, 30, pp. 319–326, 2012
[6] Anagha S. Bawaskar, Prabhakar N. Kota, Speaker Identification Based on MFCC
and IMFCC Using GMM-UBM, International Organization of Scientific
Research (IOSR Journals), 5(2), pp. 53-60, March-April 2015
[7] Cheng, Octavian, Waleed Abdulla, and Zoran Salcic, Performance evaluation of
front-end processing for speech recognition systems, School of Engineering
Report. The University of Auckland, Electrical and Computer Engineering, 2005
[8] Rabiner, L. and Juang, B, Fundamentals of speech recognition, Prentice Hall,
Inc., Upper Saddle River, New Jersey, 22 April 1993
[9] Rabiner, L.R., Schafer, R.W., Digital Processing of Speech Signals, Prentice
Hall, 1978.
[10] Pallavi P. Ingale and Dr. S.L. Nalbalwar, Novel Approach To Text Independent
Speaker Identification, International Journal of Electronics and Communication
Engineering & Technology, 3(2), 2012, pp. 87-93.
[11] Chang, Wen-Wen, Time Frequency Analysis and Wavelet Transform Tutorial
Time-Frequency Analysis for Voiceprint (Speaker) Recognition, National Taiwan
University.
Anagha S. Bawaskar and Prabhakar N. Kota
http://www.iaeme.com/IJECET/index.asp 96 [email protected]
[12] Pazhanirajan, S., and P. Dhanalakshmi, EEG Signal Classification using Linear
Predictive Cepstral Coefficient Features, International Journal of Computer
Applications, 73(1), pp. , 2013
[13] Chao, Yi-Hsiang; Tsai, W.-H.; Hsin-Min Wang, Discriminative Feedback
Adaptation for GMM-UBM Speaker Verification, Chinese Spoken language
Processing( ISCSL) 6th International Symposium on , pp.1,4, 16-19 Dec. 2008
[14] Manan Vyas, A Gaussian Mixture Model Based Speech Recognition System
Using Matlab, Signal & Image Processing: An International Journal (SIPIJ),
4(4), August 2013
[15] Amr Rashed, Fast Algorithm For Noisy Speaker Recognition Using Ann,
International journal of Computer Engineering & Technology, 5(2), 2014, pp. 12
- 18.
[16] Viplav Gautam, Saurabh Sharma,Swapnil Gautam and Gaurav Sharma,
Identification and Verification of Speaker Using Mel Frequency Cepstral
Coefficient, International Journal of Electronics and Communication
Engineering & Technology, 3(2), 2012, pp. 413-423.
[17] Scheffer N, Bonastre. J.F, UBM-GMM Driven Discriminative Approach for
Speaker verification, Speaker and Language Recognition workshop, IEEE
Odyssey, pp.1-7, 28-30 June 2006.