[IEEE 2013 IEEE International Conference on Signal Processing, Computing and Control (ISPCC) -...

5
Improved Language Identification Using Sampling Rate Compensation & Gender Based Language Models For Indian Languages Deepak Joshi Department of Electrical Engg. Indian Institute of Technology New Delhi, India [email protected] Shiv Dutt Joshi Department of Electrical Engg. Indian Institute of Technology New Delhi, India [email protected] AbstractIn today’s world of emerging technology, a Language identification system may get a test language sample from various sources like Landline or Mobile telephone call, VoIP packet, radio transmission, sample recorded on a computer etc. Often, there is a variation between the sampling rate of the test language sample and the training samples used for language models in the back end. This difference leads to deterioration in the performance of the system. Hence, there is a requirement to carry out sampling rate compensation. It is proposed to introduce a sampling rate compensation block in the language identification system. Sampling rate compensation ensures that the system is independent of the effects of sampling rate variation between the test and training language data. Also, it is proposed to use gender based language models. This approach has better performance in terms of language identification task and it also gives the gender of the speaker. Gender information extracted is useful when gender based user classification is being carried out or while providing customized gender based services. This technique of gender based language identification along with sampling rate compensation has been found to improve results for both the Vector Quantization codebook and Gaussian Mixture Model for languages. The sampling rate compensation block can be easily used in speaker identification systems also for improving performance. Keywords— sampling rate compensation, vector quantisation, , Gaussian Mixture modelling, Gender based language models. I. INTRODUCTION When an individual utters a word or sentence, wide variety of information can be extracted from that speech signal. Extracted information can be pertaining to the topic of speech, identity of the speaker, emotional state of the speaker and the language being spoken. Language identification is of paramount importance for military, emergency services, tourism information etc [1]. Majority of the automatic language identification systems extract features from the test speech in the front end and then compare these features with the language models in the back end in order to identify the language [2]. The back end language models can be either Vector Quantisation (VQ) codebook model or Gaussian Mixture model (GMM) based models [3-5]. Various techniques like Maximum likelihood etc may be used to ascertain the unknown language under test. Till now language identification, gender identification and accent identification has been studied separately [2],[6],[7]. In the world of emerging technologies, speaker’s speech in an unknown language, may come from various sources like telephone call (mobile or fixed line), speech sample received via radio, VoIP packets, speech recorded on a computer etc. All these signals vary in their sampling rate and other spectral properties from each other. Hence, they display significant differences with the language models being used in the back end. This variation between the unknown test samples and back end language models leads to significant deterioration in the performance of the language identification systems. This problem can be overcome by introducing a sampling rate compensation block in the language identification systems which will firstly resample the training language samples to a pre-decided rate so that the back end language model consists of samples of a single sampling rate. Then it resamples the test samples used for identification tasks. This technique has shown promising results in an environment where there is a significant variation between the test and training language sample sources. The sampling rate compensation block can also be used in speaker identification systems for improvement in performance. The male and female vocal pitch differs significantly from each due to the difference in the length and thickness of the vocal folds [8]. This paper also presents a technique to ascertain the gender of the speaker while carrying out the primary task of language identification. The proposed technique is equally good for small duration test samples, as the error rates are very high when the test samples are not sufficiently long and there is a situation where the back end language model does not have adequate representation of 978-1-4673-6190-3/13/$31.00 ©2013 IEEE

Transcript of [IEEE 2013 IEEE International Conference on Signal Processing, Computing and Control (ISPCC) -...

Page 1: [IEEE 2013 IEEE International Conference on Signal Processing, Computing and Control (ISPCC) - Solan, India (2013.09.26-2013.09.28)] 2013 IEEE International Conference on Signal Processing,

Improved Language Identification Using Sampling Rate Compensation & Gender Based Language

Models For Indian LanguagesDeepak Joshi

Department of Electrical Engg. Indian Institute of Technology

New Delhi, India [email protected]

Shiv Dutt Joshi Department of Electrical Engg. Indian Institute of Technology

New Delhi, India [email protected]

Abstract— In today’s world of emerging technology, a Language identification system may get a test language sample from various sources like Landline or Mobile telephone call, VoIP packet, radio transmission, sample recorded on a computer etc. Often, there is a variation between the sampling rate of the test language sample and the training samples used for language models in the back end. This difference leads to deterioration in the performance of the system. Hence, there is a requirement to carry out sampling rate compensation. It is proposed to introduce a sampling rate compensation block in the language identification system. Sampling rate compensation ensures that the system is independent of the effects of sampling rate variation between the test and training language data. Also, it is proposed to use gender based language models. This approach has better performance in terms of language identification task and it also gives the gender of the speaker. Gender information extracted is useful when gender based user classification is being carried out or while providing customized gender based services. This technique of gender based language identification along with sampling rate compensation has been found to improve results for both the Vector Quantization codebook and Gaussian Mixture Model for languages. The sampling rate compensation block can be easily used in speaker identification systems also for improving performance.

Keywords— sampling rate compensation, vector quantisation, , Gaussian Mixture modelling, Gender based language models.

I. INTRODUCTION When an individual utters a word or sentence, wide variety of information can be extracted from that speech signal. Extracted information can be pertaining to the topic of speech, identity of the speaker, emotional state of the speaker and the language being spoken. Language identification is of paramount importance for military, emergency services, tourism information etc [1]. Majority of the automatic language identification systems extract features from the test speech in the front end and then compare these features with the language models in the back end in order to identify the language [2]. The back end language models can be either Vector Quantisation (VQ) codebook model or Gaussian

Mixture model (GMM) based models [3-5]. Various techniques like Maximum likelihood etc may be used to ascertain the unknown language under test. Till now language identification, gender identification and accent identification has been studied separately [2],[6],[7]. In the world of emerging technologies, speaker’s speech in an unknown language, may come from various sources like telephone call (mobile or fixed line), speech sample received via radio, VoIP packets, speech recorded on a computer etc. All these signals vary in their sampling rate and other spectral properties from each other. Hence, they display significant differences with the language models being used in the back end. This variation between the unknown test samples and back end language models leads to significant deterioration in the performance of the language identification systems. This problem can be overcome by introducing a sampling rate compensation block in the language identification systems which will firstly resample the training language samples to a pre-decided rate so that the back end language model consists of samples of a single sampling rate. Then it resamples the test samples used for identification tasks. This technique has shown promising results in an environment where there is a significant variation between the test and training language sample sources. The sampling rate compensation block can also be used in speaker identification systems for improvement in performance. The male and female vocal pitch differs significantly from each due to the difference in the length and thickness of the vocal folds [8]. This paper also presents a technique to ascertain the gender of the speaker while carrying out the primary task of language identification. The proposed technique is equally good for small duration test samples, as the error rates are very high when the test samples are not sufficiently long and there is a situation where the back end language model does not have adequate representation of

978-1-4673-6190-3/13/$31.00 ©2013 IEEE

Page 2: [IEEE 2013 IEEE International Conference on Signal Processing, Computing and Control (ISPCC) - Solan, India (2013.09.26-2013.09.28)] 2013 IEEE International Conference on Signal Processing,

speech samples from both genders. Features extracted from the speech using Mel Frequency Cepstral Coefficient (MFCC) are compared with the gender based language models for ascertaining the gender and language of the test sample. The proposed gender based technique gives better results for both VQ codebook models and GMM models. It has also been observed that GMM gives better results than VQ for gender based language identification and hence we can say that GMM is a better technique to model the variations occurring in speech utterances due to differences in gender. In the following paper section II explains the database used for training the back end. The database comprises of samples from both genders and the speakers speak the languages in their respective dialects. Section III explains the use of proposed sampling rate compensation module. Section IV explains the gender based language models for gender and language identification after incorporating the re-sampling module for sampling rate compensation. Section V gives the results of the proposed system.

II. LANGUAGE DATABASE India is a country with an ancient culture. Languages affiliated to almost all important language families are being spoken in India however the percentage of population speaking a particular language may vary substantially. Almost 74 % population speaks languages of the Indo Aryan family like Hindi, Urdu etc., 23 % people speak Dravidian languages like Kannada, Tamil etc. Small part of the population speaks languages from Tibeto-Burman and Austro-Asiatic language families such as Manipuri etc. Tamil is one of the oldest surviving classical languages in the world. Tamil literature has a 2000 year old history. It belongs to southern branch of Dravidian languages. Present day Tamil has 12 vowels, 18 consonants and 1 special character. Tamil is widely spoken in Sri Lanka besides being the dominant language in Tamil Nadu state of India. Kannada is one of the 40 most spoken languages of the world. It is also a part of Southern Dravidian language family and is widely spoken in Karnataka, India. Urdu is official language of five Indian states and is also the national language of Pakistan. It is a South Asian language and belongs to Indo-Aryan branch of the Indo-European language family. Urdu has 38 alphabets out of which 10 are vowels. Kashmiri is a language from the Dardic subgroup of the Indo Aryan language family. It is spoken primarily in the Kashmir region of the state of Jammu and Kashmir. Kashmiri

language has 39 consonants and 8 pairs of long and short vowels. Manipuri is a language from the Sino-Tibetan branch of the Tibeto-Burman language family. It consists of 6 vowels and 24 consonants. It is spoken in the states of Manipur, Assam and Tripura in India. It is a tonal language. Effort has been made to include different dialects of the languages so as to make a good language model in back end. Adequate care has been taken to include the speakers of all age groups ranging from 12 years to 56 years. The samples chosen for making back end language models have not been used as test samples for carrying out identification task. It has been ensured that all the linguistic aspects of the language are covered by the speakers. Table I gives the No of dialects included and No of speakers who contributed their language samples for carrying out this task for this work.

TABLE I. LANGUAGES AND DIALECTS STUDIED

Language Dialects Studied No. of speakers

Tamil 4 12

Kannada 4 13

Urdu 3 24

Kashmiri 5 14

Manipuri 4 12

III. PROPOSED SAMPLING RATE COMPENSATION MODULE This section firstly describes the requirement of sampling rate compensation and its proposed usage in a language identification system for getting the best possible performance. This technique can similarly be used in speaker identification problems for improvement of performance. A. Requirement of sampling rate compensation In the present day environment there are various sources from which a language test sample may arrive for identification by a language identification system. A test language sample may be provided out of a landline/mobile telephone call, HF/VHF radio conversation, VoIP packet etc. Samples from different sources may vary from each other significantly in terms of sampling frequency and spectral properties. Hence, there is a requirement to bring the sampling rate of the test language sample to the sampling rate of the back end language model for achieving best performance.

Page 3: [IEEE 2013 IEEE International Conference on Signal Processing, Computing and Control (ISPCC) - Solan, India (2013.09.26-2013.09.28)] 2013 IEEE International Conference on Signal Processing,

B. Implementation of Sampling rate compe It has been found experimentally, thare obtained when the test sample is re-sastage itself before carrying out any other proproposed that the samples be re-sampled itself. In this paper the backend languages mtaken from Compact Discs (CD) with samKHz. If the samples provided for back enfrom different sources, then firstly we need in order to bring them to one common rate.was initially fed as input at its respective samwas different from back end rate and later after re-sampling to the rate equal to the bmodels. Performance of the system in these tevaluated. It was found that performance better when we carry out sampling rate comp It has also been observed that if thin the sampling rate among the various langback end and the sampling rate compenssample has not been carried out, there is a ltest sample may associate itself more with a at the same sampling rate as that of the testudy, the feature extraction has been done language models have been prepared usingThe improvement achieved by carrying ocompensation can be ascertained from row summarised in Table II and Table III.

IV. GENDER BASED LANGUAGE IDENTIFICA

In order to discriminate between vthe language identification system must be adifferent characteristics between various langalso take into account the differences whspeaker variations [9]. Gender of a speakeffect on the fundamental frequency (Fo) ofeffect is attributable to the length and thickfolds. Male speakers have an average Fo of female vocal folds are smaller and lighter thfolds and hence they vibrate at 200-220 Hztwice as that of the males [10]. Also the Fo mthe speakers of the same gender of thlanguages. This variation is attributable tlanguages have developed in thousands of yeGender Based Language Identification sysFigure 1 for VQ based back end and Figure language model. A. Speech Pre-processing: This comprises of bringing the sampling rate tofollowed by noise and silence removal.

ensation hat the best results

ampled at the first ocess. Hence, it is at the beginning

models have been mpling rate 44100 nd are themselves

to resample them . The test sample mpling rate which fed to the system

back end language two situations was of the system is

pensation. here are variations guages used in the sation of the test likelihood that the language which is

est sample. In this using MFCC and

g VQ and GMM. out sampling rate I and II of results

ATION SYSTEM various languages, able to analyse the guages and should hich arise due to

ker has significant f the speaker. This kness of the vocal f 100-150 Hz. The hen the male vocal z which is almost may vary between he two different to the ways; the ears. The proposed stem is shown in 2 for GMM based

step primarily o the desired rate

B. Feature Extraction:Coefficients (MFCCs) are techniques for feature extractio

Fig. 1. Re-sampled Gender based VQ

Fig. 2 . Re-sampled Gender based GM The mapping from linear frgiven by equation (1) 1127ln 1

: Mel Frequency Cepstral the most commonly used

on [11].

Language codebook.

MM Language models.

requency to Mel Frequency is

(1)

Page 4: [IEEE 2013 IEEE International Conference on Signal Processing, Computing and Control (ISPCC) - Solan, India (2013.09.26-2013.09.28)] 2013 IEEE International Conference on Signal Processing,

Hamming Window has been used for the windowing of the speech frames. The mathematical representation of Hamming Window is given by equation (2) 0.54 0.46cos 2 (2) In this study the spectral information has been represented using 12- dimensional MFCCs. We exclude the first component as it represents the mean value of the input signal and it carries little useful information [12]. Other feature extraction techniques are Perceptual Linear Prediction (PLP) and Linear Prediction Cepstral Coefficient (LPCC).

C. Gender based Language Modelling: Language modelling has been done separately in the form of VQ Codebooks and GMM, for both the genders of each of the five languages, using the MFCCs extracted above. Hence, we get 10 language models for 5 languages. It has been observed that the results are better for gender based language models for both VQ and GMM based back end language models when compared with mixed genders speech models. Two models, one male and one female are trained for each language using the gender wise segregated data. Hence we obtain an improvement in performance for the language identification task and detect the gender of the speaker as well which can be useful for certain gender sensitive applications. Making gender based language models is of special help in the cases where the database is not elaborate i.e. it does not have adequate representation from both the genders and also it does not covers the linguistic aspects of the languages in their completeness. In such cases, it has been found experimentally that a test sample may associate itself to a language back end model which has more representation of it’s matching gender. It may ignore it’s actual back end language model, as the linguistic aspects are not many to differentiate among languages and the pitch difference is significant between the test and back end language model due to gender difference.

V. RESULTS

In this paper the language identification has been carried out using three approaches. In the first approach the sampling rates are variable for test language data and the back end language model. The back end language models have been created using concatenated speech samples having speech of both genders. In second approach sampling rate compensation has been carried out using re-sampling however the back end language models are same as before i.e. gender based segregation has not been carried out in this case. In the third approach the sampling rate compensation has been used together with gender based language models. Test samples of 20 seconds have been taken as input for the system. The

performance of the system was above 97 % when either Kannada or Tamil were excluded from the study. Since, both these languages are spoken in regions of close geographical proximity and they belong to the same language family, hence the overall performance of the system came down when both of them are included in the study. The result comparison has been shown in Table II and Table III for the various approaches.

TABLE II. LID SYSTEM PERFORMANCE WITH THREE DIFFERENT APPROACHES (VQ CODEBOOK) FOR TAMIL, KANNADA, URDU, KASHMIRI AND MANIPURI

Size

(20 sec test lang data)

LID System Accuracy with Variation In No. Of Centroids ( %)

8 16 32 64 128 256

VariableSampling Rate

25.2 22.8 27.2 28.1 30.2 30.6

Sampling rate compensated (One mixed gender model for each language)

54.1 52.9 56.3 61.9 63.4 60.4

Sampling rate compensated (Gender based models for each language, 2 models per language)

71.1 74.4 79.1 83.4 85.7 81.1

TABLE III. LID SYSTEM PERFORMANCE WITH THREE DIFFERENT APPROACHES (GMM) FOR TAMIL, KANNADA, URDU, KASHMIRI AND MANIPURI.

Size

(30 sec test lang data)

LID system accuracy with variation in no. of Gaussian components ( %)

8 16 32 64 128 256

VariableSampling Rate

28.5 27.3 30.5 32.0 32.7 30.7

Sampling rate compensated (One model for each language)

82.2 82.8 84.3 87.4 92.8 95.1

Sampling rate compensated and gender based segregation carried out

87.5 88.6 94.3 96.5 98.1 98.7

Sufficient text independent test samples were fed as input to the system and based on the results obtained with every language an average performance of the system was calculated.

Page 5: [IEEE 2013 IEEE International Conference on Signal Processing, Computing and Control (ISPCC) - Solan, India (2013.09.26-2013.09.28)] 2013 IEEE International Conference on Signal Processing,

VI. CONCLUSION

Sampling rate compensation can easily be used for improvement of speaker identification systems. The re-sampling technique is equally useful when used with VQ as well as GMM based systems and significant improvements can be achieved by its use. The results show significant improvement when the number of Centroids and Gaussian components are increased, however the advantage is not very evident after increase beyond 128 Centroids and 128 Gaussian components. For best performance by the Language Identification system, one should consider languages from different language families. Languages from same language family or spoken in places in geographically close places may deteriorate the overall system performance. Similar results can be achieved for speaker identification applications but for much lesser values of Centroids and Gaussian components. The computational complexity in the proposed technique increases as more stages have been introduced in the system.

For future, it is proposed to study systems after using suitable spectral compensation techniques and to confirm its implications by using the results achieved.

ACKNOWLEDGEMENT

We are thankful to Mrs Kamini Malhotra, Scientific Analysis Group, India, Mr Dhruv Agrawal and Mr Madhur D. Upadhayay, IIT Delhi, for their help in carrying out this work.

REFERENCES [1] [1] Y. K. Muthusamy, E. Barnard and R. A. Cole, “Reviewing

Automatic Language Identification,” IEEE Signal Processing Magazine, October 1994.

[2] Eliathamby Ambikarajah, Haizhou Li, Liang Wang, Bo Yin, Vidhyasharn Sethu, “Language Identification: A Tutorial” IEEE Circuits and Systems Magazine, Second Quarter 2011.

[3] Qu Dan, Wang Bingxi, Wei Xin, “Langauge Identification using vector Quantisation,” ICSP 2002 proceedings.

[4] M.A. Zissman, “Comparison of four approaches to automatic language identification of telephone speech,” IEEE Trans Speech Audio Processing, vol.4, p.31, 1996.

[5] D.A. Reynolds and R.C. Rose, “Robust Text Independent Speaker Identification using Gaussian Mixture Speaker Models,” IEEE Trans Speech and Audio Processing, Vol 3, PP 72-83, 1995.

[6] E.S. Parris and M.J. Carey, “Language independent gender identification,” In Proc.IEEE International Conference on Speech, Acoustics and Signal Processing, 1996, vol. 2, pp 385-388.

[7] Kamini Malhotra and Anu Khosla, “ Automatic Identification of Gender and Accent in Spoken Hindi Utterances with Regional Indian Accents,” Spoken Language Technology Workshop, IEEE, 2008 pp 309-312.

[8] Titze I.R. , Principles of Voice Production, Prentice Hall, 1994. [9] J.Benesty, M.Sondhi and Y Huang, Springer Handbook of Speech

Processing. New York:Springer Verlag 2007. [10] H. Wakita, “ Normalization of vowels by vocal tract length and it’s

application to vowel identification,” IEEE Trans. Acoust., Speech, Signal Processing, Vol. 25, pp183-192, 1977.

[11] S.Davis and P Mermelstein, “Comparison of Parametric representation for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans Acoustics and Signal Processing, Vol 28, No 4, 1980.

[12] Ali Zulfiqar, Aslam Muhammad, Martinez Enriqez A.M., “A Speaker Identification system using MFCC features with VQ technique,” Third International Symposium on Intelligent Information Technology and Applications 2009, pp115-118.