[IEEE 2013 International Conference on Signal Processing and Communication (ICSC) - Noida, India...

6
Robust Language And Speaer Identification Using Image Processing Techniques Combined With PCA Deepak Joshi, Madhur Deo Upadhayay and Shiv Du Joshi Indian Institute of Technology New Delhi, India Abstract- Spoken language and speaker identification has been attracting the researchers across the globe for past several decades. Language identification shares certain similarities with speaker identification; however they both differ from each other in certain aspects. Both language identification and speaker identification problems have till now been dealt with feature extraction techniques like MFCC, PLP, LPCC etc. In this paper, a new feature extraction technique is proposed. Radon transform (RT) is proposed to be used for feature extraction aſter obtaining the spectrogram from the speech sample. PCA has been used to achieve dimension reduction and to reduce the computational complexity. The performance of the proposed method has been compared for existing identification techniques in the field of language and speaker identification. Keywords-Radon transfo Principal component analysis, speaker identcation, language identcation. I. INTRODUCTION Language and speaker identification are interesting and challenging tasks which are useful for security agencies, agencies involved in emergency services, industries like travel and tourism, biometric systems etc [1]. Languages differ om each other in the different ways they are spoken, owing to the difference in their grammar etc. Similarly speech of two individuals differs om each other due to the differences in their vocal tract shapes, larynx sizes, other voice producing organs and their leaed behaviour i.e. intonation, rhythm, vocabulary etc. Language and speaker identification systems share a large number of similarities in problem formulation, system approach and methodologies. Both the problems can be formulated either for identification or verification requirements. One system can be used as another, by carrying out intelligent changes in certain basic processing parameters. Majority of Language and Speaker identification systems can be divided into ont end and back end. Front end of the system, deals with pre-processing of the speech signal followed by feature extraction om the speech signal. Back end deals with preparation of language models using one of the various language modelling techniques [2]. Majority of the 97B-1 -4799-1 607-B/1 3/$31 .00©201 3 IEEE 21 3 speaker and language identification systems use Mel Frequency Cepstral Coefficient (MFCC), Perceptual Linear Prediction (PLP) and Linear Prediction Cepstral Coefficient (LPCC) etc for feature extraction [2],[3]. MFCC is generally used for identification tasks because of it's robustness in such tasks[4],[5]. The back end language/speaker models, most of the times are made using either Vector Quantisation (VQ) technique or Gaussian Mixture modelling (GMM) technique [6],[7],[8]. GMM has various variations out of which GMM with diagonal covariance matrix is most widely used due to it' s computational simplicity. Spectrogram is a very important information source for speech signals. It is widely used for speech analysis as it represents the contextual variations in an excellent manner [9]. The resultant of magnitude square, of the time varying spectral characteristics of speech sample, when displayed graphically is termed as Spectrogram [ 1 0] . Information pertaining to the energy, pitch, ndamental frequency, formants and timing can be extracted by carel study of spectrograms. Study indicates that rich acoustic features contained in a spectrogram are of great help in speaker identification tasks [11]. Spectrograms have been successfully used for the identification tasks up to an accuracy of 93% [ 12] . Majority of the information, required to discriminate between speakers and languages are present in the spectrograms. Spectrograms are also useful for carrying out tasks like emotion identification. Different systems used for different identification tasks study each parameter separately and then combine them into a feature vector. Studies have highlighted that information usel for carrying out identification tasks can also be extracted in visual domain [9], [ I I], [12] rather than adhering to the conventional audio domain. The prior knowledge of having studied spectrograms helps us to interpret the new spectrograms in order to utilize them for caying out identification tasks successlly. The researcher needs to be well acquainted with the parameter variations which lead to changes in the image. He must analyze these changes in conjunction with the prior

Transcript of [IEEE 2013 International Conference on Signal Processing and Communication (ICSC) - Noida, India...

Page 1: [IEEE 2013 International Conference on Signal Processing and Communication (ICSC) - Noida, India (2013.12.12-2013.12.14)] 2013 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION

Robust Language And Speal<er Identification Using Image Processing Techniques Combined With PCA

Deepak Joshi, Madhur Deo Upadhayay and Shiv Dutt Joshi Indian Institute of Technology

New Delhi, India

Abstract- Spoken language and speaker identification has

been attracting the researchers across the globe for past several decades. Language identification shares certain similarities with speaker identification; however they both differ from each other in certain aspects. Both language identification and speaker identification problems have till now been dealt with feature extraction techniques like MFCC, PLP, LPCC etc. In this paper, a new feature extraction technique is proposed. Radon transform (RT) is proposed to be used for feature extraction after obtaining the spectrogram from the speech sample. PCA has been used to achieve dimension reduction and to reduce the computational complexity. The performance of the proposed method has been compared for existing identification techniques in the field of language and speaker identification.

Keywords- Radon transform, Principal component analysis, speaker identification, language identification.

I . INTRODUCTION

Language and speaker identification are interesting and challenging tasks which are useful for security agencies, agencies involved in emergency services, industries like travel and tourism, biometric systems etc [ 1 ] . Languages differ from each other in the different ways they are spoken, owing to the difference in their grammar etc . Similarly speech of two individuals differs from each other due to the differences in their vocal tract shapes, larynx sizes, other voice producing organs and their learned behaviour i .e . intonation, rhythm, vocabulary etc . Language and speaker identification systems share a large number of similarities in problem formulation, system approach and methodologies . Both the problems can be formulated either for identification or verification requirements. One system can be used as another, by carrying out intelligent changes in certain basic processing parameters . Majority of Language and Speaker identification systems can be divided into front end and back end. Front end of the system, deals with pre-processing of the speech signal followed by feature extraction from the speech signal . Back end deals with preparation of language models using one of the various language modelling techniques [2] . Majority of the

97B-1 -4799-1 607-B/1 3/$31 .00©201 3 IEEE 21 3

speaker and language identification systems use Mel

Frequency Cepstral Coefficient (MFCC), Perceptual Linear Prediction (PLP) and Linear Prediction Cepstral Coefficient (LPCC) etc for feature extraction [2] , [ 3 ] . MFCC is generally used for identification tasks because of it ' s robustness in such tasks[4] , [5] . The back end language/speaker models, most of the times are made using either Vector Quantisation (VQ) technique or Gaussian Mixture modelling (GMM) technique [6] , [7] , [8] . GMM has various variations out of which GMM with diagonal covariance matrix is most widely used due to it' s computational simplicity.

Spectrogram is a very important information source for speech signals . It is widely used for speech analysis as it represents the contextual variations in an excellent manner [9] . The resultant of magnitude square, of the time varying spectral characteristics of speech sample, when displayed graphically is termed as Spectrogram [ 1 0] . Information pertaining to the energy, pitch, fundamental frequency, formants and timing can be extracted by careful study of spectrograms. Study indicates that rich acoustic features contained in a spectrogram are of great help in speaker identification tasks [ 1 1 ] . Spectrograms have been successfully used for the identification tasks up to an accuracy of 93% [ 12] . Majority of the information, required to discriminate between speakers and languages are present in the spectrograms. Spectrograms are also useful for carrying out tasks like emotion identification. Different systems used for different identification tasks study each parameter separately and then combine them into a feature vector.

Studies have highlighted that information useful for carrying out identification tasks can also be extracted in visual domain [9] , [ I I ] , [ 1 2] rather than adhering to the conventional audio domain. The prior knowledge of having studied spectrograms helps us to interpret the new spectrograms in order to utilize them for carrying out identification tasks successfully. The researcher needs to be well acquainted with the parameter variations which lead to changes in the image. He must analyze these changes in conjunction with the prior

Page 2: [IEEE 2013 International Conference on Signal Processing and Communication (ICSC) - Noida, India (2013.12.12-2013.12.14)] 2013 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION

background knowledge of phonetic and acoustics parameters . The changes in the picture are to be analyzed in terms of the knowledge one has in terms of acoustic changes, phonetic changes and linguistic changes . These three factors are primarily responsible for any kind of change in the image and a researcher who can analyze the changes with respect to these factors, is capable of carrying out the identification task with great accuracy.

An ideal language or speaker identification system should not be biased towards any particular language or speaker. The computation time should not be very large which has a direct relation to the complexity of the system i .e . the system should not be very complex. With an increase in the number of target languages or speakers and a reduction in test sample duration, the performance of the system should not degrade.

In this paper we carry out the task of speaker and language identification after formulating it as a problem of pattern matching of images . This technique is a relatively new technique and is also computationally more efficient. Here we use the spectrograms of the speech sample to compute the Radon projections In different directions . Principal Component Analysis (PCA) is used to carry out dimension reduction in order to make the task computationally efficient. This technique is immune to the effects of session variation and additive noise. It has been proved empirically that this technique is computationally efficient and is text independent.

In the following paper section 2 explains the proposed method with details about Radon transform and Principal component analysis. Section 3 gives the details pertaining to the database, experiments carried out and comparison of results. Section 4 gives out the conclusion based on the performance evaluation.

II. PROPOSED METHOD

In this paper we are approaching the problem of speaker and language identification as a problem of pattern matching of images . For ideal condition in pattern matching tasks, it is desirable that the features derived from samples of different classes should not show any resemblance with each other whereas the features extracted from different samples of the same class should be close. The features extracted should have least or negligible intra-class variance and inter-class variance should be large. The features should not get affected by the location, size and orientation of the pattern.

The task commences with the pre-processing of speech signal for obtaining a good spectrogram. Radon transform and PCA are then used for compact feature extraction and then finally the pattern matching methods are used for identification of the language or speaker. These are

21 4

covered in detail in subsequent sections . Fig I gives out the proposed technique.

+ +

FEATURES

REFERENCE LANGUAGES/SPEAKERS PATIERN MATCHING

I DENTI FIED LANGUAGE/SPEAKER

Fig. 1 . Proposed technique using Radon Transform with PCA

A. Spectrogram

For obtaining a good spectrogram we start with the preprocessing of the speech sample where we first carry out the Pre-emphasis. Pre-emphasis is carried out in order to boost the higher frequencies which otherwise have a very low intensity. This is because of the effects of the human vocal tract [ 1 3 , 14] . Pre emphasis for an input speech waveform x(n) is achieved by passing the waveform through a first order filter with transfer function

H(z) = 1 - az - 1 We can boost the magnitude by approx 32 dB if a i s chosen to be 0 .95 .

Though we know that the speech signal i s a non­stationary signal but over a short duration it can be considered stationary. In order to exploit this quasi-stationary nature of speech signals we carry out framing of the speech signal in periods of 20 ms with an overlap of 10 ms. This overlap helps us to prevent loss of information. In order to get rid of the effects due to abrupt beginnings and endings of the frame, we carry out windowing of the frames . We have used Hamming window in this work, which is a very popular windowing technique. Hamming window is multiplied with every frame.

X i = Y i (n) w (n) i= I ,2, . . . ,M

where wen) = 0 .54 - 0.46 cos ( 2nn

) T- 1

Page 3: [IEEE 2013 International Conference on Signal Processing and Communication (ICSC) - Noida, India (2013.12.12-2013.12.14)] 2013 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION

T is the number of samples in each frame.

[n order to obtain the spectrogram, we take out Fourier transform of every frame, which in turn gives the estimate of the short-term frequency content of the signal. This is known as Spectrogram. In other words, spectrogram is the squared magnitude of the time dependent Fourier transform versus time. In this paper, the spectrogram obtained from speech has been treated as an image. The variations between the images are because of the changes in various parameters . These changes in speech images are similar to the scene changes in the world around us. We can make use of these by using suitable image processing techniques . The systems which are in use presently, may at times consider a feature appearing at an inappropriate place as being absent, which is not the actual case. Spectrograms are useful in such cases .

I (j z

� w a: ...

TIME -

Fig.2. Spectrogram for the English word "COUNTRY"

B . Radon Transform

Radon transform is a fundamental tool in may areas. [t takes into account the parameterization of lines and then evaluation of integrals of an image along these lines. Radon transform' s power lies in it ' s ability to capture the directional features of any image . For a given spectrogram it carries out the addition of the pixel intensity values in a particular direction at a specific displacement [ 14 ] . The Radon transform of a 2-D signal f(x,y) is given by

R (T, (}) = 9t[f(x, y) = i� i�f(x' Y)O(T - xcose - ysine)dxdy

Where () E (0, rr) is the angle between the distance

vector and the x-axis [ 1 5 ] . 91 depicts the Radon transform

21 5

operator and 'r' gives the distance between a line and the origin. Radon transform captures the variations in the image due to changes in energy, pitch, fundamental frequency, formants etc . These features are initially projected onto different orientation slices. After this the Radon projections are obtained by summing all the intensity values of those pixels that are contained in the circle surrounding the pattern to be recognized and on the line that is perpendicular to the ridge . For any ridge provided to us, every pixel contained in the circle will be projected onto it along the perpendicular direction. The entire process gives rise to one Radon slice in the Radon domain. Thus we compute the Radon projections for a given spectrogram in various different orientations . For every projection, the variations of the pixel intensities are preserved evenly although the pixels are not near the origin. Hence, it can be said that the method does not measure intensity variations based on the location in the image. Radon transform is linear by definition hence geometric properties like straight lines or curves can be made explicitly by the Radon transform which concentrates on energies from the image in few high-valued coefficients in the transformed domain. Here, we have started our work by considering the Radon projections in seven different orientations .

It has been noticed empirically that for same speaker the Radon projections are similar as the acoustic features are same and hence the spectrogram is same. Similar is the case for languages where the phonetic inventory is repeated for a particular language. Thus the intra class variance is minimal and for different speakers/different languages the inter class variance has been noticed to be quite large.

Fig 3(a) and 3 (b) gives two different sentences spoken by a single speaker under different noise conditions . Speech sample at Fig 3 (a) has been recorded under the noise of a ceiling fan and speech sample in Fig 3 (b) has been recorded under the noise of a fan coupled with the noise of a generator in near vicinity. We can see the spectrogram and the Radon transforms and can conclude that due to the same size and shape of the vocal tract, the Radon transforms of the speech samples are similar despite the fact that different sentences were spoken. Hence, we can say that intra class variance is minimized for same speaker. From Fig 3 (a) and 3 (b) we can also infer that Radon Transform is insensitive to the effects of noise, as the noise conditions for both the samples have chosen to be entirely different.

Page 4: [IEEE 2013 International Conference on Signal Processing and Communication (ICSC) - Noida, India (2013.12.12-2013.12.14)] 2013 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION

o .� r--.-----.-----.-------, O �

0

-ItS

.1

·1 5

-2

·2 5

-3 Il lOCO 200J :un .lOll

(a)

0.5 .---.---.---.---------. 0 5

-O �

-1 , , \ -1 .5

.�

·;u

-3 0 1<001) 700 7IlIO 4MIl

(b)

Fig 3 . Speech samples (a) Speaker 'X' speaking a sentence in English with noise of fan as ambient noise (b) Speaker 'X' speaking different sentence with noise of fan and generator as ambient noise. Figure on left is input sound, and figure on right is the Radon Transform.

Similarly we observe from Fig 4 (a) and Fig 4 (b) that when two different speakers utter the same sentence then the Radon transform is different from each other due to the difference in the vocal tract shape and size . Thus we can say that in Radon transform the inter class variances are maximized. Also in different languages, the role of the prosodic and phonotactic properties of the language added with the syntax of the language gives enough cues to Radon transform to discriminate one language from another.

21 6

V.S 0 5 1)4 0 (L3 Cl.2 .0.5

ILl

� �. � .1 0

�_I .1 5

·R2 ·2 �u

-l ei 0() 4

-4).5 1() :2 " .. -30 1000 2OCO 3D) �

(a)

(l � 08 II) 0.&

00 :5

- 1 \1 .LoS

·1 .o.�

·H .o.e - 1 0 :3 -30 I ClIl 21m DJ) 400)

(b)

Fig 4. Speech samples (a) Speaker 'X' speaking a sentence in English with noise of fan as ambient noise (b) Speaker 'V ' speaking same sentence under same ambient noise . Figure on left is input sound, and figure on right is the Radon Transform.

When we take into account the Root Mean Square Error (RMSE) for the two Radon transform cases, we realize that for intra class speakers the value is less whereas for interclass speakers the value is less . The effect of bias and variance is considered when we take out RMSE.

C. Principal Component Analysis (PCA)

Principal Component Analysis (PCA) takes higher dimensional data as input. It utilizes the dependencies between the variables to represent the data in a lower dimensional form, without loosing much information. Here, PCA has been

Page 5: [IEEE 2013 International Conference on Signal Processing and Communication (ICSC) - Noida, India (2013.12.12-2013.12.14)] 2013 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION

applied on radon projections to obtain lower dimensional feature vectors. In this work we have taken Radon projections to evaluate the performance of the technique. When we had taken the number of Radon projections to be less than 7, the performance was bad, especially for the Language identification task.

III. DATABASE AND RESULTS

A 30 speaker database was created and used for speaker identification task. The database had 1 5 male and 1 5 female speakers. Testing for speaker identification was done in both text dependent and text independent cases. For language identification 7 Indian languages were chosen. The samples in language database were selected in such a manner, so that they covered the linguistic aspects of the languages adequately. The languages used for identification task are Bengali, Punjabi, Tamil, Kannada, Manipuri, Urdu and Kashmiri. Each language database was created using speech from 5 different speakers staying in different geographical locations so as to cover the different dialects of the language .

Table I gives a performance comparison for speaker dependent and speaker independent identification tasks between the popular techniques of MFCC with VQ, MFCC with GMM and the proposed technique . Table 2 gives the performance comparison for language identification tasks using 7 Indian language databases.

In both the identification tasks we have considered the effect of test sample duration in accuracy. While conducting the evaluation it has been found that the time of training and testing is lesser for the proposed technique. It has been observed that the performance of the proposed system is comparable to the performance of the MFCC with GMM systems and is better than MFCC with VQ systems.

TABLE I. PERFORMANCE COMPARISON FOR SPEAKER IDENTIFICATION TASK (IN %).

M ETHOD USED TEXT DEPENDENT TEXT IN DEPENDENT 5 SEC TEST 10 SEC TEST 5 SEC TEST 10 SEC TEST SAMPLE SAMPLE SAMPLE SAM PLE

ACCU RACY ( IN %) M FCC WITH VQ 93 95 86 91 M FCC WITH 94 98 88 95 GMM PROPOSED 93 95 92 94 TECHNIQUE BASED ON RADON TRANSFORM

21 7

TABLE II. PERFORMANCE COMPARISON FOR LANGUAGE IDENTIFICATION TASK (IN %).

M ETHOD USED 7 LANGUAGES DATABASE 10 SEC TEST SAM PLE 40 SEC TEST SAM PLE

ACCURACY ( IN %) MFCC WITH VQ 79 85

MFCC WITH GMM 87 92

PROPOSED TECHNIQUE 85 89 BASED ON RADON

TRANSFORM

Fig. 5 gives the performance accuracy (in %) as a function of number of radon projections . For speaker identification best performance is obtained for radon projections less than 1 0 however for language identification task, best performance is obtained for radon projections around 50 . Hence, there is a small difference for language and speaker identification taska and the number of radon projections to be selected for a task, must be selected after deliberate study.

100

90

1 :: 60 � � 50 :J � 40

*' 30

20

___ SPEAKER IDENTIFICATION

10 ....... LANGUAGE

10 m � � � ffi m w � � NUMBE R OF RADON PROJECTIONS -----+

IDENTIFICATION

Fig 5 . Performance Accuracy As Function of The Number of Radon Projections

IV. CONCLUSIONS

The proposed technique approaches a speech processing task as an image processing task. The proposed technique has produced comparable results to the most common technique for speaker/language identification tasks . The computation time is also less than the two techniques widely in use.

Spectrogram represents information pertaining to pitch, fundamental frequency, energy, formants and time of speech signal in the form of a pattern. Radon transform carries out summation of the pixel values in the spectrogram along a

Page 6: [IEEE 2013 International Conference on Signal Processing and Communication (ICSC) - Noida, India (2013.12.12-2013.12.14)] 2013 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION

straight line in a particular direction at a specific displacement. It captures the language/speaker specific features from the spectrogram. Use of PCA brings down the computation time to a great extent.

REFERENCES

[ 1 ] Y. K. Muthusamy, E. Barnard and R. A Cole, "Reviewing Automatic Language Identification," IEEE Signal Processing Magazine, October 1994. [2] D. Jurafsky and J. Martin, "Speech and Language Processing an Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition", 2nd edition New Jersey Prentice Hall, 2008. [3] 1. Rabiner and B . Juang, "Fundamentals of Speech Recognition", New Jersey, Prentice Hall, 1993 . [4] RJ. Mammone, X. Zhang and R.P. Ramachandran, "Robust speaker recognition a feature-based approach", IEEE Signal Processing Magazine pp 58-7 1 , 1 996. [5] D.A Reynolds, "Experimental evaluation of features for robust speaker identification", IEEE Transactions on Speech and Audio Processing, pp 639-643 , 1994. [6] Qu Dan, Wang Bingxi and Wei Xin, "Langauge Identification using vector Quantisation", ICSP 2002 proceedings. [7] M.A Zissman, "Comparison of four approaches to automatic language identification of telephone speech", IEEE Trans Speech Audio Processing, vol .4, p .3 1 , 1996. [8] D.A Reynolds and R.C. Rose, "Robust Text Independent Speaker Identification using Gaussian Mixture Speaker Models", IEEE Trans Speech and Audio Processing, Vol 3 , PP 72-83 , 1995. [9] R.A Cole, AI. Rudnicky and V.M. Zue, "Performance of an expert spectrogram reader", Journal of Acoustic Society of America pp 8 1-87, 1979. [ 1 0] I.F . Quatieri, "Discrete-Time Speech Signal Processing Principles and Practice", Prentice Hall, Massachusetts, 2002. [ 1 1 ] V.W. Zue, "An expert spectrogram reader a knowledge based approach to speech recognition", Proceedings of ICASSP, Japan, pp. 1 197-1200, 1986 . [ 12] R.A. Cole, AI. Rudnicky, V.W. Zue, and D.R. Reddy, "Speech as patterns on paper, perception and production of fluent speech", Erlbaum, 1 980. [13] I. Kinnunen and H. Li, "An overview of text-independent speaker recognition from features to supervectors", Speech Communication pp 12-40, 2010 . [ 14] G. Beylkin, "Discrete radon transform", IEEE Transactions on Acoustics Speech and Signal Processing, pp 1 62-1 72, 1987 . [ 1 5] Kourosh Jafari-Khouzani and Humid Soltanian-Zadeh, "Rotation invariant multi- resolution texture analysis using radon and wavelet transform", IEEE Transactions on Image Processing, pp 783-794, 2005 .

21 8