DNN-Based Speech Recognition System Dealing with Motor ...

5
DNN-based Speech Recognition System dealing with Motor State as Auxiliary Information of DNN for Head Shaking Robot Moa Lee and Joon-Hyuk Chang * Abstract— In this paper, a deep neural network (DNN) based integrated background noise suppression and acoustic modeling for speech recognition proposed in which on/off state of the motor for the head shaking robot is employed as the relevant auxiliary information of the DNN input. Since the motor sound being generated when the robot is moving or shaking its head severely degrades the performance of the speech recognition accuracy, we propose to use the motor on/off state as additional information when designing the DNN-based recognition system. Our speech recognition algorithm consists of two parts including the feature mapping model for feature enhancement and the acoustic model for phoneme recognition. As for the feature mapping, the stacked DNN is designed for the precise feature enhancement such that the lower DNN and upper DNN are trained separately and combined after which the motor state is plugged into both the lower DNN and upper DNN in addition to the input noisy speech. Then, the acoustic model is trained upon the feature enhancement model in which the motor state is again used as the augmented feature. The proposed technique to suppress the acoustic and motor noises was evaluated in term of the phoneme error rate (PER) and showed a significant improvement over the conventional system. I. INTRODUCTION Automatic speech recognition (ASR) technology is one of the most effective means of communicating with people in artificial intelligence (AI)-based social robots that perform actions or responds according to human commands. Via the ASR technology, many social robots have been developed such as the Softbank Corp.s emotional robot Pepper [1], the MITs home robot JIBO [2], and Intel Corp.s Jimmy [3]. In this regard, recently, many researches on this speech recognition technology as an essential part for the robots have been actively carried out. However, in real environments where noise or reverberation exists around the robot, it often exhibits drastic performance degradation. In particular, the performance degradation of the ASR for the robot with a driving motor is very severe, and when the robot driven by the motor rotates, speech recognition performance deterio- rates drastically due to added motor noise. Recently, the state-of-the-art DNN algorithms have been applied to speech enhancement for the ASR, and tech- nologies showing robust performance in various external noise and reverberation environments have been studied. This algorithm is called the feature enhancement and classified into two parts including the spectral mapping-based one and spectral masking-based one. First, in case of the spectral *Corresponding Author Moa Lee and Joon-Hyuk Chang are with the Electronic Engineering Department, University of Hanyang, Seoul, 04763, Republic of Korea. (e- mail: [email protected]) Fig. 1: Head shaking robot (JIBO) TABLE I: Hardware specifications of JIBO [2] Hardware Specifications Sensors 360 degrees sound localization Movement 3 full-revolute axes Sound 2 premium speakers Processor High-end ARM-based mobile mapping-based speech enhancement method learns the non- linear mapping functions directly in order to estimate the clean speech features [4, 5, 6, 7]. Next, the spectral masking- based speech enhancement algorithm finds the mapping function to obtain the time-frequency (TF) masks from the noisy speech features and then the TF masks are employed to achieve the clean speech features indirectly [8, 9, 10]. As another technique, there have been proposed models for extracting robust acoustic features to improve speech recognition performance [11, 12]. However, these techniques have focused on pre-processing techniques that rely on the removal techniques for ambient noise generated around the robot. Therefore, the study on the recognition performance deterioration due to the inner motor noise generated inside the robot turns out very limited. Specifically, in case that a social robot responding to human commands shakes the head, in addition to the motor noise and the fan noise, a driving noise generated when the robot shakes the head greatly occurs. These noises are recorded along with the user’s voice for the microphone input, which degrades the speech recognition performance once more. In order to solve the above problem, we propose a novel idea for the DNN-based speech recognition algorithm to mitigate the performance degradation due to robot’s self- driven noise. First, we introduce a technique that puts the operation state of the driving motor inside the robot as auxiliary information in addition to the speech contaminated by the noise to the DNN input when designing the DNN. The 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Madrid, Spain, October 1-5, 2018 978-1-5386-8093-3/18/$31.00 ©2018 IEEE 1859

Transcript of DNN-Based Speech Recognition System Dealing with Motor ...

Page 1: DNN-Based Speech Recognition System Dealing with Motor ...

DNN-based Speech Recognition System dealing with Motor State asAuxiliary Information of DNN for Head Shaking Robot

Moa Lee and Joon-Hyuk Chang∗

Abstract— In this paper, a deep neural network (DNN)based integrated background noise suppression and acousticmodeling for speech recognition proposed in which on/off stateof the motor for the head shaking robot is employed as therelevant auxiliary information of the DNN input. Since themotor sound being generated when the robot is moving orshaking its head severely degrades the performance of thespeech recognition accuracy, we propose to use the motor on/offstate as additional information when designing the DNN-basedrecognition system. Our speech recognition algorithm consistsof two parts including the feature mapping model for featureenhancement and the acoustic model for phoneme recognition.As for the feature mapping, the stacked DNN is designed forthe precise feature enhancement such that the lower DNN andupper DNN are trained separately and combined after whichthe motor state is plugged into both the lower DNN and upperDNN in addition to the input noisy speech. Then, the acousticmodel is trained upon the feature enhancement model in whichthe motor state is again used as the augmented feature. Theproposed technique to suppress the acoustic and motor noiseswas evaluated in term of the phoneme error rate (PER) andshowed a significant improvement over the conventional system.

I. INTRODUCTION

Automatic speech recognition (ASR) technology is one ofthe most effective means of communicating with people inartificial intelligence (AI)-based social robots that performactions or responds according to human commands. Via theASR technology, many social robots have been developedsuch as the Softbank Corp.s emotional robot Pepper [1],the MITs home robot JIBO [2], and Intel Corp.s Jimmy[3]. In this regard, recently, many researches on this speechrecognition technology as an essential part for the robotshave been actively carried out. However, in real environmentswhere noise or reverberation exists around the robot, it oftenexhibits drastic performance degradation. In particular, theperformance degradation of the ASR for the robot with adriving motor is very severe, and when the robot driven bythe motor rotates, speech recognition performance deterio-rates drastically due to added motor noise.

Recently, the state-of-the-art DNN algorithms have beenapplied to speech enhancement for the ASR, and tech-nologies showing robust performance in various externalnoise and reverberation environments have been studied. Thisalgorithm is called the feature enhancement and classifiedinto two parts including the spectral mapping-based one andspectral masking-based one. First, in case of the spectral

*Corresponding AuthorMoa Lee and Joon-Hyuk Chang are with the Electronic Engineering

Department, University of Hanyang, Seoul, 04763, Republic of Korea. (e-mail: [email protected])

Fig. 1: Head shaking robot (JIBO)

TABLE I: Hardware specifications of JIBO [2]

Hardware SpecificationsSensors 360 degrees sound localization

Movement 3 full-revolute axesSound 2 premium speakers

Processor High-end ARM-based mobile

mapping-based speech enhancement method learns the non-linear mapping functions directly in order to estimate theclean speech features [4, 5, 6, 7]. Next, the spectral masking-based speech enhancement algorithm finds the mappingfunction to obtain the time-frequency (TF) masks from thenoisy speech features and then the TF masks are employedto achieve the clean speech features indirectly [8, 9, 10].As another technique, there have been proposed modelsfor extracting robust acoustic features to improve speechrecognition performance [11, 12]. However, these techniqueshave focused on pre-processing techniques that rely on theremoval techniques for ambient noise generated around therobot. Therefore, the study on the recognition performancedeterioration due to the inner motor noise generated insidethe robot turns out very limited. Specifically, in case thata social robot responding to human commands shakes thehead, in addition to the motor noise and the fan noise, adriving noise generated when the robot shakes the headgreatly occurs. These noises are recorded along with theuser’s voice for the microphone input, which degrades thespeech recognition performance once more.

In order to solve the above problem, we propose a novelidea for the DNN-based speech recognition algorithm tomitigate the performance degradation due to robot’s self-driven noise. First, we introduce a technique that puts theoperation state of the driving motor inside the robot asauxiliary information in addition to the speech contaminatedby the noise to the DNN input when designing the DNN. The

2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)Madrid, Spain, October 1-5, 2018

978-1-5386-8093-3/18/$31.00 ©2018 IEEE 1859

Page 2: DNN-Based Speech Recognition System Dealing with Motor ...

Fig. 2: Speech recognition system architecture using auxiliary information

on/off information of the driving motor of the robot can beefficiently added to the DNN input in real time because it isinformation that can be known inherently from the inside.The motivation behind our paper is that, this augmentedinformation will be used to make the DNNs be learned muchbetter.

Specifically, in order to build up the speech recognizerfor robots, we separately design the feature mapping toenhance the noisy feature and acoustic model to representthe relationship between an audio signal and the phoneme.As for the feature mapping, we basically adopt the stackedDNN algorithm as in [9] by connecting the two DNNs inthe cascade fashion. Then, we learn the acoustic model bytaking the enhanced feature that passes through the featuremapping models as input. In other words, in the structureof the speech recognition proposed in this paper, a total ofthree DNNs are used, and binary information indicating theoperation state of the motor is used as auxiliary informationfor all three DNN inputs. Later, three DNNs responsible forfeature mapping and acoustic models are jointly trained forthe further performance improvement. We use the JIBO robotshown in Fig. 1 to record and evaluate the motor driven-noise in a real robot, and a brief specification is introducedin Table I. We compared the proposed algorithm with thebaseline without using the state information of the motor andthe proposed algorithm showed better results in performanceevaluation for speech recognition.

This paper is organized as follows. Section II describes thestacked DNN structure used for feature enhancement, andSection III explains the proposed algorithm using auxiliaryinformation. The experimental results are shown in SectionIV. The conclusion is covers in Section V.

II. STACKED DNN FOR FEATUREENHANCEMENT

In this section, we explain the mapping-based speech en-hancement method to estimate the spectral-based features ofclean speech from noisy speech. We optimize the parametersby minimizing the mean squared error between predictedspectral features x and target spectral features x as in Eq.(1)

Er =1

N

N∑n=1

∥∥x̂(yn+τn−τ ,W,b)− xn∥∥22

(1)

where x̂(yn+τn−τ ,W,b) and xn are the feature vector ofoutput and target at the sample index n, respectively, andN is the minibatch size. yn is the input feature vectorand, in general, is the feature vectors of noisy speech. And,W and b represent the learned weight and bias parameter.For y, In this work, we use the mel-frequency-cepstral-coefficients (MFCC) as an input feature. In order to enhancethe performance of the feature mapping model, a stackedDNN structure is adopted instead of a single DNN. As shownin Fig. 2, two feature mapping DNNs are used for featureenhancement. The input of the first DNN is the spectral-based features extracted from noisy speech. In addition,in the second DNN, the enhanced features passed throughthe first DNN are concatenated with the noisy features.These DNNs all estimate the clean speech features. Thestacked DNN is more optimized for feature mapping taskthan single DNN by learning parameters using two DNNs,resulting in higher performance [9]. The result of the featureenhancement based on the single DNN and stacked DNN isshown in Fig. 3.

1860

Page 3: DNN-Based Speech Recognition System Dealing with Motor ...

III. PROPOSED SPEECH RECOGNITION SYSTEMUSING AUXILIARY INFORMATION

In the previous speech recognition, feature vectors yare extracted only from the human voice information thatis the input of the microphone. In this study, we furtherincorporate the status information obtained from the robotitself for robust speech recognition as well as feature vectorsextracted from input noisy speech. Most social robots thatinteract with humans have their own noise, such as motor,fan and movement noise. Unlike the external noises, whichare difficult to predict whether noise will occur or not, theinternal noise on/off information can be obtained by itselffrom the robot. Therefore, it is possible to utilize the motorstate in both training and testing for speech recognition. Inthis paper, in addition to spectral-based features extractedfrom the speech, auxiliary features including robot statusinformation are concatenated and used as augmented inputya during both training and testing. The operation statusof the robot can be classified into a basic operation statein which only the fan and motor are turned on (motoroff) and a state in which the robot shakes its head (motoron) according to a human command. Our robot transmitsauxiliary information with the default state of operation as“state off” and moving state as “state on” as shown in Fig.2.

Our speech recognition consists of a feature mappingmodel that stacks two DNNs and an acoustic model ofsingle DNN, which predicts the phone sequence as the target.Auxiliary information is used for all DNN training. Thefirst DNN of the front-end DNN estimates the clean speechfeature by concatenating the noisy speech features and theauxiliary features. Also, the second DNN estimates cleanspeech feature once more by concatenating the enhancedoutput pass through the first DNN with the first inputfeatures. Finally, the input of the back-end DNN, whichis acoustic model, is created by concatenating the auxiliaryfeatures and the enhanced features from the front-end DNN.During the test phase, the microphone input, robot statusinformation, and learned model parameters are used in thesame way as before.

IV. EXPERIMENTAL SETUP AND RESULTS

A. Speech data

In order to simulate our system in a noisy environment,we first recorded the noises generated by the motor and thefan, which are the noise generated in the default operatingstate of robot (state off), and the noise generated when therobot moves by the users command (state on). After that,we mixed noises with clean speech to build train and testdataset. We used TIMIT database [13] for our experiments.The noisy speech data of 5, 10, 15, and 20 dB are generatedin state off and on, respectively, and these noise is used forboth training and testing. As a result, the number of utteranceused for training is 3,696 × 8 = 29,568, and the number ofutterance used for cross validation is 400 × 8 = 3,200. Thetest was evaluated with 192 noisy speech utterances for each

(a) Clean speech signal

(b) Noisy speech signal

(c) enhanced by single DNN

(d) enhanced by stacked DNN

Fig. 3: Spectrogram comparison (a) Original clean speechsignal, (b) Microphone input speech signal (state on noiseat 5dB), (c) Enhanced by single DNN-based algorithm, (d)Enhanced by stacked DNN-based algorithm

dataset (5, 10, 15 and 20 dB in state on/off). In this paper, weused kaldi toolkit [14] for both training and testing. The 13thorder MFCC, which is a spectral based feature, was used asinput, and delta and delta-delta were used together. We useda frame size of 25 ms and a step size of 10ms for featureextraction. In addition, One-hot encoding was applied to usethe auxiliary feature. For the acoustic model, the output ofthe utterances is mapped to 48 phonemes.

B. Model Architecture

The feature mapping model used in this paper was learnedby using three hidden layers, and stacked DNN modelwas used by stacking two feature mapping DNNs. Theinitial learning rates of DNNs are 0.000001 and 0.000005,respectively, and momentum is 0.9. An enhanced speechfeature by stacked DNN was used to train the followingacoustic model DNN, which was learned using five hiddenlayers and an initial learning rate of 0.0008. Thus, a totalof three DNNs were trained, and the auxiliary feature wasused to train all the DNNs as shown in Fig. 2. all layerswere trained using 1024 hidden unit with ReLU activationfunction [15]. For the evaluation of proposed model, a modelwithout auxiliary information was compared as a baseline.The model for evaluation is trained in the same way.

C. Results

An example of the results of feature mapping model basedon both single DNN and stacked DNN is shown in Fig. 3and Fig. 4. Also, the phone error rate (PER) performance

1861

Page 4: DNN-Based Speech Recognition System Dealing with Motor ...

TABLE II: PER (Phone error rate) Result

Featuremapping

DNNFeatures

PER (%)Fan noise Head shaking (motor-driven) noise Avg.

5 dB 10 dB 15 dB 20 dB 5 dB 10 dB 15 dB 20 dBBaseline

X MFCC 27.7 25.2 24.4 23.8 28.5 25.9 24.4 23.8 25.4SingleDNN

MFCC 27.0 24.9 23.3 22.7 27.0 23.4 22.7 22.6 24.2

StackedDNN

MFCC 26.4 24.6 23.1 22.5 26.9 23.4 23.1 22.5 24.0

Proposed

XMFCC

+Auxiliary

27.6 25.1 24.2 23.5 27.8 25.2 24.1 23.1 25.0

SingleMFCC

+Auxiliary

26.5 24.6 23.3 22.7 26.7 23.2 22.7 22.5 24.0

DNNMFCC

+Auxiliary

26.4 24.5 23.3 22.7 26.5 23.1 22.4 22.5 23.9

on TIMIT dataset is shown in Table 2. We compared ourmodel with the auxiliary features to the model without theauxiliary features on the three models: (1) The model withoutthe feature mapping, (2) The model with single DNN-basedfeature mapping and (3) The model with stacked DNN-based feature mapping. As a result, the performance of themodels using auxiliary information where better than allof the models of the conventional model. With auxiliaryinformation, the feature mapping model using the stackedDNN showed highest performance. Thus, we can see thatthe motor state can be the relevant feature then makes theDNN learned efficiently for the head shaking robot.

V. CONCLUSIONS

In this paper, we first tried to learn speech recognitionmodel robust to internal noise of the social robot by usingstate information from robot as an auxiliary feature. Weused the head shaking robot, JIBO, for the experiment. Inaddition, we used the information obtained from the robotitself as an auxiliary feature, as well as speech informa-tion obtained from microphones, unlike conventional speechrecognition system. In addition, the feature mapping modelused the stacked DNN to further improve the performance.Experimental results show that the stacked DNN is a suitablemethod for feature enhancement. The proposed model out-performed all the baseline models without auxiliary features.Also, the highest performance was observed in the modelusing the stacked DNN with the auxiliary information forthe feature mapping model. The algorithm proposed in thispaper enables robust speech recognition even in robots withinternal noise such as JIBO.

ACKNOWLEDGMENT

This work was supported by the Technology InnovationProgram (10076583, Development of free-running speech

recognition technologies for embedded robot system) fundedby the Ministry of Trade, Industry & Energy (MOTIE, Korea)

REFERENCES

[1] Wang, B. “IBM putting Watson into Softbank Pepper robot.” Next BigFuture, 2016.

[2] P. Rane, V. Mhatre, and L. Kurup. “Study of a home robot: Jibo.”International journal of engineering research and technology, vol. 3.No. 10, pp. 490-493. Oct. 2014.

[3] 21st Century Robot[Website]. (2018, Feb 26). https://www. 21stcen-turyrobot.com

[4] X. Feng, Y. Zhang, and J. Glass, “Speech feature denoising anddereverberation via deep autoencoders for noisy reverberant speechrecognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,2014.

[5] D. Yu, L. Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero,“A Minimum-mean-square-error noise reduction algorithm on mel-frequency cepstra for robust speech recognition,” in Proc. IEEE Int.Conf. Acoust., Speech, Signal Process., 2008.

[6] Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “A regression approach tospeech enhancement based on deep neural networks,” IEEE Trans.Audio, Speech, Lang. Process., vol. 23, no. 1, pp. 7-19, Jan. 2015.

[7] Y. X. Wang, A. Narayanan, and D. L. Wang, Bhuvana Ramabhadran,“Auto-encoder bottleneck features using deep belief networks” in Proc.IEEE Int. Conf. Acoust., Speech, Signal Process., 2012.

[8] A. Narayanan and D. L. Wang. “Ideal ratio mask estimation usingdeep neural networks for robust speech recognition.” in Proc. IEEEInt. Conf. Acoust., Speech, Signal Process., 2013.

[9] H. Seo, M. Lee, and J.-H. Chang, “Integrated acoustic echo andbackground noise suppression based on stacked deep neural networks,”Applied Acoustics, Vol. 133, pp. 194-201, Apr. 2018.

[10] D. S. Williamson, Y. X. Wang, and D. L. Wang, “Complex ratiomasking for monaural speech separation,” IEEE Trans. Audio, Speech,Lang. Process., vol. 24, no. 3, pp. 483-492, 2016.

[11] T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Auto-encoderbottleneck features using deep belief networks,” in Proc. IEEE Int.Conf. Acoust., Speech, Signal Process., 2012.

[12] T. N. Sainath, et al. “Factored spatial and spectral multichannel rawwaveform CLDNNs.” in Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess., 2016.

[13] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett,and N. L. Dahlgren, “TIMIT acoustic phonetic continuous speechcorpus,” in 510 Linguistic Data Consortium, 1993.

1862

Page 5: DNN-Based Speech Recognition System Dealing with Motor ...

[14] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stem-mer, and K. Vesely, “The Kaldi speech recognition toolkit,” in Proc.IEEE workshop on automatic speech recognition and understanding.,2011.

[15] G. E. Dahl, T. N. Sainath, and G. E. Hinton. “Improving deep neuralnetworks for LVCSR using rectified linear units and dropout.” in Proc.IEEE Int. Conf. Acoust., Speech, Signal Process., 2013

1863