[IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada...

4
ROBUST SELF-TRAINING SYSTEM FOR SPOKEN QUERY INFORMATION RETRIEVAL USING PITCH RANGE VARIATIONS Yacine Benahmed LARIHS Laboratory, Universite de Moncton, Campus de Shippagan, E8S IP6, Canada, email: [email protected] Abstract This paper presents an Automatic User Profile Building and Training (AUPB&T) system using voice pitch variation for speech recognition engines. The problem with current ASR engines is that their vocabularies are usually only suitedfor general usage. Another problem with current ASR engines is that there is no easy means for visually challenged users to train the engine to improve its performance. Our proposed solution consists of a system that will accept a user's document andfavorite web pages. These documents will then be parsed and their words added to the ASR engine's lexicon. Next, it uses those documents to start an ASR training session. The training will completed automatically by using a high quality text-to-speech (TTS) natural voice. In order to overcome the problem of the limited number of high quality natural TTS voices available, we propose to integrate voice pitch variation during the training phase of A UPB&T, which will cover a broader range of user variability. The results of our experiments using standard ASR and TTS engines show that the A UPB&T system using pitch variation improved the recognition rate for an unknown ,6 speaker. Keywords. Speech recognition; robustness; Text-to-Speech, Pitch variation; Human-Computer interaction. 1. Introduction With the advent of the ever growing broadband communication market and easier to use development tools, an emerging vision is to provide consumers with a more natural way to interact with computer applications through the use of conversational interfaces. The maturity of current Text-To- Speech (TTS) and Automatic Speech Recognition (ASR) engines allows us to include them in our systems as either a part of a multimodal interaction interface or as a completely stand alone interface. This paper presents a system allowing an automatic user profile building and training (AUPB&T) for speech recognition. The problem with current generation ASR engines is that their vocabularies are usually only suited for general usage. Hence, visually challenged users have no easy means of enriching and training the engine to their voice. This, as well as the fact that other users generally shun this type of input method, because of the long and somewhat picky training process, encouraged us to propose a way to get around 1-4244-0038-4 2006 IEEE CCECE/CCGEI, Ottawa, May 2006 Sid-Ahmed Selouani LARIHS Laboratory, Universite de Moncton, Campus de Shippagan, E8S IP6, Canada, email: [email protected] those major problems. Our proposed solution is a system that will accept a user's documents and favorite web pages (URLs), and will feed them to a TTS module in order to automatically build and train a user profile. The system parses these documents and web pages, and adds the found words in the system's lexicon. The acquired lexicon is then used to launch an automatic training session using a high quality, naturally sounding synthesized voice. In order to overcome the problem of the limited number of high quality natural TTS voices available, we propose to integrate pitch range variation during the training phase of (AUPB&T). This will theoretically allow the covering of a broader range of user variability. By doing this, the system will build its model based on different artificial voices and consequently improve speech recognition for a given user. The results of our experiments, using commercially available ASR and TTS, show that the AUPB&T system using pitch variation resulted in a greater improvement in recognition rate for an unknown user P on a baseline system than the AUPB&T system without pitch variation. This was achieved without constraining the user to long and fastidious manual training sessions. The outline of this paper is as follows. Section 2 deals with the user profile and automatic training AUPB&T system, and discusses some of this system's advantages and functionalities. Section 3 introduces the pitch variation technique. Section 4 proceeds with the description of the experiment setup, and the evaluation of the AUPB&T systems using pitch variation. Finally in Section 5, we conclude and discuss our results. 2. AUPB&T System Due to the general vocabulary provided by recognition engines, we needed to find a way to expand that vocabulary to suit the needs of individual users and tailor it to their day to day vocabulary, be it science, art, literature, etc. To get around this problem we built an automatic user profile building and training (AUPB&T) system. The system adds the words found in documents and/or URLS given by the user, and then initiate an automatic training session where the user does not need to talk to the system, since a natural synthesized voice is used instead. Of course, to accommodate visually impaired users, we provide a means to control the system using voice commands. As shown by Figure 1, the AUPB&T system is 1450

Transcript of [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada...

Page 1: [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada (2006.05.7-2006.05.10)] 2006 Canadian Conference on Electrical and Computer Engineering -

ROBUST SELF-TRAINING SYSTEM FOR SPOKEN QUERY INFORMATION

RETRIEVAL USING PITCH RANGE VARIATIONS

Yacine BenahmedLARIHS Laboratory, Universite de Moncton,Campus de Shippagan, E8S IP6, Canada,

email: [email protected]

AbstractThis paper presents an Automatic User Profile Building

and Training (AUPB&T) system using voice pitch variationfor speech recognition engines. The problem with current ASRengines is that their vocabularies are usually only suitedforgeneral usage. Another problem with current ASR engines isthat there is no easy means for visually challenged users totrain the engine to improve its performance.

Ourproposed solution consists ofa system that will accepta user's document andfavorite web pages. These documentswill then be parsed and their words added to the ASR engine'slexicon. Next, it uses those documents to start an ASR trainingsession. The training will completed automatically by using ahigh quality text-to-speech (TTS) natural voice. In order toovercome the problem of the limited number of high qualitynatural TTS voices available, we propose to integrate voicepitch variation during the training phase ofA UPB&T, whichwill cover a broader range ofuser variability.

The results of our experiments using standard ASR andTTS engines show that the AUPB&T system using pitchvariation improved the recognition rate for an unknown ,6speaker.

Keywords. Speech recognition; robustness; Text-to-Speech,Pitch variation; Human-Computer interaction.

1. Introduction

With the advent of the ever growing broadbandcommunication market and easier to use development tools, anemerging vision is to provide consumers with a more naturalway to interact with computer applications through the use ofconversational interfaces. The maturity of current Text-To-Speech (TTS) and Automatic Speech Recognition (ASR)engines allows us to include them in our systems as either apart of a multimodal interaction interface or as a completelystand alone interface.

This paper presents a system allowing an automatic userprofile building and training (AUPB&T) for speechrecognition. The problem with current generation ASRengines is that their vocabularies are usually only suited forgeneral usage. Hence, visually challenged users have no easymeans of enriching and training the engine to their voice. This,as well as the fact that other users generally shun this type ofinput method, because of the long and somewhat pickytraining process, encouraged us to propose a way to get around

1-4244-0038-4 2006IEEE CCECE/CCGEI, Ottawa, May 2006

Sid-Ahmed SelouaniLARIHS Laboratory, Universite de Moncton,Campus de Shippagan, E8S IP6, Canada,

email: [email protected]

those major problems. Our proposed solution is a system thatwill accept a user's documents and favorite web pages(URLs), and will feed them to a TTS module in order toautomatically build and train a user profile. The system parsesthese documents and web pages, and adds the found words inthe system's lexicon. The acquired lexicon is then used tolaunch an automatic training session using a high quality,naturally sounding synthesized voice. In order to overcomethe problem of the limited number of high quality natural TTSvoices available, we propose to integrate pitch range variationduring the training phase of (AUPB&T). This willtheoretically allow the covering of a broader range of uservariability. By doing this, the system will build its modelbased on different artificial voices and consequently improvespeech recognition for a given user. The results of ourexperiments, using commercially available ASR and TTS,show that the AUPB&T system using pitch variation resultedin a greater improvement in recognition rate for an unknownuser P on a baseline system than the AUPB&T system withoutpitch variation. This was achieved without constraining theuser to long and fastidious manual training sessions.

The outline of this paper is as follows. Section 2 deals withthe user profile and automatic training AUPB&T system, anddiscusses some of this system's advantages andfunctionalities. Section 3 introduces the pitch variationtechnique. Section 4 proceeds with the description of theexperiment setup, and the evaluation of the AUPB&T systemsusing pitch variation. Finally in Section 5, we conclude anddiscuss our results.

2. AUPB&T System

Due to the general vocabulary provided by recognitionengines, we needed to find a way to expand that vocabulary tosuit the needs of individual users and tailor it to their day today vocabulary, be it science, art, literature, etc. To get aroundthis problem we built an automatic user profile building andtraining (AUPB&T) system. The system adds the words foundin documents and/or URLS given by the user, and then initiatean automatic training session where the user does not need totalk to the system, since a natural synthesized voice is usedinstead. Of course, to accommodate visually impaired users,we provide a means to control the system using voicecommands. As shown by Figure 1, the AUPB&T system is

1450

Page 2: [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada (2006.05.7-2006.05.10)] 2006 Canadian Conference on Electrical and Computer Engineering -

spelling grammar, used to spell out the paths and URLs of thedocuments needed for the profile building and training.

2.3. Parser and Lexicon Module

The parser module contains two engines, and a temporarystorage area. The first engine is a HTML parser used to filterout garbage text. The second one is used to open raw textdocuments as well a Microsoft Word documents, andeventually PDF documents. Once the parsing is done thedocuments are appended to the storage area for laterprocessing.

The lexicon module is used to convert the text coming fromthe parser module to their phonetic representation [1]. Afterthis is done each word is added to the speech recognition'sengine with its corresponding phonetic code.

2.4. Trainer Module

Figure 1. The AUPB&T system diagram

built in a modular fashion to facilitate the addition of newfunctionalities.

2.1. Main Module of AUPB&T

The main module of AUPB&T controls the fetching of thedocuments through the web browser engine and the fileaccessor engine. Once the two engines have finished openingtheir documents they pass them to their respective parsers. Themain module is also in charge of telling the parser module thatit can pass its text to the lexicon module and, after it is done,to add the words to the trainer module dictionary.

2.2. Voice Control Module

The voice control module is used to tell the main modulewhich documents should be opened, and from where to get theweb documents. Two grammars are required for this task; thefirst one is the command grammar, which has all the necessaryvocabulary to control the application. The second one is the

The trainer module is used to automatically train the speechrecognition engine. After the lexicon module is doneprocessing, the text from the storage area of the parser moduleis passed to the trainer module which normalizes it and thenopens a training session. Once the training session is opened,the text is passed to the TTS engine which reads the wholetext until the training session is over.

3. Pitch Variation

The fundamental frequency, Jo, is the dominating soundfrequency produced by the vocal chords [2]. Pitch variationcan thus be defined as the variation offo. Through our pitchvariation system, we aim to transform the perceived speakeridentity by converting the speaker's prosodic characteristics.When considering pitch contours, most systems onlytransform the pitch by simple scaling [3]. Typically, voiceconversion aims at transforming an utterance spoken by agiven source speaker in such a way that it is perceived to bespoken by another target speaker.

In our case, the goal of using pitch variation during speechrecognition training is to cover a broader range of voices inorder to get a more tolerant speech model. The followingalgorithm was used to vary the pitch of Paul's voice. Itconsists of averaging the fundamental frequency of the voiceand then applying a multiplier to it.We implemented pitch variation using SAPI TTS XML [4]

tag support, and NeoSpeech's [5] support for pitch variation.It is important to note that not all natural TTS voices supportpitch variation, so it is important to find one that explicitlydoes. The way SAPI supports pitch variation is through theuse of the <pitch absmiddle="n"> tag which is used to controlthe relative pitch of the voice. The <pitch> tag supports ascale of-0 to 10 where -10 is a "low" pitched voice and 10 isa "high" pitched voice. We implemented it in our trainingsession by equally dividing the training text that was passed tothe TTS engine into different pitch sections.

1451

Page 3: [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada (2006.05.7-2006.05.10)] 2006 Canadian Conference on Electrical and Computer Engineering -

o)rigimil0 WI WI

2 W3 n3 W4

are. 2 excerpt trom the multi-suDject test document, in

this case an excerpt from Shakespeare's Hamlet

4. Experimental Setup & Result

To find out how effective custom profiling and trainingusing pitch variation could be for an average user, experimentswere set up as follows.

4.1. Text Material and Voices Platforms

To evaluate the systems, we used to separate machines, one

referred to as the baseline which did not receive any trainingand was used to establish the "no training" performance of thesystem, and a second one on which we ran training sessions todetermine the efficacy of our training method.We used two documents; one for training, which consisted

of 1338 words covering several subject matters, and, a secondone for testing, consisting of 1909 words, from which 25%came from the training document. Figure 2 shows and excerptfrom the testing document.We tested the system with two speakers, the first one Paul, a

voice from Neospeech which supports pitch variation, and thesecond (young male) speaker, called 3, being a human speakerwhich did not participate in the training process.

To insure a consistent set of results with 3, we recorded himreading the test document. We also refrained from normalizingthe volume of the wave file as to reproduce a normal usagescenario.As for the recognition engine, we used the free SAPI 5.1

compatible Microsoft English (U.S.) v6.1 Recognizer [4]. Itshould be noted to note that almost all special characters $, &,*, etc, numerical characters and punctuation marks were

stripped off from the text while testing.

4.2. Performance Measurements andMethodology

For comparing the original text and the recognized text, wecreated an algorithm that places each text's words into a

separate array. The algorithm will search in the recognizedtext array for the nth word of the original text's array in a 7

Figure. 3. Example ofhow the algorithmcalculates the omission rate

word range starting from the index i of the previously foundword (see Figure 3 for more explanation). If word n is notfound then we increment i by one as to avoid it stalling andskewing the results.The four following measures were used to evaluate the

speech recognition performance of our systems. Omissioncount is the number of words from the original text that were

not recognized. Insertion count is the difference between thenumber of words in the recognized text and the number ofwords in the original text. Position count is the number ofwords in the recognized text that don't appear in the same

order as the ones from the original text. Finally, thepercentage of word recognition rate is also considered(calculated as (total original words - total omissions) / totaloriginal words * 100).

The methodology we used for testing is as follows, the textwas tested three times with the same voice (backgroundadaptation was unchecked), and the average was recorded.

For our experiment results, note that in the tables andgraphs untrained systems are referenced as NT and the trainedones are referenced as Tnpv where npv stands for no pitchvariation and Tpv wherepv stands for pitch variation.

4.3. Experiment Results

The first phase of the experiment consisted of testing thedocument with the two voices: Paul and 3, on the baselinesystem and, as expected, the results were averaging 80%recognition rate which is standard for today's recognitionsystems. The second phase consisted of passing the trainingdocument once through the AUPB&T system with no pitchvariation using Paul's voice. The document was then testedwith Paul. As Figure 4 shows, the recognition rate improvedto 88% which is an improvement of 8%.

1452

Rccogr izcd

W.,6W7

W998

Wri 9

w

;'

Page 4: [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada (2006.05.7-2006.05.10)] 2006 Canadian Conference on Electrical and Computer Engineering -

Table 1. Comparative results giving omissioninsertion and position errors of the recognizerwithout training (NT) and with a one pass use ofAUPB&T system with and without pitch variationin the context of a multi-subject spoken queryconsisting of 1909 words.Voices Omission Insertion PositionPaul(NT) 372 137 21f (NT) 378 108 42Paul(Tnp,) 223 44 49fl (Tpv) 296 78 37Paul(Tpv) 240 9 36Ii(ToV) 274 109 36

Finally, to test the efficiency of the system (with no pitchvariation) for a regular (unknown) user, the P speakerrecording of the unknown words document was played on thetrained system. This yielded an 84% recognition rate. Animprovement of 4% in continuous speech recognition rate isachieved. The third phase of the experiment consisted ofpassing the training document once through the AUPB&Tsystem with pitch variation using Paul's voice. Again, thedocument was tested with Paul. As Figure 4 shows, therecognition rate improved to 87% which is an improvement of7% slightly less than without pitch variation. Finally, to testthe efficiency of the system (with pitch variation) for a regular(unknown) user, P's recording of the unknown wordsdocument was played on the trained system. This yielded an

86% recognition rate. An improvement of 6%, which is a 2%improvement over the basic AUPB&T system, in continuousspeech recognition rate is achieved. This result is important tous as it validates our hypothesis that adding pitch variation toour AUPB&T system will improve continuous speechrecognition even more for users that do not necessarily have a

voice similar to the TTS voice.

5. Conclusion

In this paper we have presented the Automatic User ProfileBuilding and Training (AUPB&T) system that we previouslybuilt as well as an added improvement which consists ofvarying the pitch of the synthetic voice during training. Thegoal of our improvement was to cover a broader range offundamental frequencies for the user's voice model. With thissystem, we believe that we can improve the accessibility ofspeech technologies to visually challenge users as well as tothe general public. There are two main advantages to our

system. The first one is the ability to improve speechrecognition without having to spend time reading text. Thesecond one is that the system uses the user's documents tolaunch the training session. This in fact results in a tailoreduser profile. Those who do not have the ability or the patienceto deal with the manual training process (reading the trainingtext) should find it especially attractive.

Our experiments showed that the AUPB&T system withvoice pitch variation averaged an improvement of 6% in

Document recognition rate

100

9080

S 70et 60

60I0 50

a= 400w 30

20

100

INT

ITnpviTpv

pVoice

Figure. 4. Percentage of word recognition rate obtained byPaul natural TTS voice, and by a P speaker who did not

participate in training.

continuous speech recognition performance. This is a

consistent 2% performance increase over the AUPB&Tsystem that did not use voice pitch variation. This result issignificant to us. It confirms our hypothesis that varying thepitch of the TTS voice during training should yield a greaterimprovement in continuous speech recognition for an

unknown speaker P.A future use of the AUPB&T system will be to integrate it

into an e-learning platform. It will be useful in this context as

the vocabulary used in a learning context is quite oftenspecialized. This will allow students to experience a goodrecognition performance for their specialized vocabulary.

Finally, the AUPB&T system using voice pitch variation isa great tool for improving continuous speech recognitionperformance for the visually challenged as well as for users

that do not want to go through the long manual trainingsession.

References

[1] SpeechStudio Inc. Pronunciation controlhttp. //www.speechstudio.com

[2] Ceyssens, T., Verhelst, M., and Wambacq, P., "Astrategy for pitch conversion and its evaluation", Proc. 3rdIEEE (SPS-2002), Leuven, Belgium, 2002

[3] Filipsson, M. "Speech Analysis Tutorial"http. //www. ling.lu.selresearchlspeechtutorialltutorial. html

[4] Microsoft Corporation. "Microsoft Speech ApplicationSoftware Development Kit (SASDK) Version 1.0":http. //www. microsoft. comrdownloads

[5] NextUp Technologies, LLC. "NeoSpeech TTS Voices"http. //www. nextup. comrneospeech. html

[6] Shakespeare, W. "Hamlet" http://www.shakespeare-online.com/plays/hamlet_3 2.html

1453

Paul