[IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada...
Transcript of [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada...
![Page 1: [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada (2006.05.7-2006.05.10)] 2006 Canadian Conference on Electrical and Computer Engineering -](https://reader037.fdocuments.us/reader037/viewer/2022092811/5750a7951a28abcf0cc22a5b/html5/thumbnails/1.jpg)
ROBUST SELF-TRAINING SYSTEM FOR SPOKEN QUERY INFORMATION
RETRIEVAL USING PITCH RANGE VARIATIONS
Yacine BenahmedLARIHS Laboratory, Universite de Moncton,Campus de Shippagan, E8S IP6, Canada,
email: [email protected]
AbstractThis paper presents an Automatic User Profile Building
and Training (AUPB&T) system using voice pitch variationfor speech recognition engines. The problem with current ASRengines is that their vocabularies are usually only suitedforgeneral usage. Another problem with current ASR engines isthat there is no easy means for visually challenged users totrain the engine to improve its performance.
Ourproposed solution consists ofa system that will accepta user's document andfavorite web pages. These documentswill then be parsed and their words added to the ASR engine'slexicon. Next, it uses those documents to start an ASR trainingsession. The training will completed automatically by using ahigh quality text-to-speech (TTS) natural voice. In order toovercome the problem of the limited number of high qualitynatural TTS voices available, we propose to integrate voicepitch variation during the training phase ofA UPB&T, whichwill cover a broader range ofuser variability.
The results of our experiments using standard ASR andTTS engines show that the AUPB&T system using pitchvariation improved the recognition rate for an unknown ,6speaker.
Keywords. Speech recognition; robustness; Text-to-Speech,Pitch variation; Human-Computer interaction.
1. Introduction
With the advent of the ever growing broadbandcommunication market and easier to use development tools, anemerging vision is to provide consumers with a more naturalway to interact with computer applications through the use ofconversational interfaces. The maturity of current Text-To-Speech (TTS) and Automatic Speech Recognition (ASR)engines allows us to include them in our systems as either apart of a multimodal interaction interface or as a completelystand alone interface.
This paper presents a system allowing an automatic userprofile building and training (AUPB&T) for speechrecognition. The problem with current generation ASRengines is that their vocabularies are usually only suited forgeneral usage. Hence, visually challenged users have no easymeans of enriching and training the engine to their voice. This,as well as the fact that other users generally shun this type ofinput method, because of the long and somewhat pickytraining process, encouraged us to propose a way to get around
1-4244-0038-4 2006IEEE CCECE/CCGEI, Ottawa, May 2006
Sid-Ahmed SelouaniLARIHS Laboratory, Universite de Moncton,Campus de Shippagan, E8S IP6, Canada,
email: [email protected]
those major problems. Our proposed solution is a system thatwill accept a user's documents and favorite web pages(URLs), and will feed them to a TTS module in order toautomatically build and train a user profile. The system parsesthese documents and web pages, and adds the found words inthe system's lexicon. The acquired lexicon is then used tolaunch an automatic training session using a high quality,naturally sounding synthesized voice. In order to overcomethe problem of the limited number of high quality natural TTSvoices available, we propose to integrate pitch range variationduring the training phase of (AUPB&T). This willtheoretically allow the covering of a broader range of uservariability. By doing this, the system will build its modelbased on different artificial voices and consequently improvespeech recognition for a given user. The results of ourexperiments, using commercially available ASR and TTS,show that the AUPB&T system using pitch variation resultedin a greater improvement in recognition rate for an unknownuser P on a baseline system than the AUPB&T system withoutpitch variation. This was achieved without constraining theuser to long and fastidious manual training sessions.
The outline of this paper is as follows. Section 2 deals withthe user profile and automatic training AUPB&T system, anddiscusses some of this system's advantages andfunctionalities. Section 3 introduces the pitch variationtechnique. Section 4 proceeds with the description of theexperiment setup, and the evaluation of the AUPB&T systemsusing pitch variation. Finally in Section 5, we conclude anddiscuss our results.
2. AUPB&T System
Due to the general vocabulary provided by recognitionengines, we needed to find a way to expand that vocabulary tosuit the needs of individual users and tailor it to their day today vocabulary, be it science, art, literature, etc. To get aroundthis problem we built an automatic user profile building andtraining (AUPB&T) system. The system adds the words foundin documents and/or URLS given by the user, and then initiatean automatic training session where the user does not need totalk to the system, since a natural synthesized voice is usedinstead. Of course, to accommodate visually impaired users,we provide a means to control the system using voicecommands. As shown by Figure 1, the AUPB&T system is
1450
![Page 2: [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada (2006.05.7-2006.05.10)] 2006 Canadian Conference on Electrical and Computer Engineering -](https://reader037.fdocuments.us/reader037/viewer/2022092811/5750a7951a28abcf0cc22a5b/html5/thumbnails/2.jpg)
spelling grammar, used to spell out the paths and URLs of thedocuments needed for the profile building and training.
2.3. Parser and Lexicon Module
The parser module contains two engines, and a temporarystorage area. The first engine is a HTML parser used to filterout garbage text. The second one is used to open raw textdocuments as well a Microsoft Word documents, andeventually PDF documents. Once the parsing is done thedocuments are appended to the storage area for laterprocessing.
The lexicon module is used to convert the text coming fromthe parser module to their phonetic representation [1]. Afterthis is done each word is added to the speech recognition'sengine with its corresponding phonetic code.
2.4. Trainer Module
Figure 1. The AUPB&T system diagram
built in a modular fashion to facilitate the addition of newfunctionalities.
2.1. Main Module of AUPB&T
The main module of AUPB&T controls the fetching of thedocuments through the web browser engine and the fileaccessor engine. Once the two engines have finished openingtheir documents they pass them to their respective parsers. Themain module is also in charge of telling the parser module thatit can pass its text to the lexicon module and, after it is done,to add the words to the trainer module dictionary.
2.2. Voice Control Module
The voice control module is used to tell the main modulewhich documents should be opened, and from where to get theweb documents. Two grammars are required for this task; thefirst one is the command grammar, which has all the necessaryvocabulary to control the application. The second one is the
The trainer module is used to automatically train the speechrecognition engine. After the lexicon module is doneprocessing, the text from the storage area of the parser moduleis passed to the trainer module which normalizes it and thenopens a training session. Once the training session is opened,the text is passed to the TTS engine which reads the wholetext until the training session is over.
3. Pitch Variation
The fundamental frequency, Jo, is the dominating soundfrequency produced by the vocal chords [2]. Pitch variationcan thus be defined as the variation offo. Through our pitchvariation system, we aim to transform the perceived speakeridentity by converting the speaker's prosodic characteristics.When considering pitch contours, most systems onlytransform the pitch by simple scaling [3]. Typically, voiceconversion aims at transforming an utterance spoken by agiven source speaker in such a way that it is perceived to bespoken by another target speaker.
In our case, the goal of using pitch variation during speechrecognition training is to cover a broader range of voices inorder to get a more tolerant speech model. The followingalgorithm was used to vary the pitch of Paul's voice. Itconsists of averaging the fundamental frequency of the voiceand then applying a multiplier to it.We implemented pitch variation using SAPI TTS XML [4]
tag support, and NeoSpeech's [5] support for pitch variation.It is important to note that not all natural TTS voices supportpitch variation, so it is important to find one that explicitlydoes. The way SAPI supports pitch variation is through theuse of the <pitch absmiddle="n"> tag which is used to controlthe relative pitch of the voice. The <pitch> tag supports ascale of-0 to 10 where -10 is a "low" pitched voice and 10 isa "high" pitched voice. We implemented it in our trainingsession by equally dividing the training text that was passed tothe TTS engine into different pitch sections.
1451
![Page 3: [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada (2006.05.7-2006.05.10)] 2006 Canadian Conference on Electrical and Computer Engineering -](https://reader037.fdocuments.us/reader037/viewer/2022092811/5750a7951a28abcf0cc22a5b/html5/thumbnails/3.jpg)
o)rigimil0 WI WI
2 W3 n3 W4
are. 2 excerpt trom the multi-suDject test document, in
this case an excerpt from Shakespeare's Hamlet
4. Experimental Setup & Result
To find out how effective custom profiling and trainingusing pitch variation could be for an average user, experimentswere set up as follows.
4.1. Text Material and Voices Platforms
To evaluate the systems, we used to separate machines, one
referred to as the baseline which did not receive any trainingand was used to establish the "no training" performance of thesystem, and a second one on which we ran training sessions todetermine the efficacy of our training method.We used two documents; one for training, which consisted
of 1338 words covering several subject matters, and, a secondone for testing, consisting of 1909 words, from which 25%came from the training document. Figure 2 shows and excerptfrom the testing document.We tested the system with two speakers, the first one Paul, a
voice from Neospeech which supports pitch variation, and thesecond (young male) speaker, called 3, being a human speakerwhich did not participate in the training process.
To insure a consistent set of results with 3, we recorded himreading the test document. We also refrained from normalizingthe volume of the wave file as to reproduce a normal usagescenario.As for the recognition engine, we used the free SAPI 5.1
compatible Microsoft English (U.S.) v6.1 Recognizer [4]. Itshould be noted to note that almost all special characters $, &,*, etc, numerical characters and punctuation marks were
stripped off from the text while testing.
4.2. Performance Measurements andMethodology
For comparing the original text and the recognized text, wecreated an algorithm that places each text's words into a
separate array. The algorithm will search in the recognizedtext array for the nth word of the original text's array in a 7
Figure. 3. Example ofhow the algorithmcalculates the omission rate
word range starting from the index i of the previously foundword (see Figure 3 for more explanation). If word n is notfound then we increment i by one as to avoid it stalling andskewing the results.The four following measures were used to evaluate the
speech recognition performance of our systems. Omissioncount is the number of words from the original text that were
not recognized. Insertion count is the difference between thenumber of words in the recognized text and the number ofwords in the original text. Position count is the number ofwords in the recognized text that don't appear in the same
order as the ones from the original text. Finally, thepercentage of word recognition rate is also considered(calculated as (total original words - total omissions) / totaloriginal words * 100).
The methodology we used for testing is as follows, the textwas tested three times with the same voice (backgroundadaptation was unchecked), and the average was recorded.
For our experiment results, note that in the tables andgraphs untrained systems are referenced as NT and the trainedones are referenced as Tnpv where npv stands for no pitchvariation and Tpv wherepv stands for pitch variation.
4.3. Experiment Results
The first phase of the experiment consisted of testing thedocument with the two voices: Paul and 3, on the baselinesystem and, as expected, the results were averaging 80%recognition rate which is standard for today's recognitionsystems. The second phase consisted of passing the trainingdocument once through the AUPB&T system with no pitchvariation using Paul's voice. The document was then testedwith Paul. As Figure 4 shows, the recognition rate improvedto 88% which is an improvement of 8%.
1452
Rccogr izcd
W.,6W7
W998
Wri 9
w
;'
![Page 4: [IEEE 2006 Canadian Conference on Electrical and Computer Engineering - Ottawa, ON, Canada (2006.05.7-2006.05.10)] 2006 Canadian Conference on Electrical and Computer Engineering -](https://reader037.fdocuments.us/reader037/viewer/2022092811/5750a7951a28abcf0cc22a5b/html5/thumbnails/4.jpg)
Table 1. Comparative results giving omissioninsertion and position errors of the recognizerwithout training (NT) and with a one pass use ofAUPB&T system with and without pitch variationin the context of a multi-subject spoken queryconsisting of 1909 words.Voices Omission Insertion PositionPaul(NT) 372 137 21f (NT) 378 108 42Paul(Tnp,) 223 44 49fl (Tpv) 296 78 37Paul(Tpv) 240 9 36Ii(ToV) 274 109 36
Finally, to test the efficiency of the system (with no pitchvariation) for a regular (unknown) user, the P speakerrecording of the unknown words document was played on thetrained system. This yielded an 84% recognition rate. Animprovement of 4% in continuous speech recognition rate isachieved. The third phase of the experiment consisted ofpassing the training document once through the AUPB&Tsystem with pitch variation using Paul's voice. Again, thedocument was tested with Paul. As Figure 4 shows, therecognition rate improved to 87% which is an improvement of7% slightly less than without pitch variation. Finally, to testthe efficiency of the system (with pitch variation) for a regular(unknown) user, P's recording of the unknown wordsdocument was played on the trained system. This yielded an
86% recognition rate. An improvement of 6%, which is a 2%improvement over the basic AUPB&T system, in continuousspeech recognition rate is achieved. This result is important tous as it validates our hypothesis that adding pitch variation toour AUPB&T system will improve continuous speechrecognition even more for users that do not necessarily have a
voice similar to the TTS voice.
5. Conclusion
In this paper we have presented the Automatic User ProfileBuilding and Training (AUPB&T) system that we previouslybuilt as well as an added improvement which consists ofvarying the pitch of the synthetic voice during training. Thegoal of our improvement was to cover a broader range offundamental frequencies for the user's voice model. With thissystem, we believe that we can improve the accessibility ofspeech technologies to visually challenge users as well as tothe general public. There are two main advantages to our
system. The first one is the ability to improve speechrecognition without having to spend time reading text. Thesecond one is that the system uses the user's documents tolaunch the training session. This in fact results in a tailoreduser profile. Those who do not have the ability or the patienceto deal with the manual training process (reading the trainingtext) should find it especially attractive.
Our experiments showed that the AUPB&T system withvoice pitch variation averaged an improvement of 6% in
Document recognition rate
100
9080
S 70et 60
60I0 50
a= 400w 30
20
100
INT
ITnpviTpv
pVoice
Figure. 4. Percentage of word recognition rate obtained byPaul natural TTS voice, and by a P speaker who did not
participate in training.
continuous speech recognition performance. This is a
consistent 2% performance increase over the AUPB&Tsystem that did not use voice pitch variation. This result issignificant to us. It confirms our hypothesis that varying thepitch of the TTS voice during training should yield a greaterimprovement in continuous speech recognition for an
unknown speaker P.A future use of the AUPB&T system will be to integrate it
into an e-learning platform. It will be useful in this context as
the vocabulary used in a learning context is quite oftenspecialized. This will allow students to experience a goodrecognition performance for their specialized vocabulary.
Finally, the AUPB&T system using voice pitch variation isa great tool for improving continuous speech recognitionperformance for the visually challenged as well as for users
that do not want to go through the long manual trainingsession.
References
[1] SpeechStudio Inc. Pronunciation controlhttp. //www.speechstudio.com
[2] Ceyssens, T., Verhelst, M., and Wambacq, P., "Astrategy for pitch conversion and its evaluation", Proc. 3rdIEEE (SPS-2002), Leuven, Belgium, 2002
[3] Filipsson, M. "Speech Analysis Tutorial"http. //www. ling.lu.selresearchlspeechtutorialltutorial. html
[4] Microsoft Corporation. "Microsoft Speech ApplicationSoftware Development Kit (SASDK) Version 1.0":http. //www. microsoft. comrdownloads
[5] NextUp Technologies, LLC. "NeoSpeech TTS Voices"http. //www. nextup. comrneospeech. html
[6] Shakespeare, W. "Hamlet" http://www.shakespeare-online.com/plays/hamlet_3 2.html
1453
Paul