Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to...

15
Concatenation of syllable by anchor frame to improve Naturalness in speech synthesis for Marathi language (India) Pravin M Ghate National Institute of Electronics & Information Technology Dr.Babasaheb Ambedkar .Marathwada, University. Aurangabad, India [email protected] Suresh Shirbahadurkar Zeal College of Engineering & Research Narhe Pune, Savitribai Phule Pune University, Pune India [email protected] May 26, 2018 Abstract Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for Indian language produce intelligible speech but lack the prosody of natural utterances using concatenation speech synthesis. It can be useful for various application, car navigation, announcements in railway stations, response services in telecommunications, and e-mail reading.it has been observed that widely used approach for speech syn- thesis is based on concatenation of segment, this approach is called as concatenation technique. This method uses the pre-recorded unit of speech, which preserve the nat- uralness and intelligibility of the Marathi language. The quality of the synthetic speech is the direct function of the 1 International Journal of Pure and Applied Mathematics Volume 118 No. 24 2018 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ Special Issue http://www.acadpubl.eu/hub/

Transcript of Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to...

Page 1: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

Concatenation of syllable by anchorframe to improve Naturalness in speechsynthesis for Marathi language (India)

Pravin M GhateNational Institute of Electronics & Information Technology

Dr.Babasaheb Ambedkar .Marathwada, University.Aurangabad, India

[email protected] Shirbahadurkar

Zeal College of Engineering & Research Narhe Pune,Savitribai Phule Pune University,

Pune [email protected]

May 26, 2018

Abstract

Speech synthesis is the method to convert text in tospeech waveform. Current state-of-the-art text-to-speechsystems for Indian language produce intelligible speech butlack the prosody of natural utterances using concatenationspeech synthesis. It can be useful for various application,car navigation, announcements in railway stations, responseservices in telecommunications, and e-mail reading.it hasbeen observed that widely used approach for speech syn-thesis is based on concatenation of segment, this approachis called as concatenation technique. This method usesthe pre-recorded unit of speech, which preserve the nat-uralness and intelligibility of the Marathi language. Thequality of the synthetic speech is the direct function of the

1

International Journal of Pure and Applied MathematicsVolume 118 No. 24 2018ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

Page 2: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

available syllable. In context of Indian languages, sylla-ble units are found to be much better choice than unitslike phone, diphone and half phone [1]. For good qual-ity synthesis, syllable all units of the language should bepresent and join of syllable. The issues not addressed inthe previous works on syllable based synthesizer for Indianlanguages is concatenation of the syllable. This thesis ad-dresses two problems in speech synthesis, one is to improvethe naturalness of synthetic speech in syllable-based speechsynthesis. The other is the methods of improving the qual-ity syllable for the concatenating, in order to achieve moreflexible speech synthesis. To deal with the former prob-lem, we focus on two factors: (1) an algorithm for seg-mentation of speech into syllable as per the linguistic ruleand (2) Concatenation of syllable. In this paper, a novelmethod for improving the quality of syllable for Marathilanguage, is proposed. This method is used to generatea combination of two speech units like syllables for gener-ation better synthetic sound in terms of naturalness andintelligibility [1]. Experimental work carried out on speechdata base for Marathi language collected from LanguageTechnologies Research Center (LTRC), which is one of thecenter under International Institute of Information Technol-ogy,(IIIT) Hyderabad the performance of the proposed al-gorithm has been evaluated with state-of-the art techniquelike MOS (Mean Opinion Scores) Test (Ref in: The Inter-national Telecommunication Union, ITU-T Recommenda-tion P.85, http://www.itu.int/rec/T-RECP.85-199406-I/en,1994.) Mean Opinion Scores (MOS) is the test to checkthe quality of synthetic speech [3].we have also proposedsegmentation algorithm for speech signal. Performances ofthe proposed algorithms have been evaluated on the pa-rameters such as number of sample point in speech syllable,Pitch, Energy of speech and frame of the speech signal [4].Result of proposed algorithm are encouraging due to theselection of suitable features as compared to the previousmethods. Analysis has been carried out for the proposedalgorithms on database of syllable which is generated formword sound. This research work aims to implement concate-nation approach (using syllable as speech units) for reduc-tion of database size and to prepare database of for syllables

2

International Journal of Pure and Applied Mathematics Special Issue

Page 3: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

so that they can be reused for synthesizing new words.Key Words:Concatenation, Marathi Phonology, Speech

Synthesis, syllable, TTS,

1 Introduction

Speech processing technology has been a mainstream area of re-search for more than 50 years. The ultimate goal of speech researchis to build systems that mimic human capabilities in understanding,generating and coding speech for a range of human-to-human andhuman-to-machine interactions. Speech Synthesis is the techniqueof artificially generating human speech [1]. A system doing thissynthesis function is called as synthesizer. A text to speech systemnormal text of any language for which it is design into speech. Itis widely used in audio reading devices for blind people now days.The basic block shown in the Figure: 1.1

Figure: 1.1 Basic block diagram of Speech synthesis

2 System development

In the last few years however, the use of text-to-speech conversiontechnology has grown far beyond the disabled community to becomea major adjunct to the rapidly growing use of digital voice storagefor voice mail and voice response system. The system developmentshown in Figure1.2

Proposed block diagram for System development

3

International Journal of Pure and Applied Mathematics Special Issue

Page 4: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

Figure1.2: Propose basic block diagram of system forconcatenation of syllable

The process of converting the sentence into syllable is calledsegmentation. This segmentation is perform according to the lin-guistic rule of Marathi language. The system implement two majoralgorithm.

1. Segmentation algorithm2. Syllable concatenation algorithmThe system has two input like speech signal and text. These

two input combines according algorithm and consideration of pa-rameter. The accepted speech signal and text is processed furtherfor storing in the data corpus. Mapping of text with the soundunit stored in the pre-processing is carried out. Figure:1.2 showsthe complete system development work in three stage, in the firststage it accept the speech sentence in Marathi Language on thesame stage form other side it accept the Marathi text. In the laststage speech is generated. The outcome of whole algorithm is togenerate the syllable and concatenation of the syllable with signalprocessing. Details experimentation of these algorithm is coveredin following section.

3 Algorithm syllable concatenation to

form the word

This algorithm is for the concatenation of syllable naturalness andintelligibility. Our earlier work on speech synthesis has shown thatconversation of speech signal into the syllable [5]. Nevertheless,audible artifacts are present due to discontinuities in pitch, energy,

4

International Journal of Pure and Applied Mathematics Special Issue

Page 5: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

and formant trajectories at the joining point of the units. There aretwo algorithm propose for the syllable concatenation for improvingthe quality of syllable for Marathi language.

1. Concatenation of syllable using the number of sample2. Concatenation of syllable using the Pitch of Frame at joint

The first algorithm for concatenation the two syllable using thenumber of samples.

Step 1: Take the two syllable from database (Lets Consider theS1 first syllable and S2 second syllable)

Step 2: Find the length of syllables (S1 and S2)Step 3: Take the last 1000 sample of first syllable (S1)Step 4: Take the first 1000 sample of second syllable (S2)Step 5: Find the mean of selected pointStep 6: Replace mean point at joint pointStep 7: Generate the speech after concatenationIn this paper, we present some minimal signal modification tech-

niques for reducing these artifacts. After the segmentation of sen-tence into syllable. These syllable is the stored in the in data corpusof syllable. It is extremely tough to make a machine which soundsidentical to human. Hence the best text to speech (TTS) algorithmever made sounds robotic, unless and until human speech itself isinvolved in it. But it is not possible to create a database of each andevery word possible in any language. Syllable based concatenativespeech synthesis (CSS) leads to the formation of new words fromexisting words and syllables in the database. The most importantqualities of a speech synthesis system are naturalness and intelligi-bility. The present Marathi TTS system is capable of automaticallyproducing speech by storing small segments of speech and concate-nating them when required. The objective of this algorithm is toincrease naturalness and intelligibility at the joint point.

Flow chart for concatenation of two syllable using smooth-ing and filtering

5

International Journal of Pure and Applied Mathematics Special Issue

Page 6: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

Figure:1.3 Concatenation of two syllable using smoothing andfiltering

In this both cases, minimum changes suggested in the origi-nal signal. Among different methods of speech improvement, pitchmodification is one of the simplest methods. If the pitch is mod-ified at the concatenation joint, the glitch or spectral mismatchcan be reduced [6]. This chapter explains how syllable concatenat-ing techniques are implemented for more naturalness of resultingspeech output.The aim of this work is to increase the naturalnessof text to speech synthesis by comparing pre-recorded speech (origi-nal speech) with concatenation algorithm processed output and im-proving the Speech quality. The system produces synthetic speechwith desired pitch scale and time scale modification for arbitraryinput speech signal so that there is no spectral mismatch at thepoint of concatenation. The second algorithm is used for concate-nation of the syllable using the pitch parameter from different .Thesynthesized speech quality was clearly natural sounding, but therewere audible artifacts causing a drop in the perceived overall qual-ity. In this work we have attempted to understand the causes forthese artifacts and present methods for improving the quality of thesynthesized speech by applying minimal prosodic modifications.

6

International Journal of Pure and Applied Mathematics Special Issue

Page 7: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

4 Algorithm Concatenation of two syl-

lable using Anchor Frame

Step 1: Take the two syllable from database (Lets Consider S1:first Syllable, S2: second Syllable)

Step 2: Convert the syllable into the FramesStep 3: Take the Frame containing highest last two Pitch of first

syllableStep 4: Take the Frame containing highest first two Pitch of

second syllableStep 5: Interpolation of both the syllable (S1 and S2)Step 6: Stored the values of syllable (S1 and S2) into Matrix

(Consider the values of M1 First Matrix and M2 Second Matrix)Step 7: Check the size of the Matrix & normalize the size MatrixStep 8: Take the average of the two matrixStep 9: Generate the new frame form the average of Matrix

called Anchor FrameStep 10: Insert the anchor frame between the two syllable (S1

and S2) for Smoothing Second algorithm for concatenation of twosyllable using Anchor Frame

Figure: 1.4 Second algorithm for concatenation of two syllableusing Anchor Frame

7

International Journal of Pure and Applied Mathematics Special Issue

Page 8: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

For generation of the word the syllable is the unit used for con-catenation. Further the syllable is divided into the frames of 10to 15 msec. The two syllable from database consisting the manyframes. Out of the many frame consider the last frame of the firstsyllable and first frame of second syllable. Store the vales of maxi-mum two pitch of the two frame and do the interpolation. The newframe generate from two frames is called as anchor frames.

5 Experimental results concatenation of

syllable with smoothing and filtering

signal

The objective of this research smooth concatenation of the syllableto generate the quality synthesized speech. Concatenation of twosyllable using smoothing and filtering after experimentation hasbeen carried out. It is suggested that take the last hundred (1000)point from first syllable and first hundred (1000) point from secondsyllable. Take the mean of these sample and insert the mean valuesin between the two syllable which is selected from the database.

5.1 Filtering and windowing

For joint of two syllable first we take the mean of using additivesmoothing. Then hanning window we find the point. Here at theoutput take the syllable from speech corpus and join according toconcatenation algorithm. The following equation generates the co-efficients of a Hanning window

w(n) = 0.5(1 − cos(2πn/N)), 0 ≤ n ≤ N (1)

The window length L = N + 1.Here the values for N=100 point. End 100 point from syllable

1 and start 100 point from syllable 2. The reason for selecting 100point is we get the better quality of output speech. Let us considerthe two syllable S1 and S2. The novel work in this research is topropose concatenation algorithm. If we join the two syllable thenaturalness of the signal is increased

Example: 1

8

International Journal of Pure and Applied Mathematics Special Issue

Page 9: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

Concatenation of syllable (Syllable-1) + (Syllable -2) For ex-periment to the second algorithm we have taken the syllable fromspeech corpus which is processed i.e. and concatenation done afterfiltering and smoothing. one of example which is shown in the fig-ure 1.5 We have two words and , extract the and and form the newword .

Example 1

Figure 1.5 : Concatenation of Syllable

Figure: 1.6 (a) Concatenation of syllables before processing, (b)X-axis number of sample (y) amplitude of the signal

9

International Journal of Pure and Applied Mathematics Special Issue

Page 10: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

Figure: 1.7 (a) Concatenation of syllables after processing,(b)X-axis number of sample (y) amplitude of the signal

In many experiments in science, the true signal amplitudes (y-axis values) change rather smoothly as a function of the x-axis val-ues, whereas many kinds of noise are seen as rapid, random changesin amplitude from point to point within the signal. In some casesto attempt to reduce the noise by a process called smoothing. Insmoothing, the data points of a signal are modified so that indi-vidual points that are higher than the immediately adjacent points(presumably because of noise) are reduced, and points that arelower than the adjacent points

Table: 1.1 Average 10 point showing smoothing of joint at syllable

6 Experimental result of anchor frame

concatenation algorithm for syllable.

The second concatenation syllable anchor algorithm is used to jointo syllable. In this method anchor frame is generated form thetwo syllable joint. The last segment of first syllable containing thehighest two pitch and first segment of last second syllable containingthe two pitch. All the values stored in the matrix and create thenew frame that is called as the anchor frame. This new frame isused to join the syllable smoothly. Following are the some examplefor concatenation the syllable using the anchor frame.

Table 1.2 some of the example of syllable concatenation

10

International Journal of Pure and Applied Mathematics Special Issue

Page 11: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

Following diagram show the results of anchor frame concate-nation algorithm for syllable, As per the flow chart process everystage shows the waveform.

Example 1

Figure: 1.8 original waveform of and

The Figure 1.8 shows waveform of syllable taken from the databasefor concatenation of syllable to form the word. In this diagram xaxis shows amplitude of the signal and y-axis shows number of sam-ple. The first syllable and second syllable shows the waveform inthis diagram. The figure 1.9 shown theframes of speech signal inthe different colours for further processing

11

International Journal of Pure and Applied Mathematics Special Issue

Page 12: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

Figure: 1.9. Framing of original speech ( + )

To remove the discontinuity at the joint of syllable and to main-tain smooth effect in output speech, interpolation of signal carriedout on two frame. The process of interpolation is shown in Figure1.10. The figure shows before interpolation and after interpolationof the last frame of first syllable and first frame of second syllable.

Figure: 1.10. Interpolation of speech frame ( + )

The process of interpolation increase the number of sample inthe both the frame. By selecting the last two pitch of first syllableand first two pitch of second syllable, the new frame is formed. Thisframe is called the anchor frame. The anchor frame shows in theFigure 1.11

Figure: 1.11. Anchor Frame of

This anchor frame is adjusted in the between the two syllableso that new word formed which shows in Figure 1.12

12

International Journal of Pure and Applied Mathematics Special Issue

Page 13: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

Figure 1.12. Inserting the anchor frame in Syllable

7 Testing Methodologies for Intelligibil-

ity and Naturalness

The GUI (Graphical user interface) has been design for concate-nation of syllable which is shown in Figure 1.13 for evaluating thequality of synthesized speech. The GUI consists of two small win-dow, these two small window is use for accepting the input for con-catenation and the play button is provided for listing the sound.As experiment is related with speech synthesis, the window acceptsthe text, accepted text map with the syllable database and textconverted into speech. Now the second task is to concatenate thesyllable and further formation of word. In the same GUI the an-other window shows the energy of the each syllable and boundariesof the syllable.

For example the syllable1 syllable 2. These two syllable text isinputted to the window

Figure 1.13 .1 a.Input Text b.Orginial speech signal for andConcatenation of two syllable

13

International Journal of Pure and Applied Mathematics Special Issue

Page 14: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

Some of the example for of concatenation the syllable. Usingthe anchor frame concatenation algorithm

Table 1.3: list of syllable for formation of the word

8 Conclusion:

The paper explain the two algorithm for concatenation of the sylla-ble. The first algorithm propose the maximum and minimum point.In this algorithm find the 1000 point of last segment of the first syl-lable and find 1000 point of first segment of second syllable. On thejoint point signal processing is carried out for smoothing the sig-nal. Still results is not up to the marks. So we propose the secondalgorithm, in this algorithm the syllable is divided into number offrame of 12 msec .there are two syllable is needed for combining.Take these two syllable from database .Take the last frame of firstsegment and first frame of second syllable. Using the point fromtwo frame create new anchor frame.

AcknowledgmentAuthor would like to thanks to management of JSPMs Rajarshi

Shahu College of Engineering, Tathawade. For continues motiva-tion and support.

References

[1] Hemant A Patil , Tanvina B Patel and Nirmesh J Shah ASyllable-Based Framework for Unit Selection Synthesis in 13Indian Languages” IEEE Xplore 13 January 2014, Gurgaon,India.

14

International Journal of Pure and Applied Mathematics Special Issue

Page 15: Concatenation of syllable by anchor frame to improve ... · Speech synthesis is the method to convert text in to speech waveform. Current state-of-the-art text-to-speech systems for

[2] Ashwin Bellur, K Badri Narayan, Raghava Krishnan K andHema A Murthy, Prosody modeling for Syllable-Based Con-catenative Speech Synthesis of Hindi and Tamil 978-1- 61284-091-8/11/$26.00 c©2011 IEEE

[3] International Telecommunication Union, ITU- T Recommen-dation P.85, http://www. itu.int/rec/T-REC- P.85-199406-I/en, 1994.

[4] T.Nagarajan, Hema A. Murthy and Rajesh M. Hegde Segmen-tation of speech into syllable-like units EUROSPEECH 2003GENEVA.

[5] Venugopalakrishna Y R , Vinodh M V , Hema A Murthy , CS Ramalingam Methods For Improving The Quality Of Sylla-ble Based Speech Synthesis 978-1-4244-3472- 5/08/$25.00 2008IEEE

[6] M. Plumpe, S. Meredith, Which is more important in a con-catenative text to speech systempitch, duration or spectral dis-continuity? Microsoft research, Redmond, USA, pp 193- 206,1998.

15

International Journal of Pure and Applied Mathematics Special Issue