PERFORMANCE ANALYSIS OF FUJISAKI INTONATION MODEL IN ...

Turkish Journal of Physiotherapy and Rehabilitation; 32(3)

ISSN 2651-4451 | e-ISSN 2651-446X

www.turkjphysiotherrehabil.org 5662

PERFORMANCE ANALYSIS OF FUJISAKI INTONATION MODEL IN

KANNADA SPEECH SYNTHESIS

Sadashiva Chakrasali1, Indira K2, Sunitha Y N3, Chandrashekar M Patil4 1&2Electronics & Communication Engg., Ramaiah Institute of Technology, Bengaluru, VTU, India

3Electronics & Communication Engg., SJB Institute of Technology, Bengaluru, VTU, India 4Electronics & Communication Engg., Vidya Vardhaka College of Engg., Mysuru, VTU, India

[email protected], [email protected], [email protected], [email protected]

ABSTRACT:

The speech synthesis is a process or mechanism of generating human like voice by a machine or a system is

called speech synthesis. A system which synthesizes artificial speech corresponding to text input is called Text

To Speech (TTS) system. In TTS the naturalness and intelligibility of the synthesized speech mainly depends

on the prosodic models used. This work mainly concentrates on building Fujisaki intonation model using

Festival frame work for the Kannada language, one of the important languages of the southern India. In this

work performance analysis of TTS system is measured by measuring Mel Cepstral Distortion (MCD) score

with and without Fujisaki intonation model. The MCD score of synthesized speech without intonation model

is in the range 3.52 to 5.02 dB, whereas with intonation model the range is in between 1.62 to 2.43dB. It is also

observed that intonation model has improved the intelligibility and naturalness to significant extent.

Keywords: Fujisaki model, TTS synthesis, phrase accent commands, MCD score.

I. INTRODUCTION

The recent advancements in the field of speech synthesis are very vast in international languages, but it is limited

in Indian languages like Kannada. In [1] Sadashiva Chakrasali and others, have developed Hidden Markov Model

(HMM) based Kannada speech synthesizer using Festival framework for simple declarative sentences. The

synthesized speech is shaky and lacking naturalness. Mixdorff and others [2], [3] have shown that the quality

(naturalness and intelligibility) of the synthesized speech can be enhanced by incorporating prosody models.

Prosody involves intonation (pitch pattern), duration and gain of speech segments. Intonation pattern is variation

of pitch frequency (F0) with respect to time. Intonation gives information about the periodicity of the glottal pulse

source for voiced speech sounds. Different utterance may have different intonation pattern for the same phoneme

segment based on the nature or type of utterance. Madhukumar [4], have developed linear intonation model for

simple declarative sentences of Hindi language with restricted content and function word to predict intermediate

peaks and valleys. This model lacks in capturing sudden rise and fall of pitch. Tanvina Patel and others [5], have

shown that the naturalness can be improved by incorporating models for rise and fall in pitch by phrase and accent

commands of the Fujisaki model for Gujarati language. All above factors have motivated the authors to incorporate

Fujisaki models for Kannada speech synthesis. In this work Fujisaki model is developed and incorporated for

synthesizing simple declarative sentences with multiple content and function words. The analysis of TTS system

is measured using MCD score with and without intonation model. This paper is organized as follows: Section 2,

gives brief overview of Fujisaki Model and extraction of model parameters. Synthesis using Festival framework is

discussed in Section 3 and finally analysis of results and conclusion is presented in Section 4.

II. FUJISAKI INTONATION MODEL

Fujisaki model provides method for generation of fundamental frequency (F0) variations of natural speech with

high accuracy. This model decomposes the given pitch contour into different components [6], [7] such as base

frequency Fb, phrase and accent components. Fujisaki model considers a given F0 contour in a logarithmic scale

(logF0), by superimposing above said three components. Fig. 1, shows the process of generating pitch contour F0.

The phrase components are result of impulse response of second order critically damped system excited by impulses

called phrase commands. The accent components are result of second order critically damped system when excited

http://www.turkjphysiotherrehabil.org/

mailto:[email protected]





ISSN 2651-4451 | e-ISSN 2651-446X


by the rectangular pulses of different duration and amplitudes called accent commands. A constant value Fb of the

utterance is superimposed on the two components of the same utterance to give a basic model of pitch contour F0

corresponding to that utterance.

Fig. 1: Block diagram for pitch contour generation.

ln(𝐹0(𝑡))=ln(𝐹𝑏)+𝑖=1𝐼𝐴𝑝𝑖𝐺𝑝𝑖(𝑡−𝑇𝑝𝑖)+𝑗=1𝐽𝐴𝑎𝑗{𝐺𝑎𝑗(𝑡−𝑇𝑎𝑗1)−𝐺𝑎𝑗(𝑡−𝑇𝑎𝑗2)} (1)

where,

𝐺𝑝𝑖(𝑡) = {∝𝑖

2 𝑡 𝑒𝑥𝑝(−∝𝑖𝑡), 𝑡 ≥ 00, 𝑡 < 0

𝐺𝑎𝑗(𝑡) = {min [1 − (1 + 𝛽𝑗𝑡) 𝑒𝑥𝑝(−𝛽𝑗𝑡), 𝜃𝑗] , 𝑡 ≥ 0

0 𝑡 < 0

Fb: Base frequency of the utterance

I: number of phrase commands in the utterance

J: number of accent commands in the utterance

Api: amplitude of ith phrase command in the utterance

Aaj: amplitude of jth accent command in the utterance

Tpi: instant of occurrence of ith phrase command in the utterance

Taj1: starting time of jth accent command in the utterance

Taj2: ending time of jth accent command in the utterance

αi: natural angular frequency of phrase control mechanism of ith phrase command

βj: natural angular frequency of accent control mechanism of jth accent command



ISSN 2651-4451 | e-ISSN 2651-446X


2.1 Modeling Procedure

To extract the Fujisaki model parameters of a given pitch contour of an utterance, the first step is to suppress micro

prosodic variations caused by the influence of individual person. This is achieved by interpolating the pitch contour

using cubic spline interpolation technique [6], [8], [9]. This also helps to take derivatives of High Frequency

Contours (HFC) and Low Frequency Contours (LFC). In order to extract HFC, the interpolated pitch contour is

passed through the high pass filter and the output of high pass filter is considered as HFC. HFC is subtracted from

the interpolated pitch contour to get LFC, and then the model parameters are extracted as explained in detail in the

following subsections. Fig. 2, indicates original utterence analyzed using PRAAT tool, pane 1 indicates the

variation of the signal amplitudes with respect to time, pane 2 represents spectrogram and a broken solid line

indicates the pitch contour of the utterence. The unfilled regions in the solid line indicates unvoiced frames

(unvoiced speec segments). In voiced regions one pitch frequency is identified per frame. One frame is

corresponding to 5 milli seconds with 50% overlapping. Word annotation of a utterence is indicated over pane 3.

Fig. 2. Original speech utterence and its pitch contour (/Philippines ganaraajya aagneya Eshyadalliruva ondu dweepagala

desha/ Philippines republic is a constituent of islands, is situated in south east Asia).

Fig. 3: (i) Cubic interpolation of pitch contour (ii) HFC and LFC of interpolated pitch contour (iii) Extracted phrase and

accent commands of a utterance (/ Philippines ganaraajya aagneya Eshyadalliruva ondu dweepagala desha/ Philippines

republic is a constituent of islands, is situated in south east Asia).



ISSN 2651-4451 | e-ISSN 2651-446X


Fig. 4: (i) Cubic interpolation of pitch contour (ii) HFC (iii) LFC of interpolated pitch contour (iv) Extracted phrase and

accent commands of a utterance (/ Janapada tagna janapada academy prashasti, rajyotsava prashasti modalada prashastigalu

labisive/ Folk academy award, Rajyotsava award and other awards are secured by folk expert).

High pass filtering and component separation: The Fujisaki model consider the given F0 contour in the

logarithmic scale, which comprises three components as explained earlier. The slow variation in the F0 contour is

mainly due to phrase component whereas the accent component causes the rapid variations in the F0 contour. In

order to separate the accent component and phrase component, the interpolated contour is passed through a high

pass filter with a cutoff frequency equals 0.5 Hz [6, 9]. The output of the high pass filter is considered as High

Frequency Contour (HFC), the LFC is obtained by subtracting HFC from the interpolated contour, as indicated in

Fig. 3 and Fig. 4.

2.2 Extraction of Fujisaki Model Parameters

The approximate derivative is applied to HFC to extract the accent commands, i.e. the accent commands are

considered between consecutive minima in HFC. The accent command parameter Aa is equal to the value taken to

reach maximum F0 (in logarithmic scale) between consecutive minima. The global minimum of the LFC indicates

the base frequency Fb. Since the onset of a new phrase command is characterized by a local minimum in the phrase

component, the LFC is searched for local minima, applying a minimum distance threshold of 1 second between

consecutive phrase commands. For initializing the magnitude value Ap assigned to each phrase command, the part

of the LFC after the potential onset time Tp is searched for the next local maximum. Ap is then calculated in

proportion to the F0 at this point, considering contributions of preceding commands. The other model parameters

α, β and θ are set to 1/S, 20/S and 0.9 respectively [11].

III. IMPLEMENTATION

This work is carried out using different tools; they are PRAAT, MATLAB and Festival framework. PRAAT tool

is used to extract pitch contour of the all the utterances used for training. Interpolation of the pitch contours is done

using MATLAB to fill the unvoiced regions. Extraction of HFC and LFC of the interpolated pitch contour are also

done using MATLAB. Phrase and accent commands are extracted manually. Prior to these steps speech utterances

must be preprocessed to remove unwanted noise. Preprocessing and subsequent steps carried out using Festival

Framework are explained in the following subsections.

3.1 Speech Data Pre-processing

Speech data of about 200 utterances out of 696 utterances having the transcripts is used for training is acquired

from CMU (Carnegie Mellon University) Indic dataset. These utterances are uttered by a female news reader which

are noise free and sampled at the rate of 16 K samples/sec. The speech data is clean data which do not need any

preprocessing operations like filtering to remove noise or resampling for the purpose of processing. The data base

is well balanced with approximately 2000 words, 14 vowels (including hrasva and deergha swaras i.e. aspirated



ISSN 2651-4451 | e-ISSN 2651-446X


and unaspirated vowels) and 34 consonants. The data set also involves syllable of nature \CV\, \CCV\ and \CCCV\

as indicated in Table. 1. (C – consonant and V – vowel).

Table. 1: Balanced Kannada Corpus Composition

Corpus Level Type Quantity

Syllables

vowels 30%

/CV/ 40%

/CCV/ 25%

/CCCV/ 05%

3.2 Labelling and Baum - Welch Iteration

One of the most crucial steps in the development of a TTS system is labelling. It is the process of aligning phoneme

translation with the exact duration of the wav file. The start, end, or duration of the phoneme can all be used as

timing information. A label file with the extension .lab is used to store labeled data. An utterance file (.utt) is a

similar file that contains more information. Utterance files contain data such as duration factor, stress level, labels,

and utterances that are sorted by id. In tree format, it also has relation tokens, syllable relations, word relations, and

segment relations. These utterance files are very essential in building a voice using HTS (HMM based Speech

synthesis system).

Baum Welch algorithm: This algorithm, also known as the Forward Backward algorithm, is used to

automatically estimate parameters of an HMM. It is an example of the Expectation Maximization (E.M) algorithm

in action. The steps involved in the algorithm are as follows

• Assign initial probabilities and time durations for each phoneme

• Calculate the probability of any transition or emission being used

• Using those projections, re - estimate the likelihood

• Repeat step 2 and 3 until it converges

Table 2: Baum Welch Iteration for the word JANAPADA.

Before

Baum Welch(Approximation)

After Baum Welch Labels

0.11 0.15 Pau

0.22 0.18 J

0.33 0.29 A

0.44 0.325 nB

0.55 0.435 A

0.66 0.485 p

0.77 0.515 A

0.88 0.545 dB

0.99 0.635 A

Table 2, shows the labeled data and time duration in seconds after Baum Welch Iteration for the word

JANAPADA. A dummy labeled file is used for initially assigning time of this iteration. There are three HMM

states for each phoneme. The state names are represented by English phoneme equalents with a postfix number

indicating one of three states (silence states excluded). For example, the letters j_1, j_2, and j_3 are used to represent

three different states of the letter ‘j’ (ಜ). These state names serve as the foundation for representing phoneme

qualities in all labels, utterances, and sentences.



ISSN 2651-4451 | e-ISSN 2651-446X


3.3 Building Model

The main task of the work is to predict Fujisaki parameters of phrase and accent commands using Festival frame

work [12, 13, 14 & 15]. For phrase commands Api and Tpi and for accent commands Aaj, Taj1 and Taj2 are required

to be predicted. The accuracy of this model is mainly determined by quality and quantity of data available. In this

work we are employing one of the data driven method that is classification and regression tree (CART) model. The

CART model is a binary tree constructed by passing the different parameters of the feature vectors. The feature

vectors for the CART model are as indicated in Table. 3. To construct a tree Wagon tool is used, which is available

in Edinburg speech tools library.

Phrasing: Prosodic phrasing mainly determined by the size of lungs of the speaker, so it is speaker dependent. A

test is conducted on each word to check whether it is at the end of utterance or end of word. The CART returns

either B or BB to indicate either small break or bigger break to indicate end of word or end of utterance respectively.

This tree is made available in festival/lib/phrase.scm. To implement this tree we need basic information such as

whether the utterance begins with content word or function word, number of words in utterance. This information

is taken from the function gpos, this function uses the word list in the lisp variable guess_pos to determine basic

category of a word. List all the features required to predict phrase breaks and include them to a file phrbrk.feats, in

fact gpos is alternative for phrbrk.feats. Then run the shell script for training the tree model.

Table 3: Features considered for building Fujisaki model.

Types of input

features

Features for Accent parameters Features for Phrase parameters

Phonological Nature of the sentence

Type of the syllable

Accent level of the syllable

Accent level of the previous syllable

Accent level of the succeeding syllable

Wight of the nucleus

Nature of the sentence

Type of the syllable at phrase

command

Accent level of the syllable at

phrase command

Positional Position of the syllable from beginning of the

utterance

Position of the syllable from end of utterance

Number of syllables in the utterance

Position of the nucleus in the accented syllable

Number of phonemes in the accented syllable

Position of the syllable in the

utterance at phrase command

Position of the nucleus in the

syllable at phrase command

Number of phonemes in the


Contextual Previous syllable type

Succeeding syllable type

Duration of the nucleus in the accented syllable

Previous syllable duration

Succeeding syllable duration

Sentence duration

Duration of the nucleus in the


Duration of syllable at phrase

command

Duration of utterance

Baseline frequency

Accenting: Once the training data is labeled with accent classifications, we need to train a CART model using

Wagon. All the essential features to predict accent components are tabulated in dumpfeats. The next step is to train

the tree using the command traintest accent.data, this training can be further split by using command traintest

accent.data.train.

3.4 CART Model

CART is a technique for representing parameter dependencies when making a decision or achieving a goal.

Classification entails improvising decision points at the tree's intermediate nodes, while regression entails using

nodes points to arrive at the best possible result. CART's leaf nodes define the output variables/ end. The following

Fig. 5, indicates a prototype of decision tree. CART model for pitch frequency F0 of first state of (ರ)is given in

appendix. In Fig. 5, the values A, B and C are estimated by entropy function.



ISSN 2651-4451 | e-ISSN 2651-446X


Fig. 5: Decision tree for F0 prediction.

The CART framework for speech synthesis is developed in Festival framework. The training involves building

three types of CART models essential for speech synthesis,

• Spectral/MCEP CART tree: To obtain statistics of MFCC relations between phonemes

• Duration CART tree: To obtain duration statistics of each phoneme

• F0 CART tree: To predict pitch at start, middle and end of phoneme

IV. RESULTS AND DISCUSSION

Mel Cepstral Distortion (MCD Score): The qualitative analysis of the synthesized speech or utterance can be

done by measuring the MCD score with respect to original utterance. It calculates a logarithmic deviation of mel

cepstral parameters of the original and synthesized utterances. The equation to calculate the same is given below.

𝑀𝐶𝐷 =10√2

𝑙𝑛10 √∑ (𝑚𝑐(𝑑)(𝑡)

− 𝑚𝑐(𝑑)(𝑒)

)2

𝑑

where,

mc(t)(d) : melcepstral parameter of original utterance

mc(e)(d) : melcepstral parameter of synthesized utterance

d : index of melcepstral parameter array

Normally the value of MCD of synthesized speech with respect to original speech is acceptable if the range is between 4.5 and 6

dB. In our earlier work synthesis without intonation model we were able to achieve MCD in the range 3.52 to 5.02 dB.

However in our earlier work synthesis without intonation model has resulted in shaky and unclear speech though the MCD

score was in acceptable range. This work with Fujisaki intonation model provided MCD score in the range 1.62 to

2.43 dB which proves that the synthesis is natural and intelligible.

The comparison of MCD scores of synthesized speech with and without intonation model is as indicated in Table.

4. The spectrums and pitch contours of original speech and synthesized speech with and without intonation model

are shown in Fig. 6, 7 and Fig.8.

V. CONCLUSION

The naturalness of the synthesized speech is achieved by incorporating intonation models, MCD score is

approximately reduced by 2 to 3 dB by introducing Fujisaki intonation model. It is also observed that the naturalness



ISSN 2651-4451 | e-ISSN 2651-446X


is improved significantly. The pitch contours of original speech and synthesized speech are depicted in Fig. 6, 7

and 8. From the Fig. 8, it is clear that pitch contours of the synthesized speech with intonation model are very closer

with original speech utterances.

Table. 4: Comparison of MCD scores of synthesized speech with and without intonation model

Utterance MCD Score of synthesized speech in

dB

Utterance MCD Score of synthesized speech in

dB

Without

intonation model

With intonation

model

Without

intonation model

With intonation

model

1 3.52 1.68 6 4.65 1.89

2 4.16 2.10 7 4.80 2.27

3 3.87 1.78 8 4.21 1.65

4 4.32 2.43 9 4.84 2.29

5 4.34 1.62 10 4.56 1.67

Fig. 6: Spectrum and pitch contour of original signal



ISSN 2651-4451 | e-ISSN 2651-446X


Fig. 7: Spectrum and pitch contour of synthesized speech without intonation model.

Fig. 8: Spectrum and pitch contour of synthesized speech with Fujisaki intonation model.



ISSN 2651-4451 | e-ISSN 2651-446X


REFERENCES

1. Sadashiva Chakrasali, K. Indira, Shashank Sharma, Srinivas N. M and Varun S. S, “HMM Based Kannada Speech Synthesis using Festvox”, International Journal of Recent Technology and Engineering, Vol. 8, Issue 3, 2019.

2. H. Mixdorff and D. Mehnert, “Exploring the Naturalness of Several German High Quality Text to Speech Systems”, Eurospeech-99, Vol. 4, Budapest, Hungary, 1999.

3. H. Mixdorff and O. Jokisch, “Evaluating the Quality of an Integrated Model of German Prosody”, International Journal of Speech Technology, Vol. 6, 2003.

4. Madhukumar A. S, S. Rajendran and B. Yegnanarayana, “Intonation Component of Text to Speech System for Hindi”, Journal on Computer Speech

and Language, Vol. 7, 1993.

5. Tanvina B. Patel and Hemant A. Patil, “Analysis of Natural and Synthetic Speech Using Fujisaki Model”, International Conference on Acoustics,

Speech and Signal Processing (ICASSP-16), 2016.

6. Hirose and J. Tao, “Speech Prosody in Speech Synthesis: Modeling and generation of prosody for high quality and flexible speech synthesis”,

Springer 2015. DOI 10.1007/978-3-662-45258-5

7. H. Fujisaki and K. Hirose, “Analysis of Voice Fundamental Frequency Contours for Declarative Sentences of Japanese”, Journal of Acoustical Society of Japan, Vol. 5, 1984.

8. Suichi Narusawa, Nobuaki Minematsu, K. Hirose and H. Fujisaki, “Automatic Extraction of Model Parameters from Fundamental Frequency Contours of English Utterances”, 7th International conference on spoken language processing, ICSLP-02, INTERSPEECH, 2002. [9 Mixdorff, H, “A novel approach to the fully automatic extraction of Fujisaki model parameters”, ICASSP-2000, Vol. 1, pp. 1281-

1284, Istanbul, Turkey, 2000.

9. Heiga Zen, Takashi Nose, Junichi Yamagishi, Shinji Sako, Takashi Masuko, Alan W Black and Keiichi Tokuda, “The HMMbased Speech Synthesis

System, Version 2.0, 6 294 th ISCA Workshop on Speech Synthesis, Bonn, Germany, August 22-24, 2007.

10. Zied Mnasri, Fatouma Boukadida and Noureddine Ellouze, “F0 Contour Modeling for Arabic Text-to-Speech Synthesis Using Fujisaki Parameters and Neural Networks”, Signal Processing an International Journal, Vol. 4, Issue 6, 2010.

11. Festvox Project by CMU speech group, http://festvox.org/.

12. Festival Speech Synthesis system, http://www.cstr.ed.ac.uk/projects/festival/.

13. Speech Processing Toolkit (Reference Manual), http://sptk.sourceforge.net/

14. www.cstr.ed.ac.uk/projects/speech_tools/

Appendix: CART Model

((R:mcep_link.parent.R:segstate.parent.R:SylStructure.parent.parent.R:Word.n.gpos is 0)

((R:mcep_link.parent.R:segstate.parent.R:SylStructure.parent.lisp_cg_break is 0)

((R:mcep_link.parent.R:segstate.parent.n.ph_vlng is l)

((18.7296 196.493))

((R:mcep_link.parent.R:segstate.parent.p.ph_vfront is 2)

((lisp_cg_phone_rindex < 5.4)

((12.9037 178.856))

((11.3727 175.31)))

((25.295 186.158))))

((R:mcep_link.parent.R:segstate.parent.p.ph_vrnd is -)

((lisp_cg_state_index < 3.2)

((R:mcep_link.parent.R:segstate.parent.p.ph_vlng is l)

((26.2228 147.423))

((26.0702 155.828)))

((34.5121 144.413)))

((24.4693 164.616))))

((R:mcep_link.parent.R:segstate.parent.R:SylStructure.parent.parent.R:Word.p.gpos is 0)


http://festvox.or/

http://www.cstr.ed.ac/

http://sptk.sourceforge.net/

http://www.cstr.ed.ac.uk/projects/speech_tools/


ISSN 2651-4451 | e-ISSN 2651-446X


((R:mcep_link.parent.R:segstate.parent.R:SylStructure.parent.position_type is initial)

((42.4542 241.75))

((R:mcep_link.parent.R:segstate.parent.n.ph_vfront is 2)

((45.9084 279.239))

((37.7051 297.472))))

((lisp_cg_position_in_phrasep < 0.397115)


((37.5981 264.055))

((R:mcep_link.parent.R:segstate.parent.p.ph_vheight is 2)

((R:mcep_link.parent.lisp_cg_duration < 0.0229999)

((29.286 238.109))

((21.8523 221.828)))


((26.5455 246.759))

((33.5854 235.865)))))

((lisp_cg_position_in_phrase < 0.694799)



((R:mcep_link.parent.R:segstate.parent.n.ph_vfront is 2)

((27.6879 223.165))

((30.7741 216.27)))

((22.1676 209.98)))

((R:mcep_link.parent.R:segstate.parent.R:SylStructure.parent.lisp_cg_break is 0)

((R:mcep_link.parent.R:segstate.parent.R:SylStructure.parent.R:Syllable.p.lisp_cg_break is 1)

((34.328 225.308))

((25.6653 216.684)))

((36.7985 246.606))))

((R:mcep_link.parent.R:segstate.parent.n.ph_vlng is s)


((21.9999 195.928))



ISSN 2651-4451 | e-ISSN 2651-446X


((23.6648 205.232)))

((25.1052 212.615)))))))

;; RMSE 29.6230 Correlation is 0.8006 Mean (abs) Error 21.6313 (20.2403)


PERFORMANCE ANALYSIS OF FUJISAKI INTONATION MODEL IN ...

Documents

Transcript of PERFORMANCE ANALYSIS OF FUJISAKI INTONATION MODEL IN ...