Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch...

Incremental TTS for Japanese LanguageTOM OYA YA N AGITA 1 , S A K RIA NI S A K TI 1 , 2 , S ATOS HI NA K A M URA 1 , 2

1 N A R A I N S T I T U T E O F S C I E N C E A N D T E C H N O L O G Y , J A P A N

2 R I K E N , C E N T E R F O R A D V A N C E D I N T E L L I G E N C E P R O J E C T A I P , J A P A N

Background

Text-To-Speech (TTS) needs a full sentence as input

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 2

“私の名前は柳田です。”

(I am Yanagita.)TTS

If text length is long, TTS has to wait till the end and cause a delay

“今日私はインクリメンタル音声合成について話しますが、もちろん皆さんご存知な方がおおいと ……”

(Today, I will talk about incremental

speech synthesis, but I think that there

are many people who know ……)

No response yet

(waiting the end

of sentence)

Require to synthesize speech in smaller chunks → Incremental TTS (ITTS)

Related Works

INERSPEECH2018 3

Limited existing works on ITTS (Mostly based on HMM)

Problems:• Some contextual linguistic features become unknown

• Speech quality may deteriorate compared to standard HMM-TTS

Related Works

INERSPEECH2018 4

[Baumann et al., 2014]Analysis the impact of potentially missing features on the quality of the estimated prosody.• Investigation only base on word-by-word synthesis extension.

-> German, English

Related Works

INERSPEECH2018 5

[Pouget et al., 2015]

HMM training strategy for ITTS• French

-> Word-by-words synthesis (potentially with a delay of one word)

[Baumann et al., 2014]Analysis the impact of potentially missing features on the quality of the estimated prosody.• Investigation only base on word-by-word synthesis extension.

-> German, English

Research Objectives

INERSPEECH2018 6

Most investigations focus only on German, English, French • Focus on Word-by-word synthesis and its improvement.

• ITTS for tonal language has not been studied yet.

Example of tonal language → Japanese• Major linguistic features: phoneme, accent phrase, and breath group

(several accent phrases)

• Only part-of-speech (POS) tag is used as word-level information

• Tonal feature is important

• It may uses longer unit than word as linguistic features and synthesis

Necessary to investigates the effect on speech quality in

linguistic and temporal locality choices in tonal language

Proposed Approach

INERSPEECH2018 7

Investigation the quality of synthesized speech on

various linguistic and temporal locality

• To estimate the optimum synthesis unit

Investigation of chunk connection

• To maintain the smoothness between chunks

• To produce more natural speech

Investigates the effect on speech quality in linguistic and

temporal locality choices for a Japanese ITTS system

Japanese Linguistic Information

INERSPEECH2018 8

Sentence

a ra yu ru ge N j i ts u wo m a g e t a no daPhoneme …

あらゆる曲げたのだげんじつを， …

POS1Word POS* POS2POS

3 POS10 POS

Breath group 1Breath group … Breath group 2

Accent phrase…

High pitch

Low pitch

(Bridge)

しHigh pitchLow pitch

(chopsticks)

はし(ha-shi)

Example of different accent type

• Word has each

accent type.

• Different accent type,

Different meanings.

*Part-Of-Speech tag

Japanese Linguistic Information

INERSPEECH2018 9

Sentence

a ra yu ru ge N j i ts u wo m a g e t a no daPhoneme …

あらゆる曲げたのだげんじつを， …

POS1Word POS POS2POS

3 POS10 POS

Breath group 1Breath group … Breath group 2

Accent phrase…

High pitch

Low pitch

にひゃく

Example of connecting accent type (accent phrase)

メートル

(two hundred)

ひゃ

(meter)

くートル

->(two hundred meters)

ひゃくートルメ

＋にひゃくメートル

Linguistic Features Locality

Use only “current” and “past” linguistic informationRemove “next” contextual information to unknown

Acoustic

featuresHMM

INERSPEECH2018 10

Phoneme

Accent phrase

Breath group

Train HMM models

Investigate several possible linguistic locality choices

Experimental Set-up

Setting Details

Dataset ATR 503 phonetically balanced sentences (HTS)（Training 450 sentences・Test 53 sentences）

Acoustic

F0 (1dim), Mel cepstrum coefficient (39 dim),

Aperiodic component (5 dim), and dynamic feature

Analysis

method

STRAIGHT [Kawahara et al. 1999]

Linguistic

Information

Phoneme identity, Part-of-speech tag, relative pitch position, # of accent

phrases and moras, # of breath group, position of breath group, etc.

Objective

evaluation

Mean opinion score of naturalness

(1: very bad, 2:bad, 3:normal, 4:good, 5:Very good)

16 Japanese native speakers, each 15 samples

INERSPEECH2018 11

Subjective Evaluation

Pho+Pos Pho+Pos+Acc Pho+Acc+Bre Baseline TTS

95% confidence interval

*p<0.001

Linguistic Phoneme+POSThe naturalness close to bad (MOS=2)

Linguistic Phoneme+POS+accent phraseThe naturalness close to normal (MOS=3)

INERSPEECH2018 12

Pho: phoneme

Pos: word POS tagAcc: accent phrase

Bre: breath group

Based on the results,

we decided to accent phrase as a synthesis unit

F0 Sequence Between Accent Phrases

INERSPEECH2018 13

Sentence unit synthesis Accent phrase unit synthesis

Problem• Prosody breaks occurs when using only current accent phrase

Chunk connection approach for smoothing[Timo et al. ,2012]

Proposed Approach

INERSPEECH2018 14

Investigation the quality of synthesized speech on

various linguistic and temporal locality

• To estimate the optimum synthesis unit

Investigation of chunk connection

• To maintain the smoothness between chunks

• To produce more natural speech

Investigates the effect on speech quality in linguistic and

temporal locality choices for a Japanese ITTS system

Method of Chunk Connection

INERSPEECH2018 15

Unit1 TTS

Unit2 TTS

Unit3 TTS

Synthesized

speech

Phrase with past all phrases

Unit1 TTS

Unit2 TTS

Unit3 TTS

Unit2Unit1

Phrase with past one phrase

Unit1 TTS

Unit2 TTS

Unit3 TTS

Phrase with past/next one phrase

Unit2 TTS

Unit3 TTS

Unit2Unit1

Unit3 TTSUnit2

Phrase by phrase

Subjective Evaluation

*p<0.00195%confidence interval

INERSPEECH2018 16

Connecting one past and next accent phrase chunks

could improve naturalness

F0 Sequence Between Accent Phrases

F0 sequences can be smoother by waiting for one more chunk

before starting the synthesis process

INERSPEECH2018 17

ConclusionJapanese language as tonal language

• Accent phrase (tonal information) unit required as synthesis unit

• It’s effective to wait for one accent phrase for improving quality

Future work• Implementation of full-pledge Japanese ITTS System

• Experiment of other tonal language (i.e., Chinese..)

• DNN based incremental TTS

INERSPEECH2018 18

Conclusion

End of slide

Thank you

INERSPEECH2018 20

Mel cepstrum distortion

Subjective result ofInvestigation of linguistic Information LocalityThe result tends to improve when linguistic information is added.

INERSPEECH2018 21

# of linguistic Info. ManyFew

[cent]

Pho Pho

+AccPho

Pho：Linguistic set of phoneme

Pos：Linguistic set of word

Acc：Linguistic set of accent phrase

Bre：Linguistic set of breath group

# of linguistic Info. ManyFew

Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch...

Documents

Transcript of Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch...

Dragons Breath

Yoga Breath

KNIFE BREATH.

Natural Breath

Gut Microbes and Irritable Bowel Syndrome...Jejunum Ileum . 4 Diagnosis of SIBO Type of Test Specific Test Breath testing Lactulose Breath Test 13C Xylose Breath Test Glucose Breath

Bad Breath Check: 5 Ways to Smell Your Breath

Benefits of Time-Correlated and Breath- Triggered MR ......Benefits of Time-Correlated and Breath- Triggered MR Acquisition in Treatment Position for Accurate Liver Lesion Contouring

REFLEX Breath

Bad Breath Treatment In Chennai | How to Cure Bad Breath?

Breath takingphoto

missouri breath alcohol program breath alcohol operator manual

Information Network 1 Routing (1) - NAIST

Breath of Life-text-MF - Breath Meditation · The Breath of Life The Practice of Breath Meditation According to Hindu, Buddhist, Taoist, Jewish and Christian Traditions Abbot George

Breath Testing for Prosecutors Breath Testing for Prosecutors

Air Handling Units Makes every breath a high quality breath.

Breath takingphotos

Nara Institute of Science and Technology - NAIST

Top Global University Project （Type B NAIST at Global campus with culturally diverse faculty, staff, and students. 【Summary of project】 NAIST will strive for global excellence

Breath & Imagination

Breath Science