Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch...

21
Incremental TTS for Japanese Language TOMOYA YANAGITA 1 , SAKRIANI SAKTI 1,2 , SATOSHI NAKAMURA 1,2 1 NARA INSTITUTE OF SCIENCE AND TECHNOLOGY, JAPAN 2 RIKEN, CENTER FOR ADVANCED INTELLIGENCE PROJECT AIP, JAPAN

Transcript of Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch...

Page 1: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Incremental TTS for Japanese LanguageTOM OYA YA N AGITA 1 , S A K RIA NI S A K TI 1 , 2 , S ATOS HI NA K A M URA 1 , 2

1 N A R A I N S T I T U T E O F S C I E N C E A N D T E C H N O L O G Y , J A P A N

2 R I K E N , C E N T E R F O R A D V A N C E D I N T E L L I G E N C E P R O J E C T A I P , J A P A N

Page 2: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Background

Text-To-Speech (TTS) needs a full sentence as input

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 2

“私の名前は柳田です。”

(I am Yanagita.)TTS

If text length is long, TTS has to wait till the end and cause a delay

“今日私はインクリメンタル音声合成について話しますが、もちろん皆さんご存知な方がおおいと ……”

(Today, I will talk about incremental

speech synthesis, but I think that there

are many people who know ……)

TTS

No response yet

(waiting the end

of sentence)

Require to synthesize speech in smaller chunks → Incremental TTS (ITTS)

Page 3: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Related Works

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 3

Limited existing works on ITTS (Mostly based on HMM)

Problems:• Some contextual linguistic features become unknown

• Speech quality may deteriorate compared to standard HMM-TTS

Page 4: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Related Works

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 4

Limited existing works on ITTS (Mostly based on HMM)

Problems:• Some contextual linguistic features become unknown

• Speech quality may deteriorate compared to standard HMM-TTS

[Baumann et al., 2014]Analysis the impact of potentially missing features on the quality of the estimated prosody.• Investigation only base on word-by-word synthesis extension.

-> German, English

Page 5: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Related Works

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 5

Limited existing works on ITTS (Mostly based on HMM)

Problems:• Some contextual linguistic features become unknown

• Speech quality may deteriorate compared to standard HMM-TTS

[Pouget et al., 2015]

HMM training strategy for ITTS• French

-> Word-by-words synthesis (potentially with a delay of one word)

[Baumann et al., 2014]Analysis the impact of potentially missing features on the quality of the estimated prosody.• Investigation only base on word-by-word synthesis extension.

-> German, English

Page 6: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Research Objectives

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 6

Most investigations focus only on German, English, French • Focus on Word-by-word synthesis and its improvement.

• ITTS for tonal language has not been studied yet.

Example of tonal language → Japanese• Major linguistic features: phoneme, accent phrase, and breath group

(several accent phrases)

• Only part-of-speech (POS) tag is used as word-level information

• Tonal feature is important

• It may uses longer unit than word as linguistic features and synthesis

unit.

Necessary to investigates the effect on speech quality in

linguistic and temporal locality choices in tonal language

Page 7: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Proposed Approach

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 7

1st

2nd

Investigation the quality of synthesized speech on

various linguistic and temporal locality

• To estimate the optimum synthesis unit

Investigation of chunk connection

• To maintain the smoothness between chunks

• To produce more natural speech

Investigates the effect on speech quality in linguistic and

temporal locality choices for a Japanese ITTS system

Page 8: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Japanese Linguistic Information

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 8

Sentence

a ra yu ru ge N j i ts u wo m a g e t a no daPhoneme …

あ ら ゆ る 曲 げ た の だげ ん じ つ を , …

POS1Word POS* POS2POS

3 POS10 POS

11

POS

12

POS

13…

Breath group 1Breath group … Breath group 2

Accent phrase…

High pitch

Low pitch

(Bridge)

しHigh pitchLow pitch

(chopsticks)

はし(ha-shi)

Example of different accent type

• Word has each

accent type.

• Different accent type,

Different meanings.

*Part-Of-Speech tag

Page 9: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Japanese Linguistic Information

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 9

Sentence

a ra yu ru ge N j i ts u wo m a g e t a no daPhoneme …

あ ら ゆ る 曲 げ た の だげ ん じ つ を , …

POS1Word POS POS2POS

3 POS10 POS

11

POS

12

POS

13…

Breath group 1Breath group … Breath group 2

Accent phrase…

High pitch

Low pitch

にひゃく

Example of connecting accent type (accent phrase)

メートル

(two hundred)

ひゃ

(meter)

く ー ト ル

->(two hundred meters)

ひゃ く ー ト ルメ

+ にひゃくメートル

->

Page 10: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Linguistic Features Locality

Use only “current” and “past” linguistic informationRemove “next” contextual information to unknown

Acoustic

featuresHMM

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 10

Phoneme

Word

Accent phrase

Breath group

Train HMM models

Investigate several possible linguistic locality choices

Page 11: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Experimental Set-up

Setting Details

Dataset ATR 503 phonetically balanced sentences (HTS)(Training 450 sentences・Test 53 sentences)

Acoustic

feat.

F0 (1dim), Mel cepstrum coefficient (39 dim),

Aperiodic component (5 dim), and dynamic feature

Analysis

method

STRAIGHT [Kawahara et al. 1999]

Linguistic

Information

Phoneme identity, Part-of-speech tag, relative pitch position, # of accent

phrases and moras, # of breath group, position of breath group, etc.

Objective

evaluation

Mean opinion score of naturalness

(1: very bad, 2:bad, 3:normal, 4:good, 5:Very good)

16 Japanese native speakers, each 15 samples

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 11

Page 12: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Subjective Evaluation

1.0

2.0

3.0

4.0

5.0

Pho+Pos Pho+Pos+Acc Pho+Acc+Bre Baseline TTS

MO

S

95% confidence interval

*p<0.001

Linguistic Phoneme+POSThe naturalness close to bad (MOS=2)

Linguistic Phoneme+POS+accent phraseThe naturalness close to normal (MOS=3)

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 12

Pho: phoneme

Pos: word POS tagAcc: accent phrase

Bre: breath group

Based on the results,

we decided to accent phrase as a synthesis unit

Page 13: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

F0 Sequence Between Accent Phrases

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 13

Sentence unit synthesis Accent phrase unit synthesis

Problem• Prosody breaks occurs when using only current accent phrase

units

Chunk connection approach for smoothing[Timo et al. ,2012]

Page 14: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Proposed Approach

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 14

1st

2nd

Investigation the quality of synthesized speech on

various linguistic and temporal locality

• To estimate the optimum synthesis unit

Investigation of chunk connection

• To maintain the smoothness between chunks

• To produce more natural speech

Investigates the effect on speech quality in linguistic and

temporal locality choices for a Japanese ITTS system

Page 15: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Method of Chunk Connection

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 15

Unit1 TTS

Unit2 TTS

Unit3 TTS

Synthesized

speech

Phrase with past all phrases

Unit1 TTS

Unit2 TTS

Unit3 TTS

Unit1

Unit2Unit1

Phrase with past one phrase

Unit1 TTS

Unit2 TTS

Unit3 TTS

Unit1

Unit2

Unit4

Phrase with past/next one phrase

Unit2 TTS

Unit3 TTS

Unit1

Unit2Unit1

Unit3 TTSUnit2

Phrase by phrase

Page 16: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

Subjective Evaluation

*p<0.00195%confidence interval

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 16

1.0

2.0

3.0

4.0

5.0M

OS

Connecting one past and next accent phrase chunks

could improve naturalness

Page 17: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

F0 Sequence Between Accent Phrases

F0 sequences can be smoother by waiting for one more chunk

before starting the synthesis process

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 17

Page 18: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

ConclusionJapanese language as tonal language

• Accent phrase (tonal information) unit required as synthesis unit

• It’s effective to wait for one accent phrase for improving quality

Future work• Implementation of full-pledge Japanese ITTS System

• Experiment of other tonal language (i.e., Chinese..)

• DNN based incremental TTS

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 18

Conclusion

Page 19: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

End of slide

Thank you

Q & A

2018©TOMOYA YANAGITA, AHC-LAB, NAIST, INERSPEECH2018 192018/9/19

Page 20: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 20

Page 21: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.

0

50

100

150

200

250

300

F0

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Mel cepstrum distortion

Subjective result ofInvestigation of linguistic Information LocalityThe result tends to improve when linguistic information is added.

2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,

INERSPEECH2018 21

# of linguistic Info. ManyFew

[cent]

Pho Pho

+Pos

Pho

+Pos

+Acc

Pho

+Acc

+Bre

Pho

+Pos

+Acc

+Bre

Pho Pho

+Pos

Pho

+AccPho

+Acc

+Bre

Pho

+Pos

+Acc

+Bre

Pho:Linguistic set of phoneme

Pos:Linguistic set of word

Acc:Linguistic set of accent phrase

Bre:Linguistic set of breath group

# of linguistic Info. ManyFew