Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch...
Transcript of Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch...
![Page 1: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/1.jpg)
Incremental TTS for Japanese LanguageTOM OYA YA N AGITA 1 , S A K RIA NI S A K TI 1 , 2 , S ATOS HI NA K A M URA 1 , 2
1 N A R A I N S T I T U T E O F S C I E N C E A N D T E C H N O L O G Y , J A P A N
2 R I K E N , C E N T E R F O R A D V A N C E D I N T E L L I G E N C E P R O J E C T A I P , J A P A N
![Page 2: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/2.jpg)
Background
Text-To-Speech (TTS) needs a full sentence as input
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 2
“私の名前は柳田です。”
(I am Yanagita.)TTS
If text length is long, TTS has to wait till the end and cause a delay
“今日私はインクリメンタル音声合成について話しますが、もちろん皆さんご存知な方がおおいと ……”
(Today, I will talk about incremental
speech synthesis, but I think that there
are many people who know ……)
TTS
No response yet
(waiting the end
of sentence)
Require to synthesize speech in smaller chunks → Incremental TTS (ITTS)
![Page 3: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/3.jpg)
Related Works
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 3
Limited existing works on ITTS (Mostly based on HMM)
Problems:• Some contextual linguistic features become unknown
• Speech quality may deteriorate compared to standard HMM-TTS
![Page 4: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/4.jpg)
Related Works
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 4
Limited existing works on ITTS (Mostly based on HMM)
Problems:• Some contextual linguistic features become unknown
• Speech quality may deteriorate compared to standard HMM-TTS
[Baumann et al., 2014]Analysis the impact of potentially missing features on the quality of the estimated prosody.• Investigation only base on word-by-word synthesis extension.
-> German, English
![Page 5: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/5.jpg)
Related Works
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 5
Limited existing works on ITTS (Mostly based on HMM)
Problems:• Some contextual linguistic features become unknown
• Speech quality may deteriorate compared to standard HMM-TTS
[Pouget et al., 2015]
HMM training strategy for ITTS• French
-> Word-by-words synthesis (potentially with a delay of one word)
[Baumann et al., 2014]Analysis the impact of potentially missing features on the quality of the estimated prosody.• Investigation only base on word-by-word synthesis extension.
-> German, English
![Page 6: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/6.jpg)
Research Objectives
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 6
Most investigations focus only on German, English, French • Focus on Word-by-word synthesis and its improvement.
• ITTS for tonal language has not been studied yet.
Example of tonal language → Japanese• Major linguistic features: phoneme, accent phrase, and breath group
(several accent phrases)
• Only part-of-speech (POS) tag is used as word-level information
• Tonal feature is important
• It may uses longer unit than word as linguistic features and synthesis
unit.
Necessary to investigates the effect on speech quality in
linguistic and temporal locality choices in tonal language
![Page 7: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/7.jpg)
Proposed Approach
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 7
1st
2nd
Investigation the quality of synthesized speech on
various linguistic and temporal locality
• To estimate the optimum synthesis unit
Investigation of chunk connection
• To maintain the smoothness between chunks
• To produce more natural speech
Investigates the effect on speech quality in linguistic and
temporal locality choices for a Japanese ITTS system
![Page 8: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/8.jpg)
Japanese Linguistic Information
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 8
…
Sentence
a ra yu ru ge N j i ts u wo m a g e t a no daPhoneme …
あ ら ゆ る 曲 げ た の だげ ん じ つ を , …
POS1Word POS* POS2POS
3 POS10 POS
11
POS
12
POS
13…
Breath group 1Breath group … Breath group 2
Accent phrase…
High pitch
Low pitch
(Bridge)
は
しHigh pitchLow pitch
(chopsticks)
し
は
はし(ha-shi)
Example of different accent type
• Word has each
accent type.
• Different accent type,
Different meanings.
*Part-Of-Speech tag
![Page 9: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/9.jpg)
Japanese Linguistic Information
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 9
…
Sentence
a ra yu ru ge N j i ts u wo m a g e t a no daPhoneme …
あ ら ゆ る 曲 げ た の だげ ん じ つ を , …
POS1Word POS POS2POS
3 POS10 POS
11
POS
12
POS
13…
Breath group 1Breath group … Breath group 2
Accent phrase…
High pitch
Low pitch
にひゃく
Example of connecting accent type (accent phrase)
メートル
+
(two hundred)
に
ひゃ
(meter)
メ
く ー ト ル
->(two hundred meters)
に
ひゃ く ー ト ルメ
+ にひゃくメートル
->
![Page 10: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/10.jpg)
Linguistic Features Locality
Use only “current” and “past” linguistic informationRemove “next” contextual information to unknown
Acoustic
featuresHMM
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 10
Phoneme
Word
Accent phrase
Breath group
Train HMM models
Investigate several possible linguistic locality choices
![Page 11: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/11.jpg)
Experimental Set-up
Setting Details
Dataset ATR 503 phonetically balanced sentences (HTS)(Training 450 sentences・Test 53 sentences)
Acoustic
feat.
F0 (1dim), Mel cepstrum coefficient (39 dim),
Aperiodic component (5 dim), and dynamic feature
Analysis
method
STRAIGHT [Kawahara et al. 1999]
Linguistic
Information
Phoneme identity, Part-of-speech tag, relative pitch position, # of accent
phrases and moras, # of breath group, position of breath group, etc.
Objective
evaluation
Mean opinion score of naturalness
(1: very bad, 2:bad, 3:normal, 4:good, 5:Very good)
16 Japanese native speakers, each 15 samples
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 11
![Page 12: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/12.jpg)
Subjective Evaluation
1.0
2.0
3.0
4.0
5.0
Pho+Pos Pho+Pos+Acc Pho+Acc+Bre Baseline TTS
MO
S
95% confidence interval
*p<0.001
Linguistic Phoneme+POSThe naturalness close to bad (MOS=2)
Linguistic Phoneme+POS+accent phraseThe naturalness close to normal (MOS=3)
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 12
Pho: phoneme
Pos: word POS tagAcc: accent phrase
Bre: breath group
Based on the results,
we decided to accent phrase as a synthesis unit
![Page 13: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/13.jpg)
F0 Sequence Between Accent Phrases
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 13
Sentence unit synthesis Accent phrase unit synthesis
Problem• Prosody breaks occurs when using only current accent phrase
units
Chunk connection approach for smoothing[Timo et al. ,2012]
![Page 14: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/14.jpg)
Proposed Approach
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 14
1st
2nd
Investigation the quality of synthesized speech on
various linguistic and temporal locality
• To estimate the optimum synthesis unit
Investigation of chunk connection
• To maintain the smoothness between chunks
• To produce more natural speech
Investigates the effect on speech quality in linguistic and
temporal locality choices for a Japanese ITTS system
![Page 15: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/15.jpg)
Method of Chunk Connection
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 15
Unit1 TTS
Unit2 TTS
Unit3 TTS
Synthesized
speech
Phrase with past all phrases
Unit1 TTS
Unit2 TTS
Unit3 TTS
Unit1
Unit2Unit1
Phrase with past one phrase
Unit1 TTS
Unit2 TTS
Unit3 TTS
Unit1
Unit2
Unit4
Phrase with past/next one phrase
Unit2 TTS
Unit3 TTS
Unit1
Unit2Unit1
Unit3 TTSUnit2
Phrase by phrase
![Page 16: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/16.jpg)
Subjective Evaluation
*p<0.00195%confidence interval
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 16
1.0
2.0
3.0
4.0
5.0M
OS
Connecting one past and next accent phrase chunks
could improve naturalness
![Page 17: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/17.jpg)
F0 Sequence Between Accent Phrases
F0 sequences can be smoother by waiting for one more chunk
before starting the synthesis process
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 17
![Page 18: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/18.jpg)
ConclusionJapanese language as tonal language
• Accent phrase (tonal information) unit required as synthesis unit
• It’s effective to wait for one accent phrase for improving quality
Future work• Implementation of full-pledge Japanese ITTS System
• Experiment of other tonal language (i.e., Chinese..)
• DNN based incremental TTS
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 18
Conclusion
![Page 19: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/19.jpg)
End of slide
Thank you
Q & A
2018©TOMOYA YANAGITA, AHC-LAB, NAIST, INERSPEECH2018 192018/9/19
![Page 20: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/20.jpg)
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 20
![Page 21: Incremental TTS for Japanese - NAIST...Phoneme identity, Part-of-speech tag, relative pitch position, # of accent phrases and moras, # of breath group, position of breath group, etc.](https://reader035.fdocuments.us/reader035/viewer/2022062415/5fcc1e398186ff0306770cf4/html5/thumbnails/21.jpg)
0
50
100
150
200
250
300
F0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Mel cepstrum distortion
Subjective result ofInvestigation of linguistic Information LocalityThe result tends to improve when linguistic information is added.
2018/9/192018©TOMOYA YANAGITA, AHC-LAB, NAIST,
INERSPEECH2018 21
# of linguistic Info. ManyFew
[cent]
Pho Pho
+Pos
Pho
+Pos
+Acc
Pho
+Acc
+Bre
Pho
+Pos
+Acc
+Bre
Pho Pho
+Pos
Pho
+AccPho
+Acc
+Bre
Pho
+Pos
+Acc
+Bre
Pho:Linguistic set of phoneme
Pos:Linguistic set of word
Acc:Linguistic set of accent phrase
Bre:Linguistic set of breath group
# of linguistic Info. ManyFew