Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon,...
-
Upload
job-stephens -
Category
Documents
-
view
217 -
download
0
Transcript of Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon,...
Segmental encoding of prosodic categories:
A perception study through speech synthesis
Kyuchul Yoon, Mary Beckman & Chris Brew
August 19, 2005 Segmental encoding of prosodic categories
2
Contents• An overview
Allophonic variationsSegmental positionsWord-initial vs. word-internal positions in K-ToBI frameworkAllophonic variations: an extended viewProduction studies on Korean and other languagesNeed for a perception study, but how?
• A diphone-based speech synthesis system for KoreanConventional diphones vs. prosodically-sensitive ones
• A listening experimentDesign and synthesis of test stimuliResults & conclusion
August 19, 2005 Segmental encoding of prosodic categories
3
Allophonic variations
• Defined mostly in terms of neighboring segments.e.g. Allophones of /t/ in English
/t/
[t] [th] [ʔ] [ɾ]
“stop” “top” “kitten” “little”
An overview
August 19, 2005 Segmental encoding of prosodic categories
4
Segmental positions
• Determined in most cases within a word
by its
1. neighboring segments and
2. word boundaries, i.e. word-initial/final
An overview
August 19, 2005 Segmental encoding of prosodic categories
5
Word-initial vs. word-internal positions in K(orean)-ToBI (Prosody labeling conventions)
IP: Intonational Phrase H: high toneAP: Accentual Phrase L: low toneW: Prosodic Word (PW) T: tone (could be H or L)σ: syllable %: boundary tone (e.g. H%, L%, HL%,
etc.)
An overview
Tone & Break Indices
August 19, 2005 Segmental encoding of prosodic categories
6
Word-initial positions in K-ToBI
An overview
August 19, 2005 Segmental encoding of prosodic categories
7
Word-initial positions in K-ToBI
An overview
word-initial
word-final
August 19, 2005 Segmental encoding of prosodic categories
8
Word-initial positions in K-ToBI
An overview
PW-initialAP-initialIP-initial
PW-initialAP-initial
PW-initial
PW-medial
Three types of word-initial positions in K-ToBI !
August 19, 2005 Segmental encoding of prosodic categories
9
Allophonic variations:an extended view
• Defined mostly in terms of neighboring segments.
• Need to be examined with respect to its prosodic constituency in K-ToBI.
An overview
August 19, 2005 Segmental encoding of prosodic categories
10
Productions studieson Korean and other languages
• Korean
Jun (’93,’98): lenis stop voicing, obstruent nasalization, VOT of /ph/
Cho & Keating (’01): segmental properties of /t, th, t*, n/
Kim (’01): segmental properties of /sh, s*/
Yoon (’03): subsegmental durations of /sh, s*/
• Other languagesSmith (’97): American /z/
Pierrehumbert & Talkin (’92), Pierrehumbert (’95): English /h/ and /ʔ/
Fougeron (’01): French segments /t, k, s, l, n, i, a/
Keating et al. (’98): /t, n/ of Korean, English, French & Taiwanese
An overview
August 19, 2005 Segmental encoding of prosodic categories
11
Productions studieson Korean and other languages – summary of results
• KoreanAP is the domain of lenis stop voicing, post-obstruent tensing (Jun).
IP is the domain of obstruent nasalization (Jun).VOT of /ph/: AP-initial > PW-initial > PW-medial (Jun).Consonants initial to higher prosodic domains are ‘stronger’ (Cho, Keating, Kim).Non-uniform variations in durations of subsegmental units (Yoon).
• Other languagesAmerican English /z/ is devoiced differently in different positions (Smith).English /h/ and /ʔ/ produced differently in different word-/phrase-level prosody. (P & T)Articulation of initial segments varied depending on the prosodic level of the
constituent, i.e. initial to an IP, AP, W or syllable. (Fougeron)There is phrasal/prosodic conditioning of articulation across the four languages.
(Keating et al.)
An overview
August 19, 2005 Segmental encoding of prosodic categories
12
Need for a perception study, but how?
• As the production studies show, Korean speakers seem to encode prosodic categories, i.e. IP, AP, PW, etc., in domain-initial segments.
• Then what about listeners? Do they decode the encodings?Are the encodings perceptible?How do we test it?
• One way to test it is to use a concatenative TTS system so that one can synthesize sentences by manipulating phone-sized units, i.e. diphones.
An overview
August 19, 2005 Segmental encoding of prosodic categories
13
Need for a perception study, but how?
Key idea: Synthesize a set of two sentences, differing only in terms of their domain-initial segment compositions.
An overview
IP-initial
AP-initial
PW-initial
PW-medial
August 19, 2005 Segmental encoding of prosodic categories
14
Need for a perception study, but how?
Test stimuli:1st set: good AP: composed of prosodically appropriate synthetic units
bad AP: composed of prosodically inappropriate units (Replace with )
2nd set: good PW: composed of prosodically appropriate synthetic units bad PW: composed of prosodically inappropriate units (Replace with )
An overview
IP-initial
AP-initial
PW-initial
PW-medial
August 19, 2005 Segmental encoding of prosodic categories
15
Diphones
• Text-to-speech (TTS) synthesis systems
• Diphones, prosodically sensitive
• Festival Speech Synthesis System (University of Edinburgh)
A diphone-based speech synthesis system
August 19, 2005 Segmental encoding of prosodic categories
16
Text-to-speech (TTS) synthesis systems
• A system that automatically generates speech given a particular natural language text; the speech produced should be both comprehensible and natural sounding.
• Two main components; NLP module (natural language processing)DSP module (digital signal processing)
• NLP module: an elaborate text analysis system
input text sequences of phones + prosodic organization.
• DSP module:symbolic input from NLP natural sounding speech
A diphone-based speech synthesis system
August 19, 2005 Segmental encoding of prosodic categories
17
Diphones
• Phone-sized synthesis units.
• Parametric representations of short chunks (usually extending from the middle of one phone to the middle of the immediately following one) of audio signal extracted from a cache of recorded sentences that can be re-combined by a TTS system to produce a novel synthesized word or sentence.
• Avoid the need for modeling phone-to-phone transitions of natural speech signal. For example, a diphone i-u contains the second half of [i] and the first half of [u]. A p-a diphone contains the second half of [p] and the first half of [a].
• Prosodically sensitive diphones: Each diphone is stored as four different versions, i.e. three versions initial to an IP, AP or PW, and one version medial to a PW. (NB: A conventional diphone is stored as one version)
A diphone-based speech synthesis system
August 19, 2005 Segmental encoding of prosodic categories
18
Diphones
A diphone-based speech synthesis system
IP-initial <p-a
AP-initial [p-a
PW-initial {p-a
PW-medialp-a
6,503 prosodic diphones needed to synthesize any Korean utterance.
예 ) < 바다로 ] [ 바닷가로 >… #-< ㅂ , < ㅂ - ㅏ , ㅏ - ㄷ , ㄷ - ㅏ , ㅏ - ㄹ , ㄹ - ㅗ ], ㅗ ]-[ ㅂ , [ ㅂ -ㅏ , …
August 19, 2005 Segmental encoding of prosodic categories
19
Festival Speech Synthesis System(University of Edinburgh, http://www.festvox.org)
• A free software multi-lingual speech synthesis workbench.
• An open architecture for research in speech synthesis.
• Primarily developed under Unix/Linux/FreeBSD/Solaris and ported to Windows.
• Developed for conventional diphone-based systems, but can be modified to accommodate our prosodically sensitive diphones.
• Consult Yoon (’05) for how we created a prototype system.
A diphone-based speech synthesis system
August 19, 2005 Segmental encoding of prosodic categories
20
Design & synthesis of test stimuli
• 96 stimuli (phrases) synthesized from the Festival system (Durations and F0 contours copied from natural utterances).
• All were composed of either two AP’s or two PW’s.
• All contained one target site, where an AP/PW-initial segment was replaced with a PW-medial segment.
24 good AP: phrases with intact diphones.24 bad AP : phrases whose target site segment (AP-initial segment)
was replaced with a PW-medial segment 24 good PW: phrases with intact diphones24 bad PW : phrases whose target site segment (PW-initial segment)
was replaced with a PW-medial segment
A listening experiment
August 19, 2005 Segmental encoding of prosodic categories
21
Design & synthesis of test stimuli
• Synthesis of a sample stimulus (Praat script)
< 삼성차의 ] [ 가치는 >
A listening experiment
natural utterance
diphone sequences from Festival
fundamental frequency (F0) contour and segmental durationscopied from natural utterance
intensity contour copied from natural utterance
• Prototype system lacks duration & F0 generation module Get help from natural utterances.
August 19, 2005 Segmental encoding of prosodic categories
22
Design & synthesis of test stimuli• Sample stimuli
< 그의 ] [ 발언은 > target site segment: /p/
A listening experiment
August 19, 2005 Segmental encoding of prosodic categories
23
Design & synthesis of test stimuli• More sample stimuli
A listening experiment
target segment good AP bad AP good PW bad PW
/p/
/t/
/k/
/ph/
/th/
/t*/
/tʃ/
/tʃh/
/sh/
August 19, 2005 Segmental encoding of prosodic categories
24
Results & conclusion• 80 listeners (37 women and 43 men):
native speakers of Korean, average age of 30.6, grew up in Korea until at least 18 years old.
• Two types of tests in three tasksIntelligibility: dictation task wrote down what they heard in hangulNaturalness: rating & preference task rate one version wrt/ the other and choose one over the other
• Statistical analyses showed that listeners performed better in the dictation task with “good” versions of the stimuli. They also liked/rated better the “good” versions.
• Segmental encoding of prosodic domains/categories is perceptible to Korean listeners.
A listening experiment