Presentation 3 : Classification of Speech...
Transcript of Presentation 3 : Classification of Speech...
Phonetic Realization
The role of phonemes is to povide a link between the
orthography of written language and the speech signal that
is produced when the word is spoken
It is important to understand how speech will be realized in
practise in order to design the best speech processing
systems
Larry -> /L(/l/) AE(/ae/) R(/r/) IY(/i/)/ (in ARPAbet notation)
How old are you -> /X H AW Z OW L D Z AA R Z Y UW X/
Yemek yedim -> X y e m e k Z y e d i m X
Phonemes in American English
Phonemes (in American English)
Consonants
Front
Vowels
Center
Back
Semi Vowels
Glides Liquids Stops
Fricatives
Whispers Nasals
Voiced Unvoiced
Voiced Unvoiced
Affricates
Diphthongs
Phonemes in American English
Each of the phoneme can be classified either as
a continuant or non-continuant
Continuant sounds
They are produced by a fixed (non-time varying) vocal
tract configuration
Vowels, fricatives and the nasals
Non-continuant sounds
They are produced by a changing vocal tract
configuration
Diphthongs, semivowels, stops, affricates
Vowels Generally have longest duration in natural speech
However, they carry very little linguistic information about the
orthography of the sentence
Vowels
Vowels are produced by exciting a fixed vocal tract with quasi
periodic pulses of air caused by vibration of the vocal cords (at
certain fundamental frequency); Voiced sounds
We use the term quasi because the perfect periodicity is never
achieved
The front (/IY IH EH AE/), center (/AA ER AH-AX AO/) and back
(/UW UH OW/) subgroups are defined primarily according to the
position of the tongue
But the jaw, lips, to a small extent velum also influences the
resulting sound
Vowels IPA symbols and some typical words for the vowels
Vocal tract configuration (cross-sectional area) shapes the
spectrum, defines the formant (resonant) frequencies
Vowels
The plot shows first formant frequency vs. second formant
frequency
Observe the vowel triangle in the figure
Diphthongs Vocal cords vibrate; Voiced sound
Diphthong is a gliding monosyllabic speech item that starts at or near the
articulatory position for the initial (first) vowel, and moves to or toward the
position for the final (second) vowel in the diphthong
They are produced by varying the vocal tract smoothly between vowel
configurations to the initial and final vowel
/EY/ in bay, /AY/ in buy, /AW/ in how, /OY/ in boy
Distinctive Features of Sounds
Classify the phonemes other than the vowels and diphthongs by distinctive
features
Place of articulation Bilabial (at the lips) /p (pat)/, /b (bet)/, /m (met)/, /w (wit)/
Labiodental (between the lips and the front of the teeth) /f (fat)/, /v (vat)/
Dental (at the teeth) /th (thing)/, /dh (that)/
Alveolar (the front of the palate) /t (ten)/, /d (debt)/, /s (sat)/, /z (zoo)/, /n (net)/, /l (let)/
Palatal (middle of the palate) /sh (shut)/, /zh (azure)/, /r (rent)/
Velar (at the velum) /k (kit)/, /g (get)/, /nx (sing)/
Pharyngeal (at the end of the pharynx) /h (hat)/
Manner of articulation Glide (smooth motion of the articulators) /w/, /l/, /r/, /y/
Nasal (lowered velum) /m/, /n/, /nx/
Stop (totally constricted vocal tract blocking air flow) /p/, /t/, /k/, /b/, /d/, /g/
Fricative (vocal cords not vibrating, with turbulent sound source due to high degree of constriction in
vocal tract) /f/, /th/, /s/, /sh/, /v/, /dh/, /z/, /zh/, /h/
Voicing (vocal cords are vibrating througout the sound) /b/, /d/, /g/, /v/, /dh/, /z/, /zh/, /m/, /n/, /ng/, /w/, /l/,
/r/, /y/
Mixed source (vocal cords vibrating but turbulence produced at constriction in vocal tract) /j/, /ch
(church)/
Whispered (turbulent air source at the glottis) /h/
Semi-vowels
They have vowel-like nature
Vocal cords vibrate, voiced sounds
There are two categories: Glides (/w/ we, /y/ you) and liquids
(/r/ read, /l/ let)
Acoustic characteristics of semi-vowels are strongly
influenced by the context
They are best described as transitional, vowel-like sounds,
hence similar to vowels and diphthongs
Nasals
The nasal consonants are voiced sounds
Production mechanism;
The oral tract is totally constricted
The velum is lowered so that the air mainly flows through the nasal
tract
Some typical words for the nasal sounds
/m/ me /n/ no /nx/ sing
They are distinguished by the place where the oral tract is
constricted
Produce the sounds and observe the place of the oral tract constriction
Fricatives
Can be both voiced and unvoiced
Unvoiced fricatives
Vocal cords are relaxed and not vibrating
Produced by exciting the vocal tract by a steady air flow that becomes
turbulent in the region of a constriction in the vocal tract
The location of the constriction serves to determine which sound is produced
Some typical words for the unvoiced fricatives
/f/ for /th/ thin /s/ see /sh/ she
Fricatives
Voiced fricatives
The place of constriction for each of the voiced/unvoiced fricative pairs (f-v,
th-dh, s-z, sh-zh) is essentially the same
Two excitation sources are involved in the production of the voiced
fricatives
The vocal cords are vibrating, one excitation at the glottis
Since the vocal tract is constricted at some point forward of the glottis, the air flow becomes
turbulent in the neighborhood of the constriction
Giving a noise-like component in addition to the voiced-like component
Some typical words for the voiced fricatives
/v/ vote /dh/ then /z/ zoo /zh/ azure
Fricatives
Produce voiced/unvoiced fricative pairs (such as voiced
fricative /v/) and its unvoiced pair /f/) and observe the vibration
of the vocal cords
While producing the sounds, put your finger to your Adam’s
apple on your throat
Stops (Plosives)
Stops are transient, non-continuant sounds
Can be both voiced and unvoiced
Unvoiced plosives (/p/, /t/, /k/)
Mechanism
1) Complete closure of the oral tract and build up of air pressure behind that closure;
During closure, no sound is radiated from the lips; The silent period in that closure
is named as stop gap
2) Sudden release of air pressure and generation of the burst (‘impulsive’ source);
That excites the oral cavity
3) Generation of aspiration due to turbulence at the open vocal cords
4) Onset of the following vowel about 30-50 msec after the burst; voiced onset time:
The difference between the time of the burst and the onset of the following vowel
Plosives (Stops)
Voiced plosives (/b/, /d/, /g/)
Mechanism; Different from the unvoiced stops
The vocal cords can vibrate during the oral tract constriction
There is little or no aspiration after the burst, but the vocal cords continue to
vibrate
There is much shorter delay between the burst and the voicing of the following
vowel
The constriction can occur at the front (/p/, /b/), center (/t/, /d/) or
back (/k/, /g/) of the oral tract
Whispers
Production mechanism;
Turbulent flow is produced at the glottis rather than at a vocal tract
constriction; Remember that the fricatives are produced by the vocal
tract constriction
The glottis is open and there is no vocal cord vibration; Unvoiced
sound
The characteristics of /h/ are invariably those of the vowel
that follows it; Vocal tract assumes the position for the
following vowel during the production of /h/
The only whisper in English is /h/ in word he
Affricates
Affricates
Produced by varying the vocal tract rapidly from plosives to fricatives
The unvoiced affricate /ch/ in word chew; It is the concetanation of the
unvoiced stop /t/ followed by the unvoiced fricative /sh/
The voiced counterpart is /jh/ in word just; It is the concetanation of the
voiced stop /d/ followed by the voiced fricative /zh/
Transitional Speech Sounds
Articulation is the movement of the tongue, lips, jaw and other
speech organs (the articulators) in order to make speech sounds
Some sounds are ‘stationary’ in the sense that
The underlying articulators hold an almost fixed configuration
The resulting formants appear nearly constant
Some sounds are ‘non-stationary’
They are defined by the rapid transition across two articulatory states
Diphthongs, semi-vowels and affricates are transitional speech
sounds in English
In conversational speech, our speech anatomy cannot move to a
desired position instantaneously and thus past positions influence
the present
Co-articulation
Co-articulation;
Is the influence of the articulation of one sound on the articulation of another
sound in the same utterance
Thus, the articulatory states are blended
In speech recognition systems, generally
Instead of training a context-independent monophone model for a phoneme
Ahmet -> /e/ model
We train a context-dependent tri-phone model
Ahmet -> m-e+t (/e/ phoneme in the context of /m/ and /t/)
Training robust context-dependent models require much more training data;
Hundred maybe thousand hours of training speech