Presentation 3 : Classification of Speech...

24
Presentation 3 : Classification of Speech Sounds Kocaeli University Speech Processing

Transcript of Presentation 3 : Classification of Speech...

Presentation 3 : Classification

of Speech Sounds

Kocaeli University

Speech Processing

Phonetic Realization

The role of phonemes is to povide a link between the

orthography of written language and the speech signal that

is produced when the word is spoken

It is important to understand how speech will be realized in

practise in order to design the best speech processing

systems

Larry -> /L(/l/) AE(/ae/) R(/r/) IY(/i/)/ (in ARPAbet notation)

How old are you -> /X H AW Z OW L D Z AA R Z Y UW X/

Yemek yedim -> X y e m e k Z y e d i m X

Phonemes in American English

Phonemes (in American English)

Consonants

Front

Vowels

Center

Back

Semi Vowels

Glides Liquids Stops

Fricatives

Whispers Nasals

Voiced Unvoiced

Voiced Unvoiced

Affricates

Diphthongs

Phonemes in American English

Each of the phoneme can be classified either as

a continuant or non-continuant

Continuant sounds

They are produced by a fixed (non-time varying) vocal

tract configuration

Vowels, fricatives and the nasals

Non-continuant sounds

They are produced by a changing vocal tract

configuration

Diphthongs, semivowels, stops, affricates

Vowels Generally have longest duration in natural speech

However, they carry very little linguistic information about the

orthography of the sentence

Vowels

Vowels are produced by exciting a fixed vocal tract with quasi

periodic pulses of air caused by vibration of the vocal cords (at

certain fundamental frequency); Voiced sounds

We use the term quasi because the perfect periodicity is never

achieved

The front (/IY IH EH AE/), center (/AA ER AH-AX AO/) and back

(/UW UH OW/) subgroups are defined primarily according to the

position of the tongue

But the jaw, lips, to a small extent velum also influences the

resulting sound

Vowels IPA symbols and some typical words for the vowels

Vocal tract configuration (cross-sectional area) shapes the

spectrum, defines the formant (resonant) frequencies

Vowels

The plot shows first formant frequency vs. second formant

frequency

Observe the vowel triangle in the figure

Diphthongs Vocal cords vibrate; Voiced sound

Diphthong is a gliding monosyllabic speech item that starts at or near the

articulatory position for the initial (first) vowel, and moves to or toward the

position for the final (second) vowel in the diphthong

They are produced by varying the vocal tract smoothly between vowel

configurations to the initial and final vowel

/EY/ in bay, /AY/ in buy, /AW/ in how, /OY/ in boy

Distinctive Features of Sounds

Classify the phonemes other than the vowels and diphthongs by distinctive

features

Place of articulation Bilabial (at the lips) /p (pat)/, /b (bet)/, /m (met)/, /w (wit)/

Labiodental (between the lips and the front of the teeth) /f (fat)/, /v (vat)/

Dental (at the teeth) /th (thing)/, /dh (that)/

Alveolar (the front of the palate) /t (ten)/, /d (debt)/, /s (sat)/, /z (zoo)/, /n (net)/, /l (let)/

Palatal (middle of the palate) /sh (shut)/, /zh (azure)/, /r (rent)/

Velar (at the velum) /k (kit)/, /g (get)/, /nx (sing)/

Pharyngeal (at the end of the pharynx) /h (hat)/

Manner of articulation Glide (smooth motion of the articulators) /w/, /l/, /r/, /y/

Nasal (lowered velum) /m/, /n/, /nx/

Stop (totally constricted vocal tract blocking air flow) /p/, /t/, /k/, /b/, /d/, /g/

Fricative (vocal cords not vibrating, with turbulent sound source due to high degree of constriction in

vocal tract) /f/, /th/, /s/, /sh/, /v/, /dh/, /z/, /zh/, /h/

Voicing (vocal cords are vibrating througout the sound) /b/, /d/, /g/, /v/, /dh/, /z/, /zh/, /m/, /n/, /ng/, /w/, /l/,

/r/, /y/

Mixed source (vocal cords vibrating but turbulence produced at constriction in vocal tract) /j/, /ch

(church)/

Whispered (turbulent air source at the glottis) /h/

Distinctive Features of Sounds

Semi-vowels

They have vowel-like nature

Vocal cords vibrate, voiced sounds

There are two categories: Glides (/w/ we, /y/ you) and liquids

(/r/ read, /l/ let)

Acoustic characteristics of semi-vowels are strongly

influenced by the context

They are best described as transitional, vowel-like sounds,

hence similar to vowels and diphthongs

Nasals

The nasal consonants are voiced sounds

Production mechanism;

The oral tract is totally constricted

The velum is lowered so that the air mainly flows through the nasal

tract

Some typical words for the nasal sounds

/m/ me /n/ no /nx/ sing

They are distinguished by the place where the oral tract is

constricted

Produce the sounds and observe the place of the oral tract constriction

Fricatives

Can be both voiced and unvoiced

Unvoiced fricatives

Vocal cords are relaxed and not vibrating

Produced by exciting the vocal tract by a steady air flow that becomes

turbulent in the region of a constriction in the vocal tract

The location of the constriction serves to determine which sound is produced

Some typical words for the unvoiced fricatives

/f/ for /th/ thin /s/ see /sh/ she

Fricatives

Voiced fricatives

The place of constriction for each of the voiced/unvoiced fricative pairs (f-v,

th-dh, s-z, sh-zh) is essentially the same

Two excitation sources are involved in the production of the voiced

fricatives

The vocal cords are vibrating, one excitation at the glottis

Since the vocal tract is constricted at some point forward of the glottis, the air flow becomes

turbulent in the neighborhood of the constriction

Giving a noise-like component in addition to the voiced-like component

Some typical words for the voiced fricatives

/v/ vote /dh/ then /z/ zoo /zh/ azure

Fricatives

Produce voiced/unvoiced fricative pairs (such as voiced

fricative /v/) and its unvoiced pair /f/) and observe the vibration

of the vocal cords

While producing the sounds, put your finger to your Adam’s

apple on your throat

Stops (Plosives)

Stops are transient, non-continuant sounds

Can be both voiced and unvoiced

Unvoiced plosives (/p/, /t/, /k/)

Mechanism

1) Complete closure of the oral tract and build up of air pressure behind that closure;

During closure, no sound is radiated from the lips; The silent period in that closure

is named as stop gap

2) Sudden release of air pressure and generation of the burst (‘impulsive’ source);

That excites the oral cavity

3) Generation of aspiration due to turbulence at the open vocal cords

4) Onset of the following vowel about 30-50 msec after the burst; voiced onset time:

The difference between the time of the burst and the onset of the following vowel

Plosives (Stops)

Voiced plosives (/b/, /d/, /g/)

Mechanism; Different from the unvoiced stops

The vocal cords can vibrate during the oral tract constriction

There is little or no aspiration after the burst, but the vocal cords continue to

vibrate

There is much shorter delay between the burst and the voicing of the following

vowel

The constriction can occur at the front (/p/, /b/), center (/t/, /d/) or

back (/k/, /g/) of the oral tract

Whispers

Production mechanism;

Turbulent flow is produced at the glottis rather than at a vocal tract

constriction; Remember that the fricatives are produced by the vocal

tract constriction

The glottis is open and there is no vocal cord vibration; Unvoiced

sound

The characteristics of /h/ are invariably those of the vowel

that follows it; Vocal tract assumes the position for the

following vowel during the production of /h/

The only whisper in English is /h/ in word he

Affricates

Affricates

Produced by varying the vocal tract rapidly from plosives to fricatives

The unvoiced affricate /ch/ in word chew; It is the concetanation of the

unvoiced stop /t/ followed by the unvoiced fricative /sh/

The voiced counterpart is /jh/ in word just; It is the concetanation of the

voiced stop /d/ followed by the voiced fricative /zh/

Transitional Speech Sounds

Articulation is the movement of the tongue, lips, jaw and other

speech organs (the articulators) in order to make speech sounds

Some sounds are ‘stationary’ in the sense that

The underlying articulators hold an almost fixed configuration

The resulting formants appear nearly constant

Some sounds are ‘non-stationary’

They are defined by the rapid transition across two articulatory states

Diphthongs, semi-vowels and affricates are transitional speech

sounds in English

In conversational speech, our speech anatomy cannot move to a

desired position instantaneously and thus past positions influence

the present

Co-articulation

Co-articulation;

Is the influence of the articulation of one sound on the articulation of another

sound in the same utterance

Thus, the articulatory states are blended

In speech recognition systems, generally

Instead of training a context-independent monophone model for a phoneme

Ahmet -> /e/ model

We train a context-dependent tri-phone model

Ahmet -> m-e+t (/e/ phoneme in the context of /m/ and /t/)

Training robust context-dependent models require much more training data;

Hundred maybe thousand hours of training speech

Distinctive features of the phonemes

Questions?

Thank you!