Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features...

11
Overview Overview Recall Recall What are sound features? What are sound features? Feature detection and Feature detection and extraction extraction Features in Sphinx III Features in Sphinx III

Transcript of Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features...

Page 1: Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

OverviewOverview

►RecallRecall►What are sound features?What are sound features?►Feature detection and extractionFeature detection and extraction►Features in Sphinx IIIFeatures in Sphinx III

Page 2: Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Recall:Recall:

► Speech signal is ‘slowly’ time varying singnalSpeech signal is ‘slowly’ time varying singnal► There are a number of linguistically distinct There are a number of linguistically distinct

speech sounds (phonemes) in a language.speech sounds (phonemes) in a language.► It is possible to represent the sound spectrogram It is possible to represent the sound spectrogram

in a 3D spectrogram of the speech intensity and in a 3D spectrogram of the speech intensity and the different frequency bands over time the different frequency bands over time

► Most SR systems rely heavily on vowel Most SR systems rely heavily on vowel recognition to achieve high performance (they recognition to achieve high performance (they are long in duration and spectrally well defined are long in duration and spectrally well defined and therefore easily recognized)and therefore easily recognized)

Page 3: Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Speech sounds and featuresSpeech sounds and features

Examples:Examples:► Vowels (a, u, …)Vowels (a, u, …)► Diphthongs (f.i. aDiphthongs (f.i. ayy as in g as in guyuy, …), …)► Semivowels (w, l, r, y) Semivowels (w, l, r, y) ► Nasal Consonants (m, n)Nasal Consonants (m, n)► Unvoiced Fricatives (f, s)Unvoiced Fricatives (f, s)► Voiced Fricatives (v, th, z)Voiced Fricatives (v, th, z)► Voiced and Unvoiced Stops (b, d, g)Voiced and Unvoiced Stops (b, d, g)

► They all have their own characteristics (They all have their own characteristics (featuresfeatures))

Page 4: Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

ASR StagesASR Stages

1) 1) speech analysis systemspeech analysis system: to provide an : to provide an appropriate spectral appropriate spectral representation of the characteristics of representation of the characteristics of the time-varying speech the time-varying speech signalsignal

2) 2) feature detection stagefeature detection stage: to convert the spectral : to convert the spectral measurements to a set of features that describe the measurements to a set of features that describe the

broad broad acoustic properties of the different phonetic units (f.i. acoustic properties of the different phonetic units (f.i. nasality, nasality, frication, formant locations, voiced-unvoiced frication, formant locations, voiced-unvoiced classification, ratios classification, ratios of high- and low-frequency energy, etc.)of high- and low-frequency energy, etc.)

3) 3) segmentation and labeling phasesegmentation and labeling phase: to find stable : to find stable regions regions and then label the segmented region according and then label the segmented region according to how well the to how well the features within that region match those features within that region match those of individual phonetic of individual phonetic unitsunits

4) 4) final outputfinal output of the recognizer is the word or word of the recognizer is the word or word sequence sequence that best matchesthat best matches

Page 5: Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Feature detection (and Feature detection (and extraction)extraction)

► Speech segment contains certain characteristics, Speech segment contains certain characteristics, features. features.

► Different segments of speech contain different Different segments of speech contain different features, specific for the kind of segment!features, specific for the kind of segment!

► Goal is to try to classify a speech segment into Goal is to try to classify a speech segment into one of several broad speech classes (f.i. via one of several broad speech classes (f.i. via binary tree: compact/diffuse, acute/grave, binary tree: compact/diffuse, acute/grave, long/short, high/low frequency, etc)long/short, high/low frequency, etc)

► Ideally, feature vectors for a given word should Ideally, feature vectors for a given word should hopefully be the same regardless of the way in hopefully be the same regardless of the way in which the word has been utteredwhich the word has been uttered

Page 6: Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Last week: Last week: Mel-Frequency Ceptrum Mel-Frequency Ceptrum

CoefficientCoefficient► Fourier Transform extracts the frequency

components of a signal in the time domain

► Frequency domain is filtered/sliced in 12 smaller parts, where for each it’s own coefficient (MFCC) can be calculated

► MFCC's use the log-spectrum of the speech signal. MFCC's use the log-spectrum of the speech signal.

The logarithmic nature of the technique is significant The logarithmic nature of the technique is significant since the human auditory system perceives sound since the human auditory system perceives sound on a logarithmic scale above certain frequencieson a logarithmic scale above certain frequencies

Page 7: Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

FourierTransform

FourierTransform

CepstralAnalysis

CepstralAnalysis

PerceptualWeighting

PerceptualWeighting

TimeDerivative

TimeDerivative

Time Derivative

Time Derivative

Energy+

Mel-Spaced Cepstrum

Delta Energy+

Delta Cepstrum

Delta-Delta Energy+

Delta-Delta Cepstrum

Input Speech

• MFCC’s are beautiful, because they incorporate knowledge of the nature of speech sounds in measurement of the features.

• Utilize rudimentary models of human perception.

Acoustic Modeling: Acoustic Modeling: Feature Feature ExtractionExtraction

• Fourier Transform timedomain frequency domain

• Frequency domain is sliced in 12 smaller parts with each it’s own MFCC

• Include absolute energy and 12 spectral measurements.

• Time derivatives to model spectral change

Page 8: Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

What ‘to do’ with the MFCC’s:What ‘to do’ with the MFCC’s:► A speech recognizer can be built using the energy values (time A speech recognizer can be built using the energy values (time

domain) and 12 MFCC's (frequency domain), plus the first and domain) and 12 MFCC's (frequency domain), plus the first and second order derivatives of those coefficients. second order derivatives of those coefficients.

13 (Absolute Energy (1) and MFCCs (12))13 (Absolute Energy (1) and MFCCs (12))13 (Delta First-order derivatives of the 13 absolute coefficients) 13 (Delta First-order derivatives of the 13 absolute coefficients) 13 (Delta-Delta Second-order derivatives of the 13 absolute 13 (Delta-Delta Second-order derivatives of the 13 absolute coefficients) coefficients) ------------------------------------------------------------------------------------------------3939 TotalTotal Basic MFCC Front EndBasic MFCC Front End

► The derivatives are useful because they provide information about The derivatives are useful because they provide information about the the spectral change

► These total of 39 coefficients will provide information about the These total of 39 coefficients will provide information about the different features in that segment!different features in that segment!

► The feature measurements of the segments are stored in so called The feature measurements of the segments are stored in so called ‘feature vectors’, that can be used in the next stage of the speech ‘feature vectors’, that can be used in the next stage of the speech recognition (f.i. Hidden Markov Model)recognition (f.i. Hidden Markov Model)

Page 9: Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

In Sphinx III:In Sphinx III:computation of feature computation of feature

vectorsvectors► feat_s2mfc2featfeat_s2mfc2feat► feat_s2mfc2feat_blockfeat_s2mfc2feat_block

1.1. MFC file is readMFC file is read2.2. Initialization: defining the kind of input->feature conversion desired Initialization: defining the kind of input->feature conversion desired

(there are some differences between Sphinx II and Sphinx III)(there are some differences between Sphinx II and Sphinx III)3.3. Feature vectors are computed for the entire segment specified Feature vectors are computed for the entire segment specified

((feat_s2mfc2feat and feat_s2mfc2feat_block)feat_s2mfc2feat and feat_s2mfc2feat_block)

In Sphinx in the feature vectors, the streams of features are stored as In Sphinx in the feature vectors, the streams of features are stored as follows:follows:

► CEP: C1-C12CEP: C1-C12► DCEP: D1-D12DCEP: D1-D12► Energy values: C0, D0, DD0Energy values: C0, D0, DD0► D2CEP: DD1-DD12D2CEP: DD1-DD12

Page 10: Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

► So, at this point in the speech recognition process, you have So, at this point in the speech recognition process, you have stored feature vectors for the entire speech segment you are stored feature vectors for the entire speech segment you are looking at, providing the necessary information about what looking at, providing the necessary information about what kind features are in that segment. kind features are in that segment.

► Now, Now, The feature stream can be analyzed using a Hidden-Markov Model (HMM)

frication

burst

voicing

round

nasal

glide

a1

a2

:a5

a6……

“one”

“two”

“oh”

……

……

::

Feature Extraction Modules

Input speech

Feature Vector

Concat. Train

The feature stream is

analyzed using a Hidden-Markov Model (HMM)

Page 11: Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.