Distinctive Feature Detection For Automatic Speech Recognition Jun Hou Prof. Lawrence Rabiner Dr....

Post on 31-Mar-2015

216 views 1 download

Tags:

Transcript of Distinctive Feature Detection For Automatic Speech Recognition Jun Hou Prof. Lawrence Rabiner Dr....

Distinctive Feature Detection For Automatic Speech Recognition

Jun Hou Prof. Lawrence Rabiner

Dr. Sorin DusanCAIP, ECE Dept., Rutgers University

Sep.13, 2004

Outline The history of Automatic Speech

Recognition Current Feature Detection

Technologies ASAT – Automatic Speech Attribute

Transcription Distinctive Feature Detection, as a

part of ASAT Proposed Work schedule

The Evolution of Speech Recognition

Data-driven (1980’s, 1990’s and 2000’s) vs. knowledge-driven (1960’s, 1970’s)

Figure 1 S-curve limits ASR technology advances (C.-H. Lee)

The gap between Human Speech Recognition (HSR) and Automatic Speech Recognition (ASR) is still very large

Is HMM the end of the line? Or is there somewhere else to go?

Problems with Signals To Be Recognized No two utterances of the same

linguistic content are ever the same (often they are not even close in their waveforms or spectral characteristics) Speaker variation Speaking style Background environment etc.

Statistical Methods

Figure 2 State-of-the-art HMM-based systems (C.-H. Lee)

Typical approaches: HMM and ANN

Statistical Methods Top-down approach. Higher level knowledge guides the processing

primarily at the lower levels. Incremental discrimination to get refined results (e.g., better stop

consonant discrimination) Utterance verification – Confidence measures to approximately

estimate the reliability of the result, often on a word-by-word basis Errors inevitable, mainly when the measured features fall into the

overlapped region of the different pdfs Data driven => Sensitive to training data, both the amount and type Robustness problem – Sensitive to speaking environment and

transmission characteristics of the medium No explicit use of acoustic, or phonetic knowledge No clear calculation of the required size of the training data set High computational cost when the size of statistical patterns is large

HMM Issues Sequential model Assumes frame independence – blindly treat

frames with equal importance; more or less okay when using cepstral features

No higher level (linguistic) knowledge used in acoustic modeling

Etc.

ti-1 ti ti+1 ti+2

Figure 3 HMM diagram

ANN Issues No meaningful representation of the internal nodes Lots of uncertainty as to what processing is happening Computationally expensive Hard to train; virtually impossible to guarantee

convergence at true minimum solution Etc.

……

……

…… Input layer

Hidden layer(s)

Output layer

Figure 4 ANN diagram

Knowledge Based Methods Bottom-up approach. Uses acoustic-phonetic

knowledge at all levels of processing. Temporal features are critical in discriminating

some speech sounds, e.g., VOT in stop detection

Spectral features are critical in discriminating other speech sounds, e.g., fricatives from spectral energy concentrations

Learn information in temporal and spectral domains using both static and dynamic features

Problems with Knowledge-Based Methods The knowledge of the acoustic properties of phonetic

units is not complete. Hard to cover all the rules.

The knowledge of phonetic properties of acoustic units is not complete.

Pronunciation models explain the formation of waveforms from vocal tract shapes, but no clear reverse knowledge exists.

The choice of features is not optimal in a well defined and meaningful sense.

The design of sound classifiers is not optimal. No well-defined automatic tuning methods exist.

Feature Extraction-Ali et al

Feature Extraction (Jakobson)1. Total energy2. Spectral Center of Gravity (SCG)3. Duration4. Low, medium and high frequency energy5. Formant transitions6. Silence detection7. Voicing detection8. Rate of change of energy in various frequency bands9. Rate of change of SCG10. Most prominent peak frequency11. Rate of change of the most prominent peak frequency12. Zero-crossing rate

Auditory-Based Front End Processing

Feature Extraction Utterance Segmentation (silence, obstruents,

sonorants) Fine Utterance Classification into Four

categories Sonorants – fine identification Stops – voiced and unvoiced Fricatives – voiced and unvoiced Silence

Excellent performance for stops and fricatives

Feature Extraction

Figure 5 Block diagram of the System Figure 6 Block diagram of the front-end

Feature Extraction Fricative classification

Voicing detection DUP – The Duration of the Unvoiced Portion

Place of articulation detection MDP - The Most Dominant Peak from the synchrony

detector MNSS - The Maximum Normalized Spectral Slope SCG - The Spectral Center of Gravity MDSS - The Most Dominant Spectral Slope DRHF - The Dominant Relative to the Highest Filters

Feature Extraction

Voicing detection Prevoicing VOT Closure duration

Place of articulation detection BF - Burst Frequency The second formant of the following vowel MNSS DRHF, LINP (most prominent peak of the synchrony

response after being laterally inhibited by the higher 10 filters)

Formant transitions before and after the stop The voicing decision

Stop detection

Landmark Detection Landmark Detection – Junija, et al., PhD Thesis Proposal Manner landmarks are used, whereas place and voicing

are extracted using the locations provided by the manner landmarks

Three steps: Location of manner landmarks Analysis of landmarks for place and voicing phonetic features Matching phonetic features to features of words or sentence

representations

Two manner landmarks Defined by abrupt change, e.g., burst landmark for stop

consonants, vowel onset point Defined by the most prominent manifestation of a manner

phonetic feature, e.g., a point of maximum low energy in a vowel

Landmark Detection Recognition of 5 broad classes

Vowel Stop Fricative Sonorant consonant Silence

Table 1 Broad manner classification of English phonemes

Use Support Vector Machines (SVM) to segment TIMIT data into binary classes

Results of 2 different feature organizations are reported:

Parallel – discriminate each feature against all other features

Hierarchical – distinguish the features using a probabilistic hierarchy

Landmark Detection

Table 2 Landmarks extracted for each of the manner classes and knowledge based acoustic measurements

Landmark Detection

Table 3 Acoustic Parameters used in broad class segmentation

Landmark Detection Compare the organizations of SVMs

Figure 8 Hierarchical SVM organizationFigure 7 Parallel SVM organization

Landmark Detection

Compare classification results

Table 4 Results of parallel SVM organization Table 5 Results of hierarchical SVM organization

Landmark DetectionDiscussion

1. Combine landmarks with acoustic parameters

2. The gap between correctness and accuracy is due to the insertions mainly of sonorant consonants and stops

3. Performance gap between hierarchical SVM and parallel SVM architectures is due to ??? – possibly: wrong classification in the upper level in the hierarchical architecture causes error propagation to the lower level

4. Isolated or connected word recognition Use Finite State Automata (FSA) to constrain the

segmentation paths Doesn’t allow the use of a probabilistic language

model

Landmark Detection– ANN Benoit Launay, et al. Train Artificial Neural Network to

map short-term spectral features to the posterior probability of some distinctive features

Feed features into HMM

ASAT – Automatic Speech Attribute Transcription Knowledge-based, data driven approach

Figure 9 Bottom-up ASAT based on speech attribute detection, event merging and evidence verification (C.-H. Lee)

NEW!

Distinctive Feature Detection

1. What Attributes?

2. How to measure them? 3. What

Features?

5. What outputs?6. How to

compute them?

4. How to combine the

attributes to form features?

Attribute Detector 1

Attribute Detector 2

Attribute Detector 3

Attribute Detector 4

Attribute Detector 5

Attribute Detector 6

Attribute Detector 7

Attribute Detector 8

Attribute Detector M

……

Feature Detector 1

Feature Detector 2

Feature Detector N

……

Speech Signal

Feature 1

Feature 2

Feature N

Figure 10 Distinctive Feature Detection

Attributes Combination:

Linear,

ANN,

K-L,

etc.

Attributes and Features in ASAT – Issues to be Resolved Q1: What attributes? Q2: How to measure them? Q3: What features? Q4: How to combine the attributes to

form features? Q5: What outputs? Q6: How to compute the outputs? Q7: Why use them?

Q1: What attributes?

Different set of attributes for each feature

MFCC and their derivatives, Energy in specific spectral ranges, Zero Crossing Rate, Formant Frequency, ratio of spectral peaks, etc.

VOT, energy onset, energy offset, etc. Refer to those attributes in Ali’s paper Find other indicative attributes in spectral graph,

cepstral graph, etc. Find other significant characteristics in waveforms Find characteristics inside/between the time and

frequency domains

Q2: How to measure them? Observe and analyze the speech signal in both time and

frequency domain, e.g., filter bank analysis Data mining of meaningful “patterns”

Enhance distinctive attributes, eliminate confusing attributes – better ways to measure things

Find the relations of attributes inside a frame, e.g., between prominent attributes, weak attributes.

Experiments needed to find distinguishing attributes for each acoustic feature

Calculate correlation between attributes in succeeding frames

Calculate information redundancy for different attributes

Q2: How to measure them? Topology of attribute organization

Parallel Organization – ASAT Organization

Graph Organization Hierarchical – Junija et al. (features) Eliminate redundancy in computation One attribute may trigger the test of existence of

other attributes

Combined organization-i.e., sequential and graph methods combined

Q3: What features? Features available in current acoustic-

phonetic area: binary distinctive features

Distinctive features are related to: Voicing

vocal folds vibrates or not Place of articulation

The particular articulator that is used (glottis, soft palate, lips, etc.)

Manner of articulation How that articulator is used to produce the sound

Q3: What features? Initial list of twelve pairs of distinctive features

1. Vocalic/non-vocalic 2. Consonantal/non-consonantal 3. Interrupted/continuent 4. Checked/unchecked 5. Strident/mellow 6. Voiced/unvoiced 7. Compact/diffuse 8 .Grave/acute 9. Flat/plain 10. Sharp/plain 11. Tense/lax 12. Nasal/oral

English is characterized by 9 pairs of these features

Q3: What features? Need to detect all relevant features to perform

automatic speech recognition at the phonetic level

Acoustic-phonetic features are intuitively plausible, but there might exist other good features obtained from data mining and/or clustering techniques

We can optimize (how we do it is unclear) and obtain the minimum necessary set of speech distinctive features

May use attributes directly and together with features when calculating the outputs from the detectors

Q4: How to compute or estimate the features? Develop combination methods and optimize them

to get better combination of attributes to form meaningful features, and select the best features for phonemes and possibly larger acoustic units

Possible combination algorithms: Linearly weighted average ANN K-L Fuzzy integral seems promising, compared with ANN

(cf. Chang & Greenberg’s paper)

Prominent attributes characterize features. The existence of some particular attributes may help to further define the feature or features.

Q5: What outputs?

Study the acoustic-phonetic theories and establish models that best describe the production of sound signals

Study each acoustic class and find their differences and relations

Modified features? Phonemes? Phoneme-like units?

Q6: How to compute the outputs? Study acoustical variation during

pronunciation, find common characteristics and distinguishing characteristics for acoustic-phonetic variations

Score the outputs of the feature detectors using probabilities or likelihood measures of the presence of these distinctive features

Other methods???

Q7: Why use them? We have no other choice at this

time These attributes and features may

be far from optimal, but they are well motivated by acoustic-phonetic theories

Will consider other ideas, as they are developed

Evaluation Evaluation criteria for attributes, features

Mutual information (cf. Hasegawa-Johnson’s paper) Entropy (e.g., traditional Shannon Entropy, Rényi

Entropy, cf. Cachin’s paper) Perplexity, like that used in language modeling False acceptance rate, false rejection rate Other criteria???

Use those criteria to find correlations between attributes, as well as between features

Gradually minimize the mutual information between attributes/features, e.g., Gradient Descent, and get the minimum sets of attributes and features

Segmentation of Speech Study how humans segment different

portions of speech, e.g., spectrum reading

Multiple segmentations are possible, and thus we might want to search through a range of segmentation candidates to find the best result

Collect the segments with high confidence scores

Use other knowledge sources to help clarify the segments with poor scores

Training and Testing Database – TIMIT and/or Vic corpus Divide the database into separate

training and testing sets Training

(1) On the training set (2) On the training set + testing set – is this

meaningful or proper Find the difference between (1) and (2), and

the generalization ability of the features to out of task data

Test performance on the testing set

Training and Testing Training

Study differences between isolated words, connected words, continuous and spontaneous speech

Try not to depend solely on the training data, but instead find rules that adapt the data and can be applied to more general environments

Try not to defuse the model as more data is added

Training and Testing Testing

Find reasons why the detectors failed Observe error patterns Did the error patterns emerge due to

different reasons? If so, re-examine previous steps, and combine the different information sources in ways that are less sensitive to the observed error patterns

Work Schedule First year:

Set up the structure for the ASAT system Define the most reasonable starting set of

acoustic attributes and phonetic features Look at a range of ways of combining evidence

from the acoustic attributes to create the phonetic features

Evaluate the baseline performance of the system on a given training and testing set of date – most probably using TIMIT

Baseline alternative approaches, especially front ends, including auditory models and standard speech recognition features