Hybrid Systems for Continuous Speech Recognition Issac Alphonso [email protected] Institute...

Hybrid Systems for Hybrid Systems for Continuous Speech Continuous Speech

RecognitionRecognition

Issac Alphonso Issac Alphonso [email protected]@isip.msstate.edu

Institute for Signal and Information Institute for Signal and Information ProcessingProcessing

Mississippi State UniversityMississippi State University

AbstractAbstractStatistical techniques based on Hidden Markov models Statistical techniques based on Hidden Markov models (HMMs) with Gaussian emission densities have dominated the (HMMs) with Gaussian emission densities have dominated the signal processing and pattern recognition literature for the signal processing and pattern recognition literature for the past 20 years. However, HMMs suffer from an inability to past 20 years. However, HMMs suffer from an inability to learn discriminative information and are prone to over-fitting learn discriminative information and are prone to over-fitting and over-parameterization. Recent work in machine learning and over-parameterization. Recent work in machine learning has focused on models, such as the support vector machine has focused on models, such as the support vector machine (SVM), that automatically control generalization and (SVM), that automatically control generalization and parameterization as part of the overall optimization process. parameterization as part of the overall optimization process. SVMs have been shown to provide significant improvements in SVMs have been shown to provide significant improvements in performance on small pattern recognition tasks compared to a performance on small pattern recognition tasks compared to a number of conventional approaches.number of conventional approaches.

In this presentation, I will describe some of the work that I In this presentation, I will describe some of the work that I have done in implementing a kernel based speech recognition have done in implementing a kernel based speech recognition system (this is based on work done by Aravind Ganipathiraju). system (this is based on work done by Aravind Ganipathiraju). I will then describe our work in using kernel based machines I will then describe our work in using kernel based machines as acoustic models in large vocabulary speech recognition as acoustic models in large vocabulary speech recognition systems. Finally, I will show that SVM’s perform better than systems. Finally, I will show that SVM’s perform better than Gaussian mixture-based HMMs in open-loop recognition.Gaussian mixture-based HMMs in open-loop recognition.

BioBioIssac Alphonso is a M.S. graduate from the Issac Alphonso is a M.S. graduate from the Department of Electrical and Computer Department of Electrical and Computer Engineering at Mississippi State University Engineering at Mississippi State University (MSU) under the supervision of Dr. Joe Picone. (MSU) under the supervision of Dr. Joe Picone. He has been a member of the Institute for Signal He has been a member of the Institute for Signal and Information Processing (ISIP) at MSU since and Information Processing (ISIP) at MSU since 1997. Mr. Alphonso's work as a graduate student 1997. Mr. Alphonso's work as a graduate student has revolved around exploring new acoustic has revolved around exploring new acoustic modeling techniques for continuous speech modeling techniques for continuous speech recognition systems. His most recent work has recognition systems. His most recent work has been in the implementation of a hybrid been in the implementation of a hybrid hierarchical decoder that employs kernel based hierarchical decoder that employs kernel based techniques like Support Vector machines, which techniques like Support Vector machines, which replaces the underlying Gaussian distribution replaces the underlying Gaussian distribution in hidden Markov models. His thesis work looks in hidden Markov models. His thesis work looks at a new network training framework that at a new network training framework that reduces the complexity of the training process, reduces the complexity of the training process, while retaining the robustness of the while retaining the robustness of the expectation-maximization based supervised expectation-maximization based supervised training framework.training framework.

OutlineOutline

What we do and how we fit in the big What we do and how we fit in the big picturepicture

The acoustic modeling problem for The acoustic modeling problem for speechspeech

Structural risk minimizationStructural risk minimization Support vector classifiersSupport vector classifiers Coupling vector machines to ASR Coupling vector machines to ASR

systemssystems Proof of concept and experimentsProof of concept and experiments

TechnologyTechnology Focus: speech Focus: speech

recognitionrecognition First pubic-domain First pubic-domain

LVCSR systemLVCSR system Goal: Accelerate Goal: Accelerate

researchresearch Extensibility, ModularExtensibility, Modular

(C++, Java)(C++, Java) Easy to Use (Docs, Easy to Use (Docs,

Tutorials, Toolkits)Tutorials, Toolkits) Benefit: TechnologyBenefit: Technology

Standard benchmarksStandard benchmarks

Research:

•Matlab•Octave•Python

Research:

Rapid Prototyping “Fair” Evaluations Ease of Use Lightweight

Programming

Efficiency:

MemoryHyper-real time

training Parallel processing

Data intensive

ASR:•HTK•SPHIN

X•CSLU

ISIP:• IFC’s• Java

Apps•Toolkits

ApproachApproach

ASR ProblemASR Problem Front-end maintains Front-end maintains

information important information important for modeling in a for modeling in a reduced parameter setreduced parameter set

Language model Language model typically predicts a small typically predicts a small set of next words based set of next words based on knowledge of a finite on knowledge of a finite number of previous number of previous words (N-grams)words (N-grams)

Search engine uses Search engine uses knowledge sources and knowledge sources and models to chooses models to chooses amongst competing amongst competing hypotheseshypotheses

Acoustic ConfusabilityAcoustic ConfusabilityRequires reasoning under uncertainty!Requires reasoning under uncertainty!

• Regions of overlap represent classification error

• Reduce overlap by introducing acoustic and linguistic context

Comparison of “aa” in “lOck” and “iy” in “bEAt” for SWB

Probabilistic Probabilistic FormulationFormulation

Acoustic Modeling - Acoustic Modeling - HMMsHMMs

HMMs model temporal HMMs model temporal variation in the variation in the transition probabilities transition probabilities of the state machineof the state machine

GMM emission GMM emission densities are used to densities are used to account for variations account for variations in speaker, accent, and in speaker, accent, and pronunciationpronunciation

Sharing model Sharing model parameters is a parameters is a common strategy to common strategy to reduce complexityreduce complexity

s0 s1 s2 s3 s4

THREE TWO FIVE EIGHT

Hierarchical SearchHierarchical Search

Each node in the Each node in the hierarchy can hierarchy can dynamically expand to dynamically expand to explore sub-networks explore sub-networks at the next level.at the next level.

HMM’s are employed HMM’s are employed at the lowest level of at the lowest level of the search hierarchy.the search hierarchy.

Word networks can Word networks can generalize to unseen generalize to unseen pronunciation pronunciation variants in the data.variants in the data.

Statistical ModelsStatistical Models

Each state in the HMM Each state in the HMM is associated with a is associated with a statistical model statistical model (except the non-(except the non-emitting state and emitting state and stop).stop).

The statistical model The statistical model can implement any pdf, can implement any pdf, which follows a defined which follows a defined interface contract.interface contract.

The statistical model The statistical model can transparently take can transparently take the form of a GMM or the form of a GMM or SVM.SVM.

Maximum Likelihood Maximum Likelihood TrainingTraining

Data-driven modeling supervised only from a word-level transcription

Approach: maximum likelihood estimation The EM algorithm is used to improve our

estimates:

Guaranteed convergence to local maximum No guard against overfitting!

Computationally efficient training algorithms (Forward-Backward) have been crucial

Decision trees are used to optimize sharing parameters, minimize system complexity and integrate additional linguistic knowledge

),(),ˆ( if )|Data()ˆ|Data( QQPP

Drawbacks of Current Drawbacks of Current ApproachApproach

ML Convergence does not translate to optimal classification

Error from incorrect modeling assumptions

Finding the optimal decision boundary requires only one parameter!

Drawbacks of Current Drawbacks of Current ApproachApproach

Data not separable by a hyperplane – nonlinear classifier is needed

Gaussian MLE models tend toward the center of mass – overtraining leads to poor generalization

Structural Risk Structural Risk MinimizationMinimization

The VC dimension is a The VC dimension is a measure of the complexity measure of the complexity of the learning machineof the learning machine

Higher VC dimension gives Higher VC dimension gives a looser bound on the a looser bound on the actual risk – thus actual risk – thus penalizing a more complex penalizing a more complex model (Vapnik)model (Vapnik)

Expected Risk: Expected Risk:

Not possible to estimate Not possible to estimate P(x,y)P(x,y)

Empirical Risk:Empirical Risk:

Related by the VC dimension, Related by the VC dimension, h:h:

Approach: choose the Approach: choose the machine that gives the least machine that gives the least upper bound on the actual upper bound on the actual riskrisk

),(),(2

1)( yxdPxfyR

l

iiiemp xfy

lR

1

|),(|2

1

)()()( hfRR emp

VC confidence

empirical risk

bound on the expected risk

VC dimension

Expected risk

optimum

Support Vector MachinesSupport Vector Machines

Hyperplanes C0-C2 achieve zero empirical risk. C0 generalizes optimally

The data points that define the boundary are called support vectors

Optimization: Optimization: Separable DataSeparable Data

Hyperplane: Constraints:

Quadratic optimization of a Lagrange functional minimizes risk criterion (maximizes margin). Only a small portion become support vectors

Final classifier: SVs

iii bxxyxf )()(

bwx

01)( bwxy ii

origin

class 1

class 2

w

H1

H2

C1

CO C2

optimalclassifier

SVMs as Nonlinear SVMs as Nonlinear ClassifiersClassifiers

Data for practical applications typically not separable using a hyperplane in the original input feature space

Transform data to higher dimension where hyperplane classifier is sufficient to model decision surface

Kernels used for this transformation

Final classifier:

Nn :

)()(),( jiji xxxxK

SVs

iii bxxKyxf ),()(

Experimental Experimental ProgressionProgression

Proof of concept on speech Proof of concept on speech classification using the Deterding classification using the Deterding vowel corpusvowel corpus

Coupling the SVM classifier to ASR Coupling the SVM classifier to ASR systemsystem

Results on the OGI Alphadigits Results on the OGI Alphadigits corpuscorpus

Vowel ClassificationVowel Classification Deterding Vowel Data: 11 vowels spoken in “h*d” Deterding Vowel Data: 11 vowels spoken in “h*d”

context; 10 log area parameters; 528 train, 462 testcontext; 10 log area parameters; 528 train, 462 test

ApproachApproach % % ErrorError

# # ParametersParameters

SVM: Polynomial SVM: Polynomial KernelsKernels

49%49%

K-Nearest NeighborK-Nearest Neighbor 44%44%

Gaussian Node Gaussian Node NetworkNetwork

44%44%

SVM: RBF KernelsSVM: RBF Kernels 35%35% 83 SVs83 SVs

Separable Mixture Separable Mixture ModelsModels

30%30%

Coupling to ASRCoupling to ASR Data size:Data size:

30 million frames of data 30 million frames of data in training setin training set

Solution: Segmental Solution: Segmental phone modelsphone models

Source for Segmental Data:Source for Segmental Data: Solution: Use HMM Solution: Use HMM

system in bootstrap system in bootstrap procedureprocedure

Could also build a Could also build a segment-based decodersegment-based decoder

Probabilistic decoder Probabilistic decoder coupling:coupling: SVMs: Sigmoid-fit SVMs: Sigmoid-fit

posteriorposterior

hh aw aa r y uw

region 10.3*k frames



mean region 1 mean region 2 mean region 3

k frames

Coupling to ASR SystemCoupling to ASR System

SEGMENTALCONVERTER

SEGMENTALCONVERTER

HMMRECOGNITION

HMMRECOGNITION

HYBRIDDECODER

HYBRIDDECODER

Features (Mel-Cepstra)

SegmentInformation

N-bestList

SegmentalFeatures

Hypothesis

N-Best RescoringN-Best Rescoring A word-internal N-A word-internal N-

gram decoder is used gram decoder is used to generate the N-best to generate the N-best word-graphs.word-graphs.

The word-graphs come The word-graphs come with the HMM and LM with the HMM and LM score, which is used in score, which is used in the rescoring process.the rescoring process.

The SVM score which The SVM score which is computed during is computed during rescoring is used as an rescoring is used as an additional knowledge additional knowledge source.source.

Alphadigit RecognitionAlphadigit Recognition OGI Alphadigits: continuous, OGI Alphadigits: continuous,

telephone bandwidth letters and telephone bandwidth letters and numbers (“A19B4E”)numbers (“A19B4E”)

3329 utterances using 10-best lists 3329 utterances using 10-best lists generated by the HMM decodergenerated by the HMM decoder

SVM’s require a sigmoid posterior SVM’s require a sigmoid posterior estimate to produce likelihoods – estimate to produce likelihoods – sigmoid parameters estimated from sigmoid parameters estimated from large held-out setlarge held-out set

SVM Alphadigit SVM Alphadigit RecognitionRecognition

TranscriptTranscriptionion

SegmentaSegmentationtion

SVMSVM HMMHMM

N-bestN-best HypothesiHypothesiss

11.0%11.0% 11.9%11.9%

N-N-best+Refbest+Ref

ReferenceReference 3.3%3.3% 6.3%6.3% HMM system is cross-word state-tied HMM system is cross-word state-tied

triphones with 16 mixtures of Gaussian triphones with 16 mixtures of Gaussian modelsmodels

SVM system has monophone models SVM system has monophone models with segmental featureswith segmental features

System combination experiment yields System combination experiment yields another 1% reduction in erroranother 1% reduction in error

SummarySummary We are the first speech group to We are the first speech group to

apply kernel machines to the apply kernel machines to the acoustic modeling problemacoustic modeling problem

Performance exceeds that of Performance exceeds that of HMM/GMM system, with a bit of HMM/GMM system, with a bit of HMM interactionHMM interaction

Algorithms for increased data sizes Algorithms for increased data sizes are keyare key

AcknowledgmentsAcknowledgments Collaborators: Naveen Parihar and Collaborators: Naveen Parihar and

Joe Picone at Mississippi StateJoe Picone at Mississippi State Consultants: Aravind Ganapathiraju Consultants: Aravind Ganapathiraju

(Conversay) and Jonathan Hamaker (Conversay) and Jonathan Hamaker (Microsoft)(Microsoft)

ReferencesReferences A. Ganapathiraju, “Support Vector Machines for

Speech Recognition”, Ph.D. Dissertation, Department of Electrical and Computer Engineering, Mississippi State University, January 2002.

J. Platt, “Fast Training of Support Vector Machines using Sequential Minimal Optimization,” Advances in Kernel Methods, MIT Press, 1998.

V.N. Vapnik, “Statistical Learning Theory”, John Wiley, New York, NY, USA, 1998.

C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” AT&T Bell Laboratories, November 1999.

AccomplishmentsAccomplishments Developed a set of Java based graphical tools used to

demonstrate fundamental concepts in signal processing and speech recognition.http://www.isip.msstate.edu/projects/speech/software/demonstrations/applets/.

Developed a set of Tcl-Tk based graphical tools used to transcribe, segment and analyze speech recognition databases.http://www.isip.msstate.edu/projects/speech/software/legacy/).

Developed a generalized network based speech recognition trainer, which is part of my masters thesis work.

Developed a hybrid HMM/SVM system used to rescore N-best word-graphs, which is based on work by Aravind Ganipathiraju.

Worked as part of a team to design and implement a public-domain HMM-based speech recognition system.http://www.isip.msstate.edu/projects/speech/software/

Hybrid Systems for Continuous Speech Recognition Issac Alphonso [email protected] Institute...

Documents

Transcript of Hybrid Systems for Continuous Speech Recognition Issac Alphonso [email protected] Institute...