Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

47
DESIGN OF KEYWORD SPOTTING SYSTEM BASED ON SEGMENTAL TIME WARPING OF QUANTIZED FEATURES Presented by: Piush Karmacharya Thesis Advisor: Dr. R. Yantorno Committee Members: Dr. Joseph Picone Dr. Dennis Silage Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY Speech Processing Laboratory Temple University

description

Speech Processing Laboratory Temple University. Design of Keyword Spotting System Based on SEGMENTAL Time warping of quantized features Presented by: Piush Karmacharya Thesis Advisor: Dr. R. Yantorno Committee Members: Dr. Joseph Picone Dr. Dennis Silage. - PowerPoint PPT Presentation

Transcript of Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Design of Keyword Spotting System Based on Presented by: Piush karmacharya

Design of Keyword Spotting System Based on SEGMENTAL Time warping of quantized features

Presented by:Piush Karmacharya

Thesis Advisor:Dr. R. Yantorno

Committee Members: Dr. Joseph PiconeDr. Dennis Silage

Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NYSpeech Processing LaboratoryTemple University

OutlineIntroductionWhat Problem DefinitionWhy Research MotivationHow Techniques and ChallengesRelated ResearchSystem DesignResultsFuture Work

IntroductionBranch of a more sophisticated stream Speech RecognitionIdentify keyword in a stream of written document or an audio (recorded or real time)Confusion MatrixTrue Positive - HitsTrue Negative False Negative MissesFalse Positive False Alarms (FA)Location if Present

Results3Introduction..Speaker dependentHigh accuracy, limited applicationSpeaker IndependentLower accuracy, Wide applicationPerformance Evaluation hits, misses and false alarms Receiver Operating Characteristic Hits vs FADesign Objective maximize hits while keeping false alarms lowAccuracy -

IntroductionWhat Problem DefinitionWhy Research MotivationHow TechniquesRelated ResearchSystem DesignResultsFuture Work

MotivationSpeech Most general form of human communicationInformation - embedded in redundant wordsI would like to have french toast for breakfast.Non-intentional sounds cough, exclamation, noiseEfficient human-machine interface Applications: Audio Document retrieval, Surveillance Systems, Voice commands/dialing

IntroductionWhat Problem DefinitionWhy Research MotivationHow Techniques and ChallengesRelated ResearchSystem DesignResultsFuture Work

ChallengesSimilar (1, 3, 6; 13,14) and different keywordsVariation in length (4 - around 1500 samples, 14 - 4100 samples)

Different instance of keyword REALLY from different speakers

/KINGLISHA//Hey you have to speak English for some English/Common ApproachesTemplate Based ApproachHidden Markov ModelsNeural NetworkHybrid MethodsDiscriminative Methods

Template MatchingStarted 1970sOne (or more) keyword templates available Search string keyword; Search Space - the utteranceFlexible time search - Dynamic Time Warping1971, H Sakoe, S ChibaSuitable for small scale applicationsDrawback Segment the utterance into isolated wordsFails to learn from the existing speech data

Dynamic Time WarpingTime stretch/compress one signal so that it aligns with the other signalExtremely efficient time-series similarity measureMinimizes the effects of shifting and distortionPrototype of the test keyword stored as a template; compared to each word in the incoming utterance

Dynamic Time WarpingReference and test keyword arranged along two side of the grid Template keyword vertical axis, test keyword horizontalEach block in the grid distance between corresponding feature vectorsBest match path through the grid that minimizes cumulative distanceBut number of possible path increases exponentially with length!!

DTWConstraintsMonotonic condition: no backwardContinuity condition: no break in pathAdjustment window: optimal path does not wander away from diagonalBoundary condition: starting/ending fixedConstraint manipulated as desired (e.g. for connected word recognition [Myers, C.; Rabiner, L.; Rosenberg, A.; 1980]

Hidden Markov Models1988 Lawrence R. RabinerStatistical model Hidden States/Observable OutputsEmission probability p(x|q1)Transition probability p(q2|q1)

First order Markov Process probability of next state depend only on current stateInfer output given the underlying system Estimate most likely system from observed sequence of output

HMMKWS ImplementationLarge Vocabulary Continuous Speech Recognizer (LVSCR)Model non-keywords using Garbage/Filler ModelsLimitationLarge amount of training data requiredTraining data has to be transcribed in word level and/or phone levelTranscribed data costs time and moneyNot available in all languages

Neural NetworksLate 90sClassifier learns from existing dataMulti-layer of interconnected nodes (neurons) Different weights assigned to inputs; updated in every iterationRequires large amount of transcribed data for trainingHybrid Systems HMM/NNDiscriminative Approaches Support Vector Machines

IntroductionWhat Problem DefinitionWhy Research MotivationHow Techniques and ChallengesRelated ResearchSystem DesignResultsFuture Work

Related WorkSegmental Dynamic Time Warping Alex S. Park, James R. Glass, 2008Segmentation into isolated words not requiredChoose starting point and adjustment window sizeProposed breaking words into smaller Acoustic UnitsSpeech sequence of soundsAcoustic units Phonemes Timothy J. Hazen, Wade Shen, Christopher White, 2009Gaussian Mixture Models (GMMs) Yaodong Zhang, James R. Glass, 2009

Related WorkPhonetic Posteriorgrams Phonemes as Acoustic unitGaussian Posteriorgrams Acoustic unit - GMMsPosteriorgram - Probability vector representing the posterior probabilities of a set of classes for a speech frameEvery speech frame associated with one or more phonemes

/AA//SH//ER/

IntroductionWhat Problem DefinitionWhy Research MotivationHow Techniques and ChallengesRelated ResearchSystem DesignResultsFuture Work

Research MethodologyAcoustic Unit Mean of the clusterSimple K-means clusteringLikelihood Euclidean distance from the cluster centroidSegmental Dynamic Time Warping Keyword DetectionCovariance information not requiredCorpusCall-Home Database 30 minutes of stereo conversations over telephone channelSwitchboard Database2,400 two-sided telephone conversations among 543 speakers

StepsTrainingKeyword Template ProcessingKeyword DetectionTraining SpeechFeature ExtractionK-means ClusteringDistance MatrixTrained ClusterTraining SpeechDiverse soundDiverse speakers

Speech Processing

Short-time stationaryDivide speech into short frames (20ms with 5ms spacing)

High Pass Filter[n] = s[n] - s[n - 1]; = 0.95Speech spectrum falls off at high frequenciesEmphasizes higher formants

Model Human PerceptionMultiplying with Filter banks

Pre-EmphasisWindowingFFTMel-ScalingLogDCTSpeech SignalMFCCFeature vector MFCC24

Select k random cluster centersEach data-point finds center it is closest to and associates itself with Each cluster now finds the centroid of the points it owns. Repeat step 2 to 5 until convergenceCentroid updated with new meansK-Means ClusteringFeature space populated by entire training data.

24According to the new center, we can re-classify the entire training set. For each data in the training set, find the new nearest cluster center and classify data as a member of new cluster.Distance MatrixFeature far away from the centroid might fall into adjacent clusterLikelihood Measure Euclidean DistanceVectors in region 3, 4 and 5 are closer to region 1 than region 62-D distance matrix optimize detection process

Distance Matrix - D

Keyword TemplatesMFCC Feature VectorEach frame associated to a cluster1-D template(s) stored into a folderKeywordsFeature ExtractionVector Quantization1-D string of cluster indices

Keyword DetectionSpeech utterance divided into overlapping segments (not isolated words)Warping distance for each segment computed separately

SpeechFeature ExtractionVector QuantizationDecision Logic1-D cluster indexSegmental DTWKeyword Detection

Distance PlotKwd- C1-C2-C4-C6-C1Utterance C2-C4-C5-C6-C1-C3-C4Keyword vertical axis; utterance horizontal axisEach cell distance measureGrayscaleDark Low distanceBright Large distanceMinimum distance path candidate keyword

Segmental DTWSpeech utterance divided into overlapping segmentsChoose the starting pointAdjustment window constraint Segment Span R (=3)Segment Width 2 R +1Segment Overlap REach segment has its own warping distance scoreCandidate Keyword ones with low warping distancePrecision Error 2 R

Segment -1Segment -2Segment -3Distance

S1=(0+0+0+7+9)/5 =3.2

S2= (0+5+0+7+9)/5 = 4.2

IntroductionWhat Problem DefinitionWhy Research MotivationHow Techniques and ChallengesRelated ResearchSystem DesignResultsFuture Work

ResultsSome templates fail to produce low distance at keyword locationAverage score can be used with a Threshold

Distortion Score for keyword UNIVERSITY

Decision LogicN/2 Voting based Approach, N No. of templates availableTop ten lowest distance segments for each keyword template Frequency of occurrence for each segment Top ten scorers for more than half the keywords considered the keyword

Experimental SetupFeature Vector 13MFCC, 13 and 13 = 39 Features24 Filterbank 20ms frame with overlapCluster Size 64, 128, 256Training Data 14 speakers (10 male, 4 female) * 5 mins = 70 minsSegment span R = 2 to 20Number of keywords 14Test Utterance 10 sec to 2 minKeyword Location - Cut-off Precision Error 30%

Keyword StatisticsLong length keywords easier to detect?Higher variance lower detection rateQuestion: Syllable vs PhonemeKeyword No. of Syllables No. of Phonemes Average Length (secs) Length Variance Variance%University 5100.620.0914Computer 380.380.0821College 250.450.0920English 260.390.0717Language 270.440.1022Program 270.550.1323School 140.400.0922Something 260.420.1126Student 270.370.1129[http://www.howmanysyllables.com/, http://www.speech.cs.cmu.edu/cgi-bin/cmudict]

Operation CharacteristicHits vs Segment Span RSmaller R Restrictive Large R More flexibleLarger R Large precision errorMaximum Hits at R = 5-7Compared to result on S-DTW on Gaussian Posteriorgram for Speech Pattern discovery

[Y. Zhang, J. R. Glass, 2010]

Operation CharacteristicMisses vs RSmall R restrictiveLarger R Flexible, more noiseMinimum misses at R = 5-7

False Alarm vs RSmall R Less false alarmLarge R Flexible, more FA

Operation CharacteristicSpeed vs RNo. of Segments = (UL-margin-1)/R + 1Smaller R More segments/Processing timeLarger R Fewer segments/Less timeFor R=5, 1 minute of utterance 5 secs per keyword template 12 templates possible in real time1 hr speech - 10 mins on 200 CPUs using GP and graph clustering on SDTW segments [ Y. Zhang and J. R. Glass 2010]

Execution time per keyword template per minute of utterance

ResultsResults vary for different keywordsFrequency of use of the word more important than length (University/Relationship vs. Something)Pronunciation context dependent

KeywordAccuracyUniversity0.89Conversation0.80Tomorrow0.75English0.71Computer0.67Relationship0.67School0.63College0.63Language0.62Student0.62Zero0.60Program0.57Funny0.57Something0.45

[ H. Ketabdar, J. Vepa, S Bengio and H. Bourlard, 2006]

Future WorkImplement relevance feedback technique so that generic templates are assigned higher weights after every iteration [Hazen T.J., Shen W., White C.M., 2009]Retraining the cluster for different environmentTesting on more data with refined keyword templates (isolation of keywords from the speech data was time consuming and required several iteration)Using model keyword instead of several keyword templates [*Olakunle]

Thank YouBackup Slides

Model KeywordDevelop a model keyword from all available keyword templatesImplement Self Organizing Maps (SOMs) Cluster grouping is random in K-means clusteringData belonging to same clusters are grouped into one in SOM

System DesignVector QuantizationQuantize data into finite clusters training data for populating the feature space need not be transcribed.Feature for same sound fall into same clusterReduce dimension feature vector reduced to codebookLikelihood EstimationAccount for data that might fall just outside the clusterSegmental Dynamic Time WarpingDTW requires fixed ends utterance segmentation into isolated wordsDivide the utterance into segments (not necessarily words) and compute distortion score for each segment using DTW

HMM for Speech RecognitionEach word sequence of unobservable states with certain emission probabilities (features) and transition probabilities (to next state)Estimate the model for each word in the training vocabularyFor each test keyword, model that maximizes the likelihood is selected as a match Viterbi AlgorithmGrammatical constraints are applied to improve recognition accuracyVocabulary Continuous Speech Recognizer (LVSCR)Model non-keywords using Garbage/Filler ModelsHidden Markov Models

Phonetic PosteriorgramsEach element represents the posterior probability of a specific phonetic class for a specific time frame. Can be computed directly from the frame-based acoustic likelihood scores for each phonetic class at each time frame.

Time vs Class matrix representationGaussian PosteriorgramsEach dimension of the feature vector approximated by sum of weighted Gaussian GMMParameterized by the mixture weights, mean vectors, and covariance matricesGaussian posteriorgram is a probability vector representing the posterior probabilities of a set of Gaussian components for a speech frameGMM can computed over unsupervised training data instead of using a phonetic recognizer

GPS = (s1,s2,,sn)GP(S) = (q1,q2,qn)qi = ( P(C1|Si), P(C2|si), . , P(Cm|si) )Ci - ith Gaussian component of a GMM m - number of Gaussian componentsDifference betn two GPD(p,q) = - log (p . q)DTW is used to find low distortion segment