Voice Recognition
-
Upload
amrita-more -
Category
Education
-
view
49 -
download
5
description
Transcript of Voice Recognition
![Page 1: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/1.jpg)
VOICE RECOGNITION
AMRITA MORE – 416AASHNA PARIKH - 417
![Page 2: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/2.jpg)
INTRODUCTION
• A user gives a predefined voice instruction to the system through microphone, the system understand this command and execute the required function.
• It facilitates the user to run windows through your voice without use of keyboard or mouse.
![Page 3: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/3.jpg)
KEY TERMS
• Speaking Modeso Isolated Wordso Continuous Speech
• Vocabulary sizes• Language Model• Acoustic Model • Dictionary
![Page 4: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/4.jpg)
REALIZATION OF MANDARIN SPEECH RECOGNITION SYSTEM USING
SPHINX
• Mandarin: It is the main language of China spoken by 855 million native speakers.
• Mandarin Continuous Digit Recognition System
• It is a small vocabulary speech recognition system which has only ten identity objects 0-9.
This technique builds speech recognition system using
Sphinx. It also includes Pocket sphinx, Sphinx Train, CMUCLMTK.
![Page 5: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/5.jpg)
SPHINX
• Sphinx is a set of Java classes used in background to recognize the voice.
• It is open source and is provided by Java,
• Sphinx is built on JSAPI.
• It uses HMM algorithm and BNF grammar.
![Page 6: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/6.jpg)
OVERALL PROCESSING
![Page 7: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/7.jpg)
FEATURE EXTRACTION
• It generates a set of 51 dimension feature vectors which represent important characteristics of speech signals.
• It is used to convert the speech waveform to some type of parametric representation.
• A wide range of possibilities exist for parametrically representing
the speech signal. Such as LPC (Linear Prediction Coding) and MFCC (Mel Frequency Cepstral Coefficients).
Speech Data
Feature Extraction Text Data
Acoustic Model
Recognition Engine
Training
Language Model
output
![Page 8: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/8.jpg)
Improved Acoustic Model Training
• Sphinx Train is the acoustic model training tool.
Speech Data
Feature Extraction Text Data
Acoustic Model
Recognition Engine
Training
Language Model
output
![Page 9: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/9.jpg)
Language Model Training
Text
text2wfreq text2idngram Id-N-gram
Wfreq2vocab idngram2lm
lm3g2dmp
vocab
arpa.dmp arpa
binlm2arpa
![Page 10: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/10.jpg)
POCKET SPHINX
• Decoding Engine
• It is used as a set of libraries that include core speech recognition functions.
• Input is audio file in wave format and the final output of recognition is displayed as text.
Feature Extraction
Acoustic Model
Recognition Engine
Training
output
Language Model
Text DataSpeech
Data
![Page 11: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/11.jpg)
HIDDEN MARKOV MODEL (HMM)
• Real-world has structures and processes which have (or produce) observable outputs:
o Usually sequential (process unfolds over time)
o Cannot see the event producing the output
Example: speech signals
![Page 12: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/12.jpg)
HMM Background
• Basic theory developed and published in 1960s and 70s
• No widespread understanding and application until late 80s
• Few Reasons:– Theory published in mathematic journals which were not
widely read by practicing engineers
– Insufficient tutorial material for readers to understand and apply concepts
![Page 13: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/13.jpg)
HMM Overview
• Machine learning method
• Makes use of state machines
• Based on probabilistic model
• Can only observe output from states, not the states themselves– Example: speech recognition
• Observe: acoustic signals• Hidden States: phonemes
(distinctive sounds of a language)
![Page 14: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/14.jpg)
HMM Components
• A set of states (x’s)
• A set of possible output symbols (y’s)
• A state transition matrix (a’s):probability of making transition from one state to the next
• Output emission matrix (b’s):probability of a emitting/observing a symbol at a particular state
• Initial probability vector:o probability of starting at a
particular stateo Not shown, sometimes assumed
to be 1
![Page 15: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/15.jpg)
Observable Markov Model Example
• Weathero Once each day weather is
observedState 1: rainState 2: cloudyState 3: sunny
o What is the probability the weather for the next 7 days will be:
sun, sun, rain, rain, sun, cloudy, sun
o Each state corresponds to a physical observable event
RainyRainy CloudyCloudy SunnySunny
RainyRainy 0.40.4 0.30.3 0.30.3
CloudyCloudy 0.20.2 0.60.6 0.20.2
SunnySunny 0.10.1 0.10.1 0.80.8
![Page 16: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/16.jpg)
Common HMM Types
• Ergodic (fully connected):o Every state of model can be reached in a
single step from every other state of the model
• Bakis (left-right):o As time increases, states proceed from left
to right
![Page 17: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/17.jpg)
HMM Advantages
• Advantages:
o Effectiveo Can handle variations in record structure
Optional fieldsVarying field ordering
![Page 18: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/18.jpg)
HMM Uses
• Speech recognition:Recognizing spoken words and phrases
• Text processing: Parsing raw records into structured records
• Bioinformatics: Protein sequence prediction
• Financial:o Stock market forecasts (price pattern prediction)o Comparison shopping services
![Page 19: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/19.jpg)
THE LEXICAL ACCESS COMPONENT OF THE CMU CONTINUOUS SPEECH
RECOGNITION SYSTEM
• The CMU Lexical Access System hypothesizes words from a phonetic dictionary.
• Word hypothesis are anchored on syllabic nuclei and are generated independently for different parts of the utterance.
EXAMPLES
WORD SYLLABIC NUCLEI
Cat [kæt] [æ]
![Page 20: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/20.jpg)
Parser
Verifier
Coarse labeler
Anchor Generator
Lexicon
Matcher
Front End
Lattice
Integrator
Word Hypothesizer System Diagram
![Page 21: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/21.jpg)
MATCHING ENGINE
• Words are hypothesized by matching an input sequence of labels against the stored representation of the possible pronunciation.
• It uses the Beam search algorithm which is a modified best first search strategy.
• The beam search algorithm can simultaneously search paths with different lengths.
Parser
Verifier
Coarse labeler
Anchor Generat
or
Lexicon
Matcher
Front End
Lattice Integrator
![Page 22: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/22.jpg)
THE LEXICON
• The lexicon (dictionary) is stored in the form of a phonetic network.
• The sources of pronunciations that have been used:o On-line phonetic dictionary, such as the Shop Dictionary.o Letter-to-sound compiler (The Talk System).
• The current CMU lexicon is constructed using a base over 150 rules covering several types of phenomena:o Including co-articulator phenomena.o Front-end characteristics.
Parser
Verifier
Coarse labeler
Anchor Generat
or
Lexicon
Matcher
Front End
Lattice Integrator
![Page 23: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/23.jpg)
ANCHOR GENERATION
• To eliminate unnecessary matches, the voice recognition system uses syllable anchors to select locations in an utterance where words are to be hypothesized.
• The anchor generation algorithm is straight forward and is based on the following reasoning:o Words are composed of syllables, and all the syllable contain a vocalic center.o Word divisions cannot occur inside vocalic center.o The coarse labeler provides information about vocalic, non-vocalic and silent
regions.
• The algorithm is implemented in such a way that the “best” hypothesis will be generated.
Parser
Verifier
Coarse labeler
Anchor Generat
or
Lexicon
Matcher
Front End
Lattice Integrator
![Page 24: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/24.jpg)
ANCHORS HAVE BEEN USED IN THE SYSTEM IN 2 MODES:
Single Anchor:
o In the single anchor mode, anchors of different lengths are generated and the matcher is invoked separately for each one. Although this procedure is simple, its inefficient too.
Multiple Anchor:
oThe multiple anchor mode, reduces the computations, and also reduces the number of hypothesis generated.
![Page 25: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/25.jpg)
COARSE LABELER
• The coarse labeling algorithm is based on the ZAPDASH (Zero-crossing And Peak to peak amplitude of Differenced And Smoothed data) algorithm.
• The algorithm is robust and speaker independent, and operates reliably over a large dynamic range.
Parser
Verifier
Coarse labeler
Anchor Generat
or
Lexicon
Matcher
Front End
Lattice Integrator
![Page 26: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/26.jpg)
PHONETIC LATTICE INTEGRATOR
• The phonetic labels produced by the front-end are grouped into four separate lattices: vowels, fricatives, closures and stops.
• The role of the integrator is to combine these separate streams and produce a single lattice consisting of non-overlapping segments.
• The integrator maps the label space used by the front-end into the label space used in the lexicon.
Parser
Verifier
Coarse labeler
Anchor Generat
or
Lexicon
Matcher
Front End
Lattice Integrator
![Page 27: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/27.jpg)
JUNCTION VERIFIER
• The verifier basically examines junctures between words and determines whether these words can be connected together in sequence.
• The verifier deals with three classes of junctures:o Abutmentso Gapso Overlaps
Parser
Verifier
Coarse labeler
Anchor Generato
r
Lexicon
Matcher
Front End
Lattice Integrator
![Page 28: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/28.jpg)
CONCLUSION
• Its not nearly enough detailed to actually write a speech recognizer, but it exposes the basic concepts.
• The basic concepts we learnt today to implement speech recognition are:
1. Sphinx2. Lexical Access System3. HMM Model
• The real life implementations of these techniques are still in the development phase while some are successfully launched.
• Example: Winvoice using Sphinx.
![Page 29: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/29.jpg)
REFERENCES
• Alexander I. Rudnicky, Lynn k. Baumeister, Kevin H. DeGraaf, “The Lexical Access Component of The CMU Continue Speech Recognition”, pp. 376-379, 1987, IEEE.
• Yun Wang Xueying Zhang, “Realization of Mandarin Continuous Digits Speech Recognition”, pp. 378-380, 2010, IEEE.
• Todd A. Stephenson, “Speech Recognition with Auxiliary Information”, pp. 189-203, 2004, IEEE.
![Page 30: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/30.jpg)
![Page 31: Voice Recognition](https://reader033.fdocuments.us/reader033/viewer/2022061108/544fa0e6af7959000a8b5cbe/html5/thumbnails/31.jpg)