SPEECH DESCRIPTORS GENERATION SOFTWARE UTILIZED FOR CLASSIFICATION AND RECOGNITION PURPOSES
description
Transcript of SPEECH DESCRIPTORS GENERATION SOFTWARE UTILIZED FOR CLASSIFICATION AND RECOGNITION PURPOSES
SPEECH DESCRIPTORS GENERATION SOFTWAREUTILIZED FOR CLASSIFICATION AND RECOGNITION
PURPOSES
Lukasz Laszko ([email protected])
Department of Biomedical Engineering,Faculty of Electronics, Telecommunications and Informatics,
Gdansk University of Technology
Goals
1. Create software components for speech signal descriptors extraction- low level descriptors utilized in ASR- high level descriptors utilized in SDR
2. Compare different speech recognition engines- define and describe comparison criteria- analyze different ASR methodologies
3. Create software components for spoken content descriptors indexing and retrieval- MPEG-7 SCD extraction- SCD comparison methods
4. Provide sample SDR based medical application (suggestion)
Definitions
spoken content a pice of infomration consisting ofthe actual words spoken in the speech segments of an audio stream. As speechrepresents the primary means of human communication, a significant amount of the usable information enclosed in audiovisual documents may reside in the spoken content.
speech recognition converts spoken words to machine-readable input (for example, to the binary code for a string of character codes). The term voice recognition may also be used to refer to speech recognition, but more precisely refers to speaker recognition, which attempts to identify the person speaking, as opposed to what is being said.
Automatic speech recognition automated computer solution for SR routines
Spoken Document / Content retrieval - application of the SpokenContent tool which aims at retrieving information in speech signals based on their extracted contents.
Definitions
Acoutic Model - a file that contains statistical representations of each distinct sound that makes up a spoken word (called phonemes). It must contain the sounds for each word used in your grammar (or language model).Phoneme - In human language the smallest structural unit that
distinguishes meaning
Speech Recognition Engine (SRE)
Speech Decoder
GrammarLanguage Model
Types of recognition engines:
Connected Word Recognition (CWR) Large – Vocabulary Continuous Speech Recognition (LVCSR) Automatic Phonetic Transcription (APT) Keyword Spotting (KS)
Contains sets of predefined combinations of words
Acoustic Model
A very large list of words and their probability of occurrence in a given sequence. (LVCSR)
A statistical representation of the distinct sounds that make up each word in the Language Model or Grammar.
Speech Recognition Engine (SRE)
A parametric representation X (called acoustic observation) of
speech acoustic properties is extracted from the input signal A.
The acoustic observation X is matched against a set of predefinedacoustic models. Each model represents one of the symbols used by
the system for describing the spoken language of the application (e.g. words, syllables or phonemes). The best scoring models
determine the output sequence of symbols.
Automatic speech recognition – acoustic analysis
1. The analogue signal is first digitized. The sampling rate depends on the particularapplication requirements. The most common sampling rate is 16 kHz(one sample every 625s).2. A high-pass, also called pre-emphasis, filter is often used to emphasize thehigh frequencies.3. The digital signal is segmented into successive, regularly spaced time intervalscalled acoustic frames. Time frames overlap each other. Typically, a frameduration is between 20 and 40 ms, with an overlap of 50%.4. Each frame is multiplied by a windowing function (e.g. Hanning).5. The frequency spectrum of each single frame is obtained through a Fouriertransform.6. A vector of coefficients x, called an observation vector, is extracted fromthe spectrum. It is a compact representation of the spectral properties of theframe.
Automatic speech recognition – acoustic analysis
Coefficient vectors types: -linear prediction cepstrum coefficients (LPCCs)-mel-frequency cepstral coefficients (MFCCs)
Cepstrum is the result of taking the Fourier transform (FT) of the decibel spectrum as if it were a signal. Its name was derived by reversing the first four letters of "spectrum". There is a complex cepstrum and a real cepstrum.
Definitions: * mathematically: cepstrum of signal = FT(log(|FT(the signal)|)+j2πm) (where m is the integer required to properly unwrap the angle or imaginary part of the complex log function)
* algorithmically: signal → FT → abs() → log → phase unwrapping → FT → cepstrum
Automatic speech recognition – decoding
Hidden Markov model (HMM)
Dynamic time warping (DTW)
+ Statistical model+ Speech signal modeled as a pricewise stationary or short-time stationary signal+ Can be trained automatically
+ historically used for speech recognition+ displaced by HMM+ measures similarity between two sequences which may vary in time or speed+ speech speed independent
Automatic speech recognition – existing software
Free software:
XVoice CVoiceControl/kVoiceControl Open Mind Speech GVoice ISIP CMU Sphinx Juilus Ears NICO ANN Toolkit Myers' Hidden Markov Model Software
Commercial Software:
IBM ViaVoice Microsoft SAPI Vocalis Speechware Babel Technologies SpeechWorks Nuance Abbot Entropic
Automatic speech recognition – performance measurementsTypes of errors:
• Substitution errors - when a symbol in the reference transcription was substituted with a different one in the recognized transcription.• Deletion errors - when a reference symbol has been omitted in the recognized transcription.• Insertion errors - when the system recognized a symbol not contained in the reference transcription.Measurements:
• Error rate
• Accuracy
LVCSR => Accuracy > 90%IVR => Error Rate ~= 40%
MPEG-7
MPEG-7 uses the following tools:Descriptor (D): It is a representation of a feature defined syntactically and semantically. It could be that a unique object was described by several descriptors.Description Schemes (DS): Specify the structure and semantics of the relations between its components, these components can be descriptors (D) or description schemes (DS).Description Definition Language (DDL): It is based on XML language used to define the structural relations between descriptors. It allows the creation and modification of description schemes and also the creation of new descriptors (D).System tools: These tools deal with binarization, synchronization, transport and storage of descriptors. It also deals with Intellectual Property protection.
MPEG-7 is a multimedia content description standard. This description will be associated with the content itself, to allow fast and efficient searching for material that is of interest to the user. MPEG-7 is formally called Multimedia Content Description Interface. Thus, it is not a standard which deals with the actual encoding of moving pictures and audio, like MPEG-1, MPEG-2 and MPEG-4. It uses XML to store metadata, and can be attached to timecode in order to tag particular events, or synchronise lyrics to a song, for example.
MPEG-7 - Spoken content description
MPEG-7 Spoken Content Document (SCD)
* SpokenContentHeader- WordLexicon- PhoneLexicon- DescriptionMetadata- SpeakerInfo
* SpokenContentLattice- Blocks
- Nodes- Links (reference, probability, nodeOffset,
acousticScore)
Spoken content retrivial – general system structure
Spoken content retrivial – implementation
SDR client SDR server
ASR server
SDR DB
Service Oriented Architecture
External systems
Spoken content retrivial – implementation
ASR Server architecture
Status : implemented
ASR Engine
Metadata datbase
ORM mapper
Multithread execution pool
Web service
Java Concurrency Framework
JAX-WS 2.1 withWSIT
onApache Jetty 6
CMU Sphinx-4
Network
SOAP + MTOMover HTTPS
Spoken content retrivial – implementation
SDR Server architecture
External services agents
Network Status : under development
ASR Connector
Services FrontEnd
Diagnostic portal
Data Access LogicWorkflow Runtime
Data ConnectorIndexing
Service
Search Service
Indexing Workflow
SearchWorkflow
SCDDatabase
Spoken content retrivial – implementation
SDR Client architecture
Spoken Content Recording
Media Converter
SDR Client
DICOM voice Q&R – indexing
DICOMClientwith Voice
annotation/query plug-in
Spoken Content Query converter to
DICOM Query/RetrieveSpoken Content
Descriptors
Clinical Image and Object Management Server
Spoken description
New imageFrom
PACS CI&OM server
Audio file (wav or mp3)
MPEG-7 SCD document
Requests undescribed image from PACS
DICOM voice Q&R – query
DICOMClientwith Voice
annotation/query plug-in
Spoken Content Query converter to
DICOM Query/RetrieveSpoken Content
Descriptors
Clinical Image and Object Management Server
Spoken / text query
Audio file (wav or mp3) or
query text
Best matching SCD document
Annotated image location
DICOM Q&R
Search result
Execution steps – table of contents
0 Abstract1 Introduction2 Automatic Speech Recognition and Spoken Document Retrieval3 System structure4 Voice and speech sources5 Speech signal features extraction and recognition details6 Spoken content description 7 Document query8 Bibliography9 Appendixes
Bibliography
Hyoung-Gook Kim, Nicolas Moreau, Thomas SikoraMPEG-7 Audio and Beyond : Audio Content Indexing and RetrivialJohn Wiley & Sons Ltd, 2005
Gopala Krishna ABuilding ASR and TTS Systems : Building ASR Systems using SphinxCarnegie Mellon University, 2007
Arthur Chan, Evandro Gouvˆea, Rita SinghThe Hieroglyphs : Building Speech Applications Using CMU SphinxCarnegie Mellon University, 2007
Lee Begeja, Bernard Renger, Murat Saraclar, David Gibbon, Zhu LiuBehzad ShahrarayA System for Searching and Browsing Spoken CommunicationsAT&T Labs – Research, 2004
Bibliography
Frank Seide, Peng Yu, Chengyuan Ma, and Eric ChangVocabulary-Independent Search In Spontaneous SpeechMicrosoft Research Asia, 2004
Ciprian ChelbaSpoken Document Retrieval and BrowsingGoogle, 2007
Jason Price Oracle Database 11g SQL : Master SQL and PL/SQL in the Oracle DatabaseOracle Press, 2008
http://en.wikipedia.org/wiki/Speech_recognition
http://tldp.org/HOWTO/Speech-Recognition HOWTO/software.html