SPEECH DESCRIPTORS GENERATION SOFTWARE UTILIZED FOR CLASSIFICATION AND RECOGNITION PURPOSES

SPEECH DESCRIPTORS GENERATION SOFTWAREUTILIZED FOR CLASSIFICATION AND RECOGNITION

PURPOSES

Lukasz Laszko ([email protected])

Department of Biomedical Engineering,Faculty of Electronics, Telecommunications and Informatics,

Gdansk University of Technology

Goals

1. Create software components for speech signal descriptors extraction- low level descriptors utilized in ASR- high level descriptors utilized in SDR

2. Compare different speech recognition engines- define and describe comparison criteria- analyze different ASR methodologies

3. Create software components for spoken content descriptors indexing and retrieval- MPEG-7 SCD extraction- SCD comparison methods

4. Provide sample SDR based medical application (suggestion)

Definitions

spoken content a pice of infomration consisting ofthe actual words spoken in the speech segments of an audio stream. As speechrepresents the primary means of human communication, a significant amount of the usable information enclosed in audiovisual documents may reside in the spoken content.

speech recognition converts spoken words to machine-readable input (for example, to the binary code for a string of character codes). The term voice recognition may also be used to refer to speech recognition, but more precisely refers to speaker recognition, which attempts to identify the person speaking, as opposed to what is being said.

Automatic speech recognition automated computer solution for SR routines

Spoken Document / Content retrieval - application of the SpokenContent tool which aims at retrieving information in speech signals based on their extracted contents.

Definitions

Acoutic Model - a file that contains statistical representations of each distinct sound that makes up a spoken word (called phonemes). It must contain the sounds for each word used in your grammar (or language model).Phoneme - In human language the smallest structural unit that

distinguishes meaning

Speech Recognition Engine (SRE)

Speech Decoder

GrammarLanguage Model

Types of recognition engines:

Connected Word Recognition (CWR) Large – Vocabulary Continuous Speech Recognition (LVCSR) Automatic Phonetic Transcription (APT) Keyword Spotting (KS)

Contains sets of predefined combinations of words

Acoustic Model

A very large list of words and their probability of occurrence in a given sequence. (LVCSR)

A statistical representation of the distinct sounds that make up each word in the Language Model or Grammar.

Speech Recognition Engine (SRE)

A parametric representation X (called acoustic observation) of

speech acoustic properties is extracted from the input signal A.

The acoustic observation X is matched against a set of predefinedacoustic models. Each model represents one of the symbols used by

the system for describing the spoken language of the application (e.g. words, syllables or phonemes). The best scoring models

determine the output sequence of symbols.

Automatic speech recognition – acoustic analysis

1. The analogue signal is first digitized. The sampling rate depends on the particularapplication requirements. The most common sampling rate is 16 kHz(one sample every 625s).2. A high-pass, also called pre-emphasis, filter is often used to emphasize thehigh frequencies.3. The digital signal is segmented into successive, regularly spaced time intervalscalled acoustic frames. Time frames overlap each other. Typically, a frameduration is between 20 and 40 ms, with an overlap of 50%.4. Each frame is multiplied by a windowing function (e.g. Hanning).5. The frequency spectrum of each single frame is obtained through a Fouriertransform.6. A vector of coefficients x, called an observation vector, is extracted fromthe spectrum. It is a compact representation of the spectral properties of theframe.

Automatic speech recognition – acoustic analysis

Coefficient vectors types: -linear prediction cepstrum coefficients (LPCCs)-mel-frequency cepstral coefficients (MFCCs)

Cepstrum is the result of taking the Fourier transform (FT) of the decibel spectrum as if it were a signal. Its name was derived by reversing the first four letters of "spectrum". There is a complex cepstrum and a real cepstrum.

Definitions: * mathematically: cepstrum of signal = FT(log(|FT(the signal)|)+j2πm) (where m is the integer required to properly unwrap the angle or imaginary part of the complex log function)

* algorithmically: signal → FT → abs() → log → phase unwrapping → FT → cepstrum

Automatic speech recognition – decoding

Hidden Markov model (HMM)

Dynamic time warping (DTW)

+ Statistical model+ Speech signal modeled as a pricewise stationary or short-time stationary signal+ Can be trained automatically

+ historically used for speech recognition+ displaced by HMM+ measures similarity between two sequences which may vary in time or speed+ speech speed independent

Automatic speech recognition – existing software

Free software:

XVoice CVoiceControl/kVoiceControl Open Mind Speech GVoice ISIP CMU Sphinx Juilus Ears NICO ANN Toolkit Myers' Hidden Markov Model Software

Commercial Software:

IBM ViaVoice Microsoft SAPI Vocalis Speechware Babel Technologies SpeechWorks Nuance Abbot Entropic

Automatic speech recognition – performance measurementsTypes of errors:

• Substitution errors - when a symbol in the reference transcription was substituted with a different one in the recognized transcription.• Deletion errors - when a reference symbol has been omitted in the recognized transcription.• Insertion errors - when the system recognized a symbol not contained in the reference transcription.Measurements:

• Error rate

• Accuracy

LVCSR => Accuracy > 90%IVR => Error Rate ~= 40%

MPEG-7

MPEG-7 uses the following tools:Descriptor (D): It is a representation of a feature defined syntactically and semantically. It could be that a unique object was described by several descriptors.Description Schemes (DS): Specify the structure and semantics of the relations between its components, these components can be descriptors (D) or description schemes (DS).Description Definition Language (DDL): It is based on XML language used to define the structural relations between descriptors. It allows the creation and modification of description schemes and also the creation of new descriptors (D).System tools: These tools deal with binarization, synchronization, transport and storage of descriptors. It also deals with Intellectual Property protection.

MPEG-7 is a multimedia content description standard. This description will be associated with the content itself, to allow fast and efficient searching for material that is of interest to the user. MPEG-7 is formally called Multimedia Content Description Interface. Thus, it is not a standard which deals with the actual encoding of moving pictures and audio, like MPEG-1, MPEG-2 and MPEG-4. It uses XML to store metadata, and can be attached to timecode in order to tag particular events, or synchronise lyrics to a song, for example.

MPEG-7 - Spoken content description

MPEG-7 Spoken Content Document (SCD)

* SpokenContentHeader- WordLexicon- PhoneLexicon- DescriptionMetadata- SpeakerInfo

* SpokenContentLattice- Blocks

- Nodes- Links (reference, probability, nodeOffset,

acousticScore)

Spoken content retrivial – general system structure

Spoken content retrivial – implementation

SDR client SDR server

ASR server

SDR DB

Service Oriented Architecture

External systems


ASR Server architecture

Status : implemented

ASR Engine

Metadata datbase

ORM mapper

Multithread execution pool

Web service

Java Concurrency Framework

JAX-WS 2.1 withWSIT

onApache Jetty 6

CMU Sphinx-4

Network

SOAP + MTOMover HTTPS


SDR Server architecture

External services agents

Network Status : under development

ASR Connector

Services FrontEnd

Diagnostic portal

Data Access LogicWorkflow Runtime

Data ConnectorIndexing

Service

Search Service

Indexing Workflow

SearchWorkflow

SCDDatabase


SDR Client architecture

Spoken Content Recording

Media Converter

SDR Client

DICOM voice Q&R – indexing

DICOMClientwith Voice

annotation/query plug-in

Spoken Content Query converter to

DICOM Query/RetrieveSpoken Content

Descriptors

Clinical Image and Object Management Server

Spoken description

New imageFrom

PACS CI&OM server

Audio file (wav or mp3)

MPEG-7 SCD document

Requests undescribed image from PACS

DICOM voice Q&R – query

DICOMClientwith Voice

annotation/query plug-in

Spoken Content Query converter to

DICOM Query/RetrieveSpoken Content

Descriptors

Clinical Image and Object Management Server

Spoken / text query

Audio file (wav or mp3) or

query text

Best matching SCD document

Annotated image location

DICOM Q&R

Search result

Execution steps – table of contents

0 Abstract1 Introduction2 Automatic Speech Recognition and Spoken Document Retrieval3 System structure4 Voice and speech sources5 Speech signal features extraction and recognition details6 Spoken content description 7 Document query8 Bibliography9 Appendixes

Bibliography

Hyoung-Gook Kim, Nicolas Moreau, Thomas SikoraMPEG-7 Audio and Beyond : Audio Content Indexing and RetrivialJohn Wiley & Sons Ltd, 2005

Gopala Krishna ABuilding ASR and TTS Systems : Building ASR Systems using SphinxCarnegie Mellon University, 2007

Arthur Chan, Evandro Gouvˆea, Rita SinghThe Hieroglyphs : Building Speech Applications Using CMU SphinxCarnegie Mellon University, 2007

Lee Begeja, Bernard Renger, Murat Saraclar, David Gibbon, Zhu LiuBehzad ShahrarayA System for Searching and Browsing Spoken CommunicationsAT&T Labs – Research, 2004

Bibliography

Frank Seide, Peng Yu, Chengyuan Ma, and Eric ChangVocabulary-Independent Search In Spontaneous SpeechMicrosoft Research Asia, 2004

Ciprian ChelbaSpoken Document Retrieval and BrowsingGoogle, 2007

Jason Price Oracle Database 11g SQL : Master SQL and PL/SQL in the Oracle DatabaseOracle Press, 2008

http://en.wikipedia.org/wiki/Speech_recognition

http://tldp.org/HOWTO/Speech-Recognition HOWTO/software.html

SPEECH DESCRIPTORS GENERATION SOFTWARE UTILIZED FOR CLASSIFICATION AND RECOGNITION PURPOSES

Documents

Transcript of SPEECH DESCRIPTORS GENERATION SOFTWARE UTILIZED FOR CLASSIFICATION AND RECOGNITION PURPOSES