Speech processing research paper 4

INV ITEDP A P E R

An Information-ExtractionApproach to Speech Processing:Analysis, Detection,Verification, and RecognitionThis paper presents an integrated detection and verification approach to

information extraction from speech that can be used for speech analysis, and

recognition of speech, speakers, and languages.

By Chin-Hui Lee, Fellow IEEE, and Sabato Marco Siniscalchi, Member IEEE

ABSTRACT | The field of automatic speech recognition (ASR)has enjoyed more than 30 years of technology advances due to

the extensive utilization of the hidden Markov model (HMM)

framework and a concentrated effort by the speech community

to make available a vast amount of speech and language

resources, known today as the Big Data Paradigm. State-of-the-

art ASR systems achieve a high recognition accuracy for well-

formed utterances of a variety of languages by decoding

speech into the most likely sequence of words among all possi-

ble sentences represented by a finite-state network (FSN) ap-

proximation of all the knowledge sources required by the ASR

task. However, the ASR problem is still far from being solved

because not all information available in the speech knowledge

hierarchy can be directly integrated into the FSN to improve

the ASR performance and enhance system robustness. It is

believed that some of the current issues of integrating various

knowledge sources in topdown integrated search can be par-

tially addressed by processing techniques that take advantage

of the full set of acoustic and language information in speech. It

has long been postulated that human speech recognition (HSR)

determines the linguistic identity of a sound based on detected

evidence that exists at various levels of the speech knowledge

hierarchy, ranging from acoustic phonetics to syntax and

semantics. This calls for a bottomup attribute detection and

knowledge integration framework that links speech processing

with information extraction, by spotting speech cues with a

bank of attribute detectors, weighting and combining acoustic

evidence to form cognitive hypotheses, and verifying these

theories until a consistent recognition decision can be reached.

The recently proposed automatic speech attribute transcrip-

tion (ASAT) framework is an attempt to mimic some HSR

capabilities with asynchronous speech event detection fol-

lowed by bottomup knowledge integration and verification. In

the last few years, ASAT has demonstrated good potential and

has been applied to a variety of existing applications in speech

processing and information extraction.

KEYWORDS | Acoustic phonetics; automatic speech attributetranscription (ASAT); automatic speech recognition (ASR);

cross-language phone recognition; knowledge integration;

lattice rescoring; place and manner of articulation; speech

attribute detection

I . INTRODUCTION

It is instructive to examine some of the key developmentsin automatic speech recognition (ASR) that have occurred

in the past few decades and contemplate new directions

that might lead to better system designs. The ASR problem

Manuscript received May 18, 2012; revised September 13, 2012; accepted January 4,

2013. Date of publication February 7, 2013; date of current version April 17, 2013.

The ASAT project was supported by the National Science Foundation (NSF)

Information Technology Research (ITR) Program under Contract IIS-04-27113. Part

of S. M. Siniscalchis ASAT-related work was supported by the Spoken Information

Retrieval by Knowledge Utilization in Statistical Speech Processing (SIRKUS) project

through Prof. T. Svendsen at the Norwegian University of Science and Technology

(Trondheim, Norway).

C.-H. Lee is with the School of Electrical and Computer Engineering, Georgia Instituteof Technology, Atlanta, GA 30332 USA (e-mail: [email protected]).

S. M. Siniscalchi is with the Faculty of Architecture and Engineering, University ofEnna Kore, Enna 94100, Italy (e-mail: [email protected]).

Digital Object Identifier: 10.1109/JPROC.2013.2238591

Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 10890018-9219/$31.00 2013 IEEE

is still far from solved, judging from the limiteddeployment of products and services worldwide, the

fragile nature of performance robustness for the state-of-

the-art ASR systems, and the slowdown of recent progress

in performance improvement when the amount of training

data to learn acoustic and language model for ASR is no

longer the only major technology concern in designing

ASR systems. Nevertheless, it is also clear that tremendous

technology advances have been made since the speechcommunity adopted an information-theoretic perspective

of channel modeling for encoding speech from language,

and formulated the ASR problem as a channel decoding

paradigm [1].

Taking advantage of the sequential nature of speech,

and combining an efficient stage-by-stage decoding strat-

egy with dynamic programming (DP, e.g., [2] and [3]) and

a Markov inferencing framework [4][6], a continuousspeech recognition algorithm was first developed in [7] to

deliver good performance. The ease of learning speech and

language models from data triggered almost four decades

of rapid technology progress for ASR based on this integ-

rated pattern modeling and decoding framework later

known as hidden Markov models (HMMs, e.g., [8]). The

same automatic data learning paradigm has been extended

to quite a few machine learning problems in the last twodecades. One of the most notable accomplishments was

the development of a statistical machine translation (MT)

framework originated by a group of ASR researchers at

IBM [9], which also spun off many recent MT research and

application activities and other similar statistical natural

language processing (NLP) efforts (e.g., [10] and [11]).

The aforementioned statistical pattern matching ap-

proach to ASR is considered a paradigm shift from thetraditional speech science perspective of crafting heuristic

rules manually based on expert observations from limited

data and local optimization, which is sometimes known as

a bottomup knowledge integration process. In contrast to

traditional knowledge-rich approaches, the current knowl-

edge-ignorant or knowledge-implicit modeling framework

[12][14] relies on collecting a large amount of speech and

text examples and learning the model parameters withoutthe need to use detailed knowledge about a target lang-

uage. It offers an advantage for automatic model learning

from a large collection of data via a rigorous mathematical

formulation and global optimization by using all the avail-

able knowledge sources at the same time, known as top

down knowledge integration, ready for DP-based optimal

decoding [15].

During the transition to the new paradigm in the1970s, an intensive effort in applying acoustic and lin-

guistic knowledge sources to speech recognition in the

Advanced Research Projects Agency (ARPA) Speech

Understanding Project [16] was witnessed. Many notable

examples were documented [17], [18]. Nonetheless, expert

knowledge was required to design even a simple ASR

system, which made the ASR technology hard to access.

Furthermore, robustness to adverse conditions was neveraddressed in a serious manner. Much of the knowledge

accumulated in these studies, e.g., [16][18], was not fully

explored in current HMM-based system. Moving into the

1980s and 1990s, the dominance of data-driven learning

approaches to speech modeling was witnessed. A number

of techniques, including vector quantization (VQ) [19],

HMM [8], self-organizing map (SOM) [20], and artificial

neural network (ANN) [21], [22], have been successfullyadopted.

After many years of concentrated deliberation, the

speech community has come a long way from the learning

stage in the 1970s, and made a tremendous drive in data-

driven approaches in the 1980s, 1990s, and 2000s. A

continuous stream of performance improvement and

increasing task complexity has thus been observed. For

more detail, the reader is referred to a special issue of theProceedings of the IEEE on Spoken Language Processing

published in August 2000 [23]. A number of books [24]

[34] has also been published. However, it is also safe to

argue that the technology progress has slowed down in

recent years. Most research groups are searching for the

next trend to move ASR forward. This phenomenon is

known as the S-Curve in learning illustrated in the curvelabeled in solid circles in Fig. 1. The community wouldgenerally agree that the fragile nature of ASR system design

will require new technological breakthroughs before con-

versational systems really become a ubiquitous user inter-

face mode, being able to compete with conventional

graphical user interface with point-and-click devices like

mouses or touch-sensitive screens.

In examining Fig. 1 closely, we can roughly divide the

ASR technology progress into three periods: 1) before the1970s, labeled in blue circles, the speech community en-

joyed a vast creation of speech knowledge sources (e.g.,

[35][39]); 2) between the 1970 and the 2010, labeled in

green circles, data-driven models dominated four decades

of fast advances with HMM playing the role of a paradigm

shift and emerging as the leading framework used in

almost all modern ASR systems; and 3) beyond the 2010s,

labeled in pink circles, one can envision an imaginary pathin which the speech community may be waiting for anoth-

er paradigm shift to take place by exploring knowledge-

rich modeling [12]. However, this paradigm shift should

still leverage on data-driven automated learning from big

language resources. By setting a human performance

capability ceiling shown in a horizontal line in the upper

part of Fig. 1, it is noted that there is still a big gap between

the current state-of-the-art and a human speech recogni-tion (HSR) system. In order to address the suprahuman

performance goal set by IBM a few years ago [40], the

speech community will need fast technology progress

again, similar to what the community had enjoyed in the

last four decades. This may be a good time to call for a

paradigm shift and reexamine what can be done to carry

the community forward.

Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

1090 Proceedings of the IEEE | Vol. 101, No. 5, May 2013

There have been many attempts to find the next para-

digm shift. One of them is exploiting the distinctive

features of speech (e.g., [16] and [37]) to form words along

with the use of lexical access to a dictionary of words asdemonstrated in spectrogram reading by human experts

[41] in the MIT Summit system [42]. Using bottomup

knowledge integration, HSR also performed much better

than ASR in most benchmark tests [43], [44]. The speech

cues, or attributes, or events, can thus serve as acoustic

landmarks [42], [45][47], sometimes referred to as

islands of reliability in the ocean of competing speech

events, to improve bottomup knowledge integration. Re-cently, a detection approach to ASR, called automatic

speech attribute transcription (ASAT) [13], was proposed

in another attempt to address research issues in HSR and

spectrogram reading through bottomup attribute detec-

tion and stage-by-stage knowledge integration. We will

collectively refer to this set of viewpoints as an informa-

tion-extraction perspective to extract useful acoustic and

linguistic information for the purposes of speech recogni-tion and understanding. ASAT also facilitates a modular

strategy so that various researchers can collaborate by

contributing their best detectors or knowledge modules to

plug-n-play into the overall system design.

The rest of the paper is organized as follows. In

Section II, we briefly review the statistical pattern recog-

nition approach to ASR. We describe the single most im-

portant technique that has helped advancing the state ofthe art of ASR, namely hidden Markov modeling of speech,

and discuss current ASR capabilities. We next address

several ASR technology limitations in Section III. A list of

active ASR research challenges in robustness, decision

strategies, and utterance verification are also presented.

These challenges lead to a bottomup speech attribute

detection framework followed by a stage-by-stage knowl-

edge integration process to be discussed in Section IV. To

enhance the current capabilities and alleviate some of the

limitations of HMM-based ASR, a bottomup detection

approach to ASR, called ASAT, is presented in Section V.A survey of ASAT-based speech processing applications

and their advantages over the conventional topdown

approach to ASR are highlighted in Section VI. In

Section VII, we present a critical look at new research

directions and future work opportunities through estab-

lishing the collaborative ASR Community of the 21st Cen-

tury. Finally, we conclude our findings in Section VIII.

II . STATE-OF-THE-ART TOPDOWN ASR

Modern ASR system design is based on a statistical pattern

matching framework that is motivated by representing

spoken utterances as stochastic patterns (e.g., [7], [15],

[25][27], and [48]) and formulating an information-theo-

retical perspective of speech generation, acquisition, and

transmission (e.g., [15]). Many good studies on acousticmodeling are available (e.g., [49][52]). Equally as many

papers are concerned with language modeling (e.g., [53]

[58]). In the following sections, a brief overview on the

current statistical approach to ASR is presented.

A. Statistical Pattern Recognition TheoryStarting with a message M from a message source, a

sequence of words W is formed through a linguistic chan-nel. Different word sequences may often convey the same

message. It is then followed by an articulatory channel that

converts the discrete word sequence into a continuous

speech signal S. Speaker effect, which accounts for amajor portion of the speech variabilities including speech

production difference, accent, dialect, speaking rate, etc.,

is added at this point. Additional speech distortion is

Fig. 1. The S-Curve of ASR technology progress: 1) before the 1970s: Vast creation of speech knowledge sources; 2) from the 1970sto the 2010s: Data-driven model learning with rich speech and language data resources; and 3) beyond the 2010s: What to do to sustain

fast technology progress?


Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1091

introduced when the signal passes through the acousticchannel that includes the speaking environment, interfer-

ing noise, and the transducers used to capture the speech

signal. This acoustic realization A is then passed throughsome transmission channel S^ before it reaches an ASRsystem as an observed signal X (e.g., [59]).

For real-world practical problems, it is difficult to char-

acterize the intermediate channels, such as articulatory,

acoustic, and transmission channels, which are lumpedtogether as a noisy channel. The noisy channel model is

usually formulated as follows: 1) the joint distribution

pW;X is decomposed into two components pXjW andPW, known as an acoustic model (AM) and a languagemodel (LM), respectively; 2) the forms of pXjW andPW are assumed parametric probability density functions(pdfs), i.e., pXjW and PW, respectively; and 3) theparameters and are estimated from some training data.With these simplifications, the most popular way to solve

the ASR problem is to use the well-known plug-in

maximum a posteriori (MAP) decision rule (e.g., [60][62])

W^argmaxW2

PWjXargmaxW2

p^XjW PL^ W (1)

where ^ and ^ are the estimated parameters obtainedduring training, W^ is the recognized sentence from decod-ing, and is the set of valid candidate word sequences tobe searched during testing. This decision rule, derived

from the optimal Bayes decision rule, is also widely used in

many other pattern recognition applications. In the above

equation, L commonly known as a language modelmultiplier is used to balance the AM and LM contributions

to the overall probability due to unknown distributions,

and the use of a likelihood function p^XjW to computethe acoustic probability.

A block diagram of this topdown approach to ASR is

shown in Fig. 2. The feature extraction module provides the

acoustic feature vectors used to characterize the spectral

properties of the time-varying speech signal, typically in

terms of time-synchronous cepstral analysis [63]. Hetero-

scedastic discriminant analysis (HLDA) [64] is routinelyused in modern ASR system as a method for learning

projections of high-dimensional acoustic representations

into lower dimensional spaces. The information provided

by the AM, LM, and word models is then used to evaluate

the similarity between the input feature vector sequence

(corresponding to a portion of the input speech) and a set of

acoustic word models for all words in the vocabulary to

determine which words were most likely spoken.

B. Three Key HMM AdvancesTopdown HMM modeling is considered as one of the

most fruitful areas in characterizing speech and language

in recent years. Key advances [59] can be summarized in

three broad categories to be discussed in the following.

1) Detailed Modeling: ML estimation has been exten-sively developed following Baums original formulation

[5]. Software packages, such as HTK [65] and GMTK [66],

are available now in the public domain to establish acoustic

models with hundreds of thousand Gaussian mixture com-

ponents, and support language models with hundreds of

million of n-gram probabilities. The previous limitationimposed by the curse of dimensionality, widely known in

the pattern recognition community, was alleviated withmany advanced modeling techniques that take parameter

sharing into account, such as the commonly used tied-state

tree learning strategy [67] in subphone modeling [68], [69].

2) Adaptive Modeling: A significant drop in performanceis often observed when an ASR system is used in an ope-

rating condition that is different from the training condi-

tion. Adaptation algorithms try to automatically tune agiven set of HMMs to a new test environment using a

limited, but representative set of new data, commonly

referred to as adaptation data. There exist two major

adaptation approaches: the transformation-based approach

and the Bayesian approach. The best known example of

transformation-based adaptation is the maximum-

likelihood linear regression (MLLR) framework [70]. The

feature-space MLLR (fMLLR) [71] extends MLLR and hasproven to be highly effective as a method for unsupervised

adaptation. In Bayesian learning (e.g., [72]), prior den-

sities need to be assumed and MAP estimates are obtained

for the HMM parameters. When the adaptation data size is

limited, structural maximum a posteriori (SMAP) adapta-tion [73] improves the efficiency of MAP estimation. Cor-

related HMMs with online adaptation were also shown to

be both efficient and effective [74] by considering HMMparameter correlations.

3) Discriminative Modeling: Due to inaccurate modelassumptions and limited training data, maximizing the

likelihood can be quite different from minimizing the pro-

bability of recognition errors, which is the ultimate goal in

ASR. Using a learning criterion that is consistent with ASR

Fig. 2. A typical block diagram of a continuous speech recognitionsystem with integrated search via a finite state network

representation of all the key task constraints, such as AM and LM.



objectives, the minimum classification error (MCE) [75],[76] learning for HMM has been shown to be quite

effective in improving the model separation, system accu-

racy, and performance robustness. Risk-based optimiza-

tion, such as maximum mutual information (MMI) (e.g.,

[77]), and minimum phone error (MPE) [78], has also

proven quite effective in reducing the error rates. The

MMI and MPE objective functions have been produced

good results when extended to feature-space MMI (fMMI)[79] and feature-space MPE (fMPE) [80].

III . TECHNOLOGY LIMITATIONSAND CHALLENGES

Although many successful implementations of commercial

products and services for many different languages have

been witnessed, ASR technology is still rather fragile. Un-less the users follow strict protocols that are consistent

with the speaking styles of the speaker population, trans-

ducer and channel characteristics of the training data con-

ditions, and the background acoustic environments, the

high accuracies obtained with HMM-based systems cannot

often be maintained across adverse conditions. This ro-

bustness issue limits a wide deployment of spoken lan-

guage systems. Three major technical challenges areillustrated as follows.

A. Challenges in Model Estimation and RobustnessSince it is not practical to collect a large set of speech

and text examples by a large population over all possible

combinations of signal conditions, it is likely that mis-

match of training and testing conditions are a major source

of errors for conventional pattern matching systems. Astate-of-the-art system may perform poorly when the test

data are collected under a totally different signal condi-

tion. Regarding the possible mismatches, both linguistic

and acoustic mismatches (e.g., [81]) might occur.

The mismatch can be conceptually viewed in the signal,

feature, or model space as shown in Fig. 3 where a

maximum-likelihood stochastic matching framework was

proposed to address the ASR robustness issues caused bythis mismatch [82]. The mismatch can be modeled by a

distortion in the signal space D1 and handled by speechenhancement. A feature space distortion D2 can also beconsidered, and feature compensation can be performed

(e.g., [83] and [84]). Finally, the mismatched situation can

be handled in the model space with a transformation D3that maps the trained models into the test environment via

adaptation [59].

1) Inconsistency in Language Modeling: A linguistic mis-match is mainly caused by incomplete task specifications,

inadequate knowledge representations, insufficient train-

ing data, etc. For example, task model and vocabulary

usage heavily influence the efficacy of the training process.

As mentioned before out-of-vocabulary (OOV) words not

specified in a task vocabulary are major sources of recog-

nition errors. For syllabic languages, such as Mandarin,

there is a major problem in consistently defining words.

For example, a four-character word can be broken down

into two two-character words or even into four single-

character words. If all possible combinations of wordsegments are considered in LM it may cause biases in

computing word probabilities. In another evaluation on

the 5000-word Wall Street Journal (WSJ) task, we found

that a 4% word error rate (WER) can be achieved with the

trigram language model. However, the WER went as high

as 70% when no language constraints were used [85]. It is

clear that the choices of LMs and language weights in most

unfamiliar situations will be hard when the task definitionis incomplete, e.g., in the case of spontaneous speech to be

discussed later.

2) Inconsistency in Acoustic Modeling: An acoustic mis-match between training and testing arises from various

sources, including differences in desired speaking formats

and signal realizations. For a given task, speech models

trained based on task-dependent data usually outperformmodels trained with task-independent data. Similarly,

speech models trained based on speakers with normal

speaking rate will usually encounter problems for fast and

slow talkers. Another major source of acoustic mismatch

derives from varying signal conditions. For example,

changes in transducers, channels, speaking environments,

speaker population, echoes and reverberations, and com-

binations of them, all contribute to performance degrada-tion. In addition to the previously discussed linguistic and

acoustic mismatches, model incorrectness and estimation

error also cause robustness problems for ASR.

3) Need for Collaborative ASR: The robustness problem ofcurrent ASR systems might be solved by combining the

different approaches developed by different members in

the speech community.

Fig. 3. Mismatch in training and testing: The two starred blocksindicate that the model obtained in training in the upper panel and

the features obtained in testing in the lower panel give acoustic

mismatches if the testing environments are very different from the

training conditions, resulting in a system with operating pair

mismatches shown.



In contrast to the model-based pattern matching ap-proach to extracting information from speech, a collection

of signal-based algorithms needs to be developed in order

to detect acoustic landmarks, such as vowels, glides, and

fricatives, in adverse conditions. They could serve to select

good data segments and designing signal-specific speech

enhancement, feature compensation, and model adapta-

tion algorithms for reliable information extraction. Attri-

bute-specific features, such as voice onset time (VOT) fordiscriminating voiced against unvoiced stops [86], was

developed and used for designing robust attribute de-

tectors. Since there are many speech attributes and

acoustic conditions to be dealt with, a collective expertise

will be needed in order to address different combinations

of robustness issues. It is well known that no single ro-

bustness technique is capable of handling a wide range of

adverse conditions. This again is a good opportunity todevelop collaborative efforts to solve this diverse problem.

B. Challenges in Search StrategySignificant progress has been made in developing effi-

cient and effective search algorithms in the last few years

[87]. In the future, it seems reasonable to assume that a

hybrid search strategy, which combines a modular search

with a multipass decision, will be used extensively for largevocabulary recognition tasks. Good delayed decision strat-

egies in each decoding stage are required to minimize

errors caused by hard decisions. As an example, the N-bestsearch paradigm (e.g., [88]), including the generation of

multiple-theory lattices, is an ideal way for integrating

multiple knowledge sources. Such a strategy fuses multiple

hypotheses to rescore a preliminary set of candidate digit

strings with higher level constraints like a digit check sum[89], detailed crossword unit models, and long-term lang-

uage models. It has also been used to provide competing

string hypotheses for discriminative training and for com-

bining multiple acoustic models to reduce recognition

errors [90]. As another example, confusion networks have

been proposed in [91] to represent alternative hypotheses

distilled from a word lattices and improve the accuracy of

the speech recognition system. We expect to see more useof the multipass search paradigm to find preliminary hy-

potheses in a topdown strategy by incorporating high-

level knowledge sources that cannot be integrated easily

into the finite state machine (FSN) representation for

frame-based DP search. By combining a good multipass

search strategy with utterance verification (e.g., [92])

strategies, more flexible and efficient designs in large

vocabulary continuous speech recognition (LVCSR) spokenlanguage systems are possible.

As already mentioned, a recognition decision is often

made by jointly considering all the knowledge sources in

the integrated approach ASR shown in Fig. 2. In principle,

this search strategy achieves the highest performance if all

the knowledge sources are completely characterized and

fully integrated using the speech knowledge hierarchy in

the linguistic structure of acoustics, lexicon, syntax, andsemantics. This is the commonly adopted search strategy

in speech recognition today. However, there are a number

of problems with the integrated approach, because not all

knowledge sources can be completely characterized and

properly integrated.

For LVCSR tasks, the compiled FSN is often too large

and therefore becomes computationally expensive to find

the best sentence through a huge and ever-expandingsearch space. Thus, all knowledge sources must remain

simple in order to efficiently combine them into a single

search space. In particular, this has inhibited progress at

the linguistic level, and almost all LVCSR systems em-

ployed nonoptimal linguistic components such as static

lexicons (lexicalization of morphological processes) and

n-gram LMs that force the decoding process to generatehypotheses that sometimes conflict with the acoustic con-strains. Two WSJ examples that are illustrated in the fol-

lowing subsections demonstrate how modular search can

correct wrong recognition results obtained with current

topdown, HMM-based systems.

Both examples highlight the importance of bottomup

attribute detection and stage-by-stage knowledge integ-

ration, which are two key topics to be discussed through-

out the paper. We will come back to this central themelater in Section IV.

1) Inconsistency With Attributes in Integrated Search: Thefirst WSJ example incorporates correct low-level informa-

tion from speech attributes. Specifically, it has been ob-

served that a conventional LVCSR system evaluated on the

WSJ task often confuses the word safra, with the phrase

stock for. Nonetheless, recognizing the word stock re-quires the presence of two stop sounds /t/ and /k/ in the

region of a vowel. This can be checked by visually inspect-

ing the spectrogram in the upper panel of Fig. 4, which

does not show the presence of stop sounds before and after

the middle vowel. Moreover, the frame-wise time evolu-

tion of the output posterior probabilities (generated by a

bank of ANN-based detectors for manner of articulation)

displayed in the lower panel of Fig. 4, known as a poste-riogram [93], clearly indicated that there are no stop

events in the area where the mistake occurred, and it also

signals the presence of a glide (/r/ in this case) followed by

a vowel at the end of the time-span under analysis. If this

information could be properly extracted and integrated

into search, these errors can be avoided. We will come

back to this example again later in Section VI-B in which

this particular utterance is corrected by combining attri-bute detection scores with the log-likelihood scores in

attribute-based lattice rescoring.

2) Inconsistency With Prosody in Integrated Search: In thesecond WSJ example, correct suprasegmental information

from pitch and duration is used. Specifically, supraseg-

mental information, such as prosody, and language



constraints, such as morphosyntactic language models,

cannot be easily cast into the FSN specification when per-

forming topdown knowledge integration. However, po-

tential errors can be corrected by using suprasegmental

pitch contours and duration features, as demonstrated by avisual inspection of Fig. 5.1 In the top panel, the waveform

for the WSJ sentence if the Fed pushes the dollar higher, itmay curb the demand for U.S. exports in the time spanbetween 2.72 and 3.32 s is displayed to show a three-word

recognition error occurring when using the same HMM-

based system as the example in Fig. 4.

Specifically, the phrase it may curb is misrecognizedas and maker of. The panel below shows the frameenergy, whereas the F0 contour is shown in the third panel.The recognized phone and word sequences are reported in

the fourth and fifth panels, respectively. The reference

phone and word transcriptions are displayed in the sixth

and seventh panels, respectively. Knowledge-based analy-

sis of the second and third plots reveals two inconsisten-

cies in recognizing the middle word maker in the phrase,

namely: 1) the F0 for the segment ker is too high withrespect to that for the preceding segment ma that puts

a strong stressed syllable in the middle of the word; and

2) the glottal closure of 60 ms of the stop sound in

maker is too long. It should be a stop gap of an un-

voiced stop as in the correct but misrecognized word

curb instead. Better F0 estimation will enhance this ca-pability (e.g., [94]). A recent study also demonstrated that

the performance of Mandarin LVCSR can be significantlyenhanced by incorporating prosodic information, such as

break models and pitch [95], [96].

3) Need for BottomUp Information Extraction: The topdown integrated framework also hampers the definition of

generic knowledge sources that can be used in different

domains. As a result, applications for a new knowledge

domain need to be built almost from scratch. In addition,the effectiveness of the integrated search diminishes when

dealing with unconstrained speech input, since more com-

plex language models are needed for handling spontaneous

speech phenomena along with much larger lexicons. On

the other hand, for the modular approach shown in Fig. 6,

the recognized sentence can be obtained by performing

unit matching, lexical matching, and syntactic and seman-

tic analysis in a sequential manner. As long as the interfacebetween the adjacent decoding modules is completely

specified, each module can be designed and tested

separately.

4) Need for Collaborative ASR: A collaborative researchamong different groups working on different components

of the system can be carried out to improve the overall

system performance because there are many pieces ofinformation, or acoustic cues, to be extracted and utilized.

In the meantime, modular approaches are usually more

computationally tractable than integrated approaches.

However, one of the major limitations with the modular

approach is that hard decisions are often made in each

decoding stage without knowing the constraints imposed

by the other knowledge sources. Decision errors are there-

fore likely to propagate from one decoding stage to thenext, and the accumulated errors are likely to cause search

errors unless care is taken to minimize hard decision

errors at every processing stage.

C. Challenges in Spontaneous Speech ProcessingAlthough low word error rates have been achieved in

many LVCSR tasks, the high accuracy usually does not

1The authors would like to Dr. C.-Y. Chiang of the National ChiaoTung University (NCTU, Hsinchu, Taiwan), for creating this example.

Fig. 4. Spectrogram (upper panel) and posteriogram (lower panel) for the sentence numbered 446c0210 of the Nov92 test set with focuson the area where the errors occur. A conventional LVCSR systemmisrecognize the word safra, and generates the transcription

stock for. In the second panel, the time evolution of the posterior probabilities, namely a posteriogram, of manner of articulation

shows that there are no plosive events in the time span under analysis. Furthermore, wrong word recognition occurs although

correct manner or articulation detection can be performed.



extend to recognizing spontaneous speech. An example is

the Switchboard task [97], which has attracted quite a bit of

research attention for almost 20 years. In the early 1990s, ahigh error rate of over 40% had been reported. However, a

steady decrease in word error rate led to those in the range

of 13%16% for conversational telephone speech and

broadcast news by 2006 [98][100]. Nevertheless, these

results are still rather high when compared with the re-

cognition performance on read speech. In spontaneous

speech, ill-formed utterances are often observed that

cannot be completely characterized, even if a large amountof training speech data was collected to build the n-gramlanguage models.

The plug-in MAP decoder that recognizes W^ in (1) findsthe best sentence in a set of competing sentences. How-

ever, there are many practical difficulties with this design.

First, the candidate set is usually of a finite size, and it isnot possible to include all sentences. Second, the quality of

the recognition result is not properly quantified becausethe right-hand side of (1) is only computing a relative

difference of competing word strings. Since speech sounds

are inherently ambiguous, we need to at least ask the

following three questions: 1) why should we accept W^ as

the recognized string?; 2) why should we accept some

words in W^ while rejecting others?; and 3) can we assigna value to measure the confidence of our acceptance?

These three issues lead researchers to study three new

but closely related topics, namely: 1) keyword recognition

and non-keyword rejection (e.g., [101]); 2) utterance veri-

fication (UV) at both the string and word levels (e.g., [92]

and [102]); and 3) confidence measures (CMs) or con-

fidence scoring (e.g., [103]). Although the above three

research areas cannot be solved using the classical classi-

fication formulation shown in (1), the theory of statisticalpattern verification and hypothesis testing provides a

framework to tie these three topics in a unified manner

(e.g., [104]). Verification of speech events, such as in

attributes, could also follow the same theory and design.

We will come back to this critical research area in further

depth later in Section V. Due to incomplete task specifi-

cations in defining most speech recognition and under-

standing tasks, a topdown knowledge integrationapproach alone usually could not maintain the consistency

with all the knowledge sources in the recognized sentence.

A partial understanding of any given utterance, i.e., know-

ing which part of an utterance to process and to ignore, is

very critical in order to handle spontaneous speech. We

will come back later to discuss this important issue in

Section IV-B.

1) Partial Understanding Through Event Detection: Onemajor problem related to characterizing spontaneous

speech with conventional acoustic and language models

is the set of so-called incomplete specification issues, such

as partial words, hesitation, telephone ringing, baby cry-

ing, door slamming, TV interferences, out-of-vocabulary

words, out-of-grammar sentences, and out-of-task (OOT)Fig. 6. A typical modular-search ASR system.

Fig. 5. Prosodic analysis of theWSJ sentence if the Fedpushes thedollarhigher, itmaycurb thedemand forU.S. exports. The firstpanel shows thewaveform in the time frame between 2.72 and 3.32 s, where a recognition error occurs. Specifically it may curb is recognized as and maker of.

The second panel shows the frame energy, whereas the F0 is shown in the third panel. The recognized phone and word sequences are reported

in the fourth and fifth panels, respectively. The reference phone and word transcriptions are displayed in the sixth and seventh panels,

respectively. Two inconsistencies: 1) the F0 for the segment ker is too high with respect to that for the preceding segment ma; and

2) glottal closure of the stop sound in maker is too long.



utterance constructions, which are commonly observed inspontaneous speech (e.g., [105]).

2) Need for Collaborative ASR: Many techniques havebeen developed to reduce the word error rate for the

Switchboard task for almost 20 years based on conven-

tional LVCSR approaches. The amount of effort needed to

study another spontaneous speech recognition task with an

increased complexity, or for a different language could bevery high. It is time to adopt detection-based techniques

that can be task independent and language universal, such

as key phrase detection, sound-specific filler modeling,

extraneous speech rejection, and attribute modeling,

through partial understanding. By utilizing the aforemen-

tioned modular strategy, we might be able to divide up the

big spontaneous LVCSR task to a set of smaller and man-

ageable problems so that researchers with knowledge-richalgorithms can help.

IV. BOTTOMUP DETECTIONFOLLOWED BY KNOWLEDGEINTEGRATION

Better modeling of the linguistic, articulatory, acoustic,

transmission, and noise channels missing in the currentASR formulation may enhance the current level of ASR

performance. Moreover, a knowledge integration process

with detected cues and evidence is often used in HSR and

from experience in spectrogram reading. This seems to

point to the need for a bottomup paradigm and leads to

ASAT principles. We provide some justifications of this

perspective in the following.

A. BottomUp Knowledge IntegrationThe currently prevailing topdown HMM-based sys-

tems use a large volume of speech and text training data.

However, once the models are trained it becomes a black

box in that it provides very little diagnostic information to

pinpoint why the models work well in one instance and

then fail badly in other situations. For example, it is often

clear from the pitch contour that there are only three digitsin an unknown input utterance, but somehow a four-digit

sequence scores best among the competing strings and is

recognized. Instead of using such a topdown, integrated

search approach, recent ASR systems rely on N-best stringsor word lattices to hypothesize multiple theories. The re-

cognized sentence is then obtained by rescoring these

strings using additional knowledge sources, e.g., phone

and segment lattices have been proposed [42], [106].However, phone recognition is often error-prone and be-

comes a limiting factor for further technology develop-

ment. The bottomup knowledge integration approach

would become feasible if speech cues are more reliably

detected.

A block diagram of a detection-based ASR system is

shown in Fig. 7. The input speech signal is first processed

by a bank of feature detectors aiming at events that are

relevant to the recognition task. An event lattice is then

produced with each element time-marked and scored. The

detectors do not have to be synchronized and, therefore,

the framework is flexible in embracing both short-termdetectors, e.g., for the VOT, and long-term detectors, e.g.,

for pitch contours. Once meaningful events have been de-

tected, the event merger proposes larger events by merging

smaller events and the theory verifier computes their

corresponding confidence scores and prunes unlikely

theories. This process could be repeated until all the avail-

able knowledge sources are incorporated and evaluated.

The recognized string is then the sequence of words thatscores the best against all possible knowledge sources.

This bottomup detection approach to ASR has a num-

ber of advantages, namely: 1) it provides plenty of diagnos-

tic information; 2) it is easy to compare the quality of

individual detectors and an ensemble of detectors by pro-

perly designing feature-specific evaluation sets aiming at

these events; 3) individual event detectors are often easier

to design and perfect than the whole system; 4) it takesadvantage of many years of research in speech and lang-

uage sciences, as well as statistical modeling; 5) it offers a

quantitative way for an objective performance evaluation;

and most importantly; and 6) it sets up an open framework

for the community to work collaboratively, something that

has not been done enough in the last 30 years. In the

following, we briefly present two preliminary studies that

also demonstrate the feasibility and effectiveness of such abottomup approach.

B. Key-Phrase Detection and VerificationSeveral spoken dialog systems [28] have been evalu-

ated in real-world applications. These systems use finite

state grammars to accept typical user utterances because

there is no data available to train statistical languagemodels for the specific tasks. The use of a rigid grammar

is effective for typical in-grammar (IG) utterances. How-

ever, in real-world environments, we have observed wide

utterance variation inherent in a large user population,

and are, therefore, not covered by the task grammars,

even though they had been iteratively tuned by developers

during the trial period. Even in apparently simple subtasks

Fig. 7. A detection-based speech recognition framework.



such as asking for date or time, around 20% of the user

utterances turned out to be out-of-grammar (OOG).

These samples include extraneous words, hesitations, re-

petitions, and unexpected expressions. In some cases, we

even observe many utterances that are totally irrelevant

or OOT.

Most of such spontaneous utterances contain some keyphrases that are task related and may lead to partial or full

understanding. Other samples, not relevant to the task,

should be rejected. This suggests a detection-based ap-

proach to flexible speech recognition and understanding

that is designed to detect semantically significant parts and

reject irrelevant portions. In a domain-specific form filling

or information retrieval task, it is capable of interpreting

with only key phrases. Therefore, the approach based ondetection is attractive. In [102], it was found that a com-

bined key-phrase recognition and verification strategy

worked well especially for ill-formed utterances. The com-

bined detection-verification approach improved the se-

mantic accuracy from 5% to 30% over the conventional

techniques with rigid grammatical constraints, especially

for ill-formed OOT utterances.

C. Knowledge-Based Feature Representationin LVCSR

We now show that the use of acoustic phonetics and

contextual variability in the representation of the speechsignal is indeed very useful to improve LVCSR. The system

proposed in [107] is similar to a well-known feature ex-

traction scheme called Tandem [108] that was extended to

LVCSR [109]. Data-driven multilayer perceptron (MLP)

detectors were used to measure the presence or absence of

distinctive features directly from the short-time MFCC and

limited temporal information. These phonetic distinctive

features were used as feature vectors to build a set ofcontext-dependent phone HMMs. Experiments were per-

formed on the WSJ task with the following feature con-

figurations: 1) baseline with 39 MFCC features; 2) 601features: the 61-dimensional feature vectors (1 energy

coefficient + 60 KarhunenLoe`ve (KL) transformed fea-

tures) were used as features to build triphone HMMs;

3) 44 KL-transformed phone features; 4) 6144 features;and 5) 6144 features plus MFCC. We point out that thefirst- and second-order derivative were not used for distinc-

tive and phone features.

Experimental results were obtained with the 5000 and20 000 words on the Nov92 test set. Trigram language

models were used, and all WERs are listed in the second

row of Table 1. Furthermore, systems 4) and 5) were

obtained by combining with ROVER [110], which essen-

tially corresponds to a majority vote decision. The results

given in the second to last row correspond to about 20%

and 10% relative improvements over our best MFCC base-

lines, on the 5000 and 20 000 tasks, respectively. Thesevery encouraging results seem to indicate that acoustic

phonetic features can help reducing the WERs. In the

bottom row, we report a WER of 6.6% for the 20 000-word

task obtained with a template-based systems [111], and the

state-of-the-art-result for the 5000-word task [112].

V. AUTOMATIC SPEECHATTRIBUTE TRANSCRIPTION

The speech signal contains a rich set of information that

facilitates human auditory perception and communication

beyond a simple linguistic interpretation of the spoken

input. In order to bridge the performance gap between ASR

and HSR systems, the narrow notion of speech-to-text in

ASR has to be expanded to incorporate all related infor-

mation embedded in speech utterances. This collectionof information includes a set of fundamental speech sounds

with their linguistic interpretations, a speaker profile en-

compassing gender, accent, emotional state and other

speaker characteristics, the speaking environment, etc.

Collectively, we call this superset of speech information the

attributes of speech. It is expected that directly addressing

these issues will improve ASR performance as well as

speaker recognition, language identification, speech per-ception, and speech synthesis. The human-based model of

speech processing suggests a candidate framework for

developing next-generation speech processing techniques

that have the potential to go beyond the current limitations

of existing ASR systems.

Based on the aforementioned set of speech attributes,

ASR can be extended to ASAT, which is a process that goes

beyond the current simple notion of word transcription.ASAT promises to be knowledge-rich and capable of incor-

porating multiple levels of information in the knowledge

hierarchy into attribute detection, evidence verification,

and integration, as shown in Fig. 8. The top panel illus-

trates the frontend processing, which consists of an

ensemble of speech analysis and parametrization modules.

In addition, the bottom panels demonstrate a possible

stage-by-stage backend knowledge integration process.These two key system components will be described in

more detail in the following. Since speech processing in

ASAT is highly parallel, a collaborative community effort

can be built around a common sharable platform to

enable a modular ASR paradigm that facilitates a tight

coupling of interdisciplinary studies of speech science

and processing.

Table 1 Word Error Rates (%) for Various Feature Sets and Combinationson WSJ Nov92, 5000 and 20 000



A. FrontEnd Attribute Detection ProcessingAn event detector converts an input speech signal into a

time series that describes the level of presence (or level of

activity) of a particular property of an attribute, or event, in

the input speech utterance over time. This function cancompute the a posteriori probability, or the log-likelihoodratio (LLR) of the particular attribute. The LLR involves the

calculation of two likelihoods: one pertaining to the target

model and the other the contrast model. The bank of

detectors consists of a number of such attribute detectors,

each being individually and optimally designed for the

detection of a particular event. These attribute properties

are often stochastic in nature and are relevant to infor-mation to be extracted and needed to perform speech ana-

lysis and other functions, such as ASR. One key feature of

the detection-based approach is that the outputs of the

detectors do not have to be synchronized in time and,

therefore, the system is flexible enough to allow a direct

integration of both short-term detectors, e.g., for detecting

VOT, and long-term detectors, e.g., for detecting pitch

contours, syllables, and particular word sequences. Theconventional frame synchronous constraints of most tradi-

tional ASR systems are thus relaxed in the ASAT system to

accommodate asynchronous attribute detection as shown

in Fig. 9. In the ASAT framework, different parameters at

different frame rates can be utilized and combined to de-

sign attribute-specific event detectors beyond the current

MFCC features obtained with frame-synchronous speech

analysis.Speech parametrization has been discussed in many

textbooks (e.g., [113]). For ASAT, the parameters can be

sample based, such as a zero-crossing rate, or frame based,

such as MFCCs. Speech analysis can be performed in the

temporal domain, providing features such as VOT, or in

the spectral domain, such as short time energies in differ-

ent frequency bands. Both long-term and short-term ana-

lysis can be compared and contrasted. Biologically inspired

and perceptually motivated signal analyses are considered

as promising parameter extraction directions [114], [115]

because the ASAT paradigm supports parameter extraction

at different frame rates for designing a range of attribute

detectors. Once a collection of speech parameters Ft isobtained, they can be used to perform attribute detection,

which is a critical component in the ASAT paradigm asshown in the upper panel of Fig. 8. Attributes can be used

as cues or landmarks in speech [45], [47] in order to

identify the islands of reliability for making local acous-

tic and linguistic decisions, such as energy concentration

regions and phrase boundaries, without extensive speech

modeling. A few clear examples are readily visible in

most spectrogram plots, e.g., the vowel and fricative re-

gions in Fig. 4.An attribute detection example was demonstrated in

[86] to discriminate voiced and unvoiced stops using VOT

for two-pass English letter recognition. In the first stage, a

Fig. 9. A bank of speech attribute detectors: Each can take differentparameters as inputs and generate a value between 0 and 1 over time

to indicate the presence or absence of the specific attribute.

Fig. 8. ASAT. (a) Speech analysis ensemble followed a bank of attribute detectors to produce an attribute lattice. (b) Stage-by-stageknowledge integration from speech attributes to recognized sentences.



conventional recognizer was used to produce a list ofmultiple candidates. To further discriminate some of the

minimum pairs, such as the English letters /d/ and /t/, a

VOT-based detector [86] can be used in the second stage to

provide a detailed discrimination. It was shown that the

VOT temporal feature produces a pair of curves with better

discrimination (i.e., with more separation between them)

than those obtained with spectral features alone. By reor-

dering candidates according to VOT, the two-stage recog-nizer gave an error rate 50% less than that obtained in a

state-of-the-art ASR system [116].

B. BackEnd Knowledge Integration ProcessingAnother critical component in the ASAT paradigm is

the backend processing shown in the bottom panel in

Fig. 8. An event merger takes the set of detected lower level

events as input and attempts to infer the presence of higherlevel units (e.g., a phone or a word). Those higher level

units are then validated by the evidence verifier to produce

a refined and partially integrated lattice of event hypoth-

eses to be fed back for further event merger and knowledge

integration. This iterative information fusion process al-

ways uses the original event activity functions as the raw

cues. A terminating strategy can be instituted by utilizing

all the supported attributes.The procedure produces the evidence needed for a final

decision, including a recognized sentence. Each activity

function can be modeled by a corresponding neural system.

Both activation levels and firing rates have been used in

neural encoding and neuron combinations to encode tem-

poral information. Simulating perception of temporal

events is of particular interest in auditory perception of

speech. New techniques are needed to accomplish thisform of lattice parsing. Conditional random field (CRF)

[117] is a mathematical framework that can be used to

describe sequences of symbols (such as phones or words) in

terms of input features, e.g., local phonetic attribute detec-

tions. CRF has been utilized in a number of ASAT-related

studies (e.g., [118][120].

To make use of the detected features, we must combine

them in a way that we can produce word hypotheses. Inessence, this boils down to three problems: 1) combining

multiple estimates of the same event to build a stronger

hypothesis; 2) combining estimates of different events to

form a new, higher level event with similar time bounda-

ries; and 3) combining estimates of events sequentially to

form longer term hypotheses. Note that these problems are

somewhat independent of the level of modeling: while the

canonical bottomup processing sequence would be tocombine multiple estimates of each feature, and then com-

bine the features into phones and then words (and word

sequences), we envision a highly parallel paradigm that is

flexible enough, for example, to combine a feature-based

phone detector with a directly estimated phone detector. In

principle, a 20 000-word ASR system can be realized with a

set of 20 000 single-keyword detectors [121].

Combining evidence for the same linguistic unit[problem 1)] has been the focus of techniques such as

multistream acoustic modeling (e.g., [122] and [123]) and

recognition hypothesis combination [110]. In addition,

stochastic combination to form strong verifiers from a

collection of weak detectors, such as boosting (e.g., [124]

and [125]), is a useful tool to combine low-level events into

high-level evidences.

C. Event and Evidence VerificationVerification of patterns is often formulated as a statis-

tical hypothesis testing problem [126] as follows: given a

test pattern, one wants to test the null hypothesis against

the alternative hypothesis. Event verification, a critical

ASAT component, can be formulated in a similar way. For

most practical verification problems in real-world speech

and language modeling, a set of training examples is usedto estimate the parameters of the distributions of the null

and alternative hypotheses. The two competing hypotheses

and their overlap indicate the two types of error known as

miss detection and false alarm errors [126]. A generalized

log-likelihood ratio (GLLR) was proposed as a way to

measure a separation between models of competing

hypotheses [127].

The verification performance is often evaluated as acombination of the two types errors. The related topic of

CMs (e.g., [103]) has also been intensively studied by

many researchers recently, e.g., [92] and [102]. This is due

to an increasing number of applications being developed

and deployed in the past few years. In order to have an

intelligent or humanlike interactions in these dialogs, it is

important to attach to each event a value that indicates

how confident the ASR system is about accepting the re-cognized event. This number, often referred to as a CM,

serves as a reference guide for the dialog system to provide

an appropriate response to its users just like an intelligent

human being is expected to do when interacting with

others.

Fig. 10 shows an example of how to use the GLLR plots.

Specifically, Fig. 10 displays in the top left, bottom left,

and top right, three sets of distribution curves for detectingthe three corresponding phones /w/, /ah/, and /n/, in the

word one. Here the ARPABET [128], used in the ARPA

Speech Understanding Research (SUR) project, is adopted

to denote the phonetic symbols used throughout this paper.

By approximating the three sets of curves with Gaussian

densities, the Gaussian curves for detecting the word one

can be composed as shown in the bottom right of Fig. 10. It

is noted that words are in general easier to detect thanphones, because the composed competing Gaussian curves

show a better separation, or equivalently less overlap.

D. Speech Attribute DetectionA possible implementation of the ASAT detection-

based frontend is shown in Fig. 11. It consists of two

main blocks: 1) a bank of attribute detectors that can



produce detection results in terms of a confidence score;

and 2) an evidence merger that combines low-level events

(attribute scores) into higher level evidence, such as

phone posteriors. The append module, shown in Fig. 11,

stacks together the outputs delivered by the attributedetectors for a given input and generates a supervector of

attribute detection scores. This feature vector is then fed

into the merger. In summary, the system shown in Fig. 11

maps acoustic features (e.g., short-time spectral features

or temporal pattern features) into phone posterior

probabilities.

In our studies, phonetic features, such as manner and

place of articulation, are used as the speech events of in-terest. The motivations behind this choice are: articulator-

motivated features improve robustness toward noise at

low signal-to-noise ratios [129]; they improved recognition

of hyper-articulated speech and in the presence of

different speaking styles [130]; they are reliably detected

[130]; and they carry linguistic information [129]. Table 2

shows the phonetic features used in the experiments re-

ported in the next sections. Silence is also used to repre-sent the absence of any speech activity. This feature

inventory is clearly not complete. Nevertheless, this set

could always be expanded.

Attributes should be stochastic in nature and the cor-

responding detectors designed with data-driven modeling

techniques. The goal of each detector is to analyze a speech

segment and produce a confidence score or a posterior

probability that pertains to some acoustic-phonetic attri-

bute. Generally speaking, both frame- and segment-based

data-driven techniques can be used for speech event de-tection. Frame-based detectors can be realized in several

ways, e.g., with ANNs [22], Gaussian mixture models

(GMMs) [131], and support vector machines (SVMs)

[132]. One of the advantages with ANN-based detectors is

that the output scores can simulate the posterior proba-

bilities of an attribute given the speech signal. On the other

hand, segment-based detectors are more reliable in spot-

ting segments of speech [133]. Segment-based detectorscan be built by combining frame-based detectors or with

segment models, such as HMMs, which have already been

shown effective for ASR [8]. Time-delay neural networks

(TDNNs) were also shown to be effective in designing

segment-based attribute classifiers [134]. The reader is re-

ferred to a recent Ph.D. dissertation [135] detailing the

process of building accurate TDNN-based classifiers for all

the attributes of interest.

1) Frame-Based Attribute Detectors: In the case of frame-based design, each detector is realized with three MLPs

organized in a hierarchical structure [136] similar to a way

of modeling long-term energy trajectories, referred to as

Fig. 10. Verifying a sequence of sequential hypotheses: for the word one (bottom right) based on evidence of verifyingthe three phones, /w/ (top left), /aa/ (bottom left), and /n/ (top right), in the word.



TempoRAl Patterns (TRAP)-based features [137], [138]. In

ASAT, sub-band energy trajectories arranged into split-

temporal context as described in [139] are used. In the

experiments, all MLPs are trained to compute the attributeposterior probabilities [136].

2) Segment-Based Attribute Detectors: When segment-based detectors, e.g., HMMs, are used to categorize a seg-

ment of speech into attribute classes either log-likelihood

or log-likelihood ratio can be adopted as detector scores. In

our ASAT framework, a LLR-based score is used to

measure the goodness-of-fit between a speech segment andthe corresponding speech feature because it has already

proven useful in rejecting wrong hypotheses in several

speech tasks [102], [104], [140], [141].

In Fig. 12, we show the typical detection curves of

manner of articulation, for the utterance numbered

442c013 (whose transcript is THATS FINE) of the SI-84set [142]. Detection curves of place of articulation can also

be generated. Nonetheless, manner of articulation is ofteneasier to detect than the place of articulation in both

spectrogram plots and MLP- or HMM-based detection

plots. This is mainly due to the fact that manners can

usually be clearly distinguished in their attribute beha-

viors. A collaborative effort can easily be envisioned where

researchers with many years of experience in specific

topics, e.g., stop sounds [143] and fricatives [144], can

provide their best detector modules to show their superior

performance to other competing modules.

Furthermore, score plots can be used to compare de-

tector performance. If we use 0.5 as a threshold to accept adetected event, then most of the attributes in the utterance

are correctly detected. Regions with scores below 0.5

usually indicate low confidence exhibiting either type I or

type II errors as discussed earlier. In the example shown in

Fig. 12, two sets of detector score plots for the same manner

of articulation for a short speech segment THATS FINEare displayed. The curves with a zigzag shape are obtained

with the MLP-based detectors, while the curves in red withstraight line scores were generated by conventional three-

state attribute HMMs. It is clear that HMM-based segment

detectors perform perfectly in this case, while the MLP-

based detectors show quite a bit of time-varying scores.

E. Evidence MergerThe bank of detectors provides evidence of a particular

speech event. Bits of evidence at the phonetic feature level

are combined together to form evidence at a higher level.

The focus of this section is on how to generate higher level

evidence at a subword level. There exist several methods to

generate evidence at a subword level from articulatory

events. For example, starting with manner and place of

articulation, a product lattice of degree two may be gener-

ated, and a constrained search may be performed overthis lattice to generate phone-level information [145].

CRFs [118] and segmental CRF (SCRF) [146] have also

been used to generate phone sequences by combining

articulatory features. In our framework, all of the detec-

tor outputs are combined with a feedforward MLP, which

has a single hidden layer. In a recent work, we demon-

strated that phone accuracies can be boosted using a deep

Table 2 List of Speech Attributes Used in the ASAT Experiments

Fig. 11. A preliminary implementation of the ASAT detection-based frontend. Each attribute detector analyzes any given input frameand produces a posterior probability score. The Append module stacks together attribute posterior probabilities. The merger delivers

phone posterior probabilities.



neural network (DNN) [147][149] as shown in [150].

Fig. 13 provides a schematic representation of the event

merger.By merging the attribute detector outputs and feeding

them into the attribute-to-phone mapping merger shown

in Fig. 13, we can produce frame-based posterior proba-

bilities, one for each phone of interest, and form a frame-

based feature vector. A penalized logistic regression with

HMM-based regressors [151] has also been employed to

combine information generated by a bank of segment-

based attribute detectors [152] obtaining remarkable re-sults on a phone classification task.

VI. ASAT APPLICATIONS

The ASAT detector frontend has been used with success

as a key component for several attribute-based speech ap-

plications, namely: lattice rescoring [153], [154], language-universal phone recognition [155], and bottomup,

stage-by-stage LVCSR [156], [157]. Most of the results

are preliminary studies because the ASAT-related effort is

quite recent. We hope to inspire other new applications.

A discussion of the new insights that may be offered by

the detector score plots together with the posteriogram

plots is presented first. Finally, a spoken language identi-

fication system based on attribute detectors is presentedin [158][160].

A. Visible Speech Analysis ThroughAttribute Detection

1) Detection Score Plots for MLP: We now analyzedetection score plots more closely. Fig. 14 displays a longer

sentence than that shown in Fig. 12. Although the

detection curves are not as smooth as those in the previousexample, the correct transcript can still be obtained by

following the evolution of the event detection process over

time. This outcome is also in line with spectrogram reading

by trained experts based on knowledge in acoustic phone-

tics (e.g., [41]). The detector scores here are normalized

between 0 and 1, ranging from an absence of an acoustic

property to the full presence of a speech cue. The value of

Fig. 13. A possible implementation of the ASAT merger. It is trainedusing the output of the bank of attribute detectors and generates

phone posteriors probabilities. Either a shallow MLP network, or a

DNN can be used.

Fig. 12. Detection curves of manner of articulation for the sentence numbered 442c013 (whose transcript is THATS FINE) of the SI-84 data set[142]. The curves in blue were generated using an ANN, whereas the curves in red were generated using segment-based HMM detectors.



these detections scores is a good indication of the activity

levels for the speech events of interest. Therefore, itprovides a new visualization tool in addition to the

conventional spectrogram plot shown in the top panel of

Fig. 14.

Error analysis has always played a crucial role in

providing diagnostic information to improving ASR algo-

rithms. With the extracted speech cue information re-

vealed in the new visualization tool, insight can also be

developed in understanding human speech. It could alsoprovide a good tool to offer speech insights to a new

generation of students and researchers.

For example, we can see sound transition behavior

clearly displayed in the region from segment 6 to

segment 8, going from phone /eh/ to /aa/ with a rising

activity from the preceding vowel into the glide sound /l/

in segment 7, then falling away into the following vowel.

We can also observe an overlapping nature of nasalizedvowel at the ending of segment 8 and the beginning of

segment 9. The double stop sound regions in segments 13

and 14 is also signaled. The large overlapping region for

the two-candidate segment 21 indicates that the glide

sound /r/ heavily influences articulation in its surround-

ing phones with a low level vowel activity showing up

between segments 21 and 22 on the detector plot for the

vowel manner.It is clear the detector score plots displayed in

Fig. 14 provide a rich set of information not commonly

available to researchers that are not expert trained in

spectrogram reading. It also reenforces additional ad-

vantages we intend to exploit in the information-

extraction perspective we have highlighted throughout

this paper.

2) Posteriogram Plots for MLP and DNN: We plot inFig. 15 time evolution of the estimated frame posterior

probabilities for phones, or phone posteriogram [93], for

the same short utterance used in Fig. 12. Instead of dis-

playing the CM value between 0 and 1 as in the detector

score plots, an intensity similar to spectrogram plots is

displayed showing darker regions for higher posterior pro-babilities at each 10-ms frame for the corresponding

phone. On the vertical axis, 40 phones are listed starting

with the phone /aa/ as in the word pot at the top, and

finishing with the phone /zh/ as in the word treasury at

the bottom. It is clearly visible that the silence unit

stands out in the beginning and ending parts of the utter-

ance for both plots, shown in the upper and lower panels,

respectively. A DNN is practically an MLP with manylayers (seven hidden layer have been used in our studies),

where the pretraining algorithm proposed for deep belief

networks [147] has been applied before training the MLP

[148], [150], which has a single hidden layer.

The DNN posteriograms are often sharper than the

MLP ones indicating the top candidate phones with the

darkest region at each vertical time snap shot have less

competition from other phones. The blurry nature in someregions on the plots indicates that some phones are confu-

sable at that time frame. This posteriogram can be dis-

played together with the detector score plots shown in

Fig. 12 to gain insight about the goodness of attribute-to-

phone mapping. For example, in the initial part of the

Fig. 14. Detection curves of manner of articulation for the sentence numbered 440c20t (RATES FELL ON SHORT TERM TREASURY BILLS) of theSI-84 data set [142]. The correct transcript can still be received by following the time evolution of detection of the attribute events.



phone /ay/, it would be misrecognized as phone /aw/,

but it could be corrected because the duration of a diph-

thong /aw/ cannot be so short. Finally, the DNN-based

posteriogram is not as smooth as and with less noise than that

obtained using the MLP with a single hidden layer, and

therefore more reliable than that produced by the MLP-baseddetectors. With the seven phones together with the two

silence segments clearly labeled in the bottom panel, it is

possible to observe the correct transcripts by following the

black lines (time evolution of the top phone posterior

probabilities). Furthermore, it is easy to identify the possible

sources of confusion, and devise techniques to address them.

B. Attribute-Based Lattice RescoringLattice rescoring is reported for a LVCSR task only. The

readers are referred to [154] for further insights on other

tasks.

1) Rescoring Technique: The rescoring algorithm aims tointegrate the confidence scores generated by the ASAT

detection-based frontend into the word lattice on an arc-

by-arc basis. Rescoring is carried out as a linear combina-

tion of the log-likelihood acoustic score generated by the

baseline LVCSR system and the logarithm of the phonemeposterior probability, properly discounted by the pho-

neme prior probability, generated by the detector-based

frontend.

2) Experimental Setup: The experiments are performedusing the 5000-word speaker-independent WSJ (5k-WSJ0)

corpus [142]. The SI-84 data set (7077 utterances from 84

speakers, i.e., 15.3 h of speech material) is used for train-ing. The testing material is again the Nov92 set. ML esti-

mation [5], [131], [161] is adopted to find the parameters of

a first HMM baseline. A second HMM baseline is then

Fig. 15. Posteriogram plots for the sentence numbered 442c0213 (THATS FINE) of the SI-84 data set [142]. (a) MLP-based posteriogram plot.(b) DNN-based posteriogram plot.



obtained using MMI with the MLHMMs of the first sys-

tem as seed models. A trigram language model within the5k-WSJ0 vocabulary is used in decoding.

3) Experimental Results: Table 3 shows the performanceof the ML-based baseline system, in terms of WER, on the

Nov92 task. These results are comparable with the results

reported in [162][164]. In the second row, the perfor-

mance of the rescored system on the same task is shown

when both the bank of detectors and the merger are trainedwith phonetically rich TIMIT corpus [165], [166]. More-

over, the WERs of the MMI-based baseline and rescored

systems are shown in the last two rows, respectively. The

results indicate that the rescored system achieves better

performances than the conventional decoding scheme in

both cases. We noticed that several hypotheses have the

same error patterns for both HMM systems. These incor-

rectly recognized words are typically characterized by anacoustic trajectory that did not strictly observe the under-

lying acoustic phonetic constraints. The lack of the required

acoustic-phonetic evidence for the wrong hypotheses could

be signaled by attribute detectors, and this information

could then be used to penalize the corresponding phones

during the rescoring and subsequent decoding steps.

The phonetic-based correction concept is now illus-

trated on the Nov92 sentence numbered 446c0210. Thecorrect sequence of words for this utterance is: The com-pany said its European banking affiliate Safra republic plansto raise more than four hundred fifty million dollars throughan international offering. However, the baseline ML andMMI systems produce a phrase stock for, instead of theword Safra. Fig. 4 shows the spectrogram for this utteranceat the location where the error occurred with the ML- or

MMI-based system.In Fig. 4, correctly recognizing the word stock requires

the presence of two stop sounds /t/ and /k/ in the region

surrounding the middle vowel. But from the spectrogram,

it can be easily seen that there is a lack of articulatory

evidence to support this decoded word. The stop detector

signals these mistakes, and the correct sentence is decoded

after rescoring.

C. Cross-Language Attribute DetectionWe now report our studies on language-independent

attribute detection, which were extended to phone

recognition with minimal available data for the targetlanguage in [155]. English manner attribute scores have

been effectively incorporated into Mandarin LVCSR to

improve performance by lattice rescoring in a cross-lang-

uage manner as well [96].

1) Language-Universal Knowledge Source Definition: Fun-damental speech attributes, such as voicing, nasality, and

frication, could be identified from a particular languageand shared across many different languages, so they could

also be used to derive a universal set of speech units. A

small number of these universal units could be then used

to model speech sounds. It is worth noting that these

phonetic features (attributes) have already been used to

identify a common knowledge source in several studies

(e.g., [167] and [168]), yet these features were often

employed within a knowledge-based phoneme mappingprocedure to: 1) produce an expanded phoneme set to

cover speech sounds in multiple languages (e.g., [168]);

or 2) find the mapping between language-dependent (or

independent) acoustic models and the new target acoustic

models (e.g., [167]) for decoding purposes.

2) Experimental Setup: The stories part of the OGImultilanguage telephone speech corpus [169] is employedin our investigation. The amount of transcribed data is only

about 1 h per language, which is significantly smaller than

the usual amount of data used to train multilingual ASR

systems, e.g., [170]. This corpus has phonetic transcrip-

tions for six different languages: English (ENG), German

(GER), Hindi (HIN), Japanese (JAP), Mandarin (MAN),

and Spanish (SPA). Three subsets, namely training, vali-

dation, and test sets, are formed using the data available foreach language. Table 4 shows the amount of available data

for each subset and the number of language-dependent

phone units. Each attribute detector is designed within the

MLP framework as described in Section V-D1. Perfor-

mance is reported as in [171].

3) Language-Dependent Attribute Detection: Language-specific data (Table 4) are used to train, validate, and testeach detector. Language-dependent attribute accuracies

are found to be comparable across languages and attri-

butes. This implies that attribute classification could be

reliably obtained for a variety of languages. Furthermore,

good attribute accuracies could be achieved for several

attributes, such as vowel (92%) and continuant (90%),

Table 4 The OGI Stories Corpus in Terms of Amount of Data (in Hours)and Number of Phonemes Used per Each Language

Table 3 WER, in Percentage, on the Nov92 Task. Rescoring Was Applied toBoth the ML- and MMI-Based Baseline Systems, Trained on the SI-84

Material of the WSJ0 Corpus



across languages. The full list of accuracies and insights

can be found in [155].

4) Cross-Language and Language-Universal AttributeDetection: The detectors of a specific language are nowtested on the data of the other languages. Fig. 16 shows the

attribute accuracy rates on the MAN data. The connected

line highlights results obtained with Mandarin-based de-

tectors. From Fig. 16, we can make several observations.

First, detection across languages is less reliable than in the

language-dependent cases for several attributes, but the

drop in performance is not particularly severe. Attributeaccuracies are comparable across all languages for the

vowel class. There are also cases for which cross-language

detection outperformed in-language detection, e.g., round.

This indicates that the round detector trained on a non-

Mandarin language performed as well as the language-

dependent Mandarin detector. A similar trend is observed

for all of the languages used in our studies.

By pooling together all training materials from the sixlanguages a new language-independent data set could be

formed. Then, a single bank of (universal) detectorscould be trained on this new data set. We observed that

better attribute accuracies could be attained for quite a few

attributes, e.g., vowel, fricative, round, and mid, with only

a minor degradation for the worst performing detectors.

D. ASAT-Based BottomUp LVCSRAs an initial attempt at implementing a bottomup

LVSCR system, the hybrid ANN/HMM approach has been

modified to explicitly represent and manipulate the search

space at various points in the decoding process [156] using

weighted finite state machines (WFSMs) [172]. ASR is

then accomplished in a bottomup fashion by performing

backend lexical access and syntax knowledge integration

over the output of our detection-based frontend, which

generates frame-level speech attribute detection scores

and phone posterior probabilities. Decoupled recognitionis made possible by two main factors: 1) high-accuracy

detection of acoustic information in order to generate

high-quality lattices at every stage of the acoustic and

linguistic information processing; and 2) low-error prun-

ing of the generated lattices in order to reduce search

errors likely to occur when trying to minimize the possi-

bility of memory overflow in using the AT&T WFSM tool.

1) Detection-Based LVCSR With WFSMs: LVCSR is ac-complished by building up on the frame-level evidence

gathered at the output of the detection-based frontend

shown in Fig. 11. The first step is to represent the output

of the detection-based frontend, for a given utterance,

as an acceptor F. In practice, F is a graph with anumber of states that equals the length of the input

sentence (in frames), and a number of edges between eachpair of states that equals the output dimension of the

merger (i.e., the number of even

Speech processing research paper 4

Documents

Transcript of Speech processing research paper 4