Speech processing research paper 4

27
INVITED PAPER An Information-Extraction Approach to Speech Processing: Analysis, Detection, Verification, and Recognition This paper presents an integrated detection and verification approach to information extraction from speech that can be used for speech analysis, and recognition of speech, speakers, and languages. By Chin-Hui Lee, Fellow IEEE , and Sabato Marco Siniscalchi, Member IEEE ABSTRACT | The field of automatic speech recognition (ASR) has enjoyed more than 30 years of technology advances due to the extensive utilization of the hidden Markov model (HMM) framework and a concentrated effort by the speech community to make available a vast amount of speech and language resources, known today as the Big Data Paradigm. State-of-the- art ASR systems achieve a high recognition accuracy for well- formed utterances of a variety of languages by decoding speech into the most likely sequence of words among all possi- ble sentences represented by a finite-state network (FSN) ap- proximation of all the knowledge sources required by the ASR task. However, the ASR problem is still far from being solved because not all information available in the speech knowledge hierarchy can be directly integrated into the FSN to improve the ASR performance and enhance system robustness. It is believed that some of the current issues of integrating various knowledge sources in top–down integrated search can be par- tially addressed by processing techniques that take advantage of the full set of acoustic and language information in speech. It has long been postulated that human speech recognition (HSR) determines the linguistic identity of a sound based on detected evidence that exists at various levels of the speech knowledge hierarchy, ranging from acoustic phonetics to syntax and semantics. This calls for a bottom–up attribute detection and knowledge integration framework that links speech processing with information extraction, by spotting speech cues with a bank of attribute detectors, weighting and combining acoustic evidence to form cognitive hypotheses, and verifying these theories until a consistent recognition decision can be reached. The recently proposed automatic speech attribute transcrip- tion (ASAT) framework is an attempt to mimic some HSR capabilities with asynchronous speech event detection fol- lowed by bottom–up knowledge integration and verification. In the last few years, ASAT has demonstrated good potential and has been applied to a variety of existing applications in speech processing and information extraction. KEYWORDS | Acoustic phonetics; automatic speech attribute transcription (ASAT); automatic speech recognition (ASR); cross-language phone recognition; knowledge integration; lattice rescoring; place and manner of articulation; speech attribute detection I. INTRODUCTION It is instructive to examine some of the key developments in automatic speech recognition (ASR) that have occurred in the past few decades and contemplate new directions that might lead to better system designs. The ASR problem Manuscript received May 18, 2012; revised September 13, 2012; accepted January 4, 2013. Date of publication February 7, 2013; date of current version April 17, 2013. The ASAT project was supported by the National Science Foundation (NSF) Information Technology Research (ITR) Program under Contract IIS-04-27113. Part of S. M. Siniscalchi’s ASAT-related work was supported by the Spoken Information Retrieval by Knowledge Utilization in Statistical Speech Processing (SIRKUS) project through Prof. T. Svendsen at the Norwegian University of Science and Technology (Trondheim, Norway). C.-H. Lee is with the School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: [email protected]). S. M. Siniscalchi is with the Faculty of Architecture and Engineering, University of Enna ‘‘Kore,’’ Enna 94100, Italy (e-mail: [email protected]). Digital Object Identifier: 10.1109/JPROC.2013.2238591 Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1089 0018-9219/$31.00 Ó2013 IEEE

description

This is ieee paper which is not accessible without any paid account in ieee .

Transcript of Speech processing research paper 4

  • INV ITEDP A P E R

    An Information-ExtractionApproach to Speech Processing:Analysis, Detection,Verification, and RecognitionThis paper presents an integrated detection and verification approach to

    information extraction from speech that can be used for speech analysis, and

    recognition of speech, speakers, and languages.

    By Chin-Hui Lee, Fellow IEEE, and Sabato Marco Siniscalchi, Member IEEE

    ABSTRACT | The field of automatic speech recognition (ASR)has enjoyed more than 30 years of technology advances due to

    the extensive utilization of the hidden Markov model (HMM)

    framework and a concentrated effort by the speech community

    to make available a vast amount of speech and language

    resources, known today as the Big Data Paradigm. State-of-the-

    art ASR systems achieve a high recognition accuracy for well-

    formed utterances of a variety of languages by decoding

    speech into the most likely sequence of words among all possi-

    ble sentences represented by a finite-state network (FSN) ap-

    proximation of all the knowledge sources required by the ASR

    task. However, the ASR problem is still far from being solved

    because not all information available in the speech knowledge

    hierarchy can be directly integrated into the FSN to improve

    the ASR performance and enhance system robustness. It is

    believed that some of the current issues of integrating various

    knowledge sources in topdown integrated search can be par-

    tially addressed by processing techniques that take advantage

    of the full set of acoustic and language information in speech. It

    has long been postulated that human speech recognition (HSR)

    determines the linguistic identity of a sound based on detected

    evidence that exists at various levels of the speech knowledge

    hierarchy, ranging from acoustic phonetics to syntax and

    semantics. This calls for a bottomup attribute detection and

    knowledge integration framework that links speech processing

    with information extraction, by spotting speech cues with a

    bank of attribute detectors, weighting and combining acoustic

    evidence to form cognitive hypotheses, and verifying these

    theories until a consistent recognition decision can be reached.

    The recently proposed automatic speech attribute transcrip-

    tion (ASAT) framework is an attempt to mimic some HSR

    capabilities with asynchronous speech event detection fol-

    lowed by bottomup knowledge integration and verification. In

    the last few years, ASAT has demonstrated good potential and

    has been applied to a variety of existing applications in speech

    processing and information extraction.

    KEYWORDS | Acoustic phonetics; automatic speech attributetranscription (ASAT); automatic speech recognition (ASR);

    cross-language phone recognition; knowledge integration;

    lattice rescoring; place and manner of articulation; speech

    attribute detection

    I . INTRODUCTION

    It is instructive to examine some of the key developmentsin automatic speech recognition (ASR) that have occurred

    in the past few decades and contemplate new directions

    that might lead to better system designs. The ASR problem

    Manuscript received May 18, 2012; revised September 13, 2012; accepted January 4,

    2013. Date of publication February 7, 2013; date of current version April 17, 2013.

    The ASAT project was supported by the National Science Foundation (NSF)

    Information Technology Research (ITR) Program under Contract IIS-04-27113. Part

    of S. M. Siniscalchis ASAT-related work was supported by the Spoken Information

    Retrieval by Knowledge Utilization in Statistical Speech Processing (SIRKUS) project

    through Prof. T. Svendsen at the Norwegian University of Science and Technology

    (Trondheim, Norway).

    C.-H. Lee is with the School of Electrical and Computer Engineering, Georgia Instituteof Technology, Atlanta, GA 30332 USA (e-mail: [email protected]).

    S. M. Siniscalchi is with the Faculty of Architecture and Engineering, University ofEnna Kore, Enna 94100, Italy (e-mail: [email protected]).

    Digital Object Identifier: 10.1109/JPROC.2013.2238591

    Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 10890018-9219/$31.00 2013 IEEE

  • is still far from solved, judging from the limiteddeployment of products and services worldwide, the

    fragile nature of performance robustness for the state-of-

    the-art ASR systems, and the slowdown of recent progress

    in performance improvement when the amount of training

    data to learn acoustic and language model for ASR is no

    longer the only major technology concern in designing

    ASR systems. Nevertheless, it is also clear that tremendous

    technology advances have been made since the speechcommunity adopted an information-theoretic perspective

    of channel modeling for encoding speech from language,

    and formulated the ASR problem as a channel decoding

    paradigm [1].

    Taking advantage of the sequential nature of speech,

    and combining an efficient stage-by-stage decoding strat-

    egy with dynamic programming (DP, e.g., [2] and [3]) and

    a Markov inferencing framework [4][6], a continuousspeech recognition algorithm was first developed in [7] to

    deliver good performance. The ease of learning speech and

    language models from data triggered almost four decades

    of rapid technology progress for ASR based on this integ-

    rated pattern modeling and decoding framework later

    known as hidden Markov models (HMMs, e.g., [8]). The

    same automatic data learning paradigm has been extended

    to quite a few machine learning problems in the last twodecades. One of the most notable accomplishments was

    the development of a statistical machine translation (MT)

    framework originated by a group of ASR researchers at

    IBM [9], which also spun off many recent MT research and

    application activities and other similar statistical natural

    language processing (NLP) efforts (e.g., [10] and [11]).

    The aforementioned statistical pattern matching ap-

    proach to ASR is considered a paradigm shift from thetraditional speech science perspective of crafting heuristic

    rules manually based on expert observations from limited

    data and local optimization, which is sometimes known as

    a bottomup knowledge integration process. In contrast to

    traditional knowledge-rich approaches, the current knowl-

    edge-ignorant or knowledge-implicit modeling framework

    [12][14] relies on collecting a large amount of speech and

    text examples and learning the model parameters withoutthe need to use detailed knowledge about a target lang-

    uage. It offers an advantage for automatic model learning

    from a large collection of data via a rigorous mathematical

    formulation and global optimization by using all the avail-

    able knowledge sources at the same time, known as top

    down knowledge integration, ready for DP-based optimal

    decoding [15].

    During the transition to the new paradigm in the1970s, an intensive effort in applying acoustic and lin-

    guistic knowledge sources to speech recognition in the

    Advanced Research Projects Agency (ARPA) Speech

    Understanding Project [16] was witnessed. Many notable

    examples were documented [17], [18]. Nonetheless, expert

    knowledge was required to design even a simple ASR

    system, which made the ASR technology hard to access.

    Furthermore, robustness to adverse conditions was neveraddressed in a serious manner. Much of the knowledge

    accumulated in these studies, e.g., [16][18], was not fully

    explored in current HMM-based system. Moving into the

    1980s and 1990s, the dominance of data-driven learning

    approaches to speech modeling was witnessed. A number

    of techniques, including vector quantization (VQ) [19],

    HMM [8], self-organizing map (SOM) [20], and artificial

    neural network (ANN) [21], [22], have been successfullyadopted.

    After many years of concentrated deliberation, the

    speech community has come a long way from the learning

    stage in the 1970s, and made a tremendous drive in data-

    driven approaches in the 1980s, 1990s, and 2000s. A

    continuous stream of performance improvement and

    increasing task complexity has thus been observed. For

    more detail, the reader is referred to a special issue of theProceedings of the IEEE on Spoken Language Processing

    published in August 2000 [23]. A number of books [24]

    [34] has also been published. However, it is also safe to

    argue that the technology progress has slowed down in

    recent years. Most research groups are searching for the

    next trend to move ASR forward. This phenomenon is

    known as the S-Curve in learning illustrated in the curvelabeled in solid circles in Fig. 1. The community wouldgenerally agree that the fragile nature of ASR system design

    will require new technological breakthroughs before con-

    versational systems really become a ubiquitous user inter-

    face mode, being able to compete with conventional

    graphical user interface with point-and-click devices like

    mouses or touch-sensitive screens.

    In examining Fig. 1 closely, we can roughly divide the

    ASR technology progress into three periods: 1) before the1970s, labeled in blue circles, the speech community en-

    joyed a vast creation of speech knowledge sources (e.g.,

    [35][39]); 2) between the 1970 and the 2010, labeled in

    green circles, data-driven models dominated four decades

    of fast advances with HMM playing the role of a paradigm

    shift and emerging as the leading framework used in

    almost all modern ASR systems; and 3) beyond the 2010s,

    labeled in pink circles, one can envision an imaginary pathin which the speech community may be waiting for anoth-

    er paradigm shift to take place by exploring knowledge-

    rich modeling [12]. However, this paradigm shift should

    still leverage on data-driven automated learning from big

    language resources. By setting a human performance

    capability ceiling shown in a horizontal line in the upper

    part of Fig. 1, it is noted that there is still a big gap between

    the current state-of-the-art and a human speech recogni-tion (HSR) system. In order to address the suprahuman

    performance goal set by IBM a few years ago [40], the

    speech community will need fast technology progress

    again, similar to what the community had enjoyed in the

    last four decades. This may be a good time to call for a

    paradigm shift and reexamine what can be done to carry

    the community forward.

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    1090 Proceedings of the IEEE | Vol. 101, No. 5, May 2013

  • There have been many attempts to find the next para-

    digm shift. One of them is exploiting the distinctive

    features of speech (e.g., [16] and [37]) to form words along

    with the use of lexical access to a dictionary of words asdemonstrated in spectrogram reading by human experts

    [41] in the MIT Summit system [42]. Using bottomup

    knowledge integration, HSR also performed much better

    than ASR in most benchmark tests [43], [44]. The speech

    cues, or attributes, or events, can thus serve as acoustic

    landmarks [42], [45][47], sometimes referred to as

    islands of reliability in the ocean of competing speech

    events, to improve bottomup knowledge integration. Re-cently, a detection approach to ASR, called automatic

    speech attribute transcription (ASAT) [13], was proposed

    in another attempt to address research issues in HSR and

    spectrogram reading through bottomup attribute detec-

    tion and stage-by-stage knowledge integration. We will

    collectively refer to this set of viewpoints as an informa-

    tion-extraction perspective to extract useful acoustic and

    linguistic information for the purposes of speech recogni-tion and understanding. ASAT also facilitates a modular

    strategy so that various researchers can collaborate by

    contributing their best detectors or knowledge modules to

    plug-n-play into the overall system design.

    The rest of the paper is organized as follows. In

    Section II, we briefly review the statistical pattern recog-

    nition approach to ASR. We describe the single most im-

    portant technique that has helped advancing the state ofthe art of ASR, namely hidden Markov modeling of speech,

    and discuss current ASR capabilities. We next address

    several ASR technology limitations in Section III. A list of

    active ASR research challenges in robustness, decision

    strategies, and utterance verification are also presented.

    These challenges lead to a bottomup speech attribute

    detection framework followed by a stage-by-stage knowl-

    edge integration process to be discussed in Section IV. To

    enhance the current capabilities and alleviate some of the

    limitations of HMM-based ASR, a bottomup detection

    approach to ASR, called ASAT, is presented in Section V.A survey of ASAT-based speech processing applications

    and their advantages over the conventional topdown

    approach to ASR are highlighted in Section VI. In

    Section VII, we present a critical look at new research

    directions and future work opportunities through estab-

    lishing the collaborative ASR Community of the 21st Cen-

    tury. Finally, we conclude our findings in Section VIII.

    II . STATE-OF-THE-ART TOPDOWN ASR

    Modern ASR system design is based on a statistical pattern

    matching framework that is motivated by representing

    spoken utterances as stochastic patterns (e.g., [7], [15],

    [25][27], and [48]) and formulating an information-theo-

    retical perspective of speech generation, acquisition, and

    transmission (e.g., [15]). Many good studies on acousticmodeling are available (e.g., [49][52]). Equally as many

    papers are concerned with language modeling (e.g., [53]

    [58]). In the following sections, a brief overview on the

    current statistical approach to ASR is presented.

    A. Statistical Pattern Recognition TheoryStarting with a message M from a message source, a

    sequence of words W is formed through a linguistic chan-nel. Different word sequences may often convey the same

    message. It is then followed by an articulatory channel that

    converts the discrete word sequence into a continuous

    speech signal S. Speaker effect, which accounts for amajor portion of the speech variabilities including speech

    production difference, accent, dialect, speaking rate, etc.,

    is added at this point. Additional speech distortion is

    Fig. 1. The S-Curve of ASR technology progress: 1) before the 1970s: Vast creation of speech knowledge sources; 2) from the 1970sto the 2010s: Data-driven model learning with rich speech and language data resources; and 3) beyond the 2010s: What to do to sustain

    fast technology progress?

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1091

  • introduced when the signal passes through the acousticchannel that includes the speaking environment, interfer-

    ing noise, and the transducers used to capture the speech

    signal. This acoustic realization A is then passed throughsome transmission channel S^ before it reaches an ASRsystem as an observed signal X (e.g., [59]).

    For real-world practical problems, it is difficult to char-

    acterize the intermediate channels, such as articulatory,

    acoustic, and transmission channels, which are lumpedtogether as a noisy channel. The noisy channel model is

    usually formulated as follows: 1) the joint distribution

    pW;X is decomposed into two components pXjW andPW, known as an acoustic model (AM) and a languagemodel (LM), respectively; 2) the forms of pXjW andPW are assumed parametric probability density functions(pdfs), i.e., pXjW and PW, respectively; and 3) theparameters and are estimated from some training data.With these simplifications, the most popular way to solve

    the ASR problem is to use the well-known plug-in

    maximum a posteriori (MAP) decision rule (e.g., [60][62])

    W^argmaxW2

    PWjXargmaxW2

    p^XjW PL^ W (1)

    where ^ and ^ are the estimated parameters obtainedduring training, W^ is the recognized sentence from decod-ing, and is the set of valid candidate word sequences tobe searched during testing. This decision rule, derived

    from the optimal Bayes decision rule, is also widely used in

    many other pattern recognition applications. In the above

    equation, L commonly known as a language modelmultiplier is used to balance the AM and LM contributions

    to the overall probability due to unknown distributions,

    and the use of a likelihood function p^XjW to computethe acoustic probability.

    A block diagram of this topdown approach to ASR is

    shown in Fig. 2. The feature extraction module provides the

    acoustic feature vectors used to characterize the spectral

    properties of the time-varying speech signal, typically in

    terms of time-synchronous cepstral analysis [63]. Hetero-

    scedastic discriminant analysis (HLDA) [64] is routinelyused in modern ASR system as a method for learning

    projections of high-dimensional acoustic representations

    into lower dimensional spaces. The information provided

    by the AM, LM, and word models is then used to evaluate

    the similarity between the input feature vector sequence

    (corresponding to a portion of the input speech) and a set of

    acoustic word models for all words in the vocabulary to

    determine which words were most likely spoken.

    B. Three Key HMM AdvancesTopdown HMM modeling is considered as one of the

    most fruitful areas in characterizing speech and language

    in recent years. Key advances [59] can be summarized in

    three broad categories to be discussed in the following.

    1) Detailed Modeling: ML estimation has been exten-sively developed following Baums original formulation

    [5]. Software packages, such as HTK [65] and GMTK [66],

    are available now in the public domain to establish acoustic

    models with hundreds of thousand Gaussian mixture com-

    ponents, and support language models with hundreds of

    million of n-gram probabilities. The previous limitationimposed by the curse of dimensionality, widely known in

    the pattern recognition community, was alleviated withmany advanced modeling techniques that take parameter

    sharing into account, such as the commonly used tied-state

    tree learning strategy [67] in subphone modeling [68], [69].

    2) Adaptive Modeling: A significant drop in performanceis often observed when an ASR system is used in an ope-

    rating condition that is different from the training condi-

    tion. Adaptation algorithms try to automatically tune agiven set of HMMs to a new test environment using a

    limited, but representative set of new data, commonly

    referred to as adaptation data. There exist two major

    adaptation approaches: the transformation-based approach

    and the Bayesian approach. The best known example of

    transformation-based adaptation is the maximum-

    likelihood linear regression (MLLR) framework [70]. The

    feature-space MLLR (fMLLR) [71] extends MLLR and hasproven to be highly effective as a method for unsupervised

    adaptation. In Bayesian learning (e.g., [72]), prior den-

    sities need to be assumed and MAP estimates are obtained

    for the HMM parameters. When the adaptation data size is

    limited, structural maximum a posteriori (SMAP) adapta-tion [73] improves the efficiency of MAP estimation. Cor-

    related HMMs with online adaptation were also shown to

    be both efficient and effective [74] by considering HMMparameter correlations.

    3) Discriminative Modeling: Due to inaccurate modelassumptions and limited training data, maximizing the

    likelihood can be quite different from minimizing the pro-

    bability of recognition errors, which is the ultimate goal in

    ASR. Using a learning criterion that is consistent with ASR

    Fig. 2. A typical block diagram of a continuous speech recognitionsystem with integrated search via a finite state network

    representation of all the key task constraints, such as AM and LM.

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    1092 Proceedings of the IEEE | Vol. 101, No. 5, May 2013

  • objectives, the minimum classification error (MCE) [75],[76] learning for HMM has been shown to be quite

    effective in improving the model separation, system accu-

    racy, and performance robustness. Risk-based optimiza-

    tion, such as maximum mutual information (MMI) (e.g.,

    [77]), and minimum phone error (MPE) [78], has also

    proven quite effective in reducing the error rates. The

    MMI and MPE objective functions have been produced

    good results when extended to feature-space MMI (fMMI)[79] and feature-space MPE (fMPE) [80].

    III . TECHNOLOGY LIMITATIONSAND CHALLENGES

    Although many successful implementations of commercial

    products and services for many different languages have

    been witnessed, ASR technology is still rather fragile. Un-less the users follow strict protocols that are consistent

    with the speaking styles of the speaker population, trans-

    ducer and channel characteristics of the training data con-

    ditions, and the background acoustic environments, the

    high accuracies obtained with HMM-based systems cannot

    often be maintained across adverse conditions. This ro-

    bustness issue limits a wide deployment of spoken lan-

    guage systems. Three major technical challenges areillustrated as follows.

    A. Challenges in Model Estimation and RobustnessSince it is not practical to collect a large set of speech

    and text examples by a large population over all possible

    combinations of signal conditions, it is likely that mis-

    match of training and testing conditions are a major source

    of errors for conventional pattern matching systems. Astate-of-the-art system may perform poorly when the test

    data are collected under a totally different signal condi-

    tion. Regarding the possible mismatches, both linguistic

    and acoustic mismatches (e.g., [81]) might occur.

    The mismatch can be conceptually viewed in the signal,

    feature, or model space as shown in Fig. 3 where a

    maximum-likelihood stochastic matching framework was

    proposed to address the ASR robustness issues caused bythis mismatch [82]. The mismatch can be modeled by a

    distortion in the signal space D1 and handled by speechenhancement. A feature space distortion D2 can also beconsidered, and feature compensation can be performed

    (e.g., [83] and [84]). Finally, the mismatched situation can

    be handled in the model space with a transformation D3that maps the trained models into the test environment via

    adaptation [59].

    1) Inconsistency in Language Modeling: A linguistic mis-match is mainly caused by incomplete task specifications,

    inadequate knowledge representations, insufficient train-

    ing data, etc. For example, task model and vocabulary

    usage heavily influence the efficacy of the training process.

    As mentioned before out-of-vocabulary (OOV) words not

    specified in a task vocabulary are major sources of recog-

    nition errors. For syllabic languages, such as Mandarin,

    there is a major problem in consistently defining words.

    For example, a four-character word can be broken down

    into two two-character words or even into four single-

    character words. If all possible combinations of wordsegments are considered in LM it may cause biases in

    computing word probabilities. In another evaluation on

    the 5000-word Wall Street Journal (WSJ) task, we found

    that a 4% word error rate (WER) can be achieved with the

    trigram language model. However, the WER went as high

    as 70% when no language constraints were used [85]. It is

    clear that the choices of LMs and language weights in most

    unfamiliar situations will be hard when the task definitionis incomplete, e.g., in the case of spontaneous speech to be

    discussed later.

    2) Inconsistency in Acoustic Modeling: An acoustic mis-match between training and testing arises from various

    sources, including differences in desired speaking formats

    and signal realizations. For a given task, speech models

    trained based on task-dependent data usually outperformmodels trained with task-independent data. Similarly,

    speech models trained based on speakers with normal

    speaking rate will usually encounter problems for fast and

    slow talkers. Another major source of acoustic mismatch

    derives from varying signal conditions. For example,

    changes in transducers, channels, speaking environments,

    speaker population, echoes and reverberations, and com-

    binations of them, all contribute to performance degrada-tion. In addition to the previously discussed linguistic and

    acoustic mismatches, model incorrectness and estimation

    error also cause robustness problems for ASR.

    3) Need for Collaborative ASR: The robustness problem ofcurrent ASR systems might be solved by combining the

    different approaches developed by different members in

    the speech community.

    Fig. 3. Mismatch in training and testing: The two starred blocksindicate that the model obtained in training in the upper panel and

    the features obtained in testing in the lower panel give acoustic

    mismatches if the testing environments are very different from the

    training conditions, resulting in a system with operating pair

    mismatches shown.

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1093

  • In contrast to the model-based pattern matching ap-proach to extracting information from speech, a collection

    of signal-based algorithms needs to be developed in order

    to detect acoustic landmarks, such as vowels, glides, and

    fricatives, in adverse conditions. They could serve to select

    good data segments and designing signal-specific speech

    enhancement, feature compensation, and model adapta-

    tion algorithms for reliable information extraction. Attri-

    bute-specific features, such as voice onset time (VOT) fordiscriminating voiced against unvoiced stops [86], was

    developed and used for designing robust attribute de-

    tectors. Since there are many speech attributes and

    acoustic conditions to be dealt with, a collective expertise

    will be needed in order to address different combinations

    of robustness issues. It is well known that no single ro-

    bustness technique is capable of handling a wide range of

    adverse conditions. This again is a good opportunity todevelop collaborative efforts to solve this diverse problem.

    B. Challenges in Search StrategySignificant progress has been made in developing effi-

    cient and effective search algorithms in the last few years

    [87]. In the future, it seems reasonable to assume that a

    hybrid search strategy, which combines a modular search

    with a multipass decision, will be used extensively for largevocabulary recognition tasks. Good delayed decision strat-

    egies in each decoding stage are required to minimize

    errors caused by hard decisions. As an example, the N-bestsearch paradigm (e.g., [88]), including the generation of

    multiple-theory lattices, is an ideal way for integrating

    multiple knowledge sources. Such a strategy fuses multiple

    hypotheses to rescore a preliminary set of candidate digit

    strings with higher level constraints like a digit check sum[89], detailed crossword unit models, and long-term lang-

    uage models. It has also been used to provide competing

    string hypotheses for discriminative training and for com-

    bining multiple acoustic models to reduce recognition

    errors [90]. As another example, confusion networks have

    been proposed in [91] to represent alternative hypotheses

    distilled from a word lattices and improve the accuracy of

    the speech recognition system. We expect to see more useof the multipass search paradigm to find preliminary hy-

    potheses in a topdown strategy by incorporating high-

    level knowledge sources that cannot be integrated easily

    into the finite state machine (FSN) representation for

    frame-based DP search. By combining a good multipass

    search strategy with utterance verification (e.g., [92])

    strategies, more flexible and efficient designs in large

    vocabulary continuous speech recognition (LVCSR) spokenlanguage systems are possible.

    As already mentioned, a recognition decision is often

    made by jointly considering all the knowledge sources in

    the integrated approach ASR shown in Fig. 2. In principle,

    this search strategy achieves the highest performance if all

    the knowledge sources are completely characterized and

    fully integrated using the speech knowledge hierarchy in

    the linguistic structure of acoustics, lexicon, syntax, andsemantics. This is the commonly adopted search strategy

    in speech recognition today. However, there are a number

    of problems with the integrated approach, because not all

    knowledge sources can be completely characterized and

    properly integrated.

    For LVCSR tasks, the compiled FSN is often too large

    and therefore becomes computationally expensive to find

    the best sentence through a huge and ever-expandingsearch space. Thus, all knowledge sources must remain

    simple in order to efficiently combine them into a single

    search space. In particular, this has inhibited progress at

    the linguistic level, and almost all LVCSR systems em-

    ployed nonoptimal linguistic components such as static

    lexicons (lexicalization of morphological processes) and

    n-gram LMs that force the decoding process to generatehypotheses that sometimes conflict with the acoustic con-strains. Two WSJ examples that are illustrated in the fol-

    lowing subsections demonstrate how modular search can

    correct wrong recognition results obtained with current

    topdown, HMM-based systems.

    Both examples highlight the importance of bottomup

    attribute detection and stage-by-stage knowledge integ-

    ration, which are two key topics to be discussed through-

    out the paper. We will come back to this central themelater in Section IV.

    1) Inconsistency With Attributes in Integrated Search: Thefirst WSJ example incorporates correct low-level informa-

    tion from speech attributes. Specifically, it has been ob-

    served that a conventional LVCSR system evaluated on the

    WSJ task often confuses the word safra, with the phrase

    stock for. Nonetheless, recognizing the word stock re-quires the presence of two stop sounds /t/ and /k/ in the

    region of a vowel. This can be checked by visually inspect-

    ing the spectrogram in the upper panel of Fig. 4, which

    does not show the presence of stop sounds before and after

    the middle vowel. Moreover, the frame-wise time evolu-

    tion of the output posterior probabilities (generated by a

    bank of ANN-based detectors for manner of articulation)

    displayed in the lower panel of Fig. 4, known as a poste-riogram [93], clearly indicated that there are no stop

    events in the area where the mistake occurred, and it also

    signals the presence of a glide (/r/ in this case) followed by

    a vowel at the end of the time-span under analysis. If this

    information could be properly extracted and integrated

    into search, these errors can be avoided. We will come

    back to this example again later in Section VI-B in which

    this particular utterance is corrected by combining attri-bute detection scores with the log-likelihood scores in

    attribute-based lattice rescoring.

    2) Inconsistency With Prosody in Integrated Search: In thesecond WSJ example, correct suprasegmental information

    from pitch and duration is used. Specifically, supraseg-

    mental information, such as prosody, and language

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    1094 Proceedings of the IEEE | Vol. 101, No. 5, May 2013

  • constraints, such as morphosyntactic language models,

    cannot be easily cast into the FSN specification when per-

    forming topdown knowledge integration. However, po-

    tential errors can be corrected by using suprasegmental

    pitch contours and duration features, as demonstrated by avisual inspection of Fig. 5.1 In the top panel, the waveform

    for the WSJ sentence if the Fed pushes the dollar higher, itmay curb the demand for U.S. exports in the time spanbetween 2.72 and 3.32 s is displayed to show a three-word

    recognition error occurring when using the same HMM-

    based system as the example in Fig. 4.

    Specifically, the phrase it may curb is misrecognizedas and maker of. The panel below shows the frameenergy, whereas the F0 contour is shown in the third panel.The recognized phone and word sequences are reported in

    the fourth and fifth panels, respectively. The reference

    phone and word transcriptions are displayed in the sixth

    and seventh panels, respectively. Knowledge-based analy-

    sis of the second and third plots reveals two inconsisten-

    cies in recognizing the middle word maker in the phrase,

    namely: 1) the F0 for the segment ker is too high withrespect to that for the preceding segment ma that puts

    a strong stressed syllable in the middle of the word; and

    2) the glottal closure of 60 ms of the stop sound in

    maker is too long. It should be a stop gap of an un-

    voiced stop as in the correct but misrecognized word

    curb instead. Better F0 estimation will enhance this ca-pability (e.g., [94]). A recent study also demonstrated that

    the performance of Mandarin LVCSR can be significantlyenhanced by incorporating prosodic information, such as

    break models and pitch [95], [96].

    3) Need for BottomUp Information Extraction: The topdown integrated framework also hampers the definition of

    generic knowledge sources that can be used in different

    domains. As a result, applications for a new knowledge

    domain need to be built almost from scratch. In addition,the effectiveness of the integrated search diminishes when

    dealing with unconstrained speech input, since more com-

    plex language models are needed for handling spontaneous

    speech phenomena along with much larger lexicons. On

    the other hand, for the modular approach shown in Fig. 6,

    the recognized sentence can be obtained by performing

    unit matching, lexical matching, and syntactic and seman-

    tic analysis in a sequential manner. As long as the interfacebetween the adjacent decoding modules is completely

    specified, each module can be designed and tested

    separately.

    4) Need for Collaborative ASR: A collaborative researchamong different groups working on different components

    of the system can be carried out to improve the overall

    system performance because there are many pieces ofinformation, or acoustic cues, to be extracted and utilized.

    In the meantime, modular approaches are usually more

    computationally tractable than integrated approaches.

    However, one of the major limitations with the modular

    approach is that hard decisions are often made in each

    decoding stage without knowing the constraints imposed

    by the other knowledge sources. Decision errors are there-

    fore likely to propagate from one decoding stage to thenext, and the accumulated errors are likely to cause search

    errors unless care is taken to minimize hard decision

    errors at every processing stage.

    C. Challenges in Spontaneous Speech ProcessingAlthough low word error rates have been achieved in

    many LVCSR tasks, the high accuracy usually does not

    1The authors would like to Dr. C.-Y. Chiang of the National ChiaoTung University (NCTU, Hsinchu, Taiwan), for creating this example.

    Fig. 4. Spectrogram (upper panel) and posteriogram (lower panel) for the sentence numbered 446c0210 of the Nov92 test set with focuson the area where the errors occur. A conventional LVCSR systemmisrecognize the word safra, and generates the transcription

    stock for. In the second panel, the time evolution of the posterior probabilities, namely a posteriogram, of manner of articulation

    shows that there are no plosive events in the time span under analysis. Furthermore, wrong word recognition occurs although

    correct manner or articulation detection can be performed.

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1095

  • extend to recognizing spontaneous speech. An example is

    the Switchboard task [97], which has attracted quite a bit of

    research attention for almost 20 years. In the early 1990s, ahigh error rate of over 40% had been reported. However, a

    steady decrease in word error rate led to those in the range

    of 13%16% for conversational telephone speech and

    broadcast news by 2006 [98][100]. Nevertheless, these

    results are still rather high when compared with the re-

    cognition performance on read speech. In spontaneous

    speech, ill-formed utterances are often observed that

    cannot be completely characterized, even if a large amountof training speech data was collected to build the n-gramlanguage models.

    The plug-in MAP decoder that recognizes W^ in (1) findsthe best sentence in a set of competing sentences. How-

    ever, there are many practical difficulties with this design.

    First, the candidate set is usually of a finite size, and it isnot possible to include all sentences. Second, the quality of

    the recognition result is not properly quantified becausethe right-hand side of (1) is only computing a relative

    difference of competing word strings. Since speech sounds

    are inherently ambiguous, we need to at least ask the

    following three questions: 1) why should we accept W^ as

    the recognized string?; 2) why should we accept some

    words in W^ while rejecting others?; and 3) can we assigna value to measure the confidence of our acceptance?

    These three issues lead researchers to study three new

    but closely related topics, namely: 1) keyword recognition

    and non-keyword rejection (e.g., [101]); 2) utterance veri-

    fication (UV) at both the string and word levels (e.g., [92]

    and [102]); and 3) confidence measures (CMs) or con-

    fidence scoring (e.g., [103]). Although the above three

    research areas cannot be solved using the classical classi-

    fication formulation shown in (1), the theory of statisticalpattern verification and hypothesis testing provides a

    framework to tie these three topics in a unified manner

    (e.g., [104]). Verification of speech events, such as in

    attributes, could also follow the same theory and design.

    We will come back to this critical research area in further

    depth later in Section V. Due to incomplete task specifi-

    cations in defining most speech recognition and under-

    standing tasks, a topdown knowledge integrationapproach alone usually could not maintain the consistency

    with all the knowledge sources in the recognized sentence.

    A partial understanding of any given utterance, i.e., know-

    ing which part of an utterance to process and to ignore, is

    very critical in order to handle spontaneous speech. We

    will come back later to discuss this important issue in

    Section IV-B.

    1) Partial Understanding Through Event Detection: Onemajor problem related to characterizing spontaneous

    speech with conventional acoustic and language models

    is the set of so-called incomplete specification issues, such

    as partial words, hesitation, telephone ringing, baby cry-

    ing, door slamming, TV interferences, out-of-vocabulary

    words, out-of-grammar sentences, and out-of-task (OOT)Fig. 6. A typical modular-search ASR system.

    Fig. 5. Prosodic analysis of theWSJ sentence if the Fedpushes thedollarhigher, itmaycurb thedemand forU.S. exports. The firstpanel shows thewaveform in the time frame between 2.72 and 3.32 s, where a recognition error occurs. Specifically it may curb is recognized as and maker of.

    The second panel shows the frame energy, whereas the F0 is shown in the third panel. The recognized phone and word sequences are reported

    in the fourth and fifth panels, respectively. The reference phone and word transcriptions are displayed in the sixth and seventh panels,

    respectively. Two inconsistencies: 1) the F0 for the segment ker is too high with respect to that for the preceding segment ma; and

    2) glottal closure of the stop sound in maker is too long.

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    1096 Proceedings of the IEEE | Vol. 101, No. 5, May 2013

  • utterance constructions, which are commonly observed inspontaneous speech (e.g., [105]).

    2) Need for Collaborative ASR: Many techniques havebeen developed to reduce the word error rate for the

    Switchboard task for almost 20 years based on conven-

    tional LVCSR approaches. The amount of effort needed to

    study another spontaneous speech recognition task with an

    increased complexity, or for a different language could bevery high. It is time to adopt detection-based techniques

    that can be task independent and language universal, such

    as key phrase detection, sound-specific filler modeling,

    extraneous speech rejection, and attribute modeling,

    through partial understanding. By utilizing the aforemen-

    tioned modular strategy, we might be able to divide up the

    big spontaneous LVCSR task to a set of smaller and man-

    ageable problems so that researchers with knowledge-richalgorithms can help.

    IV. BOTTOMUP DETECTIONFOLLOWED BY KNOWLEDGEINTEGRATION

    Better modeling of the linguistic, articulatory, acoustic,

    transmission, and noise channels missing in the currentASR formulation may enhance the current level of ASR

    performance. Moreover, a knowledge integration process

    with detected cues and evidence is often used in HSR and

    from experience in spectrogram reading. This seems to

    point to the need for a bottomup paradigm and leads to

    ASAT principles. We provide some justifications of this

    perspective in the following.

    A. BottomUp Knowledge IntegrationThe currently prevailing topdown HMM-based sys-

    tems use a large volume of speech and text training data.

    However, once the models are trained it becomes a black

    box in that it provides very little diagnostic information to

    pinpoint why the models work well in one instance and

    then fail badly in other situations. For example, it is often

    clear from the pitch contour that there are only three digitsin an unknown input utterance, but somehow a four-digit

    sequence scores best among the competing strings and is

    recognized. Instead of using such a topdown, integrated

    search approach, recent ASR systems rely on N-best stringsor word lattices to hypothesize multiple theories. The re-

    cognized sentence is then obtained by rescoring these

    strings using additional knowledge sources, e.g., phone

    and segment lattices have been proposed [42], [106].However, phone recognition is often error-prone and be-

    comes a limiting factor for further technology develop-

    ment. The bottomup knowledge integration approach

    would become feasible if speech cues are more reliably

    detected.

    A block diagram of a detection-based ASR system is

    shown in Fig. 7. The input speech signal is first processed

    by a bank of feature detectors aiming at events that are

    relevant to the recognition task. An event lattice is then

    produced with each element time-marked and scored. The

    detectors do not have to be synchronized and, therefore,

    the framework is flexible in embracing both short-termdetectors, e.g., for the VOT, and long-term detectors, e.g.,

    for pitch contours. Once meaningful events have been de-

    tected, the event merger proposes larger events by merging

    smaller events and the theory verifier computes their

    corresponding confidence scores and prunes unlikely

    theories. This process could be repeated until all the avail-

    able knowledge sources are incorporated and evaluated.

    The recognized string is then the sequence of words thatscores the best against all possible knowledge sources.

    This bottomup detection approach to ASR has a num-

    ber of advantages, namely: 1) it provides plenty of diagnos-

    tic information; 2) it is easy to compare the quality of

    individual detectors and an ensemble of detectors by pro-

    perly designing feature-specific evaluation sets aiming at

    these events; 3) individual event detectors are often easier

    to design and perfect than the whole system; 4) it takesadvantage of many years of research in speech and lang-

    uage sciences, as well as statistical modeling; 5) it offers a

    quantitative way for an objective performance evaluation;

    and most importantly; and 6) it sets up an open framework

    for the community to work collaboratively, something that

    has not been done enough in the last 30 years. In the

    following, we briefly present two preliminary studies that

    also demonstrate the feasibility and effectiveness of such abottomup approach.

    B. Key-Phrase Detection and VerificationSeveral spoken dialog systems [28] have been evalu-

    ated in real-world applications. These systems use finite

    state grammars to accept typical user utterances because

    there is no data available to train statistical languagemodels for the specific tasks. The use of a rigid grammar

    is effective for typical in-grammar (IG) utterances. How-

    ever, in real-world environments, we have observed wide

    utterance variation inherent in a large user population,

    and are, therefore, not covered by the task grammars,

    even though they had been iteratively tuned by developers

    during the trial period. Even in apparently simple subtasks

    Fig. 7. A detection-based speech recognition framework.

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1097

  • such as asking for date or time, around 20% of the user

    utterances turned out to be out-of-grammar (OOG).

    These samples include extraneous words, hesitations, re-

    petitions, and unexpected expressions. In some cases, we

    even observe many utterances that are totally irrelevant

    or OOT.

    Most of such spontaneous utterances contain some keyphrases that are task related and may lead to partial or full

    understanding. Other samples, not relevant to the task,

    should be rejected. This suggests a detection-based ap-

    proach to flexible speech recognition and understanding

    that is designed to detect semantically significant parts and

    reject irrelevant portions. In a domain-specific form filling

    or information retrieval task, it is capable of interpreting

    with only key phrases. Therefore, the approach based ondetection is attractive. In [102], it was found that a com-

    bined key-phrase recognition and verification strategy

    worked well especially for ill-formed utterances. The com-

    bined detection-verification approach improved the se-

    mantic accuracy from 5% to 30% over the conventional

    techniques with rigid grammatical constraints, especially

    for ill-formed OOT utterances.

    C. Knowledge-Based Feature Representationin LVCSR

    We now show that the use of acoustic phonetics and

    contextual variability in the representation of the speechsignal is indeed very useful to improve LVCSR. The system

    proposed in [107] is similar to a well-known feature ex-

    traction scheme called Tandem [108] that was extended to

    LVCSR [109]. Data-driven multilayer perceptron (MLP)

    detectors were used to measure the presence or absence of

    distinctive features directly from the short-time MFCC and

    limited temporal information. These phonetic distinctive

    features were used as feature vectors to build a set ofcontext-dependent phone HMMs. Experiments were per-

    formed on the WSJ task with the following feature con-

    figurations: 1) baseline with 39 MFCC features; 2) 601features: the 61-dimensional feature vectors (1 energy

    coefficient + 60 KarhunenLoe`ve (KL) transformed fea-

    tures) were used as features to build triphone HMMs;

    3) 44 KL-transformed phone features; 4) 6144 features;and 5) 6144 features plus MFCC. We point out that thefirst- and second-order derivative were not used for distinc-

    tive and phone features.

    Experimental results were obtained with the 5000 and20 000 words on the Nov92 test set. Trigram language

    models were used, and all WERs are listed in the second

    row of Table 1. Furthermore, systems 4) and 5) were

    obtained by combining with ROVER [110], which essen-

    tially corresponds to a majority vote decision. The results

    given in the second to last row correspond to about 20%

    and 10% relative improvements over our best MFCC base-

    lines, on the 5000 and 20 000 tasks, respectively. Thesevery encouraging results seem to indicate that acoustic

    phonetic features can help reducing the WERs. In the

    bottom row, we report a WER of 6.6% for the 20 000-word

    task obtained with a template-based systems [111], and the

    state-of-the-art-result for the 5000-word task [112].

    V. AUTOMATIC SPEECHATTRIBUTE TRANSCRIPTION

    The speech signal contains a rich set of information that

    facilitates human auditory perception and communication

    beyond a simple linguistic interpretation of the spoken

    input. In order to bridge the performance gap between ASR

    and HSR systems, the narrow notion of speech-to-text in

    ASR has to be expanded to incorporate all related infor-

    mation embedded in speech utterances. This collectionof information includes a set of fundamental speech sounds

    with their linguistic interpretations, a speaker profile en-

    compassing gender, accent, emotional state and other

    speaker characteristics, the speaking environment, etc.

    Collectively, we call this superset of speech information the

    attributes of speech. It is expected that directly addressing

    these issues will improve ASR performance as well as

    speaker recognition, language identification, speech per-ception, and speech synthesis. The human-based model of

    speech processing suggests a candidate framework for

    developing next-generation speech processing techniques

    that have the potential to go beyond the current limitations

    of existing ASR systems.

    Based on the aforementioned set of speech attributes,

    ASR can be extended to ASAT, which is a process that goes

    beyond the current simple notion of word transcription.ASAT promises to be knowledge-rich and capable of incor-

    porating multiple levels of information in the knowledge

    hierarchy into attribute detection, evidence verification,

    and integration, as shown in Fig. 8. The top panel illus-

    trates the frontend processing, which consists of an

    ensemble of speech analysis and parametrization modules.

    In addition, the bottom panels demonstrate a possible

    stage-by-stage backend knowledge integration process.These two key system components will be described in

    more detail in the following. Since speech processing in

    ASAT is highly parallel, a collaborative community effort

    can be built around a common sharable platform to

    enable a modular ASR paradigm that facilitates a tight

    coupling of interdisciplinary studies of speech science

    and processing.

    Table 1 Word Error Rates (%) for Various Feature Sets and Combinationson WSJ Nov92, 5000 and 20 000

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    1098 Proceedings of the IEEE | Vol. 101, No. 5, May 2013

  • A. FrontEnd Attribute Detection ProcessingAn event detector converts an input speech signal into a

    time series that describes the level of presence (or level of

    activity) of a particular property of an attribute, or event, in

    the input speech utterance over time. This function cancompute the a posteriori probability, or the log-likelihoodratio (LLR) of the particular attribute. The LLR involves the

    calculation of two likelihoods: one pertaining to the target

    model and the other the contrast model. The bank of

    detectors consists of a number of such attribute detectors,

    each being individually and optimally designed for the

    detection of a particular event. These attribute properties

    are often stochastic in nature and are relevant to infor-mation to be extracted and needed to perform speech ana-

    lysis and other functions, such as ASR. One key feature of

    the detection-based approach is that the outputs of the

    detectors do not have to be synchronized in time and,

    therefore, the system is flexible enough to allow a direct

    integration of both short-term detectors, e.g., for detecting

    VOT, and long-term detectors, e.g., for detecting pitch

    contours, syllables, and particular word sequences. Theconventional frame synchronous constraints of most tradi-

    tional ASR systems are thus relaxed in the ASAT system to

    accommodate asynchronous attribute detection as shown

    in Fig. 9. In the ASAT framework, different parameters at

    different frame rates can be utilized and combined to de-

    sign attribute-specific event detectors beyond the current

    MFCC features obtained with frame-synchronous speech

    analysis.Speech parametrization has been discussed in many

    textbooks (e.g., [113]). For ASAT, the parameters can be

    sample based, such as a zero-crossing rate, or frame based,

    such as MFCCs. Speech analysis can be performed in the

    temporal domain, providing features such as VOT, or in

    the spectral domain, such as short time energies in differ-

    ent frequency bands. Both long-term and short-term ana-

    lysis can be compared and contrasted. Biologically inspired

    and perceptually motivated signal analyses are considered

    as promising parameter extraction directions [114], [115]

    because the ASAT paradigm supports parameter extraction

    at different frame rates for designing a range of attribute

    detectors. Once a collection of speech parameters Ft isobtained, they can be used to perform attribute detection,

    which is a critical component in the ASAT paradigm asshown in the upper panel of Fig. 8. Attributes can be used

    as cues or landmarks in speech [45], [47] in order to

    identify the islands of reliability for making local acous-

    tic and linguistic decisions, such as energy concentration

    regions and phrase boundaries, without extensive speech

    modeling. A few clear examples are readily visible in

    most spectrogram plots, e.g., the vowel and fricative re-

    gions in Fig. 4.An attribute detection example was demonstrated in

    [86] to discriminate voiced and unvoiced stops using VOT

    for two-pass English letter recognition. In the first stage, a

    Fig. 9. A bank of speech attribute detectors: Each can take differentparameters as inputs and generate a value between 0 and 1 over time

    to indicate the presence or absence of the specific attribute.

    Fig. 8. ASAT. (a) Speech analysis ensemble followed a bank of attribute detectors to produce an attribute lattice. (b) Stage-by-stageknowledge integration from speech attributes to recognized sentences.

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1099

  • conventional recognizer was used to produce a list ofmultiple candidates. To further discriminate some of the

    minimum pairs, such as the English letters /d/ and /t/, a

    VOT-based detector [86] can be used in the second stage to

    provide a detailed discrimination. It was shown that the

    VOT temporal feature produces a pair of curves with better

    discrimination (i.e., with more separation between them)

    than those obtained with spectral features alone. By reor-

    dering candidates according to VOT, the two-stage recog-nizer gave an error rate 50% less than that obtained in a

    state-of-the-art ASR system [116].

    B. BackEnd Knowledge Integration ProcessingAnother critical component in the ASAT paradigm is

    the backend processing shown in the bottom panel in

    Fig. 8. An event merger takes the set of detected lower level

    events as input and attempts to infer the presence of higherlevel units (e.g., a phone or a word). Those higher level

    units are then validated by the evidence verifier to produce

    a refined and partially integrated lattice of event hypoth-

    eses to be fed back for further event merger and knowledge

    integration. This iterative information fusion process al-

    ways uses the original event activity functions as the raw

    cues. A terminating strategy can be instituted by utilizing

    all the supported attributes.The procedure produces the evidence needed for a final

    decision, including a recognized sentence. Each activity

    function can be modeled by a corresponding neural system.

    Both activation levels and firing rates have been used in

    neural encoding and neuron combinations to encode tem-

    poral information. Simulating perception of temporal

    events is of particular interest in auditory perception of

    speech. New techniques are needed to accomplish thisform of lattice parsing. Conditional random field (CRF)

    [117] is a mathematical framework that can be used to

    describe sequences of symbols (such as phones or words) in

    terms of input features, e.g., local phonetic attribute detec-

    tions. CRF has been utilized in a number of ASAT-related

    studies (e.g., [118][120].

    To make use of the detected features, we must combine

    them in a way that we can produce word hypotheses. Inessence, this boils down to three problems: 1) combining

    multiple estimates of the same event to build a stronger

    hypothesis; 2) combining estimates of different events to

    form a new, higher level event with similar time bounda-

    ries; and 3) combining estimates of events sequentially to

    form longer term hypotheses. Note that these problems are

    somewhat independent of the level of modeling: while the

    canonical bottomup processing sequence would be tocombine multiple estimates of each feature, and then com-

    bine the features into phones and then words (and word

    sequences), we envision a highly parallel paradigm that is

    flexible enough, for example, to combine a feature-based

    phone detector with a directly estimated phone detector. In

    principle, a 20 000-word ASR system can be realized with a

    set of 20 000 single-keyword detectors [121].

    Combining evidence for the same linguistic unit[problem 1)] has been the focus of techniques such as

    multistream acoustic modeling (e.g., [122] and [123]) and

    recognition hypothesis combination [110]. In addition,

    stochastic combination to form strong verifiers from a

    collection of weak detectors, such as boosting (e.g., [124]

    and [125]), is a useful tool to combine low-level events into

    high-level evidences.

    C. Event and Evidence VerificationVerification of patterns is often formulated as a statis-

    tical hypothesis testing problem [126] as follows: given a

    test pattern, one wants to test the null hypothesis against

    the alternative hypothesis. Event verification, a critical

    ASAT component, can be formulated in a similar way. For

    most practical verification problems in real-world speech

    and language modeling, a set of training examples is usedto estimate the parameters of the distributions of the null

    and alternative hypotheses. The two competing hypotheses

    and their overlap indicate the two types of error known as

    miss detection and false alarm errors [126]. A generalized

    log-likelihood ratio (GLLR) was proposed as a way to

    measure a separation between models of competing

    hypotheses [127].

    The verification performance is often evaluated as acombination of the two types errors. The related topic of

    CMs (e.g., [103]) has also been intensively studied by

    many researchers recently, e.g., [92] and [102]. This is due

    to an increasing number of applications being developed

    and deployed in the past few years. In order to have an

    intelligent or humanlike interactions in these dialogs, it is

    important to attach to each event a value that indicates

    how confident the ASR system is about accepting the re-cognized event. This number, often referred to as a CM,

    serves as a reference guide for the dialog system to provide

    an appropriate response to its users just like an intelligent

    human being is expected to do when interacting with

    others.

    Fig. 10 shows an example of how to use the GLLR plots.

    Specifically, Fig. 10 displays in the top left, bottom left,

    and top right, three sets of distribution curves for detectingthe three corresponding phones /w/, /ah/, and /n/, in the

    word one. Here the ARPABET [128], used in the ARPA

    Speech Understanding Research (SUR) project, is adopted

    to denote the phonetic symbols used throughout this paper.

    By approximating the three sets of curves with Gaussian

    densities, the Gaussian curves for detecting the word one

    can be composed as shown in the bottom right of Fig. 10. It

    is noted that words are in general easier to detect thanphones, because the composed competing Gaussian curves

    show a better separation, or equivalently less overlap.

    D. Speech Attribute DetectionA possible implementation of the ASAT detection-

    based frontend is shown in Fig. 11. It consists of two

    main blocks: 1) a bank of attribute detectors that can

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    1100 Proceedings of the IEEE | Vol. 101, No. 5, May 2013

  • produce detection results in terms of a confidence score;

    and 2) an evidence merger that combines low-level events

    (attribute scores) into higher level evidence, such as

    phone posteriors. The append module, shown in Fig. 11,

    stacks together the outputs delivered by the attributedetectors for a given input and generates a supervector of

    attribute detection scores. This feature vector is then fed

    into the merger. In summary, the system shown in Fig. 11

    maps acoustic features (e.g., short-time spectral features

    or temporal pattern features) into phone posterior

    probabilities.

    In our studies, phonetic features, such as manner and

    place of articulation, are used as the speech events of in-terest. The motivations behind this choice are: articulator-

    motivated features improve robustness toward noise at

    low signal-to-noise ratios [129]; they improved recognition

    of hyper-articulated speech and in the presence of

    different speaking styles [130]; they are reliably detected

    [130]; and they carry linguistic information [129]. Table 2

    shows the phonetic features used in the experiments re-

    ported in the next sections. Silence is also used to repre-sent the absence of any speech activity. This feature

    inventory is clearly not complete. Nevertheless, this set

    could always be expanded.

    Attributes should be stochastic in nature and the cor-

    responding detectors designed with data-driven modeling

    techniques. The goal of each detector is to analyze a speech

    segment and produce a confidence score or a posterior

    probability that pertains to some acoustic-phonetic attri-

    bute. Generally speaking, both frame- and segment-based

    data-driven techniques can be used for speech event de-tection. Frame-based detectors can be realized in several

    ways, e.g., with ANNs [22], Gaussian mixture models

    (GMMs) [131], and support vector machines (SVMs)

    [132]. One of the advantages with ANN-based detectors is

    that the output scores can simulate the posterior proba-

    bilities of an attribute given the speech signal. On the other

    hand, segment-based detectors are more reliable in spot-

    ting segments of speech [133]. Segment-based detectorscan be built by combining frame-based detectors or with

    segment models, such as HMMs, which have already been

    shown effective for ASR [8]. Time-delay neural networks

    (TDNNs) were also shown to be effective in designing

    segment-based attribute classifiers [134]. The reader is re-

    ferred to a recent Ph.D. dissertation [135] detailing the

    process of building accurate TDNN-based classifiers for all

    the attributes of interest.

    1) Frame-Based Attribute Detectors: In the case of frame-based design, each detector is realized with three MLPs

    organized in a hierarchical structure [136] similar to a way

    of modeling long-term energy trajectories, referred to as

    Fig. 10. Verifying a sequence of sequential hypotheses: for the word one (bottom right) based on evidence of verifyingthe three phones, /w/ (top left), /aa/ (bottom left), and /n/ (top right), in the word.

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1101

  • TempoRAl Patterns (TRAP)-based features [137], [138]. In

    ASAT, sub-band energy trajectories arranged into split-

    temporal context as described in [139] are used. In the

    experiments, all MLPs are trained to compute the attributeposterior probabilities [136].

    2) Segment-Based Attribute Detectors: When segment-based detectors, e.g., HMMs, are used to categorize a seg-

    ment of speech into attribute classes either log-likelihood

    or log-likelihood ratio can be adopted as detector scores. In

    our ASAT framework, a LLR-based score is used to

    measure the goodness-of-fit between a speech segment andthe corresponding speech feature because it has already

    proven useful in rejecting wrong hypotheses in several

    speech tasks [102], [104], [140], [141].

    In Fig. 12, we show the typical detection curves of

    manner of articulation, for the utterance numbered

    442c013 (whose transcript is THATS FINE) of the SI-84set [142]. Detection curves of place of articulation can also

    be generated. Nonetheless, manner of articulation is ofteneasier to detect than the place of articulation in both

    spectrogram plots and MLP- or HMM-based detection

    plots. This is mainly due to the fact that manners can

    usually be clearly distinguished in their attribute beha-

    viors. A collaborative effort can easily be envisioned where

    researchers with many years of experience in specific

    topics, e.g., stop sounds [143] and fricatives [144], can

    provide their best detector modules to show their superior

    performance to other competing modules.

    Furthermore, score plots can be used to compare de-

    tector performance. If we use 0.5 as a threshold to accept adetected event, then most of the attributes in the utterance

    are correctly detected. Regions with scores below 0.5

    usually indicate low confidence exhibiting either type I or

    type II errors as discussed earlier. In the example shown in

    Fig. 12, two sets of detector score plots for the same manner

    of articulation for a short speech segment THATS FINEare displayed. The curves with a zigzag shape are obtained

    with the MLP-based detectors, while the curves in red withstraight line scores were generated by conventional three-

    state attribute HMMs. It is clear that HMM-based segment

    detectors perform perfectly in this case, while the MLP-

    based detectors show quite a bit of time-varying scores.

    E. Evidence MergerThe bank of detectors provides evidence of a particular

    speech event. Bits of evidence at the phonetic feature level

    are combined together to form evidence at a higher level.

    The focus of this section is on how to generate higher level

    evidence at a subword level. There exist several methods to

    generate evidence at a subword level from articulatory

    events. For example, starting with manner and place of

    articulation, a product lattice of degree two may be gener-

    ated, and a constrained search may be performed overthis lattice to generate phone-level information [145].

    CRFs [118] and segmental CRF (SCRF) [146] have also

    been used to generate phone sequences by combining

    articulatory features. In our framework, all of the detec-

    tor outputs are combined with a feedforward MLP, which

    has a single hidden layer. In a recent work, we demon-

    strated that phone accuracies can be boosted using a deep

    Table 2 List of Speech Attributes Used in the ASAT Experiments

    Fig. 11. A preliminary implementation of the ASAT detection-based frontend. Each attribute detector analyzes any given input frameand produces a posterior probability score. The Append module stacks together attribute posterior probabilities. The merger delivers

    phone posterior probabilities.

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    1102 Proceedings of the IEEE | Vol. 101, No. 5, May 2013

  • neural network (DNN) [147][149] as shown in [150].

    Fig. 13 provides a schematic representation of the event

    merger.By merging the attribute detector outputs and feeding

    them into the attribute-to-phone mapping merger shown

    in Fig. 13, we can produce frame-based posterior proba-

    bilities, one for each phone of interest, and form a frame-

    based feature vector. A penalized logistic regression with

    HMM-based regressors [151] has also been employed to

    combine information generated by a bank of segment-

    based attribute detectors [152] obtaining remarkable re-sults on a phone classification task.

    VI. ASAT APPLICATIONS

    The ASAT detector frontend has been used with success

    as a key component for several attribute-based speech ap-

    plications, namely: lattice rescoring [153], [154], language-universal phone recognition [155], and bottomup,

    stage-by-stage LVCSR [156], [157]. Most of the results

    are preliminary studies because the ASAT-related effort is

    quite recent. We hope to inspire other new applications.

    A discussion of the new insights that may be offered by

    the detector score plots together with the posteriogram

    plots is presented first. Finally, a spoken language identi-

    fication system based on attribute detectors is presentedin [158][160].

    A. Visible Speech Analysis ThroughAttribute Detection

    1) Detection Score Plots for MLP: We now analyzedetection score plots more closely. Fig. 14 displays a longer

    sentence than that shown in Fig. 12. Although the

    detection curves are not as smooth as those in the previousexample, the correct transcript can still be obtained by

    following the evolution of the event detection process over

    time. This outcome is also in line with spectrogram reading

    by trained experts based on knowledge in acoustic phone-

    tics (e.g., [41]). The detector scores here are normalized

    between 0 and 1, ranging from an absence of an acoustic

    property to the full presence of a speech cue. The value of

    Fig. 13. A possible implementation of the ASAT merger. It is trainedusing the output of the bank of attribute detectors and generates

    phone posteriors probabilities. Either a shallow MLP network, or a

    DNN can be used.

    Fig. 12. Detection curves of manner of articulation for the sentence numbered 442c013 (whose transcript is THATS FINE) of the SI-84 data set[142]. The curves in blue were generated using an ANN, whereas the curves in red were generated using segment-based HMM detectors.

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1103

  • these detections scores is a good indication of the activity

    levels for the speech events of interest. Therefore, itprovides a new visualization tool in addition to the

    conventional spectrogram plot shown in the top panel of

    Fig. 14.

    Error analysis has always played a crucial role in

    providing diagnostic information to improving ASR algo-

    rithms. With the extracted speech cue information re-

    vealed in the new visualization tool, insight can also be

    developed in understanding human speech. It could alsoprovide a good tool to offer speech insights to a new

    generation of students and researchers.

    For example, we can see sound transition behavior

    clearly displayed in the region from segment 6 to

    segment 8, going from phone /eh/ to /aa/ with a rising

    activity from the preceding vowel into the glide sound /l/

    in segment 7, then falling away into the following vowel.

    We can also observe an overlapping nature of nasalizedvowel at the ending of segment 8 and the beginning of

    segment 9. The double stop sound regions in segments 13

    and 14 is also signaled. The large overlapping region for

    the two-candidate segment 21 indicates that the glide

    sound /r/ heavily influences articulation in its surround-

    ing phones with a low level vowel activity showing up

    between segments 21 and 22 on the detector plot for the

    vowel manner.It is clear the detector score plots displayed in

    Fig. 14 provide a rich set of information not commonly

    available to researchers that are not expert trained in

    spectrogram reading. It also reenforces additional ad-

    vantages we intend to exploit in the information-

    extraction perspective we have highlighted throughout

    this paper.

    2) Posteriogram Plots for MLP and DNN: We plot inFig. 15 time evolution of the estimated frame posterior

    probabilities for phones, or phone posteriogram [93], for

    the same short utterance used in Fig. 12. Instead of dis-

    playing the CM value between 0 and 1 as in the detector

    score plots, an intensity similar to spectrogram plots is

    displayed showing darker regions for higher posterior pro-babilities at each 10-ms frame for the corresponding

    phone. On the vertical axis, 40 phones are listed starting

    with the phone /aa/ as in the word pot at the top, and

    finishing with the phone /zh/ as in the word treasury at

    the bottom. It is clearly visible that the silence unit

    stands out in the beginning and ending parts of the utter-

    ance for both plots, shown in the upper and lower panels,

    respectively. A DNN is practically an MLP with manylayers (seven hidden layer have been used in our studies),

    where the pretraining algorithm proposed for deep belief

    networks [147] has been applied before training the MLP

    [148], [150], which has a single hidden layer.

    The DNN posteriograms are often sharper than the

    MLP ones indicating the top candidate phones with the

    darkest region at each vertical time snap shot have less

    competition from other phones. The blurry nature in someregions on the plots indicates that some phones are confu-

    sable at that time frame. This posteriogram can be dis-

    played together with the detector score plots shown in

    Fig. 12 to gain insight about the goodness of attribute-to-

    phone mapping. For example, in the initial part of the

    Fig. 14. Detection curves of manner of articulation for the sentence numbered 440c20t (RATES FELL ON SHORT TERM TREASURY BILLS) of theSI-84 data set [142]. The correct transcript can still be received by following the time evolution of detection of the attribute events.

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    1104 Proceedings of the IEEE | Vol. 101, No. 5, May 2013

  • phone /ay/, it would be misrecognized as phone /aw/,

    but it could be corrected because the duration of a diph-

    thong /aw/ cannot be so short. Finally, the DNN-based

    posteriogram is not as smooth as and with less noise than that

    obtained using the MLP with a single hidden layer, and

    therefore more reliable than that produced by the MLP-baseddetectors. With the seven phones together with the two

    silence segments clearly labeled in the bottom panel, it is

    possible to observe the correct transcripts by following the

    black lines (time evolution of the top phone posterior

    probabilities). Furthermore, it is easy to identify the possible

    sources of confusion, and devise techniques to address them.

    B. Attribute-Based Lattice RescoringLattice rescoring is reported for a LVCSR task only. The

    readers are referred to [154] for further insights on other

    tasks.

    1) Rescoring Technique: The rescoring algorithm aims tointegrate the confidence scores generated by the ASAT

    detection-based frontend into the word lattice on an arc-

    by-arc basis. Rescoring is carried out as a linear combina-

    tion of the log-likelihood acoustic score generated by the

    baseline LVCSR system and the logarithm of the phonemeposterior probability, properly discounted by the pho-

    neme prior probability, generated by the detector-based

    frontend.

    2) Experimental Setup: The experiments are performedusing the 5000-word speaker-independent WSJ (5k-WSJ0)

    corpus [142]. The SI-84 data set (7077 utterances from 84

    speakers, i.e., 15.3 h of speech material) is used for train-ing. The testing material is again the Nov92 set. ML esti-

    mation [5], [131], [161] is adopted to find the parameters of

    a first HMM baseline. A second HMM baseline is then

    Fig. 15. Posteriogram plots for the sentence numbered 442c0213 (THATS FINE) of the SI-84 data set [142]. (a) MLP-based posteriogram plot.(b) DNN-based posteriogram plot.

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1105

  • obtained using MMI with the MLHMMs of the first sys-

    tem as seed models. A trigram language model within the5k-WSJ0 vocabulary is used in decoding.

    3) Experimental Results: Table 3 shows the performanceof the ML-based baseline system, in terms of WER, on the

    Nov92 task. These results are comparable with the results

    reported in [162][164]. In the second row, the perfor-

    mance of the rescored system on the same task is shown

    when both the bank of detectors and the merger are trainedwith phonetically rich TIMIT corpus [165], [166]. More-

    over, the WERs of the MMI-based baseline and rescored

    systems are shown in the last two rows, respectively. The

    results indicate that the rescored system achieves better

    performances than the conventional decoding scheme in

    both cases. We noticed that several hypotheses have the

    same error patterns for both HMM systems. These incor-

    rectly recognized words are typically characterized by anacoustic trajectory that did not strictly observe the under-

    lying acoustic phonetic constraints. The lack of the required

    acoustic-phonetic evidence for the wrong hypotheses could

    be signaled by attribute detectors, and this information

    could then be used to penalize the corresponding phones

    during the rescoring and subsequent decoding steps.

    The phonetic-based correction concept is now illus-

    trated on the Nov92 sentence numbered 446c0210. Thecorrect sequence of words for this utterance is: The com-pany said its European banking affiliate Safra republic plansto raise more than four hundred fifty million dollars throughan international offering. However, the baseline ML andMMI systems produce a phrase stock for, instead of theword Safra. Fig. 4 shows the spectrogram for this utteranceat the location where the error occurred with the ML- or

    MMI-based system.In Fig. 4, correctly recognizing the word stock requires

    the presence of two stop sounds /t/ and /k/ in the region

    surrounding the middle vowel. But from the spectrogram,

    it can be easily seen that there is a lack of articulatory

    evidence to support this decoded word. The stop detector

    signals these mistakes, and the correct sentence is decoded

    after rescoring.

    C. Cross-Language Attribute DetectionWe now report our studies on language-independent

    attribute detection, which were extended to phone

    recognition with minimal available data for the targetlanguage in [155]. English manner attribute scores have

    been effectively incorporated into Mandarin LVCSR to

    improve performance by lattice rescoring in a cross-lang-

    uage manner as well [96].

    1) Language-Universal Knowledge Source Definition: Fun-damental speech attributes, such as voicing, nasality, and

    frication, could be identified from a particular languageand shared across many different languages, so they could

    also be used to derive a universal set of speech units. A

    small number of these universal units could be then used

    to model speech sounds. It is worth noting that these

    phonetic features (attributes) have already been used to

    identify a common knowledge source in several studies

    (e.g., [167] and [168]), yet these features were often

    employed within a knowledge-based phoneme mappingprocedure to: 1) produce an expanded phoneme set to

    cover speech sounds in multiple languages (e.g., [168]);

    or 2) find the mapping between language-dependent (or

    independent) acoustic models and the new target acoustic

    models (e.g., [167]) for decoding purposes.

    2) Experimental Setup: The stories part of the OGImultilanguage telephone speech corpus [169] is employedin our investigation. The amount of transcribed data is only

    about 1 h per language, which is significantly smaller than

    the usual amount of data used to train multilingual ASR

    systems, e.g., [170]. This corpus has phonetic transcrip-

    tions for six different languages: English (ENG), German

    (GER), Hindi (HIN), Japanese (JAP), Mandarin (MAN),

    and Spanish (SPA). Three subsets, namely training, vali-

    dation, and test sets, are formed using the data available foreach language. Table 4 shows the amount of available data

    for each subset and the number of language-dependent

    phone units. Each attribute detector is designed within the

    MLP framework as described in Section V-D1. Perfor-

    mance is reported as in [171].

    3) Language-Dependent Attribute Detection: Language-specific data (Table 4) are used to train, validate, and testeach detector. Language-dependent attribute accuracies

    are found to be comparable across languages and attri-

    butes. This implies that attribute classification could be

    reliably obtained for a variety of languages. Furthermore,

    good attribute accuracies could be achieved for several

    attributes, such as vowel (92%) and continuant (90%),

    Table 4 The OGI Stories Corpus in Terms of Amount of Data (in Hours)and Number of Phonemes Used per Each Language

    Table 3 WER, in Percentage, on the Nov92 Task. Rescoring Was Applied toBoth the ML- and MMI-Based Baseline Systems, Trained on the SI-84

    Material of the WSJ0 Corpus

    Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing

    1106 Proceedings of the IEEE | Vol. 101, No. 5, May 2013

  • across languages. The full list of accuracies and insights

    can be found in [155].

    4) Cross-Language and Language-Universal AttributeDetection: The detectors of a specific language are nowtested on the data of the other languages. Fig. 16 shows the

    attribute accuracy rates on the MAN data. The connected

    line highlights results obtained with Mandarin-based de-

    tectors. From Fig. 16, we can make several observations.

    First, detection across languages is less reliable than in the

    language-dependent cases for several attributes, but the

    drop in performance is not particularly severe. Attributeaccuracies are comparable across all languages for the

    vowel class. There are also cases for which cross-language

    detection outperformed in-language detection, e.g., round.

    This indicates that the round detector trained on a non-

    Mandarin language performed as well as the language-

    dependent Mandarin detector. A similar trend is observed

    for all of the languages used in our studies.

    By pooling together all training materials from the sixlanguages a new language-independent data set could be

    formed. Then, a single bank of (universal) detectorscould be trained on this new data set. We observed that

    better attribute accuracies could be attained for quite a few

    attributes, e.g., vowel, fricative, round, and mid, with only

    a minor degradation for the worst performing detectors.

    D. ASAT-Based BottomUp LVCSRAs an initial attempt at implementing a bottomup

    LVSCR system, the hybrid ANN/HMM approach has been

    modified to explicitly represent and manipulate the search

    space at various points in the decoding process [156] using

    weighted finite state machines (WFSMs) [172]. ASR is

    then accomplished in a bottomup fashion by performing

    backend lexical access and syntax knowledge integration

    over the output of our detection-based frontend, which

    generates frame-level speech attribute detection scores

    and phone posterior probabilities. Decoupled recognitionis made possible by two main factors: 1) high-accuracy

    detection of acoustic information in order to generate

    high-quality lattices at every stage of the acoustic and

    linguistic information processing; and 2) low-error prun-

    ing of the generated lattices in order to reduce search

    errors likely to occur when trying to minimize the possi-

    bility of memory overflow in using the AT&T WFSM tool.

    1) Detection-Based LVCSR With WFSMs: LVCSR is ac-complished by building up on the frame-level evidence

    gathered at the output of the detection-based frontend

    shown in Fig. 11. The first step is to represent the output

    of the detection-based frontend, for a given utterance,

    as an acceptor F. In practice, F is a graph with anumber of states that equals the length of the input

    sentence (in frames), and a number of edges between eachpair of states that equals the output dimension of the

    merger (i.e., the number of even