Speech processing research paper 4
-
Upload
imparivesh -
Category
Documents
-
view
225 -
download
1
description
Transcript of Speech processing research paper 4
-
INV ITEDP A P E R
An Information-ExtractionApproach to Speech Processing:Analysis, Detection,Verification, and RecognitionThis paper presents an integrated detection and verification approach to
information extraction from speech that can be used for speech analysis, and
recognition of speech, speakers, and languages.
By Chin-Hui Lee, Fellow IEEE, and Sabato Marco Siniscalchi, Member IEEE
ABSTRACT | The field of automatic speech recognition (ASR)has enjoyed more than 30 years of technology advances due to
the extensive utilization of the hidden Markov model (HMM)
framework and a concentrated effort by the speech community
to make available a vast amount of speech and language
resources, known today as the Big Data Paradigm. State-of-the-
art ASR systems achieve a high recognition accuracy for well-
formed utterances of a variety of languages by decoding
speech into the most likely sequence of words among all possi-
ble sentences represented by a finite-state network (FSN) ap-
proximation of all the knowledge sources required by the ASR
task. However, the ASR problem is still far from being solved
because not all information available in the speech knowledge
hierarchy can be directly integrated into the FSN to improve
the ASR performance and enhance system robustness. It is
believed that some of the current issues of integrating various
knowledge sources in topdown integrated search can be par-
tially addressed by processing techniques that take advantage
of the full set of acoustic and language information in speech. It
has long been postulated that human speech recognition (HSR)
determines the linguistic identity of a sound based on detected
evidence that exists at various levels of the speech knowledge
hierarchy, ranging from acoustic phonetics to syntax and
semantics. This calls for a bottomup attribute detection and
knowledge integration framework that links speech processing
with information extraction, by spotting speech cues with a
bank of attribute detectors, weighting and combining acoustic
evidence to form cognitive hypotheses, and verifying these
theories until a consistent recognition decision can be reached.
The recently proposed automatic speech attribute transcrip-
tion (ASAT) framework is an attempt to mimic some HSR
capabilities with asynchronous speech event detection fol-
lowed by bottomup knowledge integration and verification. In
the last few years, ASAT has demonstrated good potential and
has been applied to a variety of existing applications in speech
processing and information extraction.
KEYWORDS | Acoustic phonetics; automatic speech attributetranscription (ASAT); automatic speech recognition (ASR);
cross-language phone recognition; knowledge integration;
lattice rescoring; place and manner of articulation; speech
attribute detection
I . INTRODUCTION
It is instructive to examine some of the key developmentsin automatic speech recognition (ASR) that have occurred
in the past few decades and contemplate new directions
that might lead to better system designs. The ASR problem
Manuscript received May 18, 2012; revised September 13, 2012; accepted January 4,
2013. Date of publication February 7, 2013; date of current version April 17, 2013.
The ASAT project was supported by the National Science Foundation (NSF)
Information Technology Research (ITR) Program under Contract IIS-04-27113. Part
of S. M. Siniscalchis ASAT-related work was supported by the Spoken Information
Retrieval by Knowledge Utilization in Statistical Speech Processing (SIRKUS) project
through Prof. T. Svendsen at the Norwegian University of Science and Technology
(Trondheim, Norway).
C.-H. Lee is with the School of Electrical and Computer Engineering, Georgia Instituteof Technology, Atlanta, GA 30332 USA (e-mail: [email protected]).
S. M. Siniscalchi is with the Faculty of Architecture and Engineering, University ofEnna Kore, Enna 94100, Italy (e-mail: [email protected]).
Digital Object Identifier: 10.1109/JPROC.2013.2238591
Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 10890018-9219/$31.00 2013 IEEE
-
is still far from solved, judging from the limiteddeployment of products and services worldwide, the
fragile nature of performance robustness for the state-of-
the-art ASR systems, and the slowdown of recent progress
in performance improvement when the amount of training
data to learn acoustic and language model for ASR is no
longer the only major technology concern in designing
ASR systems. Nevertheless, it is also clear that tremendous
technology advances have been made since the speechcommunity adopted an information-theoretic perspective
of channel modeling for encoding speech from language,
and formulated the ASR problem as a channel decoding
paradigm [1].
Taking advantage of the sequential nature of speech,
and combining an efficient stage-by-stage decoding strat-
egy with dynamic programming (DP, e.g., [2] and [3]) and
a Markov inferencing framework [4][6], a continuousspeech recognition algorithm was first developed in [7] to
deliver good performance. The ease of learning speech and
language models from data triggered almost four decades
of rapid technology progress for ASR based on this integ-
rated pattern modeling and decoding framework later
known as hidden Markov models (HMMs, e.g., [8]). The
same automatic data learning paradigm has been extended
to quite a few machine learning problems in the last twodecades. One of the most notable accomplishments was
the development of a statistical machine translation (MT)
framework originated by a group of ASR researchers at
IBM [9], which also spun off many recent MT research and
application activities and other similar statistical natural
language processing (NLP) efforts (e.g., [10] and [11]).
The aforementioned statistical pattern matching ap-
proach to ASR is considered a paradigm shift from thetraditional speech science perspective of crafting heuristic
rules manually based on expert observations from limited
data and local optimization, which is sometimes known as
a bottomup knowledge integration process. In contrast to
traditional knowledge-rich approaches, the current knowl-
edge-ignorant or knowledge-implicit modeling framework
[12][14] relies on collecting a large amount of speech and
text examples and learning the model parameters withoutthe need to use detailed knowledge about a target lang-
uage. It offers an advantage for automatic model learning
from a large collection of data via a rigorous mathematical
formulation and global optimization by using all the avail-
able knowledge sources at the same time, known as top
down knowledge integration, ready for DP-based optimal
decoding [15].
During the transition to the new paradigm in the1970s, an intensive effort in applying acoustic and lin-
guistic knowledge sources to speech recognition in the
Advanced Research Projects Agency (ARPA) Speech
Understanding Project [16] was witnessed. Many notable
examples were documented [17], [18]. Nonetheless, expert
knowledge was required to design even a simple ASR
system, which made the ASR technology hard to access.
Furthermore, robustness to adverse conditions was neveraddressed in a serious manner. Much of the knowledge
accumulated in these studies, e.g., [16][18], was not fully
explored in current HMM-based system. Moving into the
1980s and 1990s, the dominance of data-driven learning
approaches to speech modeling was witnessed. A number
of techniques, including vector quantization (VQ) [19],
HMM [8], self-organizing map (SOM) [20], and artificial
neural network (ANN) [21], [22], have been successfullyadopted.
After many years of concentrated deliberation, the
speech community has come a long way from the learning
stage in the 1970s, and made a tremendous drive in data-
driven approaches in the 1980s, 1990s, and 2000s. A
continuous stream of performance improvement and
increasing task complexity has thus been observed. For
more detail, the reader is referred to a special issue of theProceedings of the IEEE on Spoken Language Processing
published in August 2000 [23]. A number of books [24]
[34] has also been published. However, it is also safe to
argue that the technology progress has slowed down in
recent years. Most research groups are searching for the
next trend to move ASR forward. This phenomenon is
known as the S-Curve in learning illustrated in the curvelabeled in solid circles in Fig. 1. The community wouldgenerally agree that the fragile nature of ASR system design
will require new technological breakthroughs before con-
versational systems really become a ubiquitous user inter-
face mode, being able to compete with conventional
graphical user interface with point-and-click devices like
mouses or touch-sensitive screens.
In examining Fig. 1 closely, we can roughly divide the
ASR technology progress into three periods: 1) before the1970s, labeled in blue circles, the speech community en-
joyed a vast creation of speech knowledge sources (e.g.,
[35][39]); 2) between the 1970 and the 2010, labeled in
green circles, data-driven models dominated four decades
of fast advances with HMM playing the role of a paradigm
shift and emerging as the leading framework used in
almost all modern ASR systems; and 3) beyond the 2010s,
labeled in pink circles, one can envision an imaginary pathin which the speech community may be waiting for anoth-
er paradigm shift to take place by exploring knowledge-
rich modeling [12]. However, this paradigm shift should
still leverage on data-driven automated learning from big
language resources. By setting a human performance
capability ceiling shown in a horizontal line in the upper
part of Fig. 1, it is noted that there is still a big gap between
the current state-of-the-art and a human speech recogni-tion (HSR) system. In order to address the suprahuman
performance goal set by IBM a few years ago [40], the
speech community will need fast technology progress
again, similar to what the community had enjoyed in the
last four decades. This may be a good time to call for a
paradigm shift and reexamine what can be done to carry
the community forward.
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
1090 Proceedings of the IEEE | Vol. 101, No. 5, May 2013
-
There have been many attempts to find the next para-
digm shift. One of them is exploiting the distinctive
features of speech (e.g., [16] and [37]) to form words along
with the use of lexical access to a dictionary of words asdemonstrated in spectrogram reading by human experts
[41] in the MIT Summit system [42]. Using bottomup
knowledge integration, HSR also performed much better
than ASR in most benchmark tests [43], [44]. The speech
cues, or attributes, or events, can thus serve as acoustic
landmarks [42], [45][47], sometimes referred to as
islands of reliability in the ocean of competing speech
events, to improve bottomup knowledge integration. Re-cently, a detection approach to ASR, called automatic
speech attribute transcription (ASAT) [13], was proposed
in another attempt to address research issues in HSR and
spectrogram reading through bottomup attribute detec-
tion and stage-by-stage knowledge integration. We will
collectively refer to this set of viewpoints as an informa-
tion-extraction perspective to extract useful acoustic and
linguistic information for the purposes of speech recogni-tion and understanding. ASAT also facilitates a modular
strategy so that various researchers can collaborate by
contributing their best detectors or knowledge modules to
plug-n-play into the overall system design.
The rest of the paper is organized as follows. In
Section II, we briefly review the statistical pattern recog-
nition approach to ASR. We describe the single most im-
portant technique that has helped advancing the state ofthe art of ASR, namely hidden Markov modeling of speech,
and discuss current ASR capabilities. We next address
several ASR technology limitations in Section III. A list of
active ASR research challenges in robustness, decision
strategies, and utterance verification are also presented.
These challenges lead to a bottomup speech attribute
detection framework followed by a stage-by-stage knowl-
edge integration process to be discussed in Section IV. To
enhance the current capabilities and alleviate some of the
limitations of HMM-based ASR, a bottomup detection
approach to ASR, called ASAT, is presented in Section V.A survey of ASAT-based speech processing applications
and their advantages over the conventional topdown
approach to ASR are highlighted in Section VI. In
Section VII, we present a critical look at new research
directions and future work opportunities through estab-
lishing the collaborative ASR Community of the 21st Cen-
tury. Finally, we conclude our findings in Section VIII.
II . STATE-OF-THE-ART TOPDOWN ASR
Modern ASR system design is based on a statistical pattern
matching framework that is motivated by representing
spoken utterances as stochastic patterns (e.g., [7], [15],
[25][27], and [48]) and formulating an information-theo-
retical perspective of speech generation, acquisition, and
transmission (e.g., [15]). Many good studies on acousticmodeling are available (e.g., [49][52]). Equally as many
papers are concerned with language modeling (e.g., [53]
[58]). In the following sections, a brief overview on the
current statistical approach to ASR is presented.
A. Statistical Pattern Recognition TheoryStarting with a message M from a message source, a
sequence of words W is formed through a linguistic chan-nel. Different word sequences may often convey the same
message. It is then followed by an articulatory channel that
converts the discrete word sequence into a continuous
speech signal S. Speaker effect, which accounts for amajor portion of the speech variabilities including speech
production difference, accent, dialect, speaking rate, etc.,
is added at this point. Additional speech distortion is
Fig. 1. The S-Curve of ASR technology progress: 1) before the 1970s: Vast creation of speech knowledge sources; 2) from the 1970sto the 2010s: Data-driven model learning with rich speech and language data resources; and 3) beyond the 2010s: What to do to sustain
fast technology progress?
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1091
-
introduced when the signal passes through the acousticchannel that includes the speaking environment, interfer-
ing noise, and the transducers used to capture the speech
signal. This acoustic realization A is then passed throughsome transmission channel S^ before it reaches an ASRsystem as an observed signal X (e.g., [59]).
For real-world practical problems, it is difficult to char-
acterize the intermediate channels, such as articulatory,
acoustic, and transmission channels, which are lumpedtogether as a noisy channel. The noisy channel model is
usually formulated as follows: 1) the joint distribution
pW;X is decomposed into two components pXjW andPW, known as an acoustic model (AM) and a languagemodel (LM), respectively; 2) the forms of pXjW andPW are assumed parametric probability density functions(pdfs), i.e., pXjW and PW, respectively; and 3) theparameters and are estimated from some training data.With these simplifications, the most popular way to solve
the ASR problem is to use the well-known plug-in
maximum a posteriori (MAP) decision rule (e.g., [60][62])
W^argmaxW2
PWjXargmaxW2
p^XjW PL^ W (1)
where ^ and ^ are the estimated parameters obtainedduring training, W^ is the recognized sentence from decod-ing, and is the set of valid candidate word sequences tobe searched during testing. This decision rule, derived
from the optimal Bayes decision rule, is also widely used in
many other pattern recognition applications. In the above
equation, L commonly known as a language modelmultiplier is used to balance the AM and LM contributions
to the overall probability due to unknown distributions,
and the use of a likelihood function p^XjW to computethe acoustic probability.
A block diagram of this topdown approach to ASR is
shown in Fig. 2. The feature extraction module provides the
acoustic feature vectors used to characterize the spectral
properties of the time-varying speech signal, typically in
terms of time-synchronous cepstral analysis [63]. Hetero-
scedastic discriminant analysis (HLDA) [64] is routinelyused in modern ASR system as a method for learning
projections of high-dimensional acoustic representations
into lower dimensional spaces. The information provided
by the AM, LM, and word models is then used to evaluate
the similarity between the input feature vector sequence
(corresponding to a portion of the input speech) and a set of
acoustic word models for all words in the vocabulary to
determine which words were most likely spoken.
B. Three Key HMM AdvancesTopdown HMM modeling is considered as one of the
most fruitful areas in characterizing speech and language
in recent years. Key advances [59] can be summarized in
three broad categories to be discussed in the following.
1) Detailed Modeling: ML estimation has been exten-sively developed following Baums original formulation
[5]. Software packages, such as HTK [65] and GMTK [66],
are available now in the public domain to establish acoustic
models with hundreds of thousand Gaussian mixture com-
ponents, and support language models with hundreds of
million of n-gram probabilities. The previous limitationimposed by the curse of dimensionality, widely known in
the pattern recognition community, was alleviated withmany advanced modeling techniques that take parameter
sharing into account, such as the commonly used tied-state
tree learning strategy [67] in subphone modeling [68], [69].
2) Adaptive Modeling: A significant drop in performanceis often observed when an ASR system is used in an ope-
rating condition that is different from the training condi-
tion. Adaptation algorithms try to automatically tune agiven set of HMMs to a new test environment using a
limited, but representative set of new data, commonly
referred to as adaptation data. There exist two major
adaptation approaches: the transformation-based approach
and the Bayesian approach. The best known example of
transformation-based adaptation is the maximum-
likelihood linear regression (MLLR) framework [70]. The
feature-space MLLR (fMLLR) [71] extends MLLR and hasproven to be highly effective as a method for unsupervised
adaptation. In Bayesian learning (e.g., [72]), prior den-
sities need to be assumed and MAP estimates are obtained
for the HMM parameters. When the adaptation data size is
limited, structural maximum a posteriori (SMAP) adapta-tion [73] improves the efficiency of MAP estimation. Cor-
related HMMs with online adaptation were also shown to
be both efficient and effective [74] by considering HMMparameter correlations.
3) Discriminative Modeling: Due to inaccurate modelassumptions and limited training data, maximizing the
likelihood can be quite different from minimizing the pro-
bability of recognition errors, which is the ultimate goal in
ASR. Using a learning criterion that is consistent with ASR
Fig. 2. A typical block diagram of a continuous speech recognitionsystem with integrated search via a finite state network
representation of all the key task constraints, such as AM and LM.
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
1092 Proceedings of the IEEE | Vol. 101, No. 5, May 2013
-
objectives, the minimum classification error (MCE) [75],[76] learning for HMM has been shown to be quite
effective in improving the model separation, system accu-
racy, and performance robustness. Risk-based optimiza-
tion, such as maximum mutual information (MMI) (e.g.,
[77]), and minimum phone error (MPE) [78], has also
proven quite effective in reducing the error rates. The
MMI and MPE objective functions have been produced
good results when extended to feature-space MMI (fMMI)[79] and feature-space MPE (fMPE) [80].
III . TECHNOLOGY LIMITATIONSAND CHALLENGES
Although many successful implementations of commercial
products and services for many different languages have
been witnessed, ASR technology is still rather fragile. Un-less the users follow strict protocols that are consistent
with the speaking styles of the speaker population, trans-
ducer and channel characteristics of the training data con-
ditions, and the background acoustic environments, the
high accuracies obtained with HMM-based systems cannot
often be maintained across adverse conditions. This ro-
bustness issue limits a wide deployment of spoken lan-
guage systems. Three major technical challenges areillustrated as follows.
A. Challenges in Model Estimation and RobustnessSince it is not practical to collect a large set of speech
and text examples by a large population over all possible
combinations of signal conditions, it is likely that mis-
match of training and testing conditions are a major source
of errors for conventional pattern matching systems. Astate-of-the-art system may perform poorly when the test
data are collected under a totally different signal condi-
tion. Regarding the possible mismatches, both linguistic
and acoustic mismatches (e.g., [81]) might occur.
The mismatch can be conceptually viewed in the signal,
feature, or model space as shown in Fig. 3 where a
maximum-likelihood stochastic matching framework was
proposed to address the ASR robustness issues caused bythis mismatch [82]. The mismatch can be modeled by a
distortion in the signal space D1 and handled by speechenhancement. A feature space distortion D2 can also beconsidered, and feature compensation can be performed
(e.g., [83] and [84]). Finally, the mismatched situation can
be handled in the model space with a transformation D3that maps the trained models into the test environment via
adaptation [59].
1) Inconsistency in Language Modeling: A linguistic mis-match is mainly caused by incomplete task specifications,
inadequate knowledge representations, insufficient train-
ing data, etc. For example, task model and vocabulary
usage heavily influence the efficacy of the training process.
As mentioned before out-of-vocabulary (OOV) words not
specified in a task vocabulary are major sources of recog-
nition errors. For syllabic languages, such as Mandarin,
there is a major problem in consistently defining words.
For example, a four-character word can be broken down
into two two-character words or even into four single-
character words. If all possible combinations of wordsegments are considered in LM it may cause biases in
computing word probabilities. In another evaluation on
the 5000-word Wall Street Journal (WSJ) task, we found
that a 4% word error rate (WER) can be achieved with the
trigram language model. However, the WER went as high
as 70% when no language constraints were used [85]. It is
clear that the choices of LMs and language weights in most
unfamiliar situations will be hard when the task definitionis incomplete, e.g., in the case of spontaneous speech to be
discussed later.
2) Inconsistency in Acoustic Modeling: An acoustic mis-match between training and testing arises from various
sources, including differences in desired speaking formats
and signal realizations. For a given task, speech models
trained based on task-dependent data usually outperformmodels trained with task-independent data. Similarly,
speech models trained based on speakers with normal
speaking rate will usually encounter problems for fast and
slow talkers. Another major source of acoustic mismatch
derives from varying signal conditions. For example,
changes in transducers, channels, speaking environments,
speaker population, echoes and reverberations, and com-
binations of them, all contribute to performance degrada-tion. In addition to the previously discussed linguistic and
acoustic mismatches, model incorrectness and estimation
error also cause robustness problems for ASR.
3) Need for Collaborative ASR: The robustness problem ofcurrent ASR systems might be solved by combining the
different approaches developed by different members in
the speech community.
Fig. 3. Mismatch in training and testing: The two starred blocksindicate that the model obtained in training in the upper panel and
the features obtained in testing in the lower panel give acoustic
mismatches if the testing environments are very different from the
training conditions, resulting in a system with operating pair
mismatches shown.
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1093
-
In contrast to the model-based pattern matching ap-proach to extracting information from speech, a collection
of signal-based algorithms needs to be developed in order
to detect acoustic landmarks, such as vowels, glides, and
fricatives, in adverse conditions. They could serve to select
good data segments and designing signal-specific speech
enhancement, feature compensation, and model adapta-
tion algorithms for reliable information extraction. Attri-
bute-specific features, such as voice onset time (VOT) fordiscriminating voiced against unvoiced stops [86], was
developed and used for designing robust attribute de-
tectors. Since there are many speech attributes and
acoustic conditions to be dealt with, a collective expertise
will be needed in order to address different combinations
of robustness issues. It is well known that no single ro-
bustness technique is capable of handling a wide range of
adverse conditions. This again is a good opportunity todevelop collaborative efforts to solve this diverse problem.
B. Challenges in Search StrategySignificant progress has been made in developing effi-
cient and effective search algorithms in the last few years
[87]. In the future, it seems reasonable to assume that a
hybrid search strategy, which combines a modular search
with a multipass decision, will be used extensively for largevocabulary recognition tasks. Good delayed decision strat-
egies in each decoding stage are required to minimize
errors caused by hard decisions. As an example, the N-bestsearch paradigm (e.g., [88]), including the generation of
multiple-theory lattices, is an ideal way for integrating
multiple knowledge sources. Such a strategy fuses multiple
hypotheses to rescore a preliminary set of candidate digit
strings with higher level constraints like a digit check sum[89], detailed crossword unit models, and long-term lang-
uage models. It has also been used to provide competing
string hypotheses for discriminative training and for com-
bining multiple acoustic models to reduce recognition
errors [90]. As another example, confusion networks have
been proposed in [91] to represent alternative hypotheses
distilled from a word lattices and improve the accuracy of
the speech recognition system. We expect to see more useof the multipass search paradigm to find preliminary hy-
potheses in a topdown strategy by incorporating high-
level knowledge sources that cannot be integrated easily
into the finite state machine (FSN) representation for
frame-based DP search. By combining a good multipass
search strategy with utterance verification (e.g., [92])
strategies, more flexible and efficient designs in large
vocabulary continuous speech recognition (LVCSR) spokenlanguage systems are possible.
As already mentioned, a recognition decision is often
made by jointly considering all the knowledge sources in
the integrated approach ASR shown in Fig. 2. In principle,
this search strategy achieves the highest performance if all
the knowledge sources are completely characterized and
fully integrated using the speech knowledge hierarchy in
the linguistic structure of acoustics, lexicon, syntax, andsemantics. This is the commonly adopted search strategy
in speech recognition today. However, there are a number
of problems with the integrated approach, because not all
knowledge sources can be completely characterized and
properly integrated.
For LVCSR tasks, the compiled FSN is often too large
and therefore becomes computationally expensive to find
the best sentence through a huge and ever-expandingsearch space. Thus, all knowledge sources must remain
simple in order to efficiently combine them into a single
search space. In particular, this has inhibited progress at
the linguistic level, and almost all LVCSR systems em-
ployed nonoptimal linguistic components such as static
lexicons (lexicalization of morphological processes) and
n-gram LMs that force the decoding process to generatehypotheses that sometimes conflict with the acoustic con-strains. Two WSJ examples that are illustrated in the fol-
lowing subsections demonstrate how modular search can
correct wrong recognition results obtained with current
topdown, HMM-based systems.
Both examples highlight the importance of bottomup
attribute detection and stage-by-stage knowledge integ-
ration, which are two key topics to be discussed through-
out the paper. We will come back to this central themelater in Section IV.
1) Inconsistency With Attributes in Integrated Search: Thefirst WSJ example incorporates correct low-level informa-
tion from speech attributes. Specifically, it has been ob-
served that a conventional LVCSR system evaluated on the
WSJ task often confuses the word safra, with the phrase
stock for. Nonetheless, recognizing the word stock re-quires the presence of two stop sounds /t/ and /k/ in the
region of a vowel. This can be checked by visually inspect-
ing the spectrogram in the upper panel of Fig. 4, which
does not show the presence of stop sounds before and after
the middle vowel. Moreover, the frame-wise time evolu-
tion of the output posterior probabilities (generated by a
bank of ANN-based detectors for manner of articulation)
displayed in the lower panel of Fig. 4, known as a poste-riogram [93], clearly indicated that there are no stop
events in the area where the mistake occurred, and it also
signals the presence of a glide (/r/ in this case) followed by
a vowel at the end of the time-span under analysis. If this
information could be properly extracted and integrated
into search, these errors can be avoided. We will come
back to this example again later in Section VI-B in which
this particular utterance is corrected by combining attri-bute detection scores with the log-likelihood scores in
attribute-based lattice rescoring.
2) Inconsistency With Prosody in Integrated Search: In thesecond WSJ example, correct suprasegmental information
from pitch and duration is used. Specifically, supraseg-
mental information, such as prosody, and language
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
1094 Proceedings of the IEEE | Vol. 101, No. 5, May 2013
-
constraints, such as morphosyntactic language models,
cannot be easily cast into the FSN specification when per-
forming topdown knowledge integration. However, po-
tential errors can be corrected by using suprasegmental
pitch contours and duration features, as demonstrated by avisual inspection of Fig. 5.1 In the top panel, the waveform
for the WSJ sentence if the Fed pushes the dollar higher, itmay curb the demand for U.S. exports in the time spanbetween 2.72 and 3.32 s is displayed to show a three-word
recognition error occurring when using the same HMM-
based system as the example in Fig. 4.
Specifically, the phrase it may curb is misrecognizedas and maker of. The panel below shows the frameenergy, whereas the F0 contour is shown in the third panel.The recognized phone and word sequences are reported in
the fourth and fifth panels, respectively. The reference
phone and word transcriptions are displayed in the sixth
and seventh panels, respectively. Knowledge-based analy-
sis of the second and third plots reveals two inconsisten-
cies in recognizing the middle word maker in the phrase,
namely: 1) the F0 for the segment ker is too high withrespect to that for the preceding segment ma that puts
a strong stressed syllable in the middle of the word; and
2) the glottal closure of 60 ms of the stop sound in
maker is too long. It should be a stop gap of an un-
voiced stop as in the correct but misrecognized word
curb instead. Better F0 estimation will enhance this ca-pability (e.g., [94]). A recent study also demonstrated that
the performance of Mandarin LVCSR can be significantlyenhanced by incorporating prosodic information, such as
break models and pitch [95], [96].
3) Need for BottomUp Information Extraction: The topdown integrated framework also hampers the definition of
generic knowledge sources that can be used in different
domains. As a result, applications for a new knowledge
domain need to be built almost from scratch. In addition,the effectiveness of the integrated search diminishes when
dealing with unconstrained speech input, since more com-
plex language models are needed for handling spontaneous
speech phenomena along with much larger lexicons. On
the other hand, for the modular approach shown in Fig. 6,
the recognized sentence can be obtained by performing
unit matching, lexical matching, and syntactic and seman-
tic analysis in a sequential manner. As long as the interfacebetween the adjacent decoding modules is completely
specified, each module can be designed and tested
separately.
4) Need for Collaborative ASR: A collaborative researchamong different groups working on different components
of the system can be carried out to improve the overall
system performance because there are many pieces ofinformation, or acoustic cues, to be extracted and utilized.
In the meantime, modular approaches are usually more
computationally tractable than integrated approaches.
However, one of the major limitations with the modular
approach is that hard decisions are often made in each
decoding stage without knowing the constraints imposed
by the other knowledge sources. Decision errors are there-
fore likely to propagate from one decoding stage to thenext, and the accumulated errors are likely to cause search
errors unless care is taken to minimize hard decision
errors at every processing stage.
C. Challenges in Spontaneous Speech ProcessingAlthough low word error rates have been achieved in
many LVCSR tasks, the high accuracy usually does not
1The authors would like to Dr. C.-Y. Chiang of the National ChiaoTung University (NCTU, Hsinchu, Taiwan), for creating this example.
Fig. 4. Spectrogram (upper panel) and posteriogram (lower panel) for the sentence numbered 446c0210 of the Nov92 test set with focuson the area where the errors occur. A conventional LVCSR systemmisrecognize the word safra, and generates the transcription
stock for. In the second panel, the time evolution of the posterior probabilities, namely a posteriogram, of manner of articulation
shows that there are no plosive events in the time span under analysis. Furthermore, wrong word recognition occurs although
correct manner or articulation detection can be performed.
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1095
-
extend to recognizing spontaneous speech. An example is
the Switchboard task [97], which has attracted quite a bit of
research attention for almost 20 years. In the early 1990s, ahigh error rate of over 40% had been reported. However, a
steady decrease in word error rate led to those in the range
of 13%16% for conversational telephone speech and
broadcast news by 2006 [98][100]. Nevertheless, these
results are still rather high when compared with the re-
cognition performance on read speech. In spontaneous
speech, ill-formed utterances are often observed that
cannot be completely characterized, even if a large amountof training speech data was collected to build the n-gramlanguage models.
The plug-in MAP decoder that recognizes W^ in (1) findsthe best sentence in a set of competing sentences. How-
ever, there are many practical difficulties with this design.
First, the candidate set is usually of a finite size, and it isnot possible to include all sentences. Second, the quality of
the recognition result is not properly quantified becausethe right-hand side of (1) is only computing a relative
difference of competing word strings. Since speech sounds
are inherently ambiguous, we need to at least ask the
following three questions: 1) why should we accept W^ as
the recognized string?; 2) why should we accept some
words in W^ while rejecting others?; and 3) can we assigna value to measure the confidence of our acceptance?
These three issues lead researchers to study three new
but closely related topics, namely: 1) keyword recognition
and non-keyword rejection (e.g., [101]); 2) utterance veri-
fication (UV) at both the string and word levels (e.g., [92]
and [102]); and 3) confidence measures (CMs) or con-
fidence scoring (e.g., [103]). Although the above three
research areas cannot be solved using the classical classi-
fication formulation shown in (1), the theory of statisticalpattern verification and hypothesis testing provides a
framework to tie these three topics in a unified manner
(e.g., [104]). Verification of speech events, such as in
attributes, could also follow the same theory and design.
We will come back to this critical research area in further
depth later in Section V. Due to incomplete task specifi-
cations in defining most speech recognition and under-
standing tasks, a topdown knowledge integrationapproach alone usually could not maintain the consistency
with all the knowledge sources in the recognized sentence.
A partial understanding of any given utterance, i.e., know-
ing which part of an utterance to process and to ignore, is
very critical in order to handle spontaneous speech. We
will come back later to discuss this important issue in
Section IV-B.
1) Partial Understanding Through Event Detection: Onemajor problem related to characterizing spontaneous
speech with conventional acoustic and language models
is the set of so-called incomplete specification issues, such
as partial words, hesitation, telephone ringing, baby cry-
ing, door slamming, TV interferences, out-of-vocabulary
words, out-of-grammar sentences, and out-of-task (OOT)Fig. 6. A typical modular-search ASR system.
Fig. 5. Prosodic analysis of theWSJ sentence if the Fedpushes thedollarhigher, itmaycurb thedemand forU.S. exports. The firstpanel shows thewaveform in the time frame between 2.72 and 3.32 s, where a recognition error occurs. Specifically it may curb is recognized as and maker of.
The second panel shows the frame energy, whereas the F0 is shown in the third panel. The recognized phone and word sequences are reported
in the fourth and fifth panels, respectively. The reference phone and word transcriptions are displayed in the sixth and seventh panels,
respectively. Two inconsistencies: 1) the F0 for the segment ker is too high with respect to that for the preceding segment ma; and
2) glottal closure of the stop sound in maker is too long.
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
1096 Proceedings of the IEEE | Vol. 101, No. 5, May 2013
-
utterance constructions, which are commonly observed inspontaneous speech (e.g., [105]).
2) Need for Collaborative ASR: Many techniques havebeen developed to reduce the word error rate for the
Switchboard task for almost 20 years based on conven-
tional LVCSR approaches. The amount of effort needed to
study another spontaneous speech recognition task with an
increased complexity, or for a different language could bevery high. It is time to adopt detection-based techniques
that can be task independent and language universal, such
as key phrase detection, sound-specific filler modeling,
extraneous speech rejection, and attribute modeling,
through partial understanding. By utilizing the aforemen-
tioned modular strategy, we might be able to divide up the
big spontaneous LVCSR task to a set of smaller and man-
ageable problems so that researchers with knowledge-richalgorithms can help.
IV. BOTTOMUP DETECTIONFOLLOWED BY KNOWLEDGEINTEGRATION
Better modeling of the linguistic, articulatory, acoustic,
transmission, and noise channels missing in the currentASR formulation may enhance the current level of ASR
performance. Moreover, a knowledge integration process
with detected cues and evidence is often used in HSR and
from experience in spectrogram reading. This seems to
point to the need for a bottomup paradigm and leads to
ASAT principles. We provide some justifications of this
perspective in the following.
A. BottomUp Knowledge IntegrationThe currently prevailing topdown HMM-based sys-
tems use a large volume of speech and text training data.
However, once the models are trained it becomes a black
box in that it provides very little diagnostic information to
pinpoint why the models work well in one instance and
then fail badly in other situations. For example, it is often
clear from the pitch contour that there are only three digitsin an unknown input utterance, but somehow a four-digit
sequence scores best among the competing strings and is
recognized. Instead of using such a topdown, integrated
search approach, recent ASR systems rely on N-best stringsor word lattices to hypothesize multiple theories. The re-
cognized sentence is then obtained by rescoring these
strings using additional knowledge sources, e.g., phone
and segment lattices have been proposed [42], [106].However, phone recognition is often error-prone and be-
comes a limiting factor for further technology develop-
ment. The bottomup knowledge integration approach
would become feasible if speech cues are more reliably
detected.
A block diagram of a detection-based ASR system is
shown in Fig. 7. The input speech signal is first processed
by a bank of feature detectors aiming at events that are
relevant to the recognition task. An event lattice is then
produced with each element time-marked and scored. The
detectors do not have to be synchronized and, therefore,
the framework is flexible in embracing both short-termdetectors, e.g., for the VOT, and long-term detectors, e.g.,
for pitch contours. Once meaningful events have been de-
tected, the event merger proposes larger events by merging
smaller events and the theory verifier computes their
corresponding confidence scores and prunes unlikely
theories. This process could be repeated until all the avail-
able knowledge sources are incorporated and evaluated.
The recognized string is then the sequence of words thatscores the best against all possible knowledge sources.
This bottomup detection approach to ASR has a num-
ber of advantages, namely: 1) it provides plenty of diagnos-
tic information; 2) it is easy to compare the quality of
individual detectors and an ensemble of detectors by pro-
perly designing feature-specific evaluation sets aiming at
these events; 3) individual event detectors are often easier
to design and perfect than the whole system; 4) it takesadvantage of many years of research in speech and lang-
uage sciences, as well as statistical modeling; 5) it offers a
quantitative way for an objective performance evaluation;
and most importantly; and 6) it sets up an open framework
for the community to work collaboratively, something that
has not been done enough in the last 30 years. In the
following, we briefly present two preliminary studies that
also demonstrate the feasibility and effectiveness of such abottomup approach.
B. Key-Phrase Detection and VerificationSeveral spoken dialog systems [28] have been evalu-
ated in real-world applications. These systems use finite
state grammars to accept typical user utterances because
there is no data available to train statistical languagemodels for the specific tasks. The use of a rigid grammar
is effective for typical in-grammar (IG) utterances. How-
ever, in real-world environments, we have observed wide
utterance variation inherent in a large user population,
and are, therefore, not covered by the task grammars,
even though they had been iteratively tuned by developers
during the trial period. Even in apparently simple subtasks
Fig. 7. A detection-based speech recognition framework.
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1097
-
such as asking for date or time, around 20% of the user
utterances turned out to be out-of-grammar (OOG).
These samples include extraneous words, hesitations, re-
petitions, and unexpected expressions. In some cases, we
even observe many utterances that are totally irrelevant
or OOT.
Most of such spontaneous utterances contain some keyphrases that are task related and may lead to partial or full
understanding. Other samples, not relevant to the task,
should be rejected. This suggests a detection-based ap-
proach to flexible speech recognition and understanding
that is designed to detect semantically significant parts and
reject irrelevant portions. In a domain-specific form filling
or information retrieval task, it is capable of interpreting
with only key phrases. Therefore, the approach based ondetection is attractive. In [102], it was found that a com-
bined key-phrase recognition and verification strategy
worked well especially for ill-formed utterances. The com-
bined detection-verification approach improved the se-
mantic accuracy from 5% to 30% over the conventional
techniques with rigid grammatical constraints, especially
for ill-formed OOT utterances.
C. Knowledge-Based Feature Representationin LVCSR
We now show that the use of acoustic phonetics and
contextual variability in the representation of the speechsignal is indeed very useful to improve LVCSR. The system
proposed in [107] is similar to a well-known feature ex-
traction scheme called Tandem [108] that was extended to
LVCSR [109]. Data-driven multilayer perceptron (MLP)
detectors were used to measure the presence or absence of
distinctive features directly from the short-time MFCC and
limited temporal information. These phonetic distinctive
features were used as feature vectors to build a set ofcontext-dependent phone HMMs. Experiments were per-
formed on the WSJ task with the following feature con-
figurations: 1) baseline with 39 MFCC features; 2) 601features: the 61-dimensional feature vectors (1 energy
coefficient + 60 KarhunenLoe`ve (KL) transformed fea-
tures) were used as features to build triphone HMMs;
3) 44 KL-transformed phone features; 4) 6144 features;and 5) 6144 features plus MFCC. We point out that thefirst- and second-order derivative were not used for distinc-
tive and phone features.
Experimental results were obtained with the 5000 and20 000 words on the Nov92 test set. Trigram language
models were used, and all WERs are listed in the second
row of Table 1. Furthermore, systems 4) and 5) were
obtained by combining with ROVER [110], which essen-
tially corresponds to a majority vote decision. The results
given in the second to last row correspond to about 20%
and 10% relative improvements over our best MFCC base-
lines, on the 5000 and 20 000 tasks, respectively. Thesevery encouraging results seem to indicate that acoustic
phonetic features can help reducing the WERs. In the
bottom row, we report a WER of 6.6% for the 20 000-word
task obtained with a template-based systems [111], and the
state-of-the-art-result for the 5000-word task [112].
V. AUTOMATIC SPEECHATTRIBUTE TRANSCRIPTION
The speech signal contains a rich set of information that
facilitates human auditory perception and communication
beyond a simple linguistic interpretation of the spoken
input. In order to bridge the performance gap between ASR
and HSR systems, the narrow notion of speech-to-text in
ASR has to be expanded to incorporate all related infor-
mation embedded in speech utterances. This collectionof information includes a set of fundamental speech sounds
with their linguistic interpretations, a speaker profile en-
compassing gender, accent, emotional state and other
speaker characteristics, the speaking environment, etc.
Collectively, we call this superset of speech information the
attributes of speech. It is expected that directly addressing
these issues will improve ASR performance as well as
speaker recognition, language identification, speech per-ception, and speech synthesis. The human-based model of
speech processing suggests a candidate framework for
developing next-generation speech processing techniques
that have the potential to go beyond the current limitations
of existing ASR systems.
Based on the aforementioned set of speech attributes,
ASR can be extended to ASAT, which is a process that goes
beyond the current simple notion of word transcription.ASAT promises to be knowledge-rich and capable of incor-
porating multiple levels of information in the knowledge
hierarchy into attribute detection, evidence verification,
and integration, as shown in Fig. 8. The top panel illus-
trates the frontend processing, which consists of an
ensemble of speech analysis and parametrization modules.
In addition, the bottom panels demonstrate a possible
stage-by-stage backend knowledge integration process.These two key system components will be described in
more detail in the following. Since speech processing in
ASAT is highly parallel, a collaborative community effort
can be built around a common sharable platform to
enable a modular ASR paradigm that facilitates a tight
coupling of interdisciplinary studies of speech science
and processing.
Table 1 Word Error Rates (%) for Various Feature Sets and Combinationson WSJ Nov92, 5000 and 20 000
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
1098 Proceedings of the IEEE | Vol. 101, No. 5, May 2013
-
A. FrontEnd Attribute Detection ProcessingAn event detector converts an input speech signal into a
time series that describes the level of presence (or level of
activity) of a particular property of an attribute, or event, in
the input speech utterance over time. This function cancompute the a posteriori probability, or the log-likelihoodratio (LLR) of the particular attribute. The LLR involves the
calculation of two likelihoods: one pertaining to the target
model and the other the contrast model. The bank of
detectors consists of a number of such attribute detectors,
each being individually and optimally designed for the
detection of a particular event. These attribute properties
are often stochastic in nature and are relevant to infor-mation to be extracted and needed to perform speech ana-
lysis and other functions, such as ASR. One key feature of
the detection-based approach is that the outputs of the
detectors do not have to be synchronized in time and,
therefore, the system is flexible enough to allow a direct
integration of both short-term detectors, e.g., for detecting
VOT, and long-term detectors, e.g., for detecting pitch
contours, syllables, and particular word sequences. Theconventional frame synchronous constraints of most tradi-
tional ASR systems are thus relaxed in the ASAT system to
accommodate asynchronous attribute detection as shown
in Fig. 9. In the ASAT framework, different parameters at
different frame rates can be utilized and combined to de-
sign attribute-specific event detectors beyond the current
MFCC features obtained with frame-synchronous speech
analysis.Speech parametrization has been discussed in many
textbooks (e.g., [113]). For ASAT, the parameters can be
sample based, such as a zero-crossing rate, or frame based,
such as MFCCs. Speech analysis can be performed in the
temporal domain, providing features such as VOT, or in
the spectral domain, such as short time energies in differ-
ent frequency bands. Both long-term and short-term ana-
lysis can be compared and contrasted. Biologically inspired
and perceptually motivated signal analyses are considered
as promising parameter extraction directions [114], [115]
because the ASAT paradigm supports parameter extraction
at different frame rates for designing a range of attribute
detectors. Once a collection of speech parameters Ft isobtained, they can be used to perform attribute detection,
which is a critical component in the ASAT paradigm asshown in the upper panel of Fig. 8. Attributes can be used
as cues or landmarks in speech [45], [47] in order to
identify the islands of reliability for making local acous-
tic and linguistic decisions, such as energy concentration
regions and phrase boundaries, without extensive speech
modeling. A few clear examples are readily visible in
most spectrogram plots, e.g., the vowel and fricative re-
gions in Fig. 4.An attribute detection example was demonstrated in
[86] to discriminate voiced and unvoiced stops using VOT
for two-pass English letter recognition. In the first stage, a
Fig. 9. A bank of speech attribute detectors: Each can take differentparameters as inputs and generate a value between 0 and 1 over time
to indicate the presence or absence of the specific attribute.
Fig. 8. ASAT. (a) Speech analysis ensemble followed a bank of attribute detectors to produce an attribute lattice. (b) Stage-by-stageknowledge integration from speech attributes to recognized sentences.
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1099
-
conventional recognizer was used to produce a list ofmultiple candidates. To further discriminate some of the
minimum pairs, such as the English letters /d/ and /t/, a
VOT-based detector [86] can be used in the second stage to
provide a detailed discrimination. It was shown that the
VOT temporal feature produces a pair of curves with better
discrimination (i.e., with more separation between them)
than those obtained with spectral features alone. By reor-
dering candidates according to VOT, the two-stage recog-nizer gave an error rate 50% less than that obtained in a
state-of-the-art ASR system [116].
B. BackEnd Knowledge Integration ProcessingAnother critical component in the ASAT paradigm is
the backend processing shown in the bottom panel in
Fig. 8. An event merger takes the set of detected lower level
events as input and attempts to infer the presence of higherlevel units (e.g., a phone or a word). Those higher level
units are then validated by the evidence verifier to produce
a refined and partially integrated lattice of event hypoth-
eses to be fed back for further event merger and knowledge
integration. This iterative information fusion process al-
ways uses the original event activity functions as the raw
cues. A terminating strategy can be instituted by utilizing
all the supported attributes.The procedure produces the evidence needed for a final
decision, including a recognized sentence. Each activity
function can be modeled by a corresponding neural system.
Both activation levels and firing rates have been used in
neural encoding and neuron combinations to encode tem-
poral information. Simulating perception of temporal
events is of particular interest in auditory perception of
speech. New techniques are needed to accomplish thisform of lattice parsing. Conditional random field (CRF)
[117] is a mathematical framework that can be used to
describe sequences of symbols (such as phones or words) in
terms of input features, e.g., local phonetic attribute detec-
tions. CRF has been utilized in a number of ASAT-related
studies (e.g., [118][120].
To make use of the detected features, we must combine
them in a way that we can produce word hypotheses. Inessence, this boils down to three problems: 1) combining
multiple estimates of the same event to build a stronger
hypothesis; 2) combining estimates of different events to
form a new, higher level event with similar time bounda-
ries; and 3) combining estimates of events sequentially to
form longer term hypotheses. Note that these problems are
somewhat independent of the level of modeling: while the
canonical bottomup processing sequence would be tocombine multiple estimates of each feature, and then com-
bine the features into phones and then words (and word
sequences), we envision a highly parallel paradigm that is
flexible enough, for example, to combine a feature-based
phone detector with a directly estimated phone detector. In
principle, a 20 000-word ASR system can be realized with a
set of 20 000 single-keyword detectors [121].
Combining evidence for the same linguistic unit[problem 1)] has been the focus of techniques such as
multistream acoustic modeling (e.g., [122] and [123]) and
recognition hypothesis combination [110]. In addition,
stochastic combination to form strong verifiers from a
collection of weak detectors, such as boosting (e.g., [124]
and [125]), is a useful tool to combine low-level events into
high-level evidences.
C. Event and Evidence VerificationVerification of patterns is often formulated as a statis-
tical hypothesis testing problem [126] as follows: given a
test pattern, one wants to test the null hypothesis against
the alternative hypothesis. Event verification, a critical
ASAT component, can be formulated in a similar way. For
most practical verification problems in real-world speech
and language modeling, a set of training examples is usedto estimate the parameters of the distributions of the null
and alternative hypotheses. The two competing hypotheses
and their overlap indicate the two types of error known as
miss detection and false alarm errors [126]. A generalized
log-likelihood ratio (GLLR) was proposed as a way to
measure a separation between models of competing
hypotheses [127].
The verification performance is often evaluated as acombination of the two types errors. The related topic of
CMs (e.g., [103]) has also been intensively studied by
many researchers recently, e.g., [92] and [102]. This is due
to an increasing number of applications being developed
and deployed in the past few years. In order to have an
intelligent or humanlike interactions in these dialogs, it is
important to attach to each event a value that indicates
how confident the ASR system is about accepting the re-cognized event. This number, often referred to as a CM,
serves as a reference guide for the dialog system to provide
an appropriate response to its users just like an intelligent
human being is expected to do when interacting with
others.
Fig. 10 shows an example of how to use the GLLR plots.
Specifically, Fig. 10 displays in the top left, bottom left,
and top right, three sets of distribution curves for detectingthe three corresponding phones /w/, /ah/, and /n/, in the
word one. Here the ARPABET [128], used in the ARPA
Speech Understanding Research (SUR) project, is adopted
to denote the phonetic symbols used throughout this paper.
By approximating the three sets of curves with Gaussian
densities, the Gaussian curves for detecting the word one
can be composed as shown in the bottom right of Fig. 10. It
is noted that words are in general easier to detect thanphones, because the composed competing Gaussian curves
show a better separation, or equivalently less overlap.
D. Speech Attribute DetectionA possible implementation of the ASAT detection-
based frontend is shown in Fig. 11. It consists of two
main blocks: 1) a bank of attribute detectors that can
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
1100 Proceedings of the IEEE | Vol. 101, No. 5, May 2013
-
produce detection results in terms of a confidence score;
and 2) an evidence merger that combines low-level events
(attribute scores) into higher level evidence, such as
phone posteriors. The append module, shown in Fig. 11,
stacks together the outputs delivered by the attributedetectors for a given input and generates a supervector of
attribute detection scores. This feature vector is then fed
into the merger. In summary, the system shown in Fig. 11
maps acoustic features (e.g., short-time spectral features
or temporal pattern features) into phone posterior
probabilities.
In our studies, phonetic features, such as manner and
place of articulation, are used as the speech events of in-terest. The motivations behind this choice are: articulator-
motivated features improve robustness toward noise at
low signal-to-noise ratios [129]; they improved recognition
of hyper-articulated speech and in the presence of
different speaking styles [130]; they are reliably detected
[130]; and they carry linguistic information [129]. Table 2
shows the phonetic features used in the experiments re-
ported in the next sections. Silence is also used to repre-sent the absence of any speech activity. This feature
inventory is clearly not complete. Nevertheless, this set
could always be expanded.
Attributes should be stochastic in nature and the cor-
responding detectors designed with data-driven modeling
techniques. The goal of each detector is to analyze a speech
segment and produce a confidence score or a posterior
probability that pertains to some acoustic-phonetic attri-
bute. Generally speaking, both frame- and segment-based
data-driven techniques can be used for speech event de-tection. Frame-based detectors can be realized in several
ways, e.g., with ANNs [22], Gaussian mixture models
(GMMs) [131], and support vector machines (SVMs)
[132]. One of the advantages with ANN-based detectors is
that the output scores can simulate the posterior proba-
bilities of an attribute given the speech signal. On the other
hand, segment-based detectors are more reliable in spot-
ting segments of speech [133]. Segment-based detectorscan be built by combining frame-based detectors or with
segment models, such as HMMs, which have already been
shown effective for ASR [8]. Time-delay neural networks
(TDNNs) were also shown to be effective in designing
segment-based attribute classifiers [134]. The reader is re-
ferred to a recent Ph.D. dissertation [135] detailing the
process of building accurate TDNN-based classifiers for all
the attributes of interest.
1) Frame-Based Attribute Detectors: In the case of frame-based design, each detector is realized with three MLPs
organized in a hierarchical structure [136] similar to a way
of modeling long-term energy trajectories, referred to as
Fig. 10. Verifying a sequence of sequential hypotheses: for the word one (bottom right) based on evidence of verifyingthe three phones, /w/ (top left), /aa/ (bottom left), and /n/ (top right), in the word.
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1101
-
TempoRAl Patterns (TRAP)-based features [137], [138]. In
ASAT, sub-band energy trajectories arranged into split-
temporal context as described in [139] are used. In the
experiments, all MLPs are trained to compute the attributeposterior probabilities [136].
2) Segment-Based Attribute Detectors: When segment-based detectors, e.g., HMMs, are used to categorize a seg-
ment of speech into attribute classes either log-likelihood
or log-likelihood ratio can be adopted as detector scores. In
our ASAT framework, a LLR-based score is used to
measure the goodness-of-fit between a speech segment andthe corresponding speech feature because it has already
proven useful in rejecting wrong hypotheses in several
speech tasks [102], [104], [140], [141].
In Fig. 12, we show the typical detection curves of
manner of articulation, for the utterance numbered
442c013 (whose transcript is THATS FINE) of the SI-84set [142]. Detection curves of place of articulation can also
be generated. Nonetheless, manner of articulation is ofteneasier to detect than the place of articulation in both
spectrogram plots and MLP- or HMM-based detection
plots. This is mainly due to the fact that manners can
usually be clearly distinguished in their attribute beha-
viors. A collaborative effort can easily be envisioned where
researchers with many years of experience in specific
topics, e.g., stop sounds [143] and fricatives [144], can
provide their best detector modules to show their superior
performance to other competing modules.
Furthermore, score plots can be used to compare de-
tector performance. If we use 0.5 as a threshold to accept adetected event, then most of the attributes in the utterance
are correctly detected. Regions with scores below 0.5
usually indicate low confidence exhibiting either type I or
type II errors as discussed earlier. In the example shown in
Fig. 12, two sets of detector score plots for the same manner
of articulation for a short speech segment THATS FINEare displayed. The curves with a zigzag shape are obtained
with the MLP-based detectors, while the curves in red withstraight line scores were generated by conventional three-
state attribute HMMs. It is clear that HMM-based segment
detectors perform perfectly in this case, while the MLP-
based detectors show quite a bit of time-varying scores.
E. Evidence MergerThe bank of detectors provides evidence of a particular
speech event. Bits of evidence at the phonetic feature level
are combined together to form evidence at a higher level.
The focus of this section is on how to generate higher level
evidence at a subword level. There exist several methods to
generate evidence at a subword level from articulatory
events. For example, starting with manner and place of
articulation, a product lattice of degree two may be gener-
ated, and a constrained search may be performed overthis lattice to generate phone-level information [145].
CRFs [118] and segmental CRF (SCRF) [146] have also
been used to generate phone sequences by combining
articulatory features. In our framework, all of the detec-
tor outputs are combined with a feedforward MLP, which
has a single hidden layer. In a recent work, we demon-
strated that phone accuracies can be boosted using a deep
Table 2 List of Speech Attributes Used in the ASAT Experiments
Fig. 11. A preliminary implementation of the ASAT detection-based frontend. Each attribute detector analyzes any given input frameand produces a posterior probability score. The Append module stacks together attribute posterior probabilities. The merger delivers
phone posterior probabilities.
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
1102 Proceedings of the IEEE | Vol. 101, No. 5, May 2013
-
neural network (DNN) [147][149] as shown in [150].
Fig. 13 provides a schematic representation of the event
merger.By merging the attribute detector outputs and feeding
them into the attribute-to-phone mapping merger shown
in Fig. 13, we can produce frame-based posterior proba-
bilities, one for each phone of interest, and form a frame-
based feature vector. A penalized logistic regression with
HMM-based regressors [151] has also been employed to
combine information generated by a bank of segment-
based attribute detectors [152] obtaining remarkable re-sults on a phone classification task.
VI. ASAT APPLICATIONS
The ASAT detector frontend has been used with success
as a key component for several attribute-based speech ap-
plications, namely: lattice rescoring [153], [154], language-universal phone recognition [155], and bottomup,
stage-by-stage LVCSR [156], [157]. Most of the results
are preliminary studies because the ASAT-related effort is
quite recent. We hope to inspire other new applications.
A discussion of the new insights that may be offered by
the detector score plots together with the posteriogram
plots is presented first. Finally, a spoken language identi-
fication system based on attribute detectors is presentedin [158][160].
A. Visible Speech Analysis ThroughAttribute Detection
1) Detection Score Plots for MLP: We now analyzedetection score plots more closely. Fig. 14 displays a longer
sentence than that shown in Fig. 12. Although the
detection curves are not as smooth as those in the previousexample, the correct transcript can still be obtained by
following the evolution of the event detection process over
time. This outcome is also in line with spectrogram reading
by trained experts based on knowledge in acoustic phone-
tics (e.g., [41]). The detector scores here are normalized
between 0 and 1, ranging from an absence of an acoustic
property to the full presence of a speech cue. The value of
Fig. 13. A possible implementation of the ASAT merger. It is trainedusing the output of the bank of attribute detectors and generates
phone posteriors probabilities. Either a shallow MLP network, or a
DNN can be used.
Fig. 12. Detection curves of manner of articulation for the sentence numbered 442c013 (whose transcript is THATS FINE) of the SI-84 data set[142]. The curves in blue were generated using an ANN, whereas the curves in red were generated using segment-based HMM detectors.
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1103
-
these detections scores is a good indication of the activity
levels for the speech events of interest. Therefore, itprovides a new visualization tool in addition to the
conventional spectrogram plot shown in the top panel of
Fig. 14.
Error analysis has always played a crucial role in
providing diagnostic information to improving ASR algo-
rithms. With the extracted speech cue information re-
vealed in the new visualization tool, insight can also be
developed in understanding human speech. It could alsoprovide a good tool to offer speech insights to a new
generation of students and researchers.
For example, we can see sound transition behavior
clearly displayed in the region from segment 6 to
segment 8, going from phone /eh/ to /aa/ with a rising
activity from the preceding vowel into the glide sound /l/
in segment 7, then falling away into the following vowel.
We can also observe an overlapping nature of nasalizedvowel at the ending of segment 8 and the beginning of
segment 9. The double stop sound regions in segments 13
and 14 is also signaled. The large overlapping region for
the two-candidate segment 21 indicates that the glide
sound /r/ heavily influences articulation in its surround-
ing phones with a low level vowel activity showing up
between segments 21 and 22 on the detector plot for the
vowel manner.It is clear the detector score plots displayed in
Fig. 14 provide a rich set of information not commonly
available to researchers that are not expert trained in
spectrogram reading. It also reenforces additional ad-
vantages we intend to exploit in the information-
extraction perspective we have highlighted throughout
this paper.
2) Posteriogram Plots for MLP and DNN: We plot inFig. 15 time evolution of the estimated frame posterior
probabilities for phones, or phone posteriogram [93], for
the same short utterance used in Fig. 12. Instead of dis-
playing the CM value between 0 and 1 as in the detector
score plots, an intensity similar to spectrogram plots is
displayed showing darker regions for higher posterior pro-babilities at each 10-ms frame for the corresponding
phone. On the vertical axis, 40 phones are listed starting
with the phone /aa/ as in the word pot at the top, and
finishing with the phone /zh/ as in the word treasury at
the bottom. It is clearly visible that the silence unit
stands out in the beginning and ending parts of the utter-
ance for both plots, shown in the upper and lower panels,
respectively. A DNN is practically an MLP with manylayers (seven hidden layer have been used in our studies),
where the pretraining algorithm proposed for deep belief
networks [147] has been applied before training the MLP
[148], [150], which has a single hidden layer.
The DNN posteriograms are often sharper than the
MLP ones indicating the top candidate phones with the
darkest region at each vertical time snap shot have less
competition from other phones. The blurry nature in someregions on the plots indicates that some phones are confu-
sable at that time frame. This posteriogram can be dis-
played together with the detector score plots shown in
Fig. 12 to gain insight about the goodness of attribute-to-
phone mapping. For example, in the initial part of the
Fig. 14. Detection curves of manner of articulation for the sentence numbered 440c20t (RATES FELL ON SHORT TERM TREASURY BILLS) of theSI-84 data set [142]. The correct transcript can still be received by following the time evolution of detection of the attribute events.
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
1104 Proceedings of the IEEE | Vol. 101, No. 5, May 2013
-
phone /ay/, it would be misrecognized as phone /aw/,
but it could be corrected because the duration of a diph-
thong /aw/ cannot be so short. Finally, the DNN-based
posteriogram is not as smooth as and with less noise than that
obtained using the MLP with a single hidden layer, and
therefore more reliable than that produced by the MLP-baseddetectors. With the seven phones together with the two
silence segments clearly labeled in the bottom panel, it is
possible to observe the correct transcripts by following the
black lines (time evolution of the top phone posterior
probabilities). Furthermore, it is easy to identify the possible
sources of confusion, and devise techniques to address them.
B. Attribute-Based Lattice RescoringLattice rescoring is reported for a LVCSR task only. The
readers are referred to [154] for further insights on other
tasks.
1) Rescoring Technique: The rescoring algorithm aims tointegrate the confidence scores generated by the ASAT
detection-based frontend into the word lattice on an arc-
by-arc basis. Rescoring is carried out as a linear combina-
tion of the log-likelihood acoustic score generated by the
baseline LVCSR system and the logarithm of the phonemeposterior probability, properly discounted by the pho-
neme prior probability, generated by the detector-based
frontend.
2) Experimental Setup: The experiments are performedusing the 5000-word speaker-independent WSJ (5k-WSJ0)
corpus [142]. The SI-84 data set (7077 utterances from 84
speakers, i.e., 15.3 h of speech material) is used for train-ing. The testing material is again the Nov92 set. ML esti-
mation [5], [131], [161] is adopted to find the parameters of
a first HMM baseline. A second HMM baseline is then
Fig. 15. Posteriogram plots for the sentence numbered 442c0213 (THATS FINE) of the SI-84 data set [142]. (a) MLP-based posteriogram plot.(b) DNN-based posteriogram plot.
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
Vol. 101, No. 5, May 2013 | Proceedings of the IEEE 1105
-
obtained using MMI with the MLHMMs of the first sys-
tem as seed models. A trigram language model within the5k-WSJ0 vocabulary is used in decoding.
3) Experimental Results: Table 3 shows the performanceof the ML-based baseline system, in terms of WER, on the
Nov92 task. These results are comparable with the results
reported in [162][164]. In the second row, the perfor-
mance of the rescored system on the same task is shown
when both the bank of detectors and the merger are trainedwith phonetically rich TIMIT corpus [165], [166]. More-
over, the WERs of the MMI-based baseline and rescored
systems are shown in the last two rows, respectively. The
results indicate that the rescored system achieves better
performances than the conventional decoding scheme in
both cases. We noticed that several hypotheses have the
same error patterns for both HMM systems. These incor-
rectly recognized words are typically characterized by anacoustic trajectory that did not strictly observe the under-
lying acoustic phonetic constraints. The lack of the required
acoustic-phonetic evidence for the wrong hypotheses could
be signaled by attribute detectors, and this information
could then be used to penalize the corresponding phones
during the rescoring and subsequent decoding steps.
The phonetic-based correction concept is now illus-
trated on the Nov92 sentence numbered 446c0210. Thecorrect sequence of words for this utterance is: The com-pany said its European banking affiliate Safra republic plansto raise more than four hundred fifty million dollars throughan international offering. However, the baseline ML andMMI systems produce a phrase stock for, instead of theword Safra. Fig. 4 shows the spectrogram for this utteranceat the location where the error occurred with the ML- or
MMI-based system.In Fig. 4, correctly recognizing the word stock requires
the presence of two stop sounds /t/ and /k/ in the region
surrounding the middle vowel. But from the spectrogram,
it can be easily seen that there is a lack of articulatory
evidence to support this decoded word. The stop detector
signals these mistakes, and the correct sentence is decoded
after rescoring.
C. Cross-Language Attribute DetectionWe now report our studies on language-independent
attribute detection, which were extended to phone
recognition with minimal available data for the targetlanguage in [155]. English manner attribute scores have
been effectively incorporated into Mandarin LVCSR to
improve performance by lattice rescoring in a cross-lang-
uage manner as well [96].
1) Language-Universal Knowledge Source Definition: Fun-damental speech attributes, such as voicing, nasality, and
frication, could be identified from a particular languageand shared across many different languages, so they could
also be used to derive a universal set of speech units. A
small number of these universal units could be then used
to model speech sounds. It is worth noting that these
phonetic features (attributes) have already been used to
identify a common knowledge source in several studies
(e.g., [167] and [168]), yet these features were often
employed within a knowledge-based phoneme mappingprocedure to: 1) produce an expanded phoneme set to
cover speech sounds in multiple languages (e.g., [168]);
or 2) find the mapping between language-dependent (or
independent) acoustic models and the new target acoustic
models (e.g., [167]) for decoding purposes.
2) Experimental Setup: The stories part of the OGImultilanguage telephone speech corpus [169] is employedin our investigation. The amount of transcribed data is only
about 1 h per language, which is significantly smaller than
the usual amount of data used to train multilingual ASR
systems, e.g., [170]. This corpus has phonetic transcrip-
tions for six different languages: English (ENG), German
(GER), Hindi (HIN), Japanese (JAP), Mandarin (MAN),
and Spanish (SPA). Three subsets, namely training, vali-
dation, and test sets, are formed using the data available foreach language. Table 4 shows the amount of available data
for each subset and the number of language-dependent
phone units. Each attribute detector is designed within the
MLP framework as described in Section V-D1. Perfor-
mance is reported as in [171].
3) Language-Dependent Attribute Detection: Language-specific data (Table 4) are used to train, validate, and testeach detector. Language-dependent attribute accuracies
are found to be comparable across languages and attri-
butes. This implies that attribute classification could be
reliably obtained for a variety of languages. Furthermore,
good attribute accuracies could be achieved for several
attributes, such as vowel (92%) and continuant (90%),
Table 4 The OGI Stories Corpus in Terms of Amount of Data (in Hours)and Number of Phonemes Used per Each Language
Table 3 WER, in Percentage, on the Nov92 Task. Rescoring Was Applied toBoth the ML- and MMI-Based Baseline Systems, Trained on the SI-84
Material of the WSJ0 Corpus
Lee and Siniscalchi: An Information-Extraction Approach to Speech Processing
1106 Proceedings of the IEEE | Vol. 101, No. 5, May 2013
-
across languages. The full list of accuracies and insights
can be found in [155].
4) Cross-Language and Language-Universal AttributeDetection: The detectors of a specific language are nowtested on the data of the other languages. Fig. 16 shows the
attribute accuracy rates on the MAN data. The connected
line highlights results obtained with Mandarin-based de-
tectors. From Fig. 16, we can make several observations.
First, detection across languages is less reliable than in the
language-dependent cases for several attributes, but the
drop in performance is not particularly severe. Attributeaccuracies are comparable across all languages for the
vowel class. There are also cases for which cross-language
detection outperformed in-language detection, e.g., round.
This indicates that the round detector trained on a non-
Mandarin language performed as well as the language-
dependent Mandarin detector. A similar trend is observed
for all of the languages used in our studies.
By pooling together all training materials from the sixlanguages a new language-independent data set could be
formed. Then, a single bank of (universal) detectorscould be trained on this new data set. We observed that
better attribute accuracies could be attained for quite a few
attributes, e.g., vowel, fricative, round, and mid, with only
a minor degradation for the worst performing detectors.
D. ASAT-Based BottomUp LVCSRAs an initial attempt at implementing a bottomup
LVSCR system, the hybrid ANN/HMM approach has been
modified to explicitly represent and manipulate the search
space at various points in the decoding process [156] using
weighted finite state machines (WFSMs) [172]. ASR is
then accomplished in a bottomup fashion by performing
backend lexical access and syntax knowledge integration
over the output of our detection-based frontend, which
generates frame-level speech attribute detection scores
and phone posterior probabilities. Decoupled recognitionis made possible by two main factors: 1) high-accuracy
detection of acoustic information in order to generate
high-quality lattices at every stage of the acoustic and
linguistic information processing; and 2) low-error prun-
ing of the generated lattices in order to reduce search
errors likely to occur when trying to minimize the possi-
bility of memory overflow in using the AT&T WFSM tool.
1) Detection-Based LVCSR With WFSMs: LVCSR is ac-complished by building up on the frame-level evidence
gathered at the output of the detection-based frontend
shown in Fig. 11. The first step is to represent the output
of the detection-based frontend, for a given utterance,
as an acceptor F. In practice, F is a graph with anumber of states that equals the length of the input
sentence (in frames), and a number of edges between eachpair of states that equals the output dimension of the
merger (i.e., the number of even