Post on 26-Dec-2015
2
Outline
• LVCSR– Building a Highly Accurate Mandarin Speech Recognizer
Univ. of Washington, SRI, ICSI, NTU
– Development of the 2007 RWTH Mandarin LVCSR System RWTH
– The TITECH Large Vocabulary WFST Speech Recognition System Tokyo Institute of Technology
– Development of a Phonetic System for Large Vocabulary Arbic Speech Recognition Cambridge
– Uncertainty in Training Large Vocabulary Speech Recognizers (Focus on Graphical Model)
Univ. of Washington
– Advances in Arabic Broadcast News Transcription at RWTH RWTH
– The IBM 2007 Speech Transcription System for European Parliamentary Speeches (Focus on Language Adaptation)
IBM, Univ. of Southern California
– An Algorithm for Fast Composition of Weighted Finite-State Transducers Univ. of Saarland, Univ. of Karlsruhe
3
Outline (cont.)
– A Mandarin Lecture Speech Transcription System for Speech Summarization Univ. of Science and Technology, Hong Kong
4
Outline (cont.)
• Spoken Document Retrieval and Summarization– Fast Audio Search using Vector-Space Modeling
IBM
– Soundbite Identification Using Reference and Automatic Transcripts of Broadcast News Speech
Univ. of Texts at Dallas
– A System for Speech Driven Information Retrieval Universidad de Valladolid
– SPEECHFIND for CDP: Advances in Spoken Document Retrieval for the U.S. Collaborative Digitization Program
Univ. of Texas at Dallas
– A Study of Lattice-Based Spoken Term Detection for Chinese Spontaneous Speech (Spoken Term Detection)
Microsoft Research Asia
5
Outline (cont.)
• Speaker Diarization– Never-Endind Learning System System for On-Line Speaker Diarization
NICT-ATR
– Multiple Feature Combination to Improve Speaker Diarization of Telephone Conversations
Centre de Recherche Informatique de Montreal
– Efficient Use of Overlap Information in Speaker Diarization Univ. of Washington
• Others– SENSEI: Spoken English Assessment for Call Center Agents
IBM
– The LIMSI QAST Systems: Comparison Between Human and Automatic Rules Generation for Question for Question-Answering on Speech Transcriptions
LIMSI
– Topic Identification from Audio Recordings using Word and Phone Recognition Lattices
MIT
7
Reference
• [Ref 1] X. Lei, et al, “Improved tone modeling for Mandarin broadcast news speech recognition,” in Proc. Interspeech, 2006
• [Ref 2] F. Valente, H. Hermansky, “Combination of acoustic classifiers based on dempster-shafer theory of evidence,” in Proc. ICASSP, 2007
• [Ref 3] A. Zolnay, et al., “Acoustic feature combination for robust speech recognition,” in Proc. ICASSP, 2005
• [Ref 4] F. Wessel, et al., “Explicit word error minimization using word hypothesis posterior probabilities,” in Proc. ICASSP, 2001
9
• Acoustic Data– 866 hours of speech data collected by LDC (Training)
Mandarin Hub4 (30 hours), TDT4 (89 hours), and GALE Year 1 (747 hours) corpora for training our acoustic models
Span from 1997 through July 2006, from shows on CCTV, RFA, NTDTV, PHOENIX, ANHUI, and so on
– Test on three different test sets DARPA EARS RT-04 evaluation set (eval04), DARPA GALE 2006 evaluation
set (eval06), and GALE 2007 development set (dev07)
Corpora Description
10
Corpora Description (cont.)
• Text Corpora– The transcriptions of the acoustic training data, LDC Mandarin Gigaword cor
pus, GALE-related Chinese web text releases, and so on (1 billion words)
• Lexicon– Step1: Starting from the BBN-modified LDC Chinese word lexicon, and man
ually augment it with a few thousand new words (both Chinese and English words)(70,000 words)
– Step2: Re-segmenting the text corpora using longest-first match algorithm and train a unigram LM (Choose the most frequent 60,000 words)
– Step3: Using ML word segmentation on the training text to extract the out-of-vocabulary (OOV) words
– Step4: Retraining the N-gram LMs using the modified Kneser-Ney smoothing
11
Acoustic Systems
• Create two subsystems having approximately the same error rate performance but with error behaviors as different as possible
in order to compensate for each other
• System ICSI– Phoneme Set
70 phones for pronunciations Additionally, there is one phone designated for silence, and another one for noise
s, laughter, and unknown foreign speech (context-independent)
– Front-end features (74 dimensions per frame) 13-dim MFCC, and its first- and second-order derivatives spline smoothed pitch feature , and its first- and second-order derivatives 32-dim phoneme-posterior features generated by multi-layer perceptrons (MLP)
12
Acoustic Systems (cont.)
• Spline smoothed pitch feature [Ref 1]
– Since pitch is present only in voiced segments, the F0 needs to be interpolated in unvoiced regions to avoid variance problems in recognition
Interpolate the F0 contour with piecewise cubic Hermite interpolating polynomial (PCHIP)
PCHIP spline interpolation has no overshoots and less oscillation than conventional spline interpolation
Take the log of F0 Moving window normalization (MWN)
subtracts the moving average of a long-span window (1-2 secs) normalize out phrase-level intonation effects
5-point moving average (MA) smoothing smoothing reduces the noise in F0 features
raw f0 feature raw final feature(符合中美二國根本利益 )
System ICSI
13
Acoustic Systems (cont.)
• MLP feature– Providing discriminative phonetic information at the frame level
– It involves three main steps For each frame, concatenate its neighboring 9 frames of PLP and pitch features a
s the input to an MLP (43*9 inputs, 15000 hidden units, and 71 outputs units) Each output unit of the MLP models the likelihood of the central frame belonging to a c
ertain phone (Tandem phoneme posteriors features) Excluded the nose phone (It’s not a very discriminable class)
Next, they separately construct a two-stage MLP where the first stage contains 19 MLPs and the second stage one MLP
The purpose of each MLP in the first stage, with 60 hidden units each, is to identify a different class of phonemes, based on the log energy of a different critical band across a long temporal context (51 frames ~ 0.5 seconds)
The second stage of MLP then combines the information from all of the hidden units (60*19) from the first stage to make a grand judgment on the phoneme identity for the central frame (8,000 hidden units) (HATs phoneme posteriors features)
Finally, the 71-dim Tandem and HATs posterior vectors are combined using the Dumpster-Shafer algorithm
Logarithm is then applied to the combined posteriors, followed by Principal component analysis (PCA) (dimension de-corrlection and dimension reduction)
System ICSI
15
Acoustic Systems (cont.)
• System-PLP– The system contains 42-dimension features with static, first- and second-
order derivatives of PLP features
– In order to compete with the ICSI-model which has a stronger feature representation, an fMPE feature transform is learned for the PLP-model.
The fMPE transform is trained by computing the high-dimension Gaussian posteriors of 5 neighboring frames, given a 3500x32 cross-word tri-phone ML-trained model with an SAT transform (3500*32*5=560K)
– For tackling spontaneous speech, they additionally introduce a few diphthongs in the PLP-Model
System PLP
16
Acoustic Systems (cont.)
• Acoustic model in more detail– Decision-tree based HMM state clustering
3500 shared states, each with 128 Gaussians
– A cross-word tri-phone model with the ICSI-feature is trained with an MPE objective function
– SAT feature transform based on 1-class constrained MLLR
18
Decoding Architecture (cont.)
• Acoustic Segmentation– Their segmenter is run with a finite state grammar
– Their segmenter makes use of broad phonetic knowledge of Mandarin and models the input recording with five words
silence, noise, a Mandarin syllable with a voiceless initial, a Mandarin syllable with a voiced initial, and a non-Mandarin word
Each pronunciation phone (bg, rej, I1, I2, F, forgn) is modeled by a 3-state HMM, with 300 Gaussian per state
The minimum speech duration is reduced to 60 ms
19
Decoding Architecture (cont.)
• Auto Speaker Clustering– Using Gaussian mixture models of static MFCC features and K-means clust
ering
• Search with Trigrams and Cross Adaptation– The decoding is composed of three trigram recognition passes
ICSI-SI Speaker independent (SI) within-word tri-phone MPE-trained ICSI-model and the highly
pruned trigram LM gives a good initial adaptation hypothesis quickly
PLP-Adapt Use the ICSI hypothesis to learn the speaker-dependent SAT transform and to perform
MLLR adaptation per speaker, on the cross-word tri-phone SAT+fMPE MPE trained PLP-model
ICSI-Adapt Using the top 1 PLP hypothesis to adapt the cross-word tri-phone SAT MPE trained IC
SI-model
20
Decoding Architecture (cont.)
• Topic-Based Language Model Adaptation– Using a Latent Dirichlet Allocation (LDA) topic model
During decoding, they infer the topic mixture weights dynamically for each utterance
Then Select the top few most relevant topics above a threshold, and use their weights in θ to interpolate with the topic independent N-gram background language model
Weight the words in w based on an N-best-list derived confidence measure Include words not only from the utterance being rescored but also from surrounding
utterances in the same story chunk via a decay factor
The adapted n-gram is then used to rescore the N-best list
When the entire system is applied to eval06, the CER is 15.3%
22
Corpora Description
• Phoneme Set– The phoneme set is a subset of SAMPA-C
14 vowels and 26 consonants
– Tone information is included following the main-vowel principle Tone 3 and 5 are merged for all vowels For the phoneme @’, they merge tone 1 and 2 Resulting phoneme set consist of 81 tonemes, and additional garbage phone
and silence
• Lexicon– Based on LC-Star Mandarin lexicon (96k words)
– The unknown word are segmented into a sequence of known words by applying a longest-match segmenter
• Language models are as the same as Univ. Washington and SRI– Recognition experiments pruned 4-gram LMs
– Word graph rescoring full LMs
23
Acoustic Modeling
• The final system consists of four independent subsystems– System1 (s1)
MFCC features (+segment-wise CMVN) For each frame, concatenating its neighboring 9 frames and projected to a 45
dimensional feature space (done by LDA) Tone feature and its first and second derivative are also augmented to the feature
vector
– System 2 (s2) and system 3 (s3) are equal to s1 beside the based features s2 uses PLPs feature s3 uses gammatone cepstral coefficients
– System 4 (s4) stats with the same acoustic front-end as s1, but the features are augmented with phoneme posterior features produced by a neural network
The input of the net are multiple time resolution features (MRASTA) The dimension of the phoneme posterior features is reduced by a PCA to 24
24
Acoustic Modeling
• Acoustic Training– The acoustic models for all systems are based on tri-phones with cross-word
context Modeled by a 3-state left-to right HMM
– A decision tree based stat tying is applied (4,500 generalized tri-phone states)
– The filter-banks of the MFCC and PLP feature extraction are normalized by applying a 2-pass VTLN (not for s3 system)
– Speaker variations are compensated by using SAT/CMLLR
– Additionally, in recognition MLLR is applied to update the mean of the AMs
– MPE is used for discriminative AMs training
25
System Development
• Acoustic Feature Combination– The literature contains several way to combine different feature streams
Concatenate the individual feature vectors Feed the features streams into a single LDA Perform the integration in a log-linear model
• For fewer data, the log-linear model combination gives some nice improvement over the simple concatenation approach
• But with more training data the benefit declines and for the 870 hours setup we see no improvement at all
26
System Development (cont.)
• Consensus Decoding And System Combination – min.fWER (minimum time frame error) based consensus decoding
– min.fWER combination
– ROVER with confidence scores.
• The approximated character boundary times effectively work as good as the boundaries derived from a forced alignment• For almost all experiments, there is no difference in minimizing WER or CER• Only ROVER seems to benefit from switching to characters
27
Decoding Framework
• Multi-Pass recognition– 1. pass: no adaptation
– 2. pass: 2-pass-VTLN
– 3. pass: SAT/CMLLR
– 4. pass: MLLR
– 5. pass: LM rescoring
• The five passes result in an overall reduction in CER of about 10% relative for eval06 and about 9% for dev07
• The MPE trained models give a further reduction in the CER resulting in a 12% to 15% relative decrease over all passes
• Adding the 358 hours of extra data to the MPE training slightly decreases the CER and the total relative improvement is about 16% for both corpora
• LM.v2 (4-gram) outperforms LM.v1(5-gram) by about 0.8% absolute in CER consistently for all systems and passes
30
Introduction
• The goal is to build a fast, scalable, flexible decoder to operate on weighted finite state transducers (WFSTs) search spaces
• WFSTs provide a common and natural representation for HMM models context dependency pronunciation dictionaries grammars and alternative recognition outputs
• Within the WFSTs paradigm all the knowledge sources in the search space are combined together to form a static search network– The composition often happens off-line before decoding and there exist pow
erful operations to manipulate and optimise the search networks
– The fully composed networks can often be very large and therefore at both composition and decode time large amounts of memory can be required
Solution: on-the-fly composition of the network , disk based search networks and so on
31
Evaluations
• Evaluations were carried out using the Corpus of Spontaneous Japanese (CSJ)– contains a total of 228 hours of training data from 953 lectures
– 38 dimensional feature vectors with a 10ms frame rate and 25ms window size
– The language model was back-off trigram with a vocabulary of 25k words
• On the testing data, the language model perplexity was 57.8 and the out of vocabulary rate was 0.75%– 2328 utterances which spanned 10 lectures
• The experiments were conducted on a 2.40GHz Intel Core2 machines with 2GB of memory and an Nvidia 8800GTX graphics processor running Linux
32
Evaluations (cont.)
• HLevel and CLevel WFST Evaluations– CLevel (C。 L。 G)
– HLevel (H。 C。 L。 G)C: context dependency, L: lexiconG:LMs, H:ACs
• Recognition experiments were run with the beam width varied from 100 to 200
• CLevel – 2.1M states and 4.3M arcs required 150MBs ~ 170 MBs memory
• HLevel – 6.2M states and 7.7M arcs required 330MBs ~ 400 MBs memory
•Julius – required 60MBs ~ 100 MBs memory
*For narrow beams the HLevel decoder was slightly faster and achieved a marginally higher accuracy, showing the better optimized HLevel networks can be used with small overhead using singleton arcs.
33
Evaluations (cont.)
• Multiprocessor Evaluations– The decoder was additionally run in multi-threaded mode using one and two
threads to take advantages of both of the cores in the processor The multi-threaded decoder using two threads is able to achieve higher accuracy
for the same beam when compared to a single-threaded decoder There are parts of the decoding where each thread uses a local best cost for pruning
and not the absolute best cost at that point in time
36
Introduction
• The authors presented a two-stage method for fast audio search and spoken term detection– Using vector-space modeling approach to retrieve a short list of candidate
audio segments for a query Word lattice based
– The list of candidate segments is then searched using a word based index for known words and a phone-based index for out-of-vocabulary words
37
Lattice-Based Indexing for VSM
• For vector-space modeling, it is necessary to extract an unordered list of terms of interest from each document in the database– raw count, TF/IDF, … etc.
• In order to accomplish this for lattices, We can extract the expected counts of each term
• The training documents are using reference transcripts, instead of lattices or the 1-best output of a recognizer
• The unordered list of terms also extract from the most frequently occurring 1-gram tokens in the training documents
• However, this does not account for OOV terms in a query
j
jLl
ilLji wCXlPdwETC |,
the complete set of paths in the lattice
the count of term wi in path l
38
Experimental Results
• Experiment Setup– Broadcast news audio search task
2.79 hours / 1107 query terms 1408 segments
– Two ASR systems ASR System 1: 250K, SI+SA decode
6000 quinphone context-dependent states, 250k Guassians
ASR System 2: 30K, SI only decode 3000 triphone context dependent states, 30K
Both of these systems use a 4-gram language model, built from a 54M n-gram corpus
41
Introduction
• Soundbite identification in broadcast news is important for locating information – useful for question answering, mining opinions of a particular person, an
d enriching speech recognition output with quotation marks
• This paper presents a systematic study of this problem under a classification framework– Problem formulation for classification
– Feature extraction
– The effect of using automatic speech recognition (ASR) output
– Automatic sentence boundary detection
43
Classification Framework for Soundbite Identification (cont.)
• Problem formulation– Binary classification
Soundbite versus not
– Three-way classification Anchor, reporter, or a soundbite
• Feature Extraction (each speech turn is represented as a feature vector)– Lexical features
LF-1 Unigram and bigram features in the first and the last sentence of the current speech tur
n for speaker roles
LF-2 Unigram and bigram features from the last sentence of the previous turn and from the fi
rst sentence of the following turn functional transition among different speakers
– Structural features Number of words in the current speech turn Number of sentences in the current speech turn Average number of words in each sentence in the current speech turn
44
Classification Framework for Soundbite Identification (cont.)
• Feature Weighting– Notation
N is the number of speech turns in the training collection M is the total number of features fik is the frequency of feature φi in the k-th speech turn
ni denotes the number of speech turns containing feature φi
F(φi ) means the frequency of feature φi in the collection
wik is the weight assigned to feature φi in the k-th turn
– Frequency Weighting
– TF-IDF Weighting
– TF-IWF Weighting
– Entropy Weighting
ikik fw
iikik nNfw /log*
M
jijikik FFfw
1
/log*
iikik entropyfw 1*0.1log
N
i i
ij
i
iji f
f
f
f
Nentropy
1
loglog
1
45
Experimental Results
• Experimental Setup– TDT4 Mandarin broadcast news data
335 news shows
– Performance Measure Accuracy, precision, recall, f-measure
46
Experimental Results
• Comparison of Different Weighting Methods– using global information generally
perform much better than simply using local information
– different problem formulations seem to prefer different weighting methods
– entropy-based weighting is moretheoretic and seems to be a promisingweighting choice
• Contribution of Different Types of Features– adding contextual features improves
the performance
– Removing low-frequency features(i.e., Cutoff-1) helps in classification
47
Experimental Results
• Impact of Using ASR Output– Speech recognition errors hurt the
system performance
– Automatic sentence boundary detection degrades performanceeven more
• Three-way classification strategy generally outperforms the binary setup
REF: human transcriptsASR_ASB: ASR output and automatic sentence segmentationASR_RSB: ASR output and manually segmentation
49
Introduction
• Speech driven information retrieval is a more difficult task than text-based information retrieval– Because spoken queries contain less redundancy to overcome speech reco
gnition errors Longer queries are more robust to errors than shorter ones
• Three types of errors that affect retrieval performance– out of vocabulary (OOV) words
– errors produced by words in a foreign language
– regular speech recognition errors
• Solutions– OOV problem
Two-pass strategy to adapt the Lexicons and LMs
– Foreign words problem Added the pronunciation of foreign words to pronunciation lexicon
51
Experimental Setup
• Corpus– CLEF’01 (Cross-Language Evaluation Forum) Spanish monolingual test
suite The evaluation set includes a document collection, a set of topics and
relevance judgments 215,738 documents of the year 1994 from EFE newswire agency (511 Mb).
49 topics» each of them has three parts: a brief title statement, a one-sentence description and a
more complex narrative
– Queries were formulated from the description field of each topic
– 10 different speakers reading the queries (5 male and 5 female)
• Baseline System– ASR best hypothesis was processed by the IR engine to obtain the list
of documents relevant to that query (top 1000 most relevance docs) Three types of error
Type I: errors produced by OOV words Type II: errors caused by words in a foreign language Type III: regular speech recognition errors
52
Experiment Results
• To reduce the type1 error– Vocabulary adaptation
Created a list with every word that appeared in the documents retrieved in the first pass (the average number of words in a document is about 27000)
Added the most frequent words from our general vocabulary until we reached a vocabulary of 60,000 words
– Language Model Adaptation Trained a new LM with the documents obtained in the first pass Interpolated this new LM with the general LM, using the adapted vocabulary
Linear interpolation (0.5)
53
Experiment Results (cont.)
• Inclusion of foreign words pronunciation– Mapping English phonemes to Spanish ones
– Included the pronunciation of English words in the pronunciation lexicon CMU pronouncing dictionary
Add 8,891 English words
– One-pass strategy with alternate pronunciations reduced the number of type II errors, however some new type III errors appeared
• Final System– Combined the two-pass strategy with foreign words modeling
In the first pass, obtained the 1000 most relevant documents to the query (using the pronunciation lexicon that included English words pronunciation)
Then, adapted the vocabulary and the LM and expanded the pronunciation lexicon to include English words pronunciation
55
Introduction
• SpeechFind is a SDR system serving as the platform for several programs across the United States for audio indexing and retrieval – the National Gallery of the Spoken Word (NGSW), the Collaborative Digitizat
ion Program
• The system includes the following modules– An audio spider and transcoder
automatically fetching available audio archives from a range of available servers and converting the incoming audio files into the designed audio formats
parses the metadata and extracts relevant information into a “rich” transcript database to guide future information retrieval
– Spoken documents transcriber includes an audio segmenter and transcriber
– Transcription database
– An online public accessible search engine responsible for information retrieval tasks
a web-based user interface, search and index engines
57
Structure of CDP Audio Corpus
• The structure of CDP audio corpus– CDP audio files include Interviews, discussions/debates, and lectures, each
with 2-5 speakers participants
– The recorded audio documents are spontaneously articulated with many overlapping speakers, and burst noise events such as clapping, laughing, etc
– Recordings were conducted from the 1960s to 2000s and held at library of offices, classrooms, homes
58
Transcript Verification Process with CDP
• An online web-interface was developed in order to improve the quality of the ASR-generated transcripts
• The transcript verification process is as follows– Automatic Transcription
– Online Verification
– Model Enhancement
59
Transcript Improvement via Feature/Model Enhancement
• Speech/Feature Enhancement• Lexicon Update and Language Model Adaptation• Acoustic Model Adaptation Using Selective Training Set
– Document-dependent acoustic conditions speaker dependent characteristics, time varying/short-term background noise and
channel interference, and others
– Document-across acoustic conditions gender/age/accent dependent speech traits and the background noise/channel
distortions observed broadly
60
Notes
• The Dempster-Shafer (DS) Theory of Evidence allows representation and combination of different measures of evidence [ref2] [back]
61
Piecewise cubic Hermite interpolating polynomial (PCHIP)
Matlab Code:x = -3:3; y = [-1 -1 -1 0 1 1 1];t = -3:.01:3; p = pchip(x,y,t); s = spline(x,y,t); plot(x,y,'o',t,p,'-',t,s,'-.') legend('data','pchip','spline',4)
62
Log-Linear Model
• Different acoustic features are combined indirectly via the log-linear combination of acoustic probabilities [Ref 3][back]
• In the case of log-linear model combination, the posterior probability has the following form
• The feature functions are defined as– Language model
– Acoustic model
WXP if
if|
ifi
XWPf |
','exp
,exp
|
Wj
jj
jjj
f
if
if
if
i
XWg
XWg
XWP
WPXWg ifj log,LM
WXPXWg i
i
i ff
fj |log,AM
i
ff
Wopt
ifi
i
LM WXPWPW |maxarg
63
Minimum Time Frame Error (MTFE)
• The time frame errors are caused either by word deletions, insertions, and substitutions or by differing word boundaries [Ref 4][back]
• MTFE is to overcome the mismatch between Bayes’ decision rule which aims at minimizing the expected sentence error rate and the word error rate which is used to assess the performance of speech recognition systems
• The decision rule is rewritten as follows
Standard approach – minimize expected SER
64
Composition
• Composition is the transducer operation for combining dierent levels of representation [Ref 5][back]
– e.g. a pronunciation lexicon can be composed with a word level grammar to produce a phone to word transducer whose word sequences are restricted to the grammar
Grammar G
Lexical L
Composition G。 L