Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark...
-
Upload
gary-gardner -
Category
Documents
-
view
213 -
download
0
Transcript of Acoustic Modeling for Multi- Language, Multi-Style, Multi-Channel Automatic Speech Recognition Mark...
Acoustic Modeling for Multi-Language, Multi-Style, Multi-Channel Automatic Speech Recognition
Mark Hasegawa-Johnson
Yuxiao Hu, Dennis Lin, Xiaodan Zhuang, Jui-Ting Huang, Xi Zhou, Zhen Li, and Thomas Huangincluding also the research results of
Laehoon Kim and Harsh Sharma
University of University of IllinoisIllinois
Motivation Applications in a Multilingual Society
News Hound: Find all TV news segments, in any language, mentioning “Barack Obama”
Language Learner: Transcribe learner's accented speech; tell him which words sound accented
Broadcaster/Podcaster: Automatically transcribe “man on the street” interviews in a multilingual city (LA, Sing)
Problems Physical variability: noise, echo, talker Imprecise categories: dependent on context Content variability: language, topic, dialect, style
Method: Transform and Infer(ubiquitous methodology in ASR; see, e.g., Jelinek, 1976)
Signal transforms
Classifier transforms
Likelihood Vector b
i=p(observation
t|state
t=i)
Best label sequence = argmax p(label1,...,label
T|observation
1,...,observation
T)
Inference AlgorithmA Parametric Model of p(state
1,...,state
T,label
1,...,label
T)
Signal TransformsTransforms determined by a physical model of the signal
A good signal model tells you a lot: Reverberation model: y[n]=v[n]+mh[m]x[n-m]
x[n] produced by a human vocal tract, designed for efficient processing by a human auditory system
A good signal transform improves the accuracy of all classifiers Denoising: Correct for additive noise Dereverberation: Correct for convolutional noise Perceptual freq warping: Hear what humans hear
Denoising Example(Kim et al., 2006)
Classifier TransformsCompute a precise and accurate estimate of p(obst|statet)
Robust Machine Learning From a limited amount of training data, Learn parameterized probability models as precise as possible, ...with a known upper bound on generalization error
Methods that trade off precision and generalization Decorrelate the signal measurements: PCA, DCT Select the most informative features from an inventory: AdaBoost
Train a linear or nonlinear function zt=f(y
t) that
Discriminates among the training examples from diff classes Has known upper bounds on generalization error (SVM, ANN)
Train another nonlinear function p(zt|state
t) with same properties
Classifier TransformsCompute a precise and accurate estimate of p(obst|statet)
InferenceIntegrate information to choose best global labelset
Labels = variables that matter globally Speech Recognition: what words were spoken? Information Retrieval: which segment best matches the query? Language Learning: where's the error?
States = variables that can be classified locally
May be scalar, e.g., qt=sub-phoneme
May be vector, e.g., qt=[vector of articulatory states]
Inference algorithm = Parametric model of p(states,labels) Scalar states: Hidden Markov model, Finite State Transducer Vector states: Dynamic Bayesian network, Conditional Random
Field
InferenceIntegrate information to choose best global labelset
Example: Language-Independent Phone Recognition(Huang et al., in preparation)
Voice activity detectionPerceptual freq warpingGaussian mixtures
Likelihood Vector b
i=p(observation
t|state
t=i)
Best label sequence = argmax p(phone1,...,phone
T|observation
1,...,observation
T)
Inference Algorithm: Hidden Markov Model with Token Passingp(state
1,...,state
T,phone
1,...,phone
T)
A Language-Independent Phone Set (Consonants)
Plus secondary articulations (glottis, pharynx, palate, lips), sequences, and syllabics
^~, >~, @~, &~, a~, A~, …nasalized
>:, a:, A:, e:, E:, i:, I:, o:, u:, 3r:long
3r, 4r, &rretroflexion
>i, aU, Au, ei, eI, Ei, eu, Eu, ia, ie, io, iu, oi, oU, ua, uax, ui, uo, aI, i>, iE, ue, uE (>u, AI, axI~, axU~)
diphthongs
A Language-Independent Phone Set (Vowels)
Training Data
10 languages, 11 corpora Arabic, Croatian, English, Japanese, Mandarin,
Portuguese, Russian, Spanish, Turkish, Urdu
95 hours of speech Sampled from a larger set of corpora Mixed styles of speech: broadcast, read, and
spontaneous
Summarization of Corpora
Dictionaries(Hasegawa-Johnson and Fleck, http://www.isle.uiuc.edu/dict/)
Diacriticized Versionavailable on web?
Ruleset #1q= قk= کg= گ
...
Ruleset #2� =A
� =ligature� =u
...
/sAh{SV}b{SV}/, /sA!iqƏ/
Phonetic Transcriptions
No Yesصاحب صاحب, صاعقصاع�ق
Urdu: No Vowels!!
Orthographic Transcriptions
Context-Dependent Phones Triphones: when is a /t/ not a /t/?
“writer” /t/ is unusual; call it /aI-t+3r/ “a tree” /t/ is unusual; call it /&-t+r/ “that soup” /t/ is unusual; call it /ae-t+s/
Lexical stress /i/ in “reek” longer than in “recover” Call them /r-i+k'/ vs. /r-i+k/
Punctuation, an easy-to-transcribe proxy for prosody /n/ in “I'm done.” 2X as long as /n/ in “Done yet?” Call them /^-n+{PERIOD}/ vs. /^-n+j/
Language, Dialect, Style: /o/ in “atone:” call it /t-o+n%eng/ /o/ in あとに : call it /t-o+n%jap/
Gender: handled differently (speaker adaptation)
^’-A+b%eng^’-A+b’%eng>-A+d%cmn
….
Decision Tree State TyingCategories for decision tree questions
Distinctive phone features (manner/place of articulation) of right or left context
Language identity Dialect identity (L1 vs. L2) Lexical stress Punctuation mark
^’-A+b%engL2^’-A+b’%engL2
>A+d%cmn….
Each leaf node contains at least 3.5 seconds of training data
Phone Recognition Experiment(Huang et al., in preparation)
Language-independent triphone bigram language model
Standard classifier transforms (PLP+d+dd, CDHMM, 11-17 Gaussians)
Vocabulary size: top 60K most frequent triphones (since 140K is too much!) For the rest of infrequent triphones, map them
back to center monophones
Recognition Results(Huang et al., in preparation)
Test set: 50 sentences per corpus
Example: Language-Independent Speech Information Retrieval(Zhuang et al., in preparation)
Voice activity detectionPerceptual freq warpingGaussian mixtures
Likelihood Vector b
i=p(observation
t|state
t=i)
Retrieval Ranking = E(count(query|segment observations))
Inference Algorithm: Finite State Transducer built from ASR LatticesE(count(query|observations))
Information RetrievalStandard Methods Task Description: given a query, find the “most relevant”
segments in a database Published Algorithms:
EXACT MATCH: segment = argmin d(query,segment) Fast
SUMMARY STATISTICS: segment = argmax p(query|segment), no concept of “word order” Good for text, e.g., google, yahoo, etc.
TRANSFORM AND INFER: segment = argmax p(query|segment), ≈ E(count(query)|segment); word order matters Flexible, but slow....
Language-Independent IR:The Star Challenge
A Multi-Language Multi-Media Broadcast News Retrieval Competition, sponsored by A*STAR
Elimination rounds, June-August 2008 Three rounds, each of 48 hours duration 56 teams entered from around the world 5 teams selected for the Grand Finals
Grand Finals: 10/23/2008, Singapore
Star Challenge Tasks VT1, VT2: Given image category (e.g.,
“crowd,” “sports,” “keyboard”), find examples AT1: Given an IPA phoneme sequence
(example: /ɻogutʃA/), find audio segments AT2: Given a waveform containing a word or
word sequence in any language, find audio segments containing the same word
AT1+VT2: find specified video class, speech contains IPA (e.g., “man monologue”+/groʊɵ/)
Star Challenge: Simplified Results
Round 1 Round 3 Grand FinalRanking Ranking Ranking(Verified) (Hearsay) (Verified)
National U. Singapore (EXACT MATCH) ? 5 1NII and IRISA (EXACT MATCH) ? 2 2University of Illinois (TRANSFORM INFER) 4 2 3Beijing University (TRANSFORM INFER) ? 1 4
Rounds 1 and 3: 48,000 CPU hours Round 1: English, 20 queries Round 3: English and Mandarin, 3 queries each
Grand Final: 6 CPU hours English, Mandarin, Malay, and Tamil, 2 queries each
Open Research Areas When does “Transform and Infer” help?
ROUND 3 (1000cpus, 48 hours): best algorithms were “transform and infer”
GRAND FINAL (3 cpus, 2 hours): best algorithms were “exact match”
Open research area #1: complexity “Inference algorithm:” user constraints → simplified
classifier Improved transforms and improved classifiers allow
the use of a less-constrained user interface Open research area #2: accuracy
Existence Proof:ASR can beat Human Listeners(Sharma et al., in preparation)
Talker ID (M/F=Gender)M09 M05 M06 F02 M07 F03 M04
Human Listener Accuracy(Unfamiliar Listeners, 86 58 39 29 28 6 2Unlimited Vocabulary)ASR: Digits 85 90 93 94 100 74 46ASR: Letters 97 77 77 70 86 42 19ASR: 55 Words 90 63 72 73 81 40 14ASR: 155 Words 47 50 36 43 44 22 6
The task: speech of talkers with gross motor disability (Cerebral Palsy) Familiar listeners in familiar situations understand most of what they say... ASR can also be talker-dependent and vocabulary-constrained
Open Research Areas
Remove the Constraints! ASR can beat a human listener if the ASR knows
more than the human (e.g., knows the talker and the vocabulary)
Better knowledge = better signal models better classifiers better inference
Thank You!
Questions?
Decision Tree State Tying (Odell, Woodland and Young, 1994)
1. Divide each IPA phone into three temporally sequential “states,” /i/ -> /i/onset, /i/center, /i/offset
2. Start with one model for each state. Create a statistical model p(acoustics|state) using training data
3. Ask yes-no questions about context variables Left phone, right phone, lexical stress, language ID
• If p(acoustics|state, yes) ≠ p(acoustics|state, no), split the training data into two groups The “yes” examples vs. the “no” examples If many such questions exist, choose the best Repeat this process as long as each group contains enough
training data examples