Presented by: Fang-Hui Chu
A Survey of Boosting HMM Acoustic Model Training
2
Introduction
• The No Free Lunch Theorem states that– There is no single learning algorithm that in any
domain always induces the most accurate learner
• Learning is an ill-posed problem and with finite data, each algorithm converges to a different solution and fails under different circumstances– Though the performance of a learner may be fine-tuned, but still there
are instances on which even the best learner is not accurate enough
• The idea is..– There may be another learner that is accurate on these instances
– By suitably combining multiple learners then, accuracy can be improved
3
Introduction
• Since there is no point in combining learners that always make similar decisions– The aim is to be able to find a set of base-learners who differ in their de
cisions so that they will complement each other
• There are different ways the multiple base-learners are combined to generate the final outputs:– Multiexpert combination methods
• Voting and its variants• Mixture of experts• Stacked generalization
– Multistage combination methods• Cascading
4
Voting
• The simplest way to combine multiple classifiers– which corresponds to taking a linear combination of the learners
– this is also known as ensembles and linear opinion pools
– The name voting comes from its use in classification
• if , called plurality voting• if , called majority voting
L
jjjdwy
1
1 and ,01
L
jjj wjw
L
jjiji dwy
1
KiK ,...,1 outputs, are there,
Lw j 1LwK j 1,2
5
Bagging
• Bagging is a voting method whereby base-learners are made different by training them over slightly different training sets– is done by bootstrap
• where given a training set X of size N, we draw N instances randomly from X with replacement
• In bagging, generating complementary base-learners is left to chance and to the instability of the learning method– A learning algorithm is an unstable algorithm if small changes in the tr
aining set causes a large difference in the generated learner• decision trees, multilayer perceptrons, condensed nearest neighbor
• Bagging is short for Bootstrap aggregating
Breiman, L. 1996. “Bagging Predictors.” Machine Learning 26, 123-140
6
Boosting
• In boosting, we actively try to generate complementary base-learners by training the next learner on the mistakes of the previous learners
• The original boosting algorithms (Schapire 1990) combines three weak learners to generate a strong learner– In the sense of the probably approximately correct (PAC) learning mo
del
• Disadvantage– It requires a very large training sample
Schapire, R.E. 1990. “The Strength of Weak Learnability.” Machine Learning 5, 197-227
X3X1 X2
d3
d2d1
7
AdaBoost
• AdaBoost, short for adaptive boosting, uses the same training set over and over and thus need not be large and it can also combine an arbitrary number of base-learners, not three
• The idea is to modify the probabilities of drawing the instances as a function of the error– The probability of a correctly classified instance is decreased, then a ne
w sample set is drawn from the original sample according to these modified probabilities
– That focuses more on instances misclassified by previous learner
• Schapire et al. explain that the success of AdaBoost is due to its property of increasing the margin
– Schapire. et al. 1998. “Boosting the Margin: A New Explanation for Effectiveness of Voting Methods” Annals of Statistics 26, 1651-1686
Freund and Schapire. 1996. “Experiments with a New Boosting Algorithm” In ICML 13, 148-156
8
AdaBoost.M2 (Freund and Schapire, 1997)
stop ; 21 if j
Freund and Schapire. 1997. “A decision-theoretic generalization of on-line learning and an application to boosting” Journal of Computer and System Sciences 55, 119-139
9
ICASSP 03R. Zhang & A. Rudnicky
“Improving the Performance of An LVCSR System Through Ensembles
of Acoustic Models”
ICASSP 02C. Meyer
“Utterance-Level Boosting of HMM Speech Recognition”
ICASSP 02I. Zitouni et al.
“Combination of Boosting and Discriminative Training for Natural Langu
age Call Steering Systems”
EuroSpeech 97G. Cook et al.
“Ensemble Methods for Connectionist Acoustic Modeling”
ICSLP 04R. Zhang & A. Rudnicky
“A Frame Level Boosting Training Scheme for Acoustic Modeli
ng”
Evolution of Boosting Algo.
ICSLP 96G. Cook & T. Robinson
“Boosting the Performance of Connectionist LVSR”
B
ICASSP 00G. Zweig & M. Padmanabhan
“Boosting Gaussian Mixtures in An LVCSR System”
2
1996
2000
2003
ICASSP 04C. Dimitrakakis & S. Bengio
“Boosting HMMs with An Application to Speech Recognition”
ICSLP 04R. Zhang & A. Rudnicky
“Optimizing Boosting with Discriminative Criteria”
ICSLP 04R. Zhang & A. Rudnicky
“Apply N-Best List Re-Ranking to Acoustic Model Combinations of
Boosting Training”
EuroSpeech 05R. Zhang et al.
“Investigations on Ensemble Based Semi-Supervised Acousti
c Model Training”
ICSLP 06R. Zhang & A. Rudnicky “Investigations of Issues for Using Multiple Acoustic Models
to Improve CSR”
EuroSpeech 03R. Zhang & A. Rudnicky
“Comparative Study of Boosting and Non-Boosting Training for Constructing Ensembles of Acoustic Mo
dels”
2004 2005 2006
0
1
5
7
6
8
9
D
SpeechCom 06C. Meyer & H. Schramm“Boosting HMM Acoustic Mo
dels in LVCSR”
A
2002
ICASSP 99H. Schwenk
“Using Boosting to Improve a Hybrid HMM/Neural Network Speech Reco
gnizer”
1999
Neural Network
1997
GMM
HMM
C
3
4
Presented by: Fang-Hui Chu
Improving The Performance of An LVCSR System Through Ensembles of Acoustic Models
ICASSP 2003
Rong Zhang and Alexander I. RudnickyLanguage Technologies Institute,
School of Computer Science
Carnegie Mellon University
11
Bagging vs. Boosting
• Bagging– In each round, bagging randomly selects a number of examples from the
original training set, and produces a new single classifier based on the selected subset
– The final classifier is built by choosing the hypothesis best agreed on by single classifiers
• Boosting– In boosting, the single classifiers are iteratively trained in a fashion such
that hard-to-classify examples are given increasing emphasis
– A parameter that measures the classifier’s importance is determined in respect of its classification accuracy
– The final hypothesis is the weighted majority vote from the single classifiers
12
Algorithms
• The first algorithm is based on the intuition that an incorrectly recognized utterance should receive more attention in training
• If the weight of an utterance is 2.6, we first add two copies of the utterance to the new training set, and then add its third copy with probability 0.6
13
Algorithms
• The exponential increase in the size of training set is a severe problem for algorithm 1
• Algorithm 2 is proposed to address this problem
14
Algorithms
• In algorithm 1 and 2, there is no concern to measure how important a model is relative to others– Good model should play more
important role than bad one
x
x1 1
expi
T
tittecL
otherwise
ife
ti
ti
it
0
x
x
x
x
xx
1
1
1
1
1
exp
expexp
iiTT
Ti
iiTT
T
titt
ecw
ececL
15
Experiments
• Corpus : CMU Communicator system
• Experimental results :
Presented by: Fang-Hui Chu
Comparative Study of Boosting and Non-Boosting Training for Constructing Ensembles
of Acoustic Models
Rong Zhang and Alexander I. Rudnicky
Language Technologies Institute, CMU
EuroSpeech 2003
17
Non-Boosting method
• Bagging – is a commonly used method in machine learning field
– randomly selects a number of examples from the original training set and produces a new single classifier
– in this paper, we call it a non-Boosting method
• Based on the intuition– The misrecognized utterance should receive more attention in the
successive training
18
Algorithms
λ is a parameter that prevents the size of the training set from being too large.
19
Experiments
• The corpus:– Training set: 31248 utterances; Test set: 1689 utterances
Presented by: Fang-Hui Chu
A Frame Level Boosting Training Scheme for Acoustic Modeling
ICSLP 2004
Rong Zhang and Alexander I. RudnickyLanguage Technologies Institute,
School of Computer Science
Carnegie Mellon University
21
Introduction
• In the current Boosting algorithm, utterance is the basic unit used for acoustic model training
• Our analysis shows that there are two notable weaknesses in this setting..– First, the objective function of current Boosting algorithm is designed to
minimize utterance error instead of word error
– Second, in the current algorithm, an utterance is treated as a unity for resample
• This paper proposes a frame level Boosting training scheme for acoustic modeling to address these two problems
22
is the pseudo loss for frame t, which describes the
degree of confusion of this frame for recognition
Frame Level Boosting Training Scheme
• The metrics that we will use in Boosting training is the frame level conditional probability -----(word level)
• Objective function :
x|taP
NBest
ttNBest
h
ahtt
hP
hP
P
aPaP
x
x
x
x
x
xx
,
,,
| label,
N
i
T
t aaiii
i
t
aPaPL1 1
||exp xx
taa
iii aPaP xx ||exp
23
Frame Level Boosting Training Scheme
• Training Scheme:– How to resample the frame
level training data?• to duplicate for times
and creates a new utterance for acoustic model training
ti,x ti ,x
24
Experiments
• Corpus : CMU Communicator system
• Experimental results :
Presented by: Fang-Hui Chu
Boosting HMM acoustic models in large vocabulary speech recognition
Carsten Meyer, Hauke SchrammPhilips Research Laboratories, Germany
SPEECH COMMUNICATION 2006
26
Utterance approach for boosting in ASR
• An intuitive way of applying boosting to HMM speech recognition is at the utterance level– Thus, boosting is used to improve upon an initial ranking of candidate
word sequences
• The utterance approach has two advantages:– First, it is directly related to the sentence error rate
– Second, it is computationally much less expensive than boosting applied at the level of feature vectors
27
Utterance approach for boosting in ASR
• In utterance approach, we define the input patterns to be the sequence of feature vectors corresponding to the entire utterance
• denotes one possible candidate word sequence of the speech recognizer, being the correct word sequence for utterance
• The a posteriori confidence measure is calculated on basis of the N-best list for utterance
ix
i
i
y
iy
iiL
iLz it
itit
zpzxp
ypyxpyxh
,
28
Utterance approach for boosting in ASR
• Based on the confidence values and AdaBoost.M2 algorithm, we calculate an utterance weight for each training utterance
• Subsequently, the weight are used in maximum likelihood and discriminative training of Gaussian mixture model
i)(iwt
N
iiittML yxpiwF
1, log
N
i y i
iittMMI ypyxp
yxpiwF
1, log
29
Utterance approach for boosting in ASR
• Some problem encountered when apply it to large-scale continuous speech application:– The N-best lists of reasonable length (e.g. N=100) generally contain only
a tiny fraction of the possible classification results
• This has two consequences:– In training, it may lead to sub-optimal utterance weights
– In recognition, Eq. (1) cannot be applied appropriately
iLz it
itit
zpzxp
ypyxpyxh
,
),(1
lnmaxarg)(1
yxhxH t
T
t tYy
30
Utterance approach for CSR--Training
• Training– A convenient strategy to reduce the complexity of the classification task
and to provide more meaningful N-best lists consists in “chopping” of the training data
– For long sentences, it simply means to insert additional sentence break symbols at silence intervals with a given minimum length
– This reduces the number of possible classifications of each sentence “fragment”, so that the resulting N-best lists should cover a sufficiently large fraction of hypotheses
31
Utterance approach for CSR--Decoding
• Decoding: lexical approach for model combination– A single pass decoding setup, where the combination of the boosted
acoustic models is realized at a lexical level
– The basic idea is to add a new pronunciation model by “replicating” the set of phoneme symbols in each boosting iteration (e.g. by appending the suffix “_t” to the phoneme symbol)
– The new phoneme symbols, represent the underlying acoustic model of boosting iteration
t “au”, “au_1” ,“au_2”,…
t
32
Utterance approach for CSR--Decoding
• Decoding: lexical approach for model combination (cont.)– Add to each phonetic transcription in the decoding lexicon a new transcri
ption using the corresponding phoneme set
– Use the reweighted training data to train the boosted classifier
– Decoding is then performed using the extended lexicon and the set of acoustic models weighted by their unigram prior probabilities which are estimated on the training data
tM
“sic_a”, “sic_1 a_1” ,…
t
weighted summation
33
In more detail
BoostingIteration t
“_t”Mt
Trainingcorpus
trainingcorpus(Mt)
phonetically transcribed
)(iwtML/MMItraining
M1,M2,…,MtLexicon
pronunciationvariant
extend
Training
“sic_a”, “sic_1 a_1” ,…
unweighted model combinationweighted model combination
)( 1
11
)(11111
)(11111
11111
11
1111
11
,,,
where
,maxargmaxarg
NN
NNNN
NN
wRv
N
iiiii
imii
wRv
NMNNN
wRv
MNNMN
MN
w
MN
w
N
vxpwvpwwp
vxpwvpwpxvwpxwp
xwpxwpw
Decoding
34
In more detail
35
Weighted model combination
i
i
NN
NN
NN
tti
T
t wRv
N
iiiiiiii
imii
T
t wRv
NMNNN
T
t wRv
MNN
MN
tp
tvxptptwvpwwp
tvxptwvptwp
xtvwp
xwp
1
ln
,simplicityfor
,,
,,
,,,
,
1 )( 1
11
1 )(11111
1 )(111
11
11
11
11
• Word level model combination
36
Experiments
• Isolated word recognition– Telephone-bandwidth large vocabulary isolated word recognition
– SpeechDat(II) German meterial
• Continuous speech recognition– Professional dictation and Switchboard
37
Isolated word recognition
• Database:– Training corpus: consists of 18k utterances (4.3h) of city, company, first
and family names
– Evaluations:• LILI test corpus: 10k single word utterances (3.5h); 10k words lexicon; (matc
hed conditions)• Names corpus: an inhouse collection of 676 utterances (0.5h); two different d
ecoding lexica: 10k lex, 190k lex; (acoustic conditions are matched, whereas there is a lexical mismatch)
• Office corpus: 3.2k utterances (1.5h), recorded over microphone in clean conditions; 20k lexicon; (an acoustic mismatch to the training conditions)
38
Isolated word recognition
• Boosting ML models
39
Isolated word recognition
• Combining boosting and discriminative training
– The experiments in isolated word recognition showed that boosting may improve the best test error rates
40
Continuous speech recognition
• Database– Professional dictation
• An inhouse data collection of real-life recordings of medical reports• The acoustic training corpus consists of about 58h of data• Evaluations were carried out on two test corpora:
– Development corpus consists of 5.0h of speech– Evaluation corpus consists of 3.3h of speech
– Switchboard• Consisting of spontaneous conversations recorded over telephone line; 57h
(73h) of male(female)• Evaluations corpus:
– Containing about 1h(0.5h) of male(female)
41
Continuous speech recognition
• Professional dictation:
42
• Switchboard:
43
Conclusions
• In this paper, a boosting approach which can be applied to any HMM based speech recognizer was be presented and evaluated
• The increased recognizer complexity and thus decoding effort of the boosted systems is a major drawback compared to other training techniques like discriminative training
44
Probably Approximately Correct Learning
• We would like our hypothesis to be approximately correct, namely, that the error probability be bounded by some value
• We also would like to be confident in our hypothesis in that we want to know that our hypothesis will be correct most of the time, so we want to be probably correct as well
• Given a class, , and examples drawn from some unknown but fixed probability distribution, such that with probability at least , the hypothesis has error at most , for arbitrary and
C)(xp
1 h
21 0
1hCP
45
Probably Approximately Correct Learning
• How many training examples N should we have, such that with probability at least 1 ‒ δ, h has error at most ε ?
most general hypothesis, G
most specific hypothesis, S
h H, between S and G isconsistent
and make up the version space
•Each strip is at most ε/4
•Pr that we miss a strip 1‒ ε/4
•Pr that N instances miss a strip (1 ‒ ε/4)N
•Pr that N instances miss 4 strips 4(1 ‒ ε/4)N
•4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x)
•4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)
Presented by: Fang-Hui Chu
The Boosting Approach to Machine Learning An Overview
Robert E. SchapireAT&T Labs, USA
MSRI Workshop on Nonlinear Estimation and Classification, 2002
47
Abstract
• This paper overviews some of recent work on boosting including :– Analyses of AdaBoost’s training error and generalization error
– Boosting’s connection to game theory and linear programming
– The relationship between boosting and logistic regression
– Extensions of AdaBoost for multiclass classification problems
– Methods of incorporation human knowledge into boosting
48
References
• Freund and Schapire. 1997. “A decision-theoretic generalization of on-line learning and an application to boosting” Journal of Computer and System Sciences 55, 119-139
• Meir and Ratsch. 2003 “An introduction to boosting and leveraging” in Advanced Lectures on Machine Learning (LNAI2600), 118-183
49
Introduction
• Boosting is based on the observation:– finding many rough rules of thumb can be a lot easier than finding a
single, highly accurate prediction rule
• Two fundamental questions:– How should each distribution be chosen on each round?
– How should the weak rules be combined into a single rule?
A method for finding rough rules of thumb is called as “weak” or “base” learning algorithm
50
AdaBoost algorithm
51
AdaBoost algorithm cont.
• The base learner’s job is to find a base classifier appropriate for the distribution
• In the binary case, the base learner’s then is to minimize the error
• AdaBoost choose a parameter that intuitively measures the importance that it assigns to
RXht :
tD
itDit yxht
~Pr
th
Rt
t
tt
1ln
2
1
52
Analyzing the training error
• The most theoretical property of AdaBoost concerns its ability to reduce the training error
• The training error of the final classifier is bounded as follows:
t
ti
iiii Zxfym
yxHim
exp1
:1
define
i
tt xhxf
53
Detail derivation
tt
i
T
titit
i
T
titit
i
T
tititii
i
T
tititii
i titit
i titit
i titti
iii
iiiii
Z
xhyiDZxhyiDZ
xhyxhym
xhyxhym
xhym
xhym
xhym
xfym
xfym
yxHim
221
221
211
211
expexp
expexp1
expexp1
exp1
exp1
exp1
exp1
exp1
:1
iixfy xfye ii if 1
1
1112
exp
Z
xhyiDiD ii
54
Analyzing the training error cont.
• The training error can be reduced most rapidly by choosing and on each round to minimize
• In the case of binary classifiers,
tht
t
itittt xhyiDZ exp
ttii ZyxHi
m:
1
tt
tt
ttt
ttt
ttZ
2
2
2exp
41
2
1
2
1212
tt
2
1
55
Analyzing the training error cont.
• Thus, if each base classifier is slightly better than random so that for some , then the training error drops exponentially fast in T
• The fact that AdaBoost is a procedure for finding a linear combination f of base classifiers which attempts to minimize
t 0
22 2exp2exp:1 TZyxHim t
tt
tii
i titti
iii xhyxfy expexp
AdaBoost is doing a kind of steepest descent search to minimize above equation where the search is constrained at each step to follow coordinate directions
Mason et al. 1999. “Boosting Algorithms as gradient descent” in Advances in Neural Information Processing Systems 12, 2000
56
Detail derivation
...
1
expexp
expexp
exp
tt ee
iDiD
xhyiDxhyiD
xhyiDZ
tt
Eitt
Citt
Eiititt
Ciititt
iitittt
Top Related