DISCRIMINATIVE TRAINING OF STREAM WEIGHTS IN A ...mak/PG-Thesis/mphil-thesis-benny.pdfIn this...

DISCRIMINATIVE TRAINING OFSTREAM WEIGHTS IN A MULTI-STREAM

HMM AS A LINEAR PROGRAMMINGPROBLEM

by

NG, YIK LUN

A Thesis Submitted toThe Hong Kong University of Science and Technology

in Partial Fulfillment of the Requirements for

the Degree of Master of Philosophy

in Computer Science and Engineering

January 2008, Hong Kong

Copyright c© by NG, Yik Lun 2008

Authorization

I hereby declare that I am the sole author of the thesis.

I authorize the Hong Kong University of Science and Technology to lend this

thesis to other institutions or individuals for the purpose of scholarly research.

I further authorize the Hong Kong University of Science and Technology to

reproduce the thesis by photocopying or by other means, in total or in part, at the

request of other institutions or individuals for the purpose of scholarly research.

NG, YIK LUN

ii



by

NG, YIK LUN

This is to certify that I have examined the above M.Phil. thesis

and have found that it is complete and satisfactory in all respects,

and that any and all revisions required by

the thesis examination committee have been made.

DR. BRIAN KAN-WING MAK, THESIS SUPERVISOR

PROF. LIONEL NI, HEAD OF DEPARTMENT

Department of Computer Science and Engineering

15 January 2008

iii

ACKNOWLEDGMENTS

I would like to express my sincere thankfulness to Dr. Brian Mak for his supervi-

sion throughout my MPhil study. He taught me not only the concepts on speech

technology, but also the ways to think critically and present the ideas in a precise

and organized manner. I would also like to thank Dr. Dit-Yan Yeung and Dr.

Huamin Qu for being my panel.

I would like to express my gratitude to my colleagues including Harry Lai,

Jeff Au Yeung, Kimo Lai, Ka-Keung Wong and Yang Xi. I learnt a lot from them

in the past 2 years. I would also like to thank Ivor Tsang as he helped me solve

some technical problems in my thesis.

Finally, I would like to thank my parents for their patience and encouragement

for my study. Their love and care give me the energy to work hard in my research

life.

iv

TABLE OF CONTENTS

Title Page i

Authorization Page ii

Signature Page iii

Acknowledgments iv

Table of Contents v

List of Figures vii

List of Tables viii

Abstract ix

Chapter 1 Introduction 1

1.1 Background 1

1.2 Outline of the Thesis 3

Chapter 2 Background 4

2.1 Introduction 4

2.2 Review of multi-stream hidden Markov model (HMM) 4

2.3 Stream weight estimation methods 6

2.3.1 Discriminative estimation methods 7

2.3.2 Bayesian estimation methods 11

2.3.3 Heuristics 15

2.4 Discriminative training as a constrained optimization problem inmetric learning 19

2.5 Conclusion 21

Chapter 3 Discriminative Training of Stream Weights in a Multi-Stream HMM as a Linear Programming Problem 22

3.1 Introduction 22

3.2 With Complete Knowledge of the Feasible Region Based on FrameRecognition Correctness 23

v

3.2.1 The Basic Requirement 23

3.2.2 LP Form 23

3.2.3 Discussion 24

3.3 With Incomplete Knowledge of the Feasible Region Based on WordRecognition Correctness 25

3.3.1 Iterative LP Optimization 25

3.3.2 Discussion 26

3.4 Conclusion 27

Chapter 4 Experimental Evaluation 28

4.1 Introduction 28

4.2 Setup 28

4.2.1 Resource Management Corpus 28

4.2.2 Acoustic Modeling 28

4.3 Implementation Issues 29

4.4 Experiment 1: LP Optimization with Complete Knowledge of theFeasible Region Based on Frame Recognition Correctness 31

4.4.1 Effect of Tying Weights, Biases, and Slack Variables 32

4.4.2 Effect of More Training Frames 33

4.4.3 Remarks 34

4.5 Experiment 2: Iterative LP with Incomplete Knowledge of the Fea-sible Region Based on Word Recognition Correctness 34

4.5.1 Experiment 2.1: Single LP iteration 35

4.5.2 Experiment 2.2: Iterative LP optimization 36

4.6 Significant Tests 42

4.7 Summary and Discussion 42

Chapter 5 Conclusion and Future Work 46

5.1 Contributions 46

5.2 Future Work 47

References 48

Appendix A Notations in this thesis 54

Appendix B Maximum entropy estimation 55

Appendix C Significant Tests 58

vi

LIST OF FIGURES

2.1 An example of HMM 5

2.2 Change in shape of sigmoid function as γ changes 8

2.3 Change in position of sigmoid function as θ changes 8

2.4 Bayes classification error in a 2-class classification problem 13

4.1 Effect of ∆wmax on iterative LP optimization using state-dependentweights. 37

4.2 Effect of ∆vmax on iterative LP optimization using phoneme-dependentweights and phoneme-dependent biases and ∆wmax=0.01. 38

4.3 Effect of ∆vmax on iterative LP optimization using phoneme-dependentweights and phoneme-dependent biases and ∆wmax=0.005. 39

4.4 Effect of ∆vmax on iterative LP optimization using phoneme-dependentweights and state-dependent biases and ∆wmax=0.01. 39

4.5 Effect of ∆vmax on iterative LP optimization using phoneme-dependentweights and state-dependent biases and ∆wmax=0.005. 40

4.6 Effect of ∆vmax on iterative LP optimization using state-dependentweights and phoneme-dependent biases and ∆wmax=0.01. 40

4.7 Effect of ∆vmax on iterative LP optimization using state-dependentweights and phoneme-dependent biases and ∆wmax=0.005. 41

4.8 Effect of ∆vmax on iterative LP optimization using state-dependentweights and state-dependent biases and ∆wmax=0.01. 41

4.9 Effect of ∆vmax on iterative LP optimization using state-dependentweights and state-dependent biases and ∆wmax=0.005. 42

vii

LIST OF TABLES

4.1 Word accuracy of the baseline models. All models are monophonemodels with 10 Gaussian mixtures per state. 29

4.2 Effect of tying stream weights, bias, and slack variables on the LPsolution of stream weight estimation based on frame recognitioncorrectness. 3,600 training frames were used; the sum of weightsat any state was set to 4.0. 32

4.3 Effect of more training data on the LP solution of stream weightestimation based on frame recognition correctness. Global streamweights, global biases, frame-dependent slack variables, and zeromargin were used. 33

4.4 Effect of tying stream weights, bias, and slack variables and usingmargin on the LP solution of stream weight estimation based onframe recognition correctness. 57,600 training frames were used;the sum of weights at any state was set to 4.0. 34

4.5 Effect of tying stream weights, bias, and slack variables on the LPsolution of stream weight estimation based on word recognitioncorrectness (Single LP iteration). 3,990 training utterances and50-best utterance hypotheses of the baseline 4-stream model wereused; the sum of weights at any state was set to 4.0. 35

4.6 Best word accuracies of the LP optimization with various streamweight tying and ∆wmax within 5 iterations. 37

4.7 Word accuracy of the baseline models and our method based onframe and word recognition correctness. All models are monophonemodels with 10 Gaussian mixtures per state. 43

5.1 Word accuracy of the baseline models and our method. All modelsare monophone models with 10 Gaussian mixtures per state. 46

C.1 Significant tests 59

viii



by

NG, YIK LUN

Department of Computer Science and Engineering

The Hong Kong University of Science and Technology

ABSTRACT

Hidden Markov model (HMM) is a commonly used statistical model for pattern

classification. One way to incorporate multiple information sources under the

HMM framework is to use the multi-stream HMM. The state log likelihood of a

multi-stream HMM is usually computed as a linear combination of the stream

log-likelihoods using a set of stream weights. The estimation of stream weights is

important because it can affect the performance of the multi-stream HMM greatly.

Various estimation methods have been proposed. Some pose the estimation of

stream weights as an optimization problem and various objective functions such

as minimum classification error and maximum entropy had been tried. In this

thesis, we cast the estimation of stream weights into a linear programming (LP)

problem. The LP formulation is very flexible, allowing various degrees of tying

the stream weights. It also de-couples the estimation of stream weights from the

recognition system so that the estimation may be done by any commonly avail-

able and efficient LP solvers. In practice, however, we may not have complete

knowledge of the feasible region since it is constructed from a limited number of

competing hypotheses generated from the current acoustic model. We investigate

ix

an iterative LP optimization algorithm in which additional constraints on the pa-

rameters being optimized is further imposed. We evaluate our LP formulation in

automatic speech recognition using the Resource Management recognition task.

It is found that the stream weights of a 4-stream HMM found via our LP for-

mulation reduce the word error rate (WER) of the baseline system by 17% and

WER of the stream weights found by extensive brute-force grid search by 8.75%.

x

CHAPTER 1

INTRODUCTION

1.1 Background

A common approach to improving pattern classification performance is to make

use of multiple information sources. There are three approaches to combine these

information sources: (1) in the concatenative approach the features are concate-

nated into a single feature vector and classification is performed on the values of

these combined features, (2) in the parallel approach the probability scores of the

feature streams are combined via some combination functions, (3) the hypotheses

generated by separate features are combined to get the final hypotheses (such as

ROVER [1]). The first two approaches integrate multiple features during acoustic

modeling (early integration). The last one concerning the late integration of mul-

tiple hypotheses will not be discussed in this thesis. The parallel approach is

commonly used when either the multiple features are believed to be independent

of each other, or if the concatenative approach results in a feature vector of di-

mension that is too large to model efficiently. Its application includes discrete

hidden Markov model (DHMM) [2, 3, 4], multi-band speech recognition [5, 6],

audio-visual speech recognition [7, 8, 9], and so on.

In this thesis, we are only interested in using hidden Markov model (HMM)

for automatic speech recognition (ASR). The parallel approach for the HMM is

multi-stream HMM. In most multi-stream HMMs, the state probability density

function (pdf) is modeled as a factored pdf and the state log-likelihood as a linear

combination of the per-stream state log-likelihoods using stream weights. The

performance of multi-stream HMM highly depends on the stream weights. Better

stream weights can greatly improve the performance. Therefore, the estimation

of stream weights are crucial to multi-stream HMM. When there are only two

streams, the weights may be found through brute-force grid search. However,

when there are more than two streams, numerical estimation of the stream weights

is generally required. The stream weights can be estimated based on various

1

heuristic such as signal-to-noise ratio [10] and mutual information [11]. On the

other hand, their estimation can be formulated as an optimization problem. It is

obvious that these stream weights cannot be estimated using maximum-likelihood

approach as it will simply give all the weights to the most probable stream.

Instead, different cost functions such as minimum classification error (MCE) [7, 8]

and maximum entropy [12] had been explored. Potamianos et al. [13] presents

some theoretical results about the ratio of the stream weights in a 2-stream model

based on a study of Bayes classification error in 2-stream models; the results are

correct within the limit of the assumption made in the study.

In most previous work in the area of speech recognition, the estimation of

stream weights are tightly coupled with their recognition systems. Although their

work can solve the problems in the literature, a generic framework for solving

such problems is always preferred. In this thesis, we formulate the stream weight

estimation as a linear programming (LP) [14] problem, and tap into commonly

available optimization solvers for computing the solution. Our approach has

several advantages over the others:

• as the linear programming problem is convex, global optimum/optima is/are

guaranteed (though it may not be unique);

• the approach is very flexible, and one may experiment with using global

stream weights, phoneme-dependent stream weights, or even state-specific

stream weights. The formulation can also be used for minimizing frame,

word or utterance recognition errors. The flexibility is only limited by the

available computing resources;

• the study of linear programming is very mature and many efficient solvers

are available;

• there is no need to implement the optimization routine for each specific

recognition system.

LP solutions are supposed to be globally optimal, so it needs the complete knowl-

edge of the feasible region (the region for valid solutions in LP) for searching the

global solution. Unfortunately, sometimes we may not have the complete knowl-

edge of the feasible region. We only have some of the competing hypotheses in

2

the relevant ASR problem and thus they form part of the feasible region. These

competing hypotheses are generated with specific stream weights and thus they

will be different for different stream weights. Hence, the globally optimal so-

lution found using the competing hypotheses generated by current model may

not be really the true optimal solution. In light of this, we devise an iterative

LP optimization algorithm and impose additional constraints on how much the

parameters being optimized can be changed in each iteration.

We evaluate our stream weight estimation method using the Resource Man-

agement (RM) recognition task. The stream weights of a 4-stream HMM found

via our LP formulation reduces the word error rate (WER) of the baseline system

by 17% and WER of the stream weights found by an extensive brute-force grid

search by 8.75%.

1.2 Outline of the Thesis

In Chapter 2, various methods of combining streams in a multi-stream classifier

are discussed. A discriminative training using constrained optimization formu-

lation on weight estimation in the area of metric learning is also presented as it

motivates this thesis.

In Chapter 3, the formulation of our stream weight estimation method is

presented. The formulation is based on the recognition correctness at the frame

level and word level.

In Chapter 4, the experimental results of our stream weight estimation method

are presented. We evaluate our LP optimization algorithm on RM recognition

task.

Finally, in Chapter 5, we present the contribution of this thesis for automatic

speech recognition. We also suggest some directions for future development of

our LP formulation.

3

CHAPTER 2

BACKGROUND

2.1 Introduction

This chapter surveys the methods of combining linearly multiple streams of in-

formation in a multi-stream classifier. Firstly, the current approaches of stream

weight estimation is reviewed. Secondly, discriminative training using constrained

optimization formulation on weight estimation in the area of metric learning is

presented as it motivates this thesis. The notations used in this chapter can be

found in Appendix A.

2.2 Review of multi-stream hidden Markov model

(HMM)

A hidden Markov model is a statistical model of the Markov process. The model

parameters have to be determined from the observable parameters. It has been

widely used in pattern recognition applications such as speech recognition [15,

16, 17, 18, 19, 20], handwriting recognition [21, 22], bioinformatics [23, 24, 25]

and financial prediction [26, 27].

An example of HMM is shown in Figure 2.1. qi is a state of the HMM and aij

is the state transitiion probability for the transition from state i to state j. An

HMM contains also the initial state probabilities and the state output probabilites

(for discrete observable symbol)/state probability density function (for continuous

observable data). The initial state probability πj is the probability that it starts

with state j. The state output probability/probability density function (pdf) of

the observation vector at time t (denoted xt) at the state j of the HMM λ is

given by:

bj(xt) =

{P (xt|qt = j, λ) for discrete observable symbol,p(xt|qt = j, λ) for continuous observable data.

(2.1)

4

q 1

q 3

q 2

a 21

a 12

a 31

a 13

a 23

a 32

a 33

a 22 a 11

Figure 2.1: An example of HMM

The pdf is defined as

p(x) =∂P (x)

∂x, (2.2)

where P (x1) = Pr(x ≤ x1) which is the distribution function. So the probability

of x in interval (a, b] is

Pr(a < x ≤ b) =

∫ b

a

p(x)dx. (2.3)

In a multi-stream HMM with K streams, the state output probability/pdf is

represented by a factored pdf as follows:

bj(xt) = cj

K∏

k=1

b(k)j (x

(k)t )w

(k)j , (2.4)

or equivalently in the log domain as

log bj(xt) = log cj +K∑

k=1

w(k)j log b

(k)j (x

(k)t ), (2.5)

where x(k)t is the feature vector of the kth stream; b

(k)j (x

(k)t ) is the observation

probability of x(k)t in state j; w

(k)j is the weight of the kth stream in state j with

the constraint thatK∑

k=1

w(k)j = 1; (2.6)

5

cj is the normalization factor to make the right-hand side of Eqn.(2.4) a true

probability density function. Theoretically, we should have

∫

x(1)t

∫

x(2)t

. . .

∫

x(K)t

cj

K∏

k=1

b(k)j (x

(k)t )w

(k)j dx

(1)t dx

(2)t . . . dx

(K)t = 1. (2.7)

However, that will render the problem intractable. In this thesis, we will not

pursue this requirement, and we will treat log cj as a bias for each state and

log bj(xt) should be treated more as a likelihood score than a strict probability

term.

Furthermore, with the following variable substitutions:

wj = [w(1)j , w

(2)j , . . . , w

(K)j ]′, (2.8)

zjt = [log b(1)j (x

(1)t ), log b

(2)j (x

(2)t ), . . . , log b

(K)j (x

(K)t )]′, (2.9)

vj = log cj, (2.10)

Eqn.(2.5) can be expressed in vector form as

log bj(xt) = w′jzjt + vj. (2.11)

2.3 Stream weight estimation methods

In this section, we will investigate some current approaches of estimating stream

weights. Stream weight estimation methods can be categorized as discriminative

estimation methods, Bayesian estimation methods and heuristics. To describe

them formally, we first define:

X: an observation vector sequence,

Λ: a set of HMM models,

λ: an HMM model,

xt: an observation vector at time t, where X = [x1,x2, . . . ,xT ],

x(k)t : the kth stream of the observation vector xt, where xt = [x

(1)t ,x

(2)t , . . . ,x

(K)t ],

K: the total number of streams in the HMM,

T : the total number of frames in the observation vector sequence X,

J : the total number of states in all the HMMs in Λ,

w(k)j : the stream weight for the kth stream of the state j,

wj: the stream weight vector for state j, where wj = [w(1)j , w

(2)j , . . . , w

(K)j ]′,

6

yt: the truth state of the observation vector xt at time t.

In automatic speech recognition (ASR), a speech signal (sound wave) is usually

segmented as frames and features are extracted from each frame to form an

observation vector. Thus, a word consists of a number of frames, and an utterance

consists of a number of words.

2.3.1 Discriminative estimation methods

Minimum classification error (MCE) training

The objective of MCE training is to minimize the classification error in the train-

ing data. The MCE criterion had been applied before in finding stream weights in

the audio-visual speech recognition problem by Potamianos et al. [7] and Naka-

mura et al. [8]. Cerisara et al. [6] also tried to optimize the sub-band weights

with this criterion in the multi-band speech recognition problem.

In MCE training, a discriminant function d(X) is defined in terms of the clas-

sification error using the N -best hypotheses given the training data X. Mathe-

matically,

d(X) = log

{1

N

N∑n=1

eηLn(X)

}1/η

− Lc(X), (2.12)

where Ln(X) is the log likelihood of X in the nth-best hypothesis, and Lc(X) is

the log likelihood of X in the truth. Thus, d(X) measures the distance between a

smoothed mis-recognized “hypothesis” (denoted by the first term of d(X)) and the

truth. η is the parameter to control the contribution of the competing hypotheses.

When η tends to infinity, the log likelihood of the smoothed “hypothesis” is

dominated by the log likelihood of the best hypothesis.

d(X) is turned into a soft error count with value between 0 and 1 by using the

sigmoid function:

Sigmoid(d) =1

1 + e−γ(d+θ), (2.13)

where γ controls the slope of the sigmoid function while θ determines its position

along the axis of abscissa. Figure 2.2 and 2.3 illustrate the change in shape and

position of the sigmoid function as γ and θ change. The slope and position of the

sigmoid function together control the trainable region in the MCE training. If the

7

−20 −15 −10 −5 0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

d

Sig

moi

d(d)

Shape of the Sigmoid function

γ = 0.3γ = 1.0

Figure 2.2: Change in shape of sigmoid function as γ changes

−20 −15 −10 −5 0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

d

Sig

moi

d(d)

Shape of the Sigmoid function

θ = 0θ = −10θ = 10

Figure 2.3: Change in position of sigmoid function as θ changes

8

slope is too steep, the trainable region will be very small. If the position is too far

away from most of the d(X) values, there will be too few training samples in the

trainable region. Both these cases will result in poor training due to insufficient

effective training data. On the other hand, if the slope is too flat, MCE training

will be very slow. Liu et al. has a detailed explanation of how to choose them

appropriately on speaker recognition [28].

The expected misclassification error is given by

ε = EX [Sigmoid(d(X))]. (2.14)

Since the problem is non-linear due to the use of sigmoid function, it is usually

minimized by the generalized probabilistic descent (GPD) algorithm. In the ith

iteration of GPD algorithm, the stream weight w(k)j is updated as follows:

w(k)j (i + 1) = w

(k)j (i)− εi

∂ε

∂w(k)j

, (2.15)

where w(k)j (i) is the value of w

(k)j in the ith iteration, and εi is the learning rate

in the ith iteration with εi > 0. To ensure convergence of w(k)j , two constraints

are required:∑∞

i=1 εi = ∞ and∑∞

i=1 ε2i < ∞.

The MCE training can minimize the word classification error which is the ob-

jective of the recognition task. However, there are several disadvantages. Firstly,

the training is tightly coupled with the recognition system. Secondly, MCE train-

ing performs corrective training which means only incorrect data (hypothesis) are

used for training. Thirdly, there are several parameters to tune, such as the learn-

ing rate for the gradient descent and the sigmoid function parameters. The values

of these parameters will affect the convergence and the efficiency of the algorithm

and so they need to be tuned carefully. Lastly, the GPD algorithm does not

guarantee to give globally optimal stream weights.

Maximum entropy (MAXENT) estimation

To explain the concept of maximum entropy, we first consider what entropy is.

The term “entropy” in this thesis refers to the information entropy. Entropy is

a measure of the uncertainty associated with a random variable. For a random

9

variable D with N possible outcomes {d1, . . . , dN}, the entropy is defined as

H(D) = −N∑

i=1

p(di) log p(di) (2.16)

where p(di) is the probability that di occurs. When entropy is 0, there is no

uncertainty for the related random variables.

Maximum entropy modeling is a method that makes the least biased inference

given certain information. In other words, entropy is maximized subject to the

constraints of the information. It is commonly used in natural language processing

[29].

Gravier et al. applied the maximum entropy criterion for estimating the

stream weight[12]. It is equivalent to maximizing the posterior log probability of

a fixed alignment (state sequence) given the training vectors (see Appendix B).

The stream weight w(k)j is chosen to maximized the cost function:

log P (Y |X) =T∑

t=1

(log byt(xt)− log

J∑j=1

bj(xt)

), (2.17)

where yt is the truth state for xt, Y is the true state sequence y1, . . . , yT , J is the

total number of states for all HMMs and log bj(xt) =∑K

k=1 w(k)j log b

(k)j (x

(k)t ).

This method is fast and requires no parameter tuning. It is a convex opti-

mization problem with unique solution. Generic optimization techniques such

as BFGS can be used to solve this problem. However, it minimizes the frame

classification errors which may not reflect the actual word classification errors.

Linear discriminant analysis (LDA)

LDA [30] is a method used in statistics and machine learning. It aims to find the

linear combination of features which best separates two or more classes of objects.

The resulting combination can be used as a linear classifier or for dimensionity

reduction.

Iwano et al. tried to use LDA to estimate stream weights in multi-band speech

recognition [31]. For each stream k, the frame-averaged log likelihoods of all word

strings are computed with all the word HMM models and all these string-and-

model-specific log likelihoods are plotted as the coordinate z(k). LDA is applied

10

to separate the correct (the word string matches with the word model) from the

incorrect log likelihood with the discriminant function:

a0 +K∑

k=1

akz(k) = 0, (2.18)

and ak may be used as the stream weight for stream k. However, the problems

are that ak’s may be negative and∑K

k=1 ak may not equal to 1. One solution is

to reset all negative values to 0 and normalize the sum to 1:

a′i =

{ai (ai ≥ 0),0 otherwise,

(2.19)

w(k) =a′k∑Ki=1 a′i

, (2.20)

where w(k) is the global stream weight for stream k.

LDA is a simple method for estimating stream weight and can achieve good

result in multi-band speech recognition [31]. However, it is hard to apply LDA

to estimate phoneme or state-dependent stream weights which are believed to

perform better than global stream weight.

2.3.2 Bayesian estimation methods

Bayesian estimation methods are methods based on the Bayes Rule [32]. Bayes

Rule is the result in probability theory, and is define as:

P (gj|x) =P (x|gj)P (gj)

P (x)=

P (x|gj)P (gj)∑Ji=1 P (x|gi)P (gi)

(2.21)

where P (gj) is the prior probability that jth class (gj) occurs, P (x) is the prob-

ability that the feature x is observed, P (x|gj) is the likelihood that feature x is

observed given that x is in class gj, P (gj|x) is the posterior probability that class

gj occurs if we observe the feature x, and J is the total number of classes.

The Bayes decision rule can be used to determine the class of an input feature

in a classification problem. For a two-class problem with input feature x, the

Bayes decision rule is:

P (g1|x)g1

≷g2

P (g2|x) (2.22)

11

⇒ P (x|g1)P (g1)

P (x)

g1

≷g2

P (x|g2)P (g2)

P (x)(2.23)

⇒ P (x|g1)P (g1)g1

≷g2

P (x|g2)P (g2). (2.24)

In general, the Bayes rule will select the class with the highest value of P (x|gj)P (gj)

among all the classes. Figure 2.4 is an illustration of using the Bayes decision

rule to solve a two-class classification problem; the dotted line is the decision

boundary. The Bayes classification error for the two-class problem is

εBayes = E[error|x] (2.25)

=

∫min[P (g1|x), P (g2|x)]p(x)dx (2.26)

=

∫min[P (g1)P (x|g1), P (g2)P (x|g2)]dx (2.27)

= P (g1)

∫

R2

P (x|g1)dx + P (g2)

∫

R1

P (x|g2)dx (2.28)

= P (g1)P (ε1|x) + P (g2)P (ε2|x), (2.29)

where Rj is the decision region for class gj (region in the feature space that will

give class gj as the decision under the Bayes decision rule).

Optimal stream weight estimation by minimizing estimation error

Potamianos et al. tried to study the stream weight optimization problem in the

direction of Bayesian classification error [13]. They suggest that there are usually

some estimation or modeling errors once we choose a parametric distribution

(such as Gaussian) to model the distribution of data. Therefore, the actual

classification error εtotal is the sum of the Bayes classification error εBayes and the

estimation/modeling error εestimation. It is difficult to obtain the minimum Bayes

error unless we can find out the actual decision boundary of the true distribution

of the two classes. Without further knowledge of the true distribution, they try to

find the stream weights that minimize the estimation error with the assumption

that the increase in Bayes error by using these stream weights is small.

Mathematically, the actual classification error is

εtotal = εBayes + εestimation, (2.30)

12

Figure 2.4: Bayes classification error in a 2-class classification problem

and εestimation is assumed to follow a normal distribution N(ε; 0, σ2).

To evaluate the estimation error εestimation, the estimation error for class gi and

stream k is defiend as

εik = P (gi|x(k), λ(k))− P (gi|x(k)), (2.31)

where λ(k) is the model for the distribution in stream k, and P (gi|x(k), λ(k)) is the

estimated value of the true distribution P (gi|x(k)). For simplicity, it is assumed

that εik ∼ N(ε; 0, σ2ik).

The corresponding Bayes decision rule for the two-stream two-class problem is

P (g1|x, λ)g1

≷g2

P (g2|x, λ) (2.32)

⇒2∏

k=1

[P (g1|x(k), λ(k))]w(k)

g1

≷g2

2∏

k=1

[P (g2|x(k), λ(k))]w(k)

(2.33)

⇒2∏

k=1

[P (g1|x(k)) + ε1k]w(k)

g1

≷g2

2∏

k=1

[P (g2|x(k)) + ε2k]w(k)

(2.34)

13

⇒2∏

k=1

[p(x(k)|g1)P (g1) + ε1kP (x(k))]w(k)

g1

≷g2

2∏

k=1

[p(x(k)|g2)P (g2) + ε2kP (x(k))]w(k)

,

(2.35)

where λ is the multi-stream model for the distribution of data and w(k) is the

global stream weight for stream k with the constraint∑K

k=1 w(k) = 1.

With some assumptions, it can be shown that the estimation error is given by

εestimation ≈ 2[p(x(1)|g1)P (g1)]w(1)

[p(x(2)|g1)P (g1)]w(2)

[w(1)(ε11−ε21)+w(2)(ε12−ε22)],

(2.36)

and its variances

σ2 ≈ 4[P (g1)]2[p(x(1)|g1)]

2w(1)

[p(x(2)|g1)]2w(2)

2∑i=1

2∑

k=1

(w(k))2σ2ik, (2.37)

where σ2ik is the variances of the estimation error εik.

By minimizing σ2 we can minimize the estimation error. Note that the value of

σ2 depends on the stream pdf p(x(k)|g1) and the stream estimation error variances∑2

i=1 σ2ik. Further investigation was done by holding one of the two conditions

fixed.

Case 1: Assume p(x(1)|g1) ≈ p(x(2)|g1)

σ2 ≈ 4[P (g1)]2[p(x(1)|g1)]

2(w(1)+w(2))

2∑i=1

2∑

k=1

(w(k))2σ2ik. (2.38)

Minimize σ2 by setting ∂σ2

∂w(1) = 0:

2w(1)

2∑i=1

σ2i1 + 2(1− w(1))

2∑i=1

σ2i2 = 0 (2.39)

w(1)

w(2)=

∑2i=1 σ2

i2∑2i=1 σ2

i1

. (2.40)

Case 2: Assume∑2

i=1 σ2i1 =

∑2i=1 σ2

i2

σ2 ≈ 4[P (g1)]2[p(x(1)|g1)]

2w(1)

[p(x(2)|g1)]2w(2)

(2∑

k=1

(w(k))2)(2∑

i=1

σ2i1). (2.41)

Minimize σ2 by setting ∂σ2

∂w(1) = 0 and it can be shown that an approximation for

14

the ratio of stream weights is given by:

w(1)

w(2)≈ p(x(2)|g1)

p(x(1)|g1)for− 1.5 ≤ p(x(1)|g1)

p(x(2)|g1)≤ 1.5. (2.42)

Assume Eqns. (2.40) and (2.42) still hold without the assumptions in the two

cases, we have

w(1)

w(2)≈ p(x(2)|g1)

∑2i=1 σ2

i2

p(x(1)|g1)∑2

i=1 σ2i1

. (2.43)

The challenge is how to obtain the stream pdf for a certain class, i.e. p(x(k)|g1),

and the pdf estimation error variances for each stream, i.e.∑2

i=1 σ2ik.

An approximation of the stream pdf is also proposed in the paper:

p(x(2)|g1)

p(x(1)|g1)≈ p(x(2)|g1, λ

(2))

p(x(1)|g1, λ(1))(2.44)

which is proportional to the ratio of single-stream classification error. Thus we

have

w(1)

w(2)∝ classification error for stream 2

classification error for stream 1(2.45)

This matches with our intuition that a stream with higher classification error will

have a lower stream weight.

This Bayesian method is able to give optimal stream weights under their

assumptions, but further investigation is needed to determine the ratio of stream

pdf estimation error variances.

2.3.3 Heuristics

Stream weights can be estimated by various heuristics. The stream weights found

by these methods are not proven to be optimal in any sense, but they give satis-

factory result in their specific problems.

Accuracy weighting and SNR weighting

Bourland et al. proposed the accuracy weighting which weighs the stream with

the recognition accuracies of the single streams [5]:

w(k) ∝ Acc(k) (2.46)

15

where w(k) is the global stream weight for stream k and Acc(k) is the recognition

accuracy of stream k.

On the other hand, Bourland et al. [5] and Okawa et al. [10] both tried to

weigh the streams with the signal-to-noise ratio (SNR) of the single streams in

multi-band speech recognition:

w(k) ∝ SNR(k) (2.47)

where SNR(k) is the estimated signal-to-noise ratio of stream k.

Weighting according to similarity to true state sequence

Li et al. proposed to weigh streams according to the differences between the

posterior probabilities of the states in the true state sequence and the other

states of each single stream [33] which is defined as:

d(k)j (t) =

(P (true)(s

(k)j |x(k)

t )− P (s(k)j |x(k)

t ))2

, (2.48)

where x(k)t is the observation vector of stream k at time t, s

(k)j is the kth stream

of state j,

P (true)(s(k)j |x(k)

t ) =

{1 if state j is the truth state at time t of stream k0 otherwise

(2.49)

and P (s(k)j |x(k)

t ) is the posterior probability of stream k of state j at time t.

The combined difference for all training data is

d(k)j =

T∑t=1

d(k)j (t). (2.50)

The stream weight for stream k of state j is defined as

w(k)j ∝ e(

−d(k)j

C), (2.51)

where C is a parameter controlling the emphasis of the effect of d(k)j on the stream

weight. A smaller C will give larger weight for small d(k)j , i.e. the more reliable

stream.

16

Conditional entropy weighting and mutual information weighting

Okawa et al. applied the conditional entropy weighting to find the stream weights

in multi-band speech recognition [10]. Mathematically, the conditional entropy

of the whole state space given a sub-band observation vector is

H(S|x(k)t ) = −

J∑j=1

P (sj|x(k)t ) log P (sj|x(k)

t ), (2.52)

and

P (sj|x(k)t ) ≈ p(x

(k)t |sj)∑J

i=1 p(x(k)t |si)

(2.53)

provided that all prior probabilities P (sj) are the same. S denotes the whole

state space, sj is the state j in S, J is the number of states in S, and x(k)t is

the observation vector at time t in sub-band (stream) k. Since entropy measures

uncertainty, sub-band weights are suggested to be

w(k) ∝ 1∑Tt=1 H(S|x(k)

t ). (2.54)

On the other hand, Okawa et al. proposed an extension to conditional entropy

weighting: mutual information weighting [11]. Intuitively, sub-band with higher

capability in reducing the state entropy after receiving an observation sequence

should deserve more weight. Mathematically, the state entropy given the obser-

vation sequence is

H(S|X(k)) = −T∑

t=1

J∑j=1

P (sj;x(k)t ) log P (sj|x(k)

t ), (2.55)

where X(k) = x(k)1 , . . . , x

(k)T is the observation sequence of sub-band k. The state

entropy without the observation sequence is

H(S) = −J∑

j=1

P (sj) log P (sj), (2.56)

and the difference in the state entropies is

I(S; X(k)) = H(S)−H(S|X(k)) (2.57)

Thus, the stream weight for stream k is estimated as follows

w(k) ∝ I(S; X(k)). (2.58)

17

Likelihood ratio maximization and likelihood value normalization

Tamura et al. proposed the likelihood ratio maximization [9] and likelihood value

normalization [34] to estimate the stream weights. Likelihood ratio maximization,

like its name, tries to maximize the likelihood ratio of the first hypothesis to other

hypotheses. Let Xt be the output word from a decoder at time t and Xm be the

mth word in the dictionary, where M is the vocabulary size. The stream weights

{w(k)Xm} are estimated to maximize

T∑t=1

M∑m=1

{log bXt(xt)− log bXm(xt)}2 , (2.59)

where bXm(xt) is the likelihood of xt for the word model of the word Xm with

bXm(xt) = w(1)Xm

b(1)Xm

(x(1)t ) + w

(2)Xm

b(2)Xm

(x(2)t ) for a 2-stream model. It can be shown

that the variation of w(1)Xr

, denoted by ∆w(1)Xr

, is given by

∆w(1)Xr

=A + B∑T

t=1,Xt=XrMdXr(xt) +

∑Tt=1,Xt 6=Xr

dXr(xt), (2.60)

where

A =T∑

t=1,Xt=Xr

{M log bXr(xt)−

M∑m=1

log bXm(xt)

}, (2.61)

B =T∑

t=1,Xt 6=Xr

{log bXr(xt)− log bXt(xt)} , (2.62)

and dXr(xt) = log b(1)Xr

(x(1)t )−log b

(2)Xr

(x(2)t ) and b

(k)Xr

(x(k)t ) is the likelihood of stream

k of xt for the word model of Xr.

Likelihood value normalization is to equalize the mean values of the log like-

lihood for all HMMs. The mean value of the log likelihoods for stream 1 of the

HMM of word Xr is

µ(1)Xr

=1

T

T∑t=1

log b(1)Xr

(x(1)t ). (2.63)

Thus, the stream weight of word Xr for stream 1 is

w(1)Xr

=1M

∑Mm=1 µ

(1)Xm

µ(1)Xr

. (2.64)

18

The mean values for all HMM is then normalized to the global mean 1M

∑Mm=1 µ

(1)Xm

.

The above two methods only estimate the stream weight for 1 stream. The second

stream weight is estimated as

w(2)Xr

= 1− w(1)Xr

. (2.65)

2.4 Discriminative training as a constrained op-

timization problem in metric learning

Constrained optimization has been widely used in many research fields and re-

cently it is used in the area of metric learning. Metric (or distance function) is

a function which defines a distance between elements of a set and learning the

metric is an important problem.

Frome et al. formulated the metric learning of the image retrieval and classifi-

cation problem as a constrained optimization problem [35]. In visual recognition,

one is given some exemplar images (“focal images”) in the training set, and for

each focal image some positive examples (in the same category as the exemplar)

and negative examples (from all other categories). The problem is to learn a

metric to model the perceptual distance between the focal image and any other

image. The principle to train the distance function is to have smaller perceptual

distances between the focal image and the positive examples than between the

focal image and the negative examples. The metric can then be used for image

retrieval and classification.

The remaining problem is how to define the distance function between the

focal image and a particular image. The paper uses the patch-based features to

capture the similarity of that image to other images. The patches used are cen-

tered at some edge points sampled from the image. The geometric blur features

[36] are used as the shape patch-based features; two different scales of geometric

blur features are used. The color patch-based features are computed with a color

histogram for a patch. Therefore, we have N = 3 different types of features.

For each type of features, at most K0 (400 in the paper) patches are selected for

each type of features, resulting in a maximum of K = N ∗ K0 features for an

image. The elementary distance function for the kth patch of the focal image F

19

to another image I is then defined as

d(k)F (F, I) = min

x(k1)I ∈{S(k)

F }

√‖x(k)

F − x(k1)I ‖2, (2.66)

where x(k)I is the kth patch feature for image I and S

(k)F is the type of feature

that x(k)F belongs to (e.g. geometric blur feature of the same scale). x

(k1)I ∈ {S(k)

F }means any patch feature in image I that belongs to the same feature type as the

kth patch feature of focal image F . The distance is defined to find the perceptual

distance between the most similar patch features of the two images.

Since there are at most K features for a focal image, the way to combine them

to a distance function is to combine the elementary distance functions linearly

with some weights:

DF (F, I) =K∑

k=1

w(k)F d

(k)F (F, I) = 〈wF · dF (F, I)〉 . (2.67)

If we have some image Ii which is more similar to the focal image F than another

image Ij, ideally

DF (F, Ij) > DF (F, Ii) (2.68)

The weight wf can be estimated using the following maximal margin formulation:

arg minwF ,ξ1

2‖wF‖2 + C

∑ij

ξij (2.69)

s.t. : ∀(i, j) ∈ TF , 〈wF · (dF (F, Ij)− dF (F, Ii))〉 ≥ 1− ξij, (2.70)

ξij ≥ 0, (2.71)

w(k)F ≥ 0 (2.72)

where ξij is the slack variable for the constraint relating images Ii and Ij, C ≥ 0 is

the parameter that controls the importance of the slack variables in the objective

function and TF is the set containing the pairs of images satisfying Eqn.(2.68).

From this formulation, we can see that the estimation of the weights in the

distance function can be transformed into a constrained optimization problem

given the set of constraints that we want to enforce. The linear weights in the

distance function is similar to the stream weights of the HMM, but the distance

function is non-negative, while in our case, the distance function is replaced by

20

the discriminant function between 2 recognition hypotheses and it can be both

positive and negative. Similar formulation used for estimating the stream weights

is presented in Chapter 3.

2.5 Conclusion

Unlike many other HMM parameters, stream weights cannot be estimated us-

ing maximum-likelihood approach as it will simply give all the weights to the

most probable stream. Several methods have been proposed to estimate stream

weights. Stream weights estimated using heuristics cannot be proved optimal.

Bayesian methods are developed to find the optimal stream weights in terms

of Bayes classification error, but they make many assumptions that have not

been fully justified, such as using the normal distribution for the estimation er-

rors. Various discriminative methods have their deficiencies. Estimation method

based on LDA can only estimate global stream weights. Maximum entropy can

achieve the optimal stream weight in the sense of information theory, but it deals

with each frame independently, which makes it hard to minimize the word classi-

fication error. MCE training is the best among the above discriminative methods

as it can minimize the word classification error and estimate the various types

of stream weights (global/phoneme/state-dependent stream weights), but it is

tightly coupled with the recognition system. MCE training also suffers from the

problem of corrective training, and some system parameters have to be tuned for

the convergence of the algorithm. In addition, the solution obtained by MCE

training is just a local optimum. A new framework is needed to tackle the above

problems.

21

CHAPTER 3



3.1 Introduction

The goal of this chapter is to present the formulation of our stream weight esti-

mation method. Several approaches for estimating the stream weights have been

described in section 2.3. Some of them have limited applications, such as the

signal-to-noise ratio (SNR) weighting for noisy environment. Some methods are

based on heuristics and thus optimality is usually not achieved. Discriminative

methods using criteria in classification errors (MCE), information theory (MAX-

ENT) and statistics (LDA) have been explored. The MCE training is good in

minimizing the word classification error, so it is the most common method for

estimating stream weights. Unfortunately, the training procedure is usually in-

corporated into the recognition system for efficiency, so it has to be re-written

for a different recognition system. One characteristic of MCE training is that

it performs corrective training. We cannot conclude that corrective training is

bad in the MCE training, but making use of only the incorrect data is in general

not good for algorithm which gives globally optimal solution. Some important

system parameters in MCE training needs to be chosen carefully and the solu-

tion obtained by MCE training is only locally optimal. These problems, however,

has been addressed by MAXENT method, but MAXENT optimizes the stream

weight in the direction of information theory and each frame is handled inde-

pendently. Our method is developed to overcome the shortcomings of MCE and

MAXENT.

Our method [37] involves solving a linear programming (LP) problem. For

simplicity, we will formulate the estimation of stream weights and biases of a

multi-stream HMM by first considering the recognition correctness at the frame

level, and then extend the formulation to the word level.

22

3.2 With Complete Knowledge of the Feasible

Region Based on Frame Recognition Cor-

rectness

Continuing with the multi-stream HMM formulation in Section 2.2, we have

wj = [w(1)j , w

(2)j , . . . , w

(K)j ]′, (3.1)

zjt = [log b(1)j (x

(1)t ), log b

(2)j (x

(2)t ), . . . , log b

(K)j (x

(K)t )]′, (3.2)

vj = log cj, (3.3)

wj is the stream weight vector for state j, zjt is the vector of log likelihood of the

feature xt in the state j and vj is the bias. The log likelihood of xt in state j is

log bj(xt) = w′jzjt + vj. (3.4)

3.2.1 The Basic Requirement

For each training frame xt = [x′(1)t , . . . ,x

′(K)t ]′ belonging to the truth state yt,

we would like its probability computed by the truth state to be greater than its

probability computed by any other competing state. That is,

∀j 6= yt log byt(xt)− log bj(xt) ≥ 0 (3.5)

⇒ (w′ytzytt −w′

jzjt) + (vyt − vj) ≥ 0. (3.6)

To allow possible “noise” in the training data, we may relax the requirement by

introducing slack variables ξtj ≥ 0, and require

(w′ytzytt −w′

jzjt) + (vyt − vj) + ξtj ≥ 0. (3.7)

The slack variables basically implements the hinge loss function so that their

values for correctly recognized frames are zero, and their values for incorrectly

recognized frames are positive. From another point of view, the slack variables

are a measure of frame recognition errors.

3.2.2 LP Form

From Eqn. (3.7), we may formulate the estimation of the stream weight vector

wj as a standard LP problem as follows:

minwj ,vj

∑t

∑

j 6=yt

ξtj (3.8)

23

such that

∀t, ∀j 6= yt, (w′ytzytt −w′

jzjt) + (vyt − vj) + ξtj ≥ 0, (3.9)

∀t, ∀j 6= yt, ξtj ≥ 0, (3.10)

∀j,K∑

k=1

w(k)j = constant, (3.11)

∀j, ∀k, w(k)j ≥ 0. (3.12)

3.2.3 Discussion

The first thing one may notice is that we formulate the problem with state-

dependent weight wj, state-dependent biases vj, and frame-and-state dependent

error ξtj. However, one may tie these parameters to provide various degree of

smoothing for his problem at hand.

One may compare this method with the MCE training. They are similar that

they both try to minimize the classification error, but MCE uses the corrective

training and this method does not. In MCE training the soft error count is

usually defined for the incorrect data and correct data will not contribute to the

error count. Nevertheless, whether a datum is correct or incorrect is actually

based on the current stream weight. In our method, we do not start our LP with

any particular stream weight, so we cannot determine which data are correct or

incorrect before running our algorithm. Therefore, all training frames are used.

The LP solver will find optimal weights that will increase the number of correctly

recognized frames, and reduce the log likelihood difference between the correct

state and competing states for the incorrectly recognized frames.

The final thing we may notice is that in the basic setting the sum of stream

weights is set to 1. However, in the LP formulation, we only require the sum to

be a constant. It is just a scaling problem as we can divide Eqn. (3.9) by the

constant value.

24

3.3 With Incomplete Knowledge of the Feasi-

ble Region Based on Word Recognition Cor-

rectness

The LP formulation in Eqns. (3.8 - 3.12) can easily be extended to consider

recognition accuracy at the word level. For the ith instance Xmi of the word,

Xm, m = 1, . . . , M , where M is the vocabulary size, we would like to have its

probability given by the HMM λm of the word Xm greater than that of all its

competing hypotheses. That is, if the function T (·) maps an instance of a word in

an utterance to its time span, λm represents the models in a competing hypothesis,

and yt represents the state in the competing hypothesis at time t, and we ignore

the contribution of likelihoods due to the transition probabilities, we have,

∀i, ∀m, log P (Xmi|λm)− log P (Xmi|λm) ≥ 0 (3.13)

⇒∑

t∈T (Xmi)

[(w′ytzytt −w′

ytzytt) + (vyt − vyt)] ≥ 0. (3.14)

The corresponding LP problem with R competing hypotheses is,

minwj ,vj

∑i

∑m

R∑r=1

ξmir (3.15)

such that

∀i, ∀m, ∀r,∑

t∈T (Xmi)

[(w′ytzytt −w′

ytzytt) + (vyt − vyt)] + ξmir ≥ 0, (3.16)

∀i, ∀m, ∀r, ξmir ≥ 0, (3.17)

∀j,K∑

k=1

w(k)j = constant, (3.18)

∀j, ∀k, w(k)j ≥ 0. (3.19)

3.3.1 Iterative LP Optimization

It is computationally and memory-spatially infeasible to implement the formula-

tion described in Eqns. (3.15 - 3.19) because there are exponential numbers of

competing hypotheses for a word instance Xmi. For an word instance consisting of

TXmiframes and an HMM set of J states, there are JTXmi different combinations

25

of state sequence and each state sequence corresponds to a competing hypothe-

sis. Practically, in speech recognition, we only generate a number of competing

hypotheses (N-best hypotheses [38]) using the current models. These competing

hypotheses form the feasible region in which the LP optimization gives a globally

optimal solution. As the number of competing hypotheses increases, we have

more knowledge of the feasible region and the actual objective function.

Unfortunately, the competing hypotheses generated by the current models may

be different from those generated by the new set of models from the LP solu-

tion. In other words, the feasible region changes when we change the stream

weights and biases. The LP solution may not be optimal in the new feasible

region. Thus, unless we have complete knowledge of the feasible region, the glob-

ally optimal solution given by LP optimization is only correct with respect to the

feasible region created by the current set of competing hypotheses. Other dis-

criminative training methods such as MCE have the same problem, but they try

to reach a local optimum slowly in an iterative algorithm to overcome the prob-

lem. Here, we investigate an iterative LP optimization approach as follows: The

LP optimization will run iteratively. In each iteration, the competing hypotheses

are generated by the current model, and LP optimization is performed with an

additional constraint to control the amount of change in w and v as follows:

∀j, ∀k, ∆w(k)j ≤ ∆wmax; (3.20)

∀j, ∆vj ≤ ∆vmax. (3.21)

Then, a new model is obtained with the new w and v, which then is used to gen-

erate a new set of competing hypotheses, and the algorithm is repeated. By care-

fully controlling how much the parameters being optimized (∆wmax and ∆vmax

here) can change in each iteration, it is hoped that the globally optimal solution

in each LP iteration will converge to a locally optimal solution.

3.3.2 Discussion

One may be curious that LP usually gives global optimal solution, but our itera-

tive LP optimization method does not. The iterative LP optimization method is

similar to MCE training that they both make use of N-best hypotheses to define

the classification error. The N-best hypotheses provide only partial information

26

of the whole hypothesis space, so we cannot guarantee a global optimal solution

by using these two methods.

Another issue of the iterative LP method is whether it will converge to a

local optimal solution. MCE training uses a gradient-based method which is

proven to converge to a local optimum. Our iterative LP method finds the global

solution with incomplete information in each LP iteration, but this solution may

fall outside the its feasible region. Even if the solution is inside the its feasible

region, since this solution is not computed from its feasible region, optimality

within the feasible region is still not guaranteed. Therefore, the convergence to a

local optimum still needs further investigation and this will not be discussed in

this thesis.

Even the iterative LP method can converge to a local optimal solution, but like

other optimization methods that give only local optimal solution, different initial

points (initial stream weights) to the algorithm may give different final solutions

and these final solutions may give varied performance. Therefore, choosing the

initial stream weight that give best final solution is important for the iterative

LP method.

3.4 Conclusion

This section presented our stream weight estimation method by using the LP

based on both frame-level and word-level recognition correctness. The frame-

level method has the advantage of giving globally optimal stream weight, but it

can only minimize the frame recognition accuracy due to its formulation. The

word-level method tends to minimize the word recognition accuracy which is not

obtained by the frame-level method, but it sacrifices the global optimality of the

stream weight since we have to use the iterative LP method in practice. Initial

stream weights have to be chosen carefully as they affect the final result in the

iterative LP method.

27

CHAPTER 4

EXPERIMENTAL EVALUATION

4.1 Introduction

In this chapter, we will present the experimental results of our stream weight

estimation method. We evaluated our linear programming algorithm in the

determination of stream weights and biases for a 4-stream continuous density

HMM (CDHMM) in a medium-vocabulary speech recognition system. Although

a single-stream CDHMM should perform better than a multi-stream CDHMM for

this task, and the 4 streams are clearly not independent, as a preliminary study,

we only would like to use such simple setting to show that the new algorithm can

effectively estimate better stream weights and biases for the 4-stream CDHMM

than if equal stream weights and global biases were used.

4.2 Setup

4.2.1 Resource Management Corpus

The Resource Management Corpus (RM1) [39] was chosen for this study. Only

the speaker-independent (SI) training set was used for training the various SI

models. It consists of 3990 utterances from 109 speakers. Evaluation was done

on the 300 utterances in the SI Feb’91 test set. The standard word-pair grammar

with a perplexity of 60 was used during decoding. All model training and decoding

was performed using HTK software [40].

4.2.2 Acoustic Modeling

We extracted the conventional 39-dimensional MFCC vectors at every 10ms over

a window of 25ms. The (1-stream) SI model consists of 47 monophones plus the

silence and short pause (short pause is tied with the middle state of silence).

Each of them was modeled as a continuous density HMM (CDHMM) which is

28

Table 4.1: Word accuracy of the baseline models. All models are monophonemodels with 10 Gaussian mixtures per state.

CDHMM Word Accuracy1-stream 93.16%

4-stream, global stream weights found by grid search 92.23%

4-stream, global weights, ∀j, ∀k, w(k)j = 1 91.43%

strictly left-to-right and has three states with 10 Gaussian mixture components

per state.

Each MFCC vector was split into 4 streams, consisting of the static MFCCs,

delta MFCC, delta-delta MFCCs, and energies respectively. Each 4-stream CDHMM

was derived from the corresponding 1-stream CDHMM assuming that all stream

weights w(k)j are equal to 1. Starting from a 1-stream 1-mixture CDHMM, we split

it into 4 streams to form a 4-stream 1-mixture CDHMM which was re-trained

with 4 iterations of the EM algorithm. Then the number of mixtures was grown

by one and the model was re-trained. The procedure was repeated until each

HMM state had 10 mixture components. We also tried to locate the optimal

stream weights using an extensive grid search as follows: each stream weight was

first varied from 0.8-1.2 in step of 0.1 so that a total of 54 = 625 different weight

combinations were searched. Based on the results, promising (but not all) points

in the grid were further explored in the range of 0.7-1.5 again in step of 0.1.

The best global weight vector found by grid search is w=[0.8,0.7,1.0,1.5]. Table

4.1 gives the word recognition accuracies of the 1-stream baseline HMM, the 4-

stream baseline HMM using equal stream weights of 1.0, and the same 4-stream

HMM using stream weights found by grid search. It can be seen that better

stream weights found by the grid search reduces the word error rate (WER) of

the 4-stream baseline system by 9.33%.

4.3 Implementation Issues

One problem with the LP optimization method is its large number of constraints

and thus the high memory consumption and long computational time. Moreover,

it is possible that the LP solver cannot solve the LP with so many constraints.

The major problem with common LP algorithms such as Simplex algorithm and

interior-point method is that they make use of all the constraints at the same

29

time. Using only part of them cannot give the global solution in general.

To deal with the problem, we decide to use a variation of cutting-plane method

[41]. Cutting-plane method is mainly used for solving integer programming and

non-linear convex programming. It tries to locate the optimal solution by re-

peatingly adding cutting-planes which eliminate the halfspaces that the optimal

solution is not in. In our LP optimization, the linear constraints are the cutting-

planes. The algorithm we use is as follows:

Let S be the set of constraints (cutting-planes), which can be written in the form

a′u ≤ b, where u = [w′,v′, ξ′]′. The algorithm starts by randomly selecting

a subset S1 from S. Let P0 be the feasible region of S1 and we randomly selectan initial point u(0) from P0.

Step 1. set i := 0, S2 := S − S1, Feasible:=false and Optimal:=false.

RepeatStep 2. if u(i) is feasible for all the cutting-planes in S2,

Pi+1 := Pi and set Feasible:=true.else, add all the cutting-planes in S2 that make u(i) infeasible:

Pi+1 := Pi ∩ {u|a′u ≤ b}, where a′u(i) > b,

remove those cutting-planes from S2:S2:=S2 − {u|a′u ≤ b}, where a′u(i) > b,

and set Feasible:=false.

Step 3. if Feasible=true and Optimal=true,quit and return u(i) as the optimal solution.

Step 4. Run LP with Pi+1 using LP solver.

Step 5. if there is no feasible solution from the solver,

quit and return no solution.else, set u(i+1):= solution from the solver and Optimal:=true.

Step 6. set i := i + 1.

The algorithm terminates if Pi+1 = φ which means there is no feasible solution

for the LP problem, or we find the optimal solution to the LP problem. Note

that we start the algorithm with a random set of constraints and initial point

u(0) but this is not necessary in practice. The initial polyhedron P0 can be the

whole hyperspace of u and thus any u(0) can be used the initial point. However,

it is possible that the initial point u(0) is feasible for all the constraints and so we

will solve the LP using LP solver with P0 which does not contain any constraints.

However, in practice it is not common to have an initial point which is feasible

30

for all the constraints if we select the initial values of the slack variables as ξ = 0.

Therefore, in the actual implementation of this algorithm we starts with the whole

set of constraints in Step 1 (i.e. S2 = S) and a designated initial point.

In this algorithm, we separate the LP problem into two parts: the feasibility

of a solution and the optimality of the solution. The optimal solution for the LP

problem must be feasible for all the constraints in that LP problem. Here, Pi

specifies the feasible region in the ith iteration of the algorithm. However, this

is not enough for an optimal solution. We use a LP solver to find the optimal

solution in Pi. Notice that this solution is both feasible and optimal to Pi, but

Pi is the feasible region for some of the constraints only. Therefore, we cannot

say this solution is optimal to the LP problem unless it is feasible for all the

constraints. Our approach is to find an optimal solution to a smaller LP problem

which contains part of the constraints and ensure its optimality to the LP problem

containing all the constraints if it is feasible to the constraints that are not in the

smaller LP problem.

In the worst case, all the constraints are used to find the optimal solution.

There are several methods proposed to avoid this situation. They tried to find the

“center” of the feasible region so that we can always reduce half of the hyperspace

for searching the optimal solution. These methods are center of gravity algorithm

[42], maximum volume ellipsoid method [43, 44], Chebyshev center method [45]

and analytic center cutting-plane method [46, 47, 48, 49]. They are efficient

algorithms to find the optimal solution. However, in this thesis, we employ the

LP solver to find the optimal solution because it is a more generic way to solve a

large LP problem by first solving a smaller LP problem.

4.4 Experiment 1: LP Optimization with Com-

plete Knowledge of the Feasible Region Based

on Frame Recognition Correctness

Thirty-six seconds or 3,600 frames of speech were randomly selected from the

training set so that the amount of training frames for each truth state was the

same. The truth state of a frame was obtained from forced alignments of the

training utterances using the baseline 4-stream 10-mixture CDHMMs with uni-

form stream weights of 1 and bias of 0. The log likelihood vector zjt were then

31

computed from each frame accordingly for all 144 possible states (3 states * 47

monophones + 3 states for silence). Thus, each truth state likelihood had 143

competing state likelihoods at each frame.

The sum of weights at any state∑K

k=1 w(k)j was set to 4. Various tying schemes

of the weights and biases at the global, phoneme, and state level were considered.

For the slack variables, we also tried no tying at all, or tying at the frame,

phoneme, and state level. The frame-level tying of the slack variables means that

for a training frame xt, its log likelihood from the truth state has to be better than

all competing states —both the nearby and the farthest competing states —by

the same amount. Finally, the LP problem was solved by the Mosek software [50]

using the interior-point method.

4.4.1 Effect of Tying Weights, Biases, and Slack Variables

Table 4.2 presents the effect of tying slack variables, biases and weights using

3,600 training frames. The best result in a block is highlighted in bold. It can

be seen that the best accuracy of 92.19% is obtained by using frame-dependent

slack variables, global bias and global weights.

Table 4.2: Effect of tying stream weights, bias, and slack variables on the LP so-lution of stream weight estimation based on frame recognition correctness. 3,600training frames were used; the sum of weights at any state was set to 4.0.

Weight Tying Bias Tying Slack Variable Tying Wordw wp wj v vp vj ξtj ξt ξp ξj Acc√ √ √

91.43%√ √ √92.19%√ √ √91.51%√ √ √91.91%√ √ √92.19%√ √ √91.38%√ √ √90.30%√ √ √92.19%√ √ √91.14%√ √ √90.86%

32

4.4.2 Effect of More Training Frames

Since the number of free variables —global stream weights and biases —is small,

only 36 seconds of training frames already give good solution. We checked if

more training data would help by randomly selecting more training frames from

the RM training set but keeping the distribution of frames equal over the states.

Results are shown in Table 4.3 for global stream weights and biases. It may be

seen that the performance does not improve much with more training frames.

Doubling or quadrupling the number of training frames only improves the perfor-

mance from 92.19% to 92.23% - which is the best result we got and is the same

as the result obtained through the grid search. The stream weights obtained

with 7,200 or 14,400 training frames are very similar: [0.868,0.910,0.803,1.42]

and [0.861,0.895,0.824,1.42] respectively. It shows that, for this task, the great-

est discriminative power seems to come from the normalized energies and their

time derivatives, followed by delta MFCCs, static MFCCs, and then delta-delta

MFCCs. It is worth noting that the stream weights found by our LP method are

quite different from what we obtain with grid search. This may not be surprising

as the two methods are very different: our grid search is a cheating experiment in

which we tried to locate the best set of stream weights that give the best perfor-

mance on the test data, whereas the LP formulation tries to minimize the frame

errors.

We also checked if more training data would help improve the performance of

various stream weight and bias tying other than the global ones. Table 4.4 shows

the result of tying slack variables, biases and weights using 57,600 training frames.

It shows that more training data is good to other stream weight and biases tying,

but global stream weights and biases are still the best.

Table 4.3: Effect of more training data on the LP solution of stream weightestimation based on frame recognition correctness. Global stream weights, globalbiases, frame-dependent slack variables, and zero margin were used.

#Training Frames Word Accuracy3,600 92.19%7,200 92.23%14,400 92.23%28,800 92.19%57,600 92.07%460,800 92.15%

33

Table 4.4: Effect of tying stream weights, bias, and slack variables and usingmargin on the LP solution of stream weight estimation based on frame recognitioncorrectness. 57,600 training frames were used; the sum of weights at any statewas set to 4.0.

Weight Tying Bias Tying Slack Variable Tying Wordw wp wj v vp vj ξtj ξt ξp ξj Acc√ √ √

91.55%√ √ √92.07%√ √ √91.51%√ √ √91.22%√ √ √92.07%√ √ √91.87%√ √ √92.07%√ √ √92.07%√ √ √91.18%√ √ √91.67%

4.4.3 Remarks

It is surprising that the phoneme and state level stream weights give worse result

than global stream weights in Table 4.2. Even we use more training data, they

still do not give better result than the global ones. The same phenomenon was

observed in the maximum entropy training of stream weights [12] even though

improvement in the cost function was observed during the training procedure.

More research is needed in this area.

4.5 Experiment 2: Iterative LP with Incomplete

Knowledge of the Feasible Region Based on

Word Recognition Correctness

The LP formulation of Experiment 1 is based on frame recognition correctness.

It does not match exactly with the common performance measure in ASR which

is the word recognition accuracy, though usually we expect increasing frame ac-

curacy will give non-decreasing word accuracy. Here, we repeated the estimation

of stream weights by formulating the LP optimization based on word recognition

correctness.

34

4.5.1 Experiment 2.1: Single LP iteration

We first investigated the conjecture that the feasible region constructed by an N-

best list that is generated by a given model is incomplete. Only one iteration of

the LP optimization was run as in Experiment 1. 50-best utterance hypotheses are

generated using the baseline stream weights for this experiment. It is estimated

that there are on average 5 distinct word hypotheses for each word instance.

There was no further constraint on ∆wmax and ∆vmax. The results are shown

in Table 4.5. It is observed that tying the slack variables at the word instance

level is better than no tying at all for most cases, and global stream weights

achieve better result than phoneme-dependent or state-dependent weights. The

best result of 92.15% is obtained with global stream weight, phoneme-level biases

and word instance slack variables.

Table 4.5: Effect of tying stream weights, bias, and slack variables on the LPsolution of stream weight estimation based on word recognition correctness (SingleLP iteration). 3,990 training utterances and 50-best utterance hypotheses of thebaseline 4-stream model were used; the sum of weights at any state was set to4.0.

Weight Tying Bias Tying Slack Variable Tying Wordw wp wj v vp vj ξmir ξmi Acc√ √ √

91.87%√ √ √90.66%√ √ √89.49%√ √ √91.83%√ √ √92.15%√ √ √91.87%√ √ √91.87%√ √ √89.33%√ √ √81.92%√ √ √91.83%√ √ √91.30%√ √ √89.61%√ √ √91.83%√ √ √92.15%√ √ √91.87%√ √ √91.30%√ √ √91.26%√ √ √90.94%√ √ √89.61%√ √ √89.25%√ √ √89.45%

35

4.5.2 Experiment 2.2: Iterative LP optimization

Experiment 2.1 was repeated with the iterative LP algorithm described in Sec-

tion 3.3.1 using 50-best competing hypotheses, and constraining the change in all

stream weights to be less than ∆wmax and all biases to be less than ∆vmax. Intu-

itively, small values of ∆wmax and ∆vmax are always good because the resulting

weights is less likely to fall outside its feasible region. However, it will take many

iterations to reach some better stream weights. On the other hand, large ∆wmax

and ∆vmax will cause the resulting weights falling outside its feasible region. So

it will be a tradeoff between efficiency and performance. Since it is anticipated

that ∆wmax and ∆vmax will be small compared to their dynamic ranges, we chose

to use the best solution of the frame-level LP (using 7,200 training frames, global

stream weights and biases, frame-level slack variables) w=[0.868,0.910,0.803,1.42]

as the initial stream weight, so that we don’t need so many iterations before we

can see substantial improvement in the result.

Effect of various ∆wmax

We first investigated the effect of various ∆wmax on iterative LP optimization.

State-dependent stream weights, global stream biases and word instance slack

variables were used. Global stream biases were used so that we did not need

to consider the effect of ∆vmax at the same time. The result with varied values

of ∆wmax are plotted in Figure 4.1. It is found that the iterative LP algorithm

effectively improves the estimation of stream weights. A smaller value of ∆wmax

such as 0.01 and 0.005 gives better performance in a few iterations. The highest

word accuracy obtained is 92.79% which is better than word accuracy of the

baseline system with global stream weights found by grid search. Also, we can

observe that good result can already be obtained within the first 5 iterations.

Table 4.6 shows the results with various ∆wmax using global, phoneme-dependent

and state-dependent stream weights within the first 5 iterations. We can see that

the difference in result when using ∆wmax=0.01 or 0.005 is not significant for all

the stream weight tying. Also, phoneme-dependent and state-dependent weights

give better result than global stream weights.

36

1 2 3 4 5 6 7 8 9 1089

89.5

90

90.5

91

91.5

92

92.5

93

Iteration

Wor

d A

ccur

acy(

%)

delta w_max = +infdelta w_max = 0.02delta w_max = 0.01delta w_max = 0.005

Figure 4.1: Effect of ∆wmax on iterative LP optimization using state-dependentweights.

Table 4.6: Best word accuracies of the LP optimization with various streamweight tying and ∆wmax within 5 iterations.

Weight Tying ∆wmax Best Word Accuracyw wp wj in 5 iterations√

0.01 92.15%√0.005 92.23%√0.01 92.55%√0.005 92.63%√0.01 92.79%√0.005 92.63%

37

1 2 3 4 590.5

91

91.5

92

92.5

93

Iteration

Wor

d A

ccur

acy(

%)

delta v_max = +infdelta v_max = 0.5delta v_max = 0.125delta v_max = 0.03

Figure 4.2: Effect of ∆vmax on iterative LP optimization usingphoneme-dependent weights and phoneme-dependent biases and ∆wmax=0.01.

Effect of various ∆vmax

We then studied the effect of various ∆vmax for phoneme-dependent and state-

dependent biases. The value of global stream biases is unimportant for our for-

mulation as they will cancel out each other in Eqn. 3.16. The result for phoneme-

dependent stream weights and various biases is shown in Figures 4.2, 4.3, 4.4 and

4.5, and that for state-dependent stream weights and various biases is shown in

Figures 4.6, 4.7, 4.8 and 4.9. We can see that for both phoneme-dependent and

state-dependent weights and biases, when we have no constraints on the value

of ∆v (∆vmax = +∞), the result is worse than those with small ∆vmax. The

problem is more serious for state-dependent biases after the first 2 iterations.

This matches with our hypothesis that the stream weights found by LP may fall

outside its feasible region, and this is more likely to happen when the parame-

ters are free to change their values. For the other three values of ∆vmax, their

performance is about the same, and they do not outperform the ones with global

biases.

38

1 2 3 4 590.5

91

91.5

92

92.5

93

Iteration

Wor

d A

ccur

acy(

%)


Figure 4.3: Effect of ∆vmax on iterative LP optimization usingphoneme-dependent weights and phoneme-dependent biases and ∆wmax=0.005.

1 2 3 4 590.5

91

91.5

92

92.5

93

Iteration

Wor

d A

ccur

acy(

%)


Figure 4.4: Effect of ∆vmax on iterative LP optimization usingphoneme-dependent weights and state-dependent biases and ∆wmax=0.01.

39

1 2 3 4 590.5

91

91.5

92

92.5

93

Iteration

Wor

d A

ccur

acy(

%)


Figure 4.5: Effect of ∆vmax on iterative LP optimization usingphoneme-dependent weights and state-dependent biases and ∆wmax=0.005.

1 2 3 4 590.5

91

91.5

92

92.5

93

Iteration

Wor

d A

ccur

acy(

%)


Figure 4.6: Effect of ∆vmax on iterative LP optimization using state-dependentweights and phoneme-dependent biases and ∆wmax=0.01.

40

1 2 3 4 590.5

91

91.5

92

92.5

93

Iteration

Wor

d A

ccur

acy(

%)


Figure 4.7: Effect of ∆vmax on iterative LP optimization using state-dependentweights and phoneme-dependent biases and ∆wmax=0.005.

1 2 3 4 590.5

91

91.5

92

92.5

93

Iteration

Wor

d A

ccur

acy(

%)


Figure 4.8: Effect of ∆vmax on iterative LP optimization using state-dependentweights and state-dependent biases and ∆wmax=0.01.

41

1 2 3 4 590.5

91

91.5

92

92.5

93

Iteration

Wor

d A

ccur

acy(

%)


Figure 4.9: Effect of ∆vmax on iterative LP optimization using state-dependentweights and state-dependent biases and ∆wmax=0.005.

4.6 Significant Tests

To compare the various stream weight estimation methods, statistical significance

tests are run over the recognition performance of the systems. A software devel-

oped by the National Institute of Standards and Technology (NIST) is used to

conduct the tests. In the tests, a two-tail 5% significance level is used. The

results are shown in Table C.1 in Appendix C. From the result, we can see that

the frame-level LP and word-level LP with constraints on stream weights and

biases are significantly better than using uniform stream weights of unity as in

the 4-stream baseline model.

4.7 Summary and Discussion

We investigate the LP optimization method based on frame-level and word-level

recognition correctness. The result is summarized in Table 4.7. It is shown that

this method achieves satisfactory result over the multi-stream baseline model.

The frame-level LP optimization method estimates the stream weights that gives

comparable result with the ones found by brute-force grid search, while the word-

level LP makes further improvement on the result starting with the solution found

42

Table 4.7: Word accuracy of the baseline models and our method based on frameand word recognition correctness. All models are monophone models with 10Gaussian mixtures per state.

CDHMM Word AccuracyIterative word-level LP with state-dependent streamweights and global biases (no constraint on ∆w)

90.50%


4-stream, global stream weights found by grid search 92.23%Frame-level LP with global stream weights and biases 92.23%Iterative word-level LP with global stream weights andbiases (with constraint on ∆w)

92.23%

Iterative word-level LP with state-dependent streamweights and global biases (with constraint on ∆w)

92.79%

Iterative word-level LP with state-dependent streamweights and state-dependent biases (with constraints on∆w and ∆v)

92.91%

1-stream 93.16%

from the frame-level LP.

The frame-level LP and word-level LP correspond to two different types of

optimization strategy. The frame-level LP is able to capture the complete knowl-

edge of the feasible region because the amount of competing states (competing

hypotheses) in a frame is relatively small compared with that in a word instance.

However, optimization is done only on frames and thus the word error may not

be minimized. On the other hand, the word-level LP is designed to minimize

the word error, but it can only make use of some of the competing hypotheses

because there are too many of them. So it has only incomplete knowledge of the

feasible region. The frame-level LP can be used to estimate the initial stream

weights and we can fine tune these stream weights by the word-level LP.

One worry with the LP optimization is its scalability. It can be a problem if

we have a lot of training data and/or many states in the HMM set (e.g. triphone

models). This problem is more serious for frame-level LP because the number of

constraints is proportional to the number of training frames and the number of

states. A medium-vocabulary speech corpus such as RM can have about 1 millon

of frames in the training set. The LP problem can be very large if we use all of

them. Luckily, from the result of using various amount of training frames (see

Table 4.3) we can conclude that a small amount of training frames (e.g. 7,200

frames) already gives quite good result. Moreover, not all the constraints are

43

needed if we employ an iterative way of solving the LP problem as proposed in

Section 4.3.

One surprising thing is that in frame-level LP, state-dependent stream weights

performs worse than global stream weights. Similar result is obtained in word-

level LP without constraints on the change in stream weights and biases (∆wmax =

+∞ and ∆vmax = +∞). Nevertheless, after we impose the constraints on the

change of stream weights, state-dependent stream weights give further improve-

ment over the global stream weights. This, however, does not imply that the

constraints on the change of stream weights are useful for frame-level LP. The

frame-level LP already has complete knowledge of the feasible region and it does

not depend on the stream weights and biases, so whether to use those constraints

or not does not affect the feasible region. Further research is needed to investigate

the cause.

The values of how much the parameters change in each LP iteration (∆wmax

and ∆vmax) have to be chosen to balance the training efficiency and convergence

performance. Small values of ∆wmax and ∆vmax will give better convergence

performance after some iterations, but if the values are too small it will take

too long to achieve good result. On the other hand, larger values will make the

training fast, but the convergence cannot be obtained easily.

While state-dependent stream weights are shown to have better result than

global stream weights in iterative word-level LP, phoneme-dependent or state-

dependent biases do not show significant improvement over the global biases. The

thing we should notice is the interaction between the stream weights and stream

biases. The values for each stream in the stream weights can be different, while

stream bias is a value independent of the streams. When we tie the stream weights

and biases to the same degree (e.g. phoneme-level), stream weights will have more

freedom than stream biases as the former ones have separate parameter for each

stream and for each tying unit, while the latter ones only haves single value

for each tying unit. We can say stream biases are more constrained than stream

weights and thus their effect is usually shadowed by the stream weights. However,

if the the tying of stream weights is so tight (e.g. global stream weights), stream

biases can be useful (see result of global stream weight and phoneme-dependent

biases in Table 4.5).

44

Lastly, we have not used a development set or cross-validation for determining

the best set of parameter tying and the values controlling the change in the

parameters in each LP iteration (for iterative word LP). We tend to find out the

potential of our LP method in this thesis and further experiment will be run with

a development set or cross-validation.

45

CHAPTER 5

CONCLUSION AND FUTURE WORK

This thesis casts the problem of stream weight estimation in a multi-stream hid-

den Markov model (HMM) into a standard linear programming (LP) optimization

framework. We analyze the problem of incomplete knowledge of the feasible re-

gion that can be constructed from competing hypotheses in practical ASR system,

and propose an iterative LP optimization algorithm. In the following sections, I

will summarize the contributions of this thesis to the ASR community and suggest

possible future extension of our LP optimization framework.

5.1 Contributions

The most significant contribution of this thesis is the investigation of the formu-

lation of the stream weight estimation in a multi-stream HMM as a LP problem.

We show that our new algorithm is effective in estimating stream weights with

the medium-vocabulary speech corpus Resource Management (RM). We compare

the best result obtained by our method with the 4-stream baseline model and the

stream weights found by brute-force grid search in Table 5.1. Our method re-

duces the word error rate (WER) of the baseline system by 17% and WER of

the stream weight found by grid search by 8.75%. From the result of significant

test in Appendix C, our method is significantly better than the 4-stream baseline

model.

Table 5.1: Word accuracy of the baseline models and our method. All modelsare monophone models with 10 Gaussian mixtures per state.

CDHMM Word Accuracy


4-stream, global stream weights found by grid search 92.23%Our method 92.91%

1-stream 93.16%

Another contribution is the investigation of the worse performance of state-

dependent stream weights in frame-level LP optimization. This phenomenon is

46

also observed in the maximum entropy training of stream weights [12] which

tries to obtain globally optimal solution. Gradient-based methods such as MCE

training do not have this problem, however. We make a hypothesis that the

stream weight found by LP may fall outside its feasible region. We verify this

hypothesis with our experiments.

5.2 Future Work

We have shown that our stream weight estimation method achieves satisfactory

result. We would like to see the effect of using more competing hypotheses and

lattices will be used for more compact form of the N-best hypotheses to save

memory and computation time. We would also like to apply it to large-vocabulary

continuous speech recognition (LVCSR) and see if it still works for this more

realistic problem.

The RM speech recognition task used in this thesis is for preliminary study.

Our LP formulation can be applied to any linear functions in ASR, such as audio-

visual ASR and scaling factors (insertion penalty and grammar factor) tuning [51]

in speech recognition. We would like to apply our method with the product HMM

in audio-visual ASR. On the other hand, high-density discrete HMM (HDDHMM)

[4] is another application for stream weight estimation because the concatenative

approach of combining streams results in a feature vector of dimension that is

too large to model efficiently.

MCE training is the most common method for stream weight estimation. Our

method should compare the result with MCE training.

47

REFERENCES

[1] J. G. Fiscus, “A post-processing system to yield reduced word error rates:

Recognizer output voting error reduction (rover),” in Proceedings of the

IEEE Automatic Speech Recognition and Understanding Workshop, 1997,

pp. 347–354.

[2] V. N. Gupta, M. Lennig, and P. Mermelstein, “Integration of acoustic infor-

mation in a large vocabulary word recognizer,” in Proceedings of the IEEE

International Conference on Acoustics, Speech, and Signal Processing, 1987,

pp. 697–700.

[3] K. F. Lee, “Context-dependent phonetic hidden markov models for

speaker-independent continuous speech recognition,” IEEE Transactions on

Acoustics, Speech and Signal Processing, vol. 38, no. 4, pp. 599–609, April

1990.

[4] Brian Mak, S. K. Au Yeung, Y. P. Lai, and M. Siu, “High-density discrete

HMM with the use of scalar quantization indexing,” in Proceedings of the

European Conference on Speech Communication and Technology, Sept 2005.

[5] H. Bourlard and S. Dupont, “A new ASR approach based on independent

processing and recombination of partial frequency bands,” in Proceedings of

the International Conference on Spoken Language Processing, October 1996.

[6] C. Cerisara, J. P. Haton, J. F. Mari, and D. Fohr, “A recombination model

for multi-band speech recognition,” in Proceedings of the IEEE International

Conference on Acoustics, Speech, and Signal Processing, vol. II, 1998, pp.

717–720.

[7] G. Potamianos and H. P. Graf, “Discriminative training of HMM stream

exponents for audio-visual speech recognition,” in Proceedings of the IEEE

International Conference on Acoustics, Speech, and Signal Processing, 1998,

pp. 3733–3736.

48

[8] S. Nakamura, K. Kumatani, and S. Tamura, “Robust bi-modal speech recog-

nition based on state synchronous modeling and stream weight optimiza-

tion,” in Proceedings of the IEEE International Conference on Acoustics,

Speech, and Signal Processing, vol. 1, 2002, pp. 309–312.

[9] S. Tamura, K. Iwano, and S. Furui, “A stream-weight optimization method

for audio-visual recognition using multi-stream HMMs,” in Proceedings of the

IEEE International Conference on Acoustics, Speech, and Signal Processing,

vol. 1, 2004, pp. 857–860.

[10] S. Okawa, E. Bocchieri, and A. Potamianos, “Multi-band speech recognition

in noisy environments,” in Proceedings of the IEEE International Conference

on Acoustics, Speech, and Signal Processing, vol. II, May 1998, pp. 641–644.

[11] S. Okawa, T. Nakajima, and K. Shirai, “A recombination strategy for multi-

band speech recognition based on mutual information criterion,” in Proceed-

ings of the European Conference on Speech Communication and Technology,

vol. 2, Sept 1999, pp. 603–606.

[12] G. Gravier, S. Axelrod, G. Potamianos, and C. Neti, “Maximum entropy

and MCE based HMM stream weight estimation for audio-visual ASR,” in

Proceedings of the IEEE International Conference on Acoustics, Speech, and

Signal Processing, vol. 1, 2002, pp. 853–856.

[13] A. Potamianos, E. Sanchez-Soto, and K. Daoudi, “Stream weight computa-

tion for multi-stream classifiers,” in Proceedings of the IEEE International

Conference on Acoustics, Speech, and Signal Processing, vol. 1, 2006, pp.

353–356.

[14] A. Schrijver, Theory of Linear and Integer Programming. John Wiley &

sons, 1998.

[15] L. R. Rabiner, “A tutorial on hidden Markov models and selected appli-

cations in speech recognition,” in Proceedings of the IEEE, vol. 77, no. 2,

Febrary 1989, pp. 257–286.

[16] L. R. Rabiner and B. H. Juang, “An introduction to hidden Markov models,”

IEEE ASSP Magazine, vol. 3, no. 1, pp. 4–16, January 1986.

49

[17] X. Huang and M. A. Jack, “Semi-continuous hidden Markov models for

speech signals,” Journal of Computer Speech and Language, vol. 3, no. 3, pp.

239–251, July 1989.

[18] J. R. Bellegarda and D. Nahamoo, “Tied mixture continuous parameter

modeling for speech recognition,” IEEE Transactions on Acoustics, Speech

and Signal Processing, vol. 38, no. 12, pp. 2033–2045, December 1990.

[19] S. Takahashi, K. Aikawa, and S. Sagayama, “Discrete mixture HMM,” in


Signal Processing, vol. 2, April 1997, pp. 971–974.

[20] S. Takahashi and S. Sagayama, “Four-level tied-structure for efficient rep-

resentation of acoustic modeling,” in Proceedings of the IEEE International


520–523.

[21] J. Hu, M. K. Brown, and W. Turin, “HMM based on-line handwriting recog-

nition,” IEEE Transactions on Pattern Analysis and Machine Intelligence,

vol. 18, no. 10, pp. 1039–1045, 1996.

[22] C. Bahlmann and H. Burkhardt, “Measuring HMM similarity with the Bayes

probability of error and its application to online handwriting recognition,”

in Proceedings of the International Conference on Document Analysis and

Recognition, 2001, pp. 406–411.

[23] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analy-

sis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univer-

sity Press, 1999.

[24] S. R. Eddy, “Profile hidden Markov models,” Bioinformatics, vol. 14, pp.

755–763, 1998.

[25] C. B. K. Karplus and R. Hughey, “Hidden Markov models for detecting

remote protein homologies,” Bioinformatics, vol. 14, pp. 846–856, 1998.

[26] Y. Bengio, V. P. Lauzon, and R. Ducharme, “Experiments on the application

of IOHMM’s to model financial returns series,” IEEE Transactions on Neural

Networks, vol. 12, pp. 113–123, 2001.

50

[27] M. R. Hassan and B. Nath, “Stock market forecasting using hidden Markov

model: a new approach,” in Proceedings of Intelligent Systems Design and

Applications, 2005, pp. 192–196.

[28] C. S. Liu, C. H. Lee, W. Chou, B. H. Juang, and A. E. Rosenberg, “A study

on minimum error discriminative training for speaker recognition,” Journal

of Acoustical Society of America, pp. 637–648, January 1995.

[29] A. Berger, S. Della, and V. D. Pietra, “A maximum entropy approach to

natural language processing,” Computational Linguistics, vol. 22, no. 1, 1996.

[30] R. A. Fisher, “The use of multiple measurements in taxonomic problems,”

Annals of Eugenics, 1936.

[31] K. Iwano, K. Kojima, and S. Furui, “A weight estimation method using

LDA for multi-band speech recognition,” in Proceedings of the International

Conference on Spoken Language Processing, 2006, pp. 2534–2537.

[32] T. Bayes, “An essay towards solving a problem in the doctrine of chances,”

Philosophical Transactions, 1763.

[33] X. Li and R. M. Stern, “Training of stream weights for the decoding of speech

using parallel feature streams,” in Proceedings of the IEEE International


832–835.

[34] S. Tamura, K. Iwano, and S. Furui, “A stream-weight optimization method

for multi-stream HMMs based on likelihood value normalization,” in Pro-

ceedings of the IEEE International Conference on Acoustics, Speech, and

Signal Processing, vol. 1, 2005, pp. 469–472.

[35] A. Frome, Y. Singer, and J. Malik, “Image retrieval and recognition using

local distance function,” in Proceedings of Neural Information Processing

Systems (NIPS), 2006.

[36] A. Berg, T. Berg, and J. Malik, “Shape matching and object recognition

using low distortion correspondence,” in CVPR, 2005.

51

[37] Brian Mak and Benny Ng (Ng Yik Lun), “Discriminative training by iterative

linear programming optimization,” in Proceedings of the IEEE International

Conference on Acoustics, Speech, and Signal Processing, 2008, to appear.

[38] R. M. Schwartz and Y. L. Chow, “The N-best algorithm: An efficient and ex-

act procedure for finding the N most likely hypotheses,” in Proceedings of the

IEEE International Conference on Acoustics, Speech, and Signal Processing,

1990.

[39] P. Price, W. M. Fisher, J. Bernstein, and D. S. Pallett, “The DARPA 1000-

word resource management database for continuous speech recognition,” in


Signal Processing, 1988.

[40] Steve Young et al., The HTK Book (Version 3.2). University of Cambridge,

2002.

[41] J. E. Kelley, “The cutting-plane method for solving convex programs,” Jour-

nal of the SIAM, vol. 8, pp. 703–712, 1960.

[42] A. Levin, “An algorithm of minimization of convex functions,” Soviet Math-

ematics Doklady, vol. 160, pp. 1244–1247, 1965.

[43] A. Tarasov, L. G. Khachiyan, and I. Erlich, “The method of inscribed ellip-

soids,” Soviet Mathematics Doklady, vol. 37, 1988.

[44] L. G. Khachiyan and M. J. Todd, “On the complexity of approximating the

maximal inscribed ellipsoid for a prototype,” Mathematical Programming,

vol. 61, pp. 137–160, 1993.

[45] J. Elzinga and T. G. Moore, “A central cutting plane algorithm for the

convex programming problem,” Mathematical Programming, vol. 8, pp. 134–

145, 1975.

[46] D. S. Atkinson and P. M. Vaidya, “A cutting plane algorithm for convex pro-

gramming that uses analytic centers,” Mathematical Programming, vol. 69,

pp. 1–43, 1995.

[47] Y. Nesterov, “Cutting plane algorithms from analytic centers: efficiency

estimates,” Mathematical Programming, vol. 69, pp. 149–176, 1995.

52

[48] A. Altman and K. C. Kiwiel, “A note on some analytic center cutting plane

methods for convex feasibility and minimization problems,” Computational

Optimization and Applications, vol. 5, pp. 175–180, 1996.

[49] K. C. Kiwiel, “Efficiency of the analytic center cutting plane method for

convex minimization,” SIAM Journal on Optimization, vol. 7, no. 2, pp.

336–346, 1997.

[50] [Online]. Available: http://www.mosek.com

[51] T. Emori, Y. Onishi, and K. Shinoda, “Automatic estimation of scaling

factors among probabilistic models in speech recognition,” in Interspeech,

2007, pp. 1453–1456.

[52] W. Karush, “Minima of functions of several variables with inequalities as

side constraints,” Master’s thesis, Dept. of Mathematics, Univ. of Chicago,

1939.

[53] H. W. Kuhn and A. W. Tucker, “Nonlinear programming,” in Proceedings

of 2nd Berkeley Symposium, 1951, pp. 481–492.

53

APPENDIX A

NOTATIONS IN THIS THESIS

K: number of streams in a multi-stream HMMJ : total number of states in a HMM systemT : total number of frames in the training dataxt: an observation vector at time t, t = 1, . . . , T

x(k)t : kth stream of the observation vector at time t, t = 1, . . . , TX: observation sequence of the training datayt: the HMM state that generates the observation vector at time t (xt)

bj(xt): state observation probability/likelihood of the observation vectorat time t (xt) at state j

b(k)j (x

(k)t ): state observation probability/likelihood of kth stream of the obser-

vation vector at time t (x(k)t ) in the kth stream at state j

w(k)j : stream weight of the kth stream in state j of an HMM set

wj : stream weight vector for state j of an HMM set, where wj =

[w(1)j , w

(2)j , . . . , w

(K)j ]′

λm: HMM for the word mΛ: set of HMM models

54

APPENDIX B

MAXIMUM ENTROPY ESTIMATION

Detailed formulation of maximum entropy estimation can be found in Berger

et al. [29]. We reproduce it here for easier reference. To explain the concept of

maximum entropy, we first define p as the underlying probability distribution and

p as the empirical probability distribution. Assume we have (xt, yt), t = 1, . . . , T,

in the training data, where xt is the observation vector at time t and yt is the

HMM state that generates xt. We have

p(x, y) =1

T× number of times that (x, y) occurs in the training data (B.1)

and

p(x) =1

T× number of times that x occurs in the training data. (B.2)

The goal is to find the probability distribution p with maximum entropy, subject

to some constraints derived from the training data.

The entropy that we want to maximize is the conditional entropy of the pos-

terior probability p(y|x), which is given by

H(p) = −∑x,y

p(x)p(y|x) log p(y|x). (B.3)

A more common notation for the conditional entropy H(p) is H(Y |X), where

Y and X are random variables with joint distribution p(x)p(y|x). The notation

H(p) is used to emphasize the dependence of the entropy on probability distrib-

ution p.

The entropy H(p) is maximized subject to some constraints based on the

statistics of the training data. To express the statistics, we define the indicator

function, or feature, as

f(x, y) =

{1 if (x, y) appears in the training data0 otherwise.

(B.4)

55

The expected value of the feature fi in the training data is given by

E[fi] =∑x,y

p(x, y)fi(x, y). (B.5)

The expected value of fi with respect to p(y|x) is

E[fi] =∑x,y

p(x)p(y|x)fi(x, y). (B.6)

We constrain this expected value to be the same as the expected value of fi in

the training data:

E[fi] = E[fi] (B.7)

∑x,y

p(x)p(y|x)fi(x, y) =∑x,y

p(x, y)fi(x, y). (B.8)

We call it a constraint in the maximum entropy estimation. There are at most T

constraints for all (x,y) pairs being distinct in the training data.

We can set up the maximum entropy estimation as a constrained optimization

problem:

find p∗ = arg maxp

H(p) (B.9)

subject to the constraints E[fi] = E[fi], ∀i. This is the primal problem.

For each feature fi, we introduce the Lagrange multiplier λi. We define the

Lagrangian Λ(p, λ) by

Λ(p, λ) = H(p) +∑

i

λi(E[fi]− E[fi]). (B.10)

Holding λ fixed, we compute the unconstrained maximum of the Lagrangian

Λ(p, λ). We denote pλ as the p when Λ(p, λ) achieves its maximum and Ψ(λ) as

the maximum value:

pλ = arg maxp

Λ(p, λ) (B.11)

Ψ(λ) = Λ(pλ, λ). (B.12)

It can be shown that

pλ(y|x) =1

Zλ(x)e(P

i λifi(x,y)) (B.13)

Ψ(λ) = −∑x

p(x) log Zλ(x) +∑

i

λiE[fi] (B.14)

56

where Zλ(x) is a normalizing constant that makes∑

y pλ(y|x) = 1 for all x:

Zλ(x) =∑

y

e(P

i λifi(x,y)) (B.15)

The corresponding dual problem is

Find λ∗ = arg maxλ

Ψ(λ) (B.16)

By KKT theorem [52, 53], the primal and dual solution are related as

p∗ = pλ∗ (B.17)

The log probability Lp(p) of the empirical distribution p as predicted by the

distribution pλ is defined by

Lp(p) = log∏x,y

pλ(y|x)p(x,y) =∑x,y

p(x, y) log pλ(y|x). (B.18)

We can check that the dual function Ψ(λ) is in fact just the log probability for

the exponential distribution pλ; that is

Ψ(λ) = Lp(p) (B.19)

So we can conclude that the probability distribution p∗ with maximum entropy is

the probability distribution in the form of pλ(y|x) that maximizes the posterior

probability of the training data.

57

APPENDIX C

SIGNIFICANT TESTS

In the significant tests, various stream weight estimation methods are compared.

The abbreviations of the stream weight estimation methods and the tests are

summarized as follows:

SI1: 1-stream baseline model.

SI4: 4-stream baseline model.

GRID: 4-stream brute-force grid search.

LP-FR: frame-level LP with global stream weights and biases.

LP-WD-NO-CON: iterative word-level LP with state-dependent stream weights

and global biases (no constraint on ∆w).

LP-WD-W-B: iterative word-level LP with global stream weights and biases

(with constraint on ∆w).

LP-WD-WJ-B: iterative word-level LP with state-dependent stream weights

and global biases (with constraint on ∆w).

LP-WD-WJ-BJ: iterative word-level LP with state-dependent stream weights

and state-dependent biases (with constraint on ∆w and ∆v).

MP: Matched Pair Sentence Segment (Word Error) Test.

SP: Signed Paired Comparison (Speaker Word Accuracy Rate) Test.

WI: Wilcoxon Signed Rank (Speaker Word Accuracy Rate) Test.

MN: McNemar (Sentence Error) Test.

58

Tab

leC

.1:

Sig

nifi

cant

test

s

SI4

GR

IDLP

-FR

LP

-WD

-W-B

LP

-WD

-WJ-B

LP

-WD

-WJ-B

JSI1

LP

-WD

-NO

-CO

NM

P:sa

me

MP

:G

RID

MP

:LP

-FR

MP

:LP

-WD

-W-B

MP

:LP

-WD

-WJ-B

MP

:LP

-WD

-WJ-B

JM

P:SI1

SP

:sa

me

SP

:sa

me

SP

:sa

me

SP

:sa

me

SP

:sa

me

SP

:sa

me

SP

:sa

me

WI:

sam

eW

I:sa

me

WI:

sam

eW

I:sa

me

WI:

sam

eW

I:sa

me

WI:

sam

eM

N:SI4

MN

:G

RID

MN

:LP

-FR

MN

:LP

-WD

-W-B

MN

:LP

-WD

-WJ-B

MN

:LP

-WD

-WJ-B

JM

N:SI1

SI4

MP

:G

RID

MP

:LP

-FR

MP

:LP

-WD

-W-B

MP

:LP

-WD

-WJ-B

MP

:LP

-WD

-WJ-B

JM

P:SI1

SP

:sa

me

SP

:sa

me

SP

:sa

me

SP

:sa

me

SP

:sa

me

SP

:sa

me

WI:

sam

eW

I:sa

me

WI:

sam

eW

I:sa

me

WI:

sam

eW

I:sa

me

MN

:sa

me

MN

:LP

-FR

MN

:LP

-WD

-W-B

MN

:LP

-WD

-WJ-B

MN

:LP

-WD

-WJ-B

JM

N:sa

me

GR

IDM

P:sa

me

MP

:sa

me

MP

:sa

me

MP

:LP

-WD

-WJ-B

JM

P:sa

me

SP

:sa

me

SP

:sa

me

SP

:sa

me

SP

:sa

me

SP

:sa

me

WI:

sam

eW

I:sa

me

WI:

sam

eW

I:sa

me

WI:

sam

eM

N:sa

me

MN

:sa

me

MN

:sa

me

MN

:sa

me

MN

:sa

me

LP

-FR

MP

:sa

me

MP

:sa

me

MP

:sa

me

MP

:sa

me

SP

:sa

me

SP

:sa

me

SP

:sa

me

SP

:sa

me

WI:

sam

eW

I:sa

me

WI:

sam

eW

I:sa

me

MN

:sa

me

MN

:sa

me

MN

:sa

me

MN

:sa

me

LP

-WD

-W-B

MP

:sa

me

MP

:sa

me

MP

:sa

me

SP

:sa

me

SP

:sa

me

SP

:sa

me

WI:

sam

eW

I:sa

me

WI:

sam

eM

N:sa

me

MN

:sa

me

MN

:sa

me

LP

-WD

-WJ-B

MP

:sa

me

MP

:sa

me

SP

:sa

me

SP

:sa

me

WI:

sam

eW

I:sa

me

MN

:sa

me

MN

:sa

me

LP

-WD

-WJ-B

JM

P:sa

me

SP

:sa

me

WI:

sam

eM

N:sa

me

59

DISCRIMINATIVE TRAINING OF STREAM WEIGHTS IN A ...mak/PG-Thesis/mphil-thesis-benny.pdfIn this...

Documents

Transcript of DISCRIMINATIVE TRAINING OF STREAM WEIGHTS IN A ...mak/PG-Thesis/mphil-thesis-benny.pdfIn this...