Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

35
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana-Champaign, USA

description

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana-Champaign, USA. Lecture 9. Learning in Bayesian Networks. - PowerPoint PPT Presentation

Transcript of Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Page 1: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Landmark-Based Speech Recognition:

Spectrogram Reading,Support Vector Machines,

Dynamic Bayesian Networks,and Phonology

Mark [email protected]

University of Illinois at Urbana-Champaign, USA

Page 2: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Lecture 9. Learning in Bayesian Networks

• Learning via Global Optimization of a Criterion• Maximum-likelihood learning

– The Expectation Maximization algorithm– Solution for discrete variables using Lagrangian

multipliers– General solution for continuous variables– Example: Gaussian PDF– Example: Mixture Gaussian– Example: Bourlard-Morgan NN-DBN Hybrid– Example: BDFK NN-DBN Hybrid

• Discriminative learning criteria– Maximum Mutual Information– Minimum Classification Error

Page 3: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

What is Learning?

Imagine that you are a student who needs to learn how to propagate belief in a junction tree.

• Level 1 Learning (Rule-Based): I tell you the algorithm. You memorize it.

• Level 2 Learning (Category Formation): You observe examples (FHMM). You memorize them. From the examples, you build a cognitive model of each of the steps (moralization, triangulation, cliques, sum-product).

• Level 3 Learning (Performance): You try a few problems. When you fail, you optimize your understanding of all components of the cognitive model in order to minimize the probability of future failures.

Page 4: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

What is Machine Learning? • Level 1 Learning (Rule-Based): Programmer tells

the computer how to behave. This is not usually called “machine learning.”

• Level 2 Learning (Category Formation): The program is given a numerical model of each category (e.g., a PDF, or a geometric model). Parameters of the numerical model are adjusted in order to represent the category.

• Level 3 Learning (Performance): All parameters in a complex system are simultaneously adjusted in order to optimize a global performance metric.

Page 5: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Learning Criteria

Page 6: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Optimization Methods

Page 7: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Maximum Likelihood Learning in a Dynamic Bayesian Network

• Given: a particular model structure• Given: a set of training examples for that model,

(bm,om), 1≤m≤M

• Estimate all model parameters (p(b|a), p(c|a),…) in order to maximize mlog p(bm,om|)

• Recognition is Nested within Training: at each step of the training algorithm, we need to compute p(bm,om,am,…,qm) for every training token, using sum-product algorithm.

a

b c

e fd

n o

q

o

b

Page 8: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Baum’s Theorem(Baum and Eagon, Bull. Am. Math. Soc., 1967)

Page 9: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Expectation Maximization (EM)

Page 10: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

EM for a Discrete-Variable Bayesian Network

a

b c

e fd

n o

q

o

b

Page 11: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

EM for a Discrete-Variable Bayesian Network

a

b c

e fd

n o

q

o

b

Page 12: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Solution: Lagrangian Method

Page 13: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

The EM Algorithm for a Large Training Corpus

Page 14: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

EM for Continuous Observations(Liporace, IEEE Trans. Inf. Th., 1982)

Page 15: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Solution: Lagrangian Method

Page 16: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Example: Gaussian(Liporace, IEEE Trans. Inf. Th., 1982)

Page 17: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Example: Mixture Gaussian(Juang, Levinson, and Sondhi, IEEE Trans. Inf. Th., 1986)

Page 18: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Example: Bourlard-Morgan Hybrid (Morgan and Bourlard, IEEE Sign. Proc. Magazine 1995)

Page 19: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Pseudo-Priors and Training Priors

Page 20: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Training the Hybrid Model Using the EM Algorithm

Page 21: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

The Solution: Q Back-Propagation

Page 22: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Merging the EM and Gradient Ascent Loops

Page 23: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Example: BDFK Hybrid (Bengio, De Mori, Flammia, and Kompe, Spe. Comm. 1992)

Page 24: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

The Q Function for a BDFK Hybrid

Page 25: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

The EM Algorithm for a BDFK Hybrid

Page 26: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Discriminative Learning Criteria

Page 27: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Maximum Mutual Information

Page 28: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Maximum Mutual Information

Page 29: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Maximum Mutual Information

Page 30: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Maximum Mutual Information

Page 31: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

An EM-Like Algorithm for MMI

Page 32: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

An EM-Like Algorithm for MMI

Page 33: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

MMI for Databases with Different Kinds of Transcription

• If every word’s start and end times are labeled, then WT is the true word label, and W* is the label of the false word (or words!) with maximum modeled probability.

• If the start and times of individual word strings are not known, then WT is the true word sequence. W* may be computed as the best path (or paths) through a word lattice or N-best list. (Schlüter, Macherey, Müller, and Ney, Spe. Comm. 2001)

Page 34: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Minimum Classification Error(McDermott and Katagiri, Comput. Speech Lang. 1994)

• Define empirical risk as “the number of word tokens for which the wrong HMM has higher log-likelihood than the right HMM”

• This risk definition has two nonlinearities:– Zero-one loss function, u(x). Replace with a

differentiable loss function, (x).– Max. Replace with a “softmax” function,

log(exp(a)+exp(b)+exp(c)).

• Differentiate the result; train all HMM parameters using error backpropagation.

Page 35: Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Summary• What is Machine Learning?

– choose an optimality criterion, – find an algorithm that will adjust model

parameters to optimize the criterion

• Maximum Likelihood– Baum’s theorem: argmax E[log(p)] = argmax[p]– Apply directly to discrete, Gaussian, MG– Nest within EBP for BM and BDFK hybrids

• Discriminative Criteria– Maximum Mutual Information (MMI)– Minimum Classification Error (MCE)