Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Post on 20-Jan-2016

27 views 0 download

Tags:

description

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA. Lecture 9. Learning in Bayesian Networks. - PowerPoint PPT Presentation

Transcript of Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Landmark-Based Speech Recognition:

Spectrogram Reading,Support Vector Machines,

Dynamic Bayesian Networks,and Phonology

Mark Hasegawa-Johnsonjhasegaw@uiuc.edu

University of Illinois at Urbana-Champaign, USA

Lecture 9. Learning in Bayesian Networks

• Learning via Global Optimization of a Criterion• Maximum-likelihood learning

– The Expectation Maximization algorithm– Solution for discrete variables using Lagrangian

multipliers– General solution for continuous variables– Example: Gaussian PDF– Example: Mixture Gaussian– Example: Bourlard-Morgan NN-DBN Hybrid– Example: BDFK NN-DBN Hybrid

• Discriminative learning criteria– Maximum Mutual Information– Minimum Classification Error

What is Learning?

Imagine that you are a student who needs to learn how to propagate belief in a junction tree.

• Level 1 Learning (Rule-Based): I tell you the algorithm. You memorize it.

• Level 2 Learning (Category Formation): You observe examples (FHMM). You memorize them. From the examples, you build a cognitive model of each of the steps (moralization, triangulation, cliques, sum-product).

• Level 3 Learning (Performance): You try a few problems. When you fail, you optimize your understanding of all components of the cognitive model in order to minimize the probability of future failures.

What is Machine Learning? • Level 1 Learning (Rule-Based): Programmer tells

the computer how to behave. This is not usually called “machine learning.”

• Level 2 Learning (Category Formation): The program is given a numerical model of each category (e.g., a PDF, or a geometric model). Parameters of the numerical model are adjusted in order to represent the category.

• Level 3 Learning (Performance): All parameters in a complex system are simultaneously adjusted in order to optimize a global performance metric.

Learning Criteria

Optimization Methods

Maximum Likelihood Learning in a Dynamic Bayesian Network

• Given: a particular model structure• Given: a set of training examples for that model,

(bm,om), 1≤m≤M

• Estimate all model parameters (p(b|a), p(c|a),…) in order to maximize mlog p(bm,om|)

• Recognition is Nested within Training: at each step of the training algorithm, we need to compute p(bm,om,am,…,qm) for every training token, using sum-product algorithm.

a

b c

e fd

n o

q

o

b

Baum’s Theorem(Baum and Eagon, Bull. Am. Math. Soc., 1967)

Expectation Maximization (EM)

EM for a Discrete-Variable Bayesian Network

a

b c

e fd

n o

q

o

b

EM for a Discrete-Variable Bayesian Network

a

b c

e fd

n o

q

o

b

Solution: Lagrangian Method

The EM Algorithm for a Large Training Corpus

EM for Continuous Observations(Liporace, IEEE Trans. Inf. Th., 1982)

Solution: Lagrangian Method

Example: Gaussian(Liporace, IEEE Trans. Inf. Th., 1982)

Example: Mixture Gaussian(Juang, Levinson, and Sondhi, IEEE Trans. Inf. Th., 1986)

Example: Bourlard-Morgan Hybrid (Morgan and Bourlard, IEEE Sign. Proc. Magazine 1995)

Pseudo-Priors and Training Priors

Training the Hybrid Model Using the EM Algorithm

The Solution: Q Back-Propagation

Merging the EM and Gradient Ascent Loops

Example: BDFK Hybrid (Bengio, De Mori, Flammia, and Kompe, Spe. Comm. 1992)

The Q Function for a BDFK Hybrid

The EM Algorithm for a BDFK Hybrid

Discriminative Learning Criteria

Maximum Mutual Information

Maximum Mutual Information

Maximum Mutual Information

Maximum Mutual Information

An EM-Like Algorithm for MMI

An EM-Like Algorithm for MMI

MMI for Databases with Different Kinds of Transcription

• If every word’s start and end times are labeled, then WT is the true word label, and W* is the label of the false word (or words!) with maximum modeled probability.

• If the start and times of individual word strings are not known, then WT is the true word sequence. W* may be computed as the best path (or paths) through a word lattice or N-best list. (Schlüter, Macherey, Müller, and Ney, Spe. Comm. 2001)

Minimum Classification Error(McDermott and Katagiri, Comput. Speech Lang. 1994)

• Define empirical risk as “the number of word tokens for which the wrong HMM has higher log-likelihood than the right HMM”

• This risk definition has two nonlinearities:– Zero-one loss function, u(x). Replace with a

differentiable loss function, (x).– Max. Replace with a “softmax” function,

log(exp(a)+exp(b)+exp(c)).

• Differentiate the result; train all HMM parameters using error backpropagation.

Summary• What is Machine Learning?

– choose an optimality criterion, – find an algorithm that will adjust model

parameters to optimize the criterion

• Maximum Likelihood– Baum’s theorem: argmax E[log(p)] = argmax[p]– Apply directly to discrete, Gaussian, MG– Nest within EBP for BM and BDFK hybrids

• Discriminative Criteria– Maximum Mutual Information (MMI)– Minimum Classification Error (MCE)