The Infinite Hidden Markov Modelbeal/papers/ihmm_dai02.pdf · We show that it is possible to...

24
The Infinite Hidden Markov Model Matthew J. Beal Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/beal Mar 2002 work with Zoubin Ghahramani Carl E. Rasmussen http://www.gatsby.ucl.ac.uk/

Transcript of The Infinite Hidden Markov Modelbeal/papers/ihmm_dai02.pdf · We show that it is possible to...

The Infinite Hidden Markov Model

Matthew J. Beal

Gatsby Computational Neuroscience UnitUniversity College London

http://www.gatsby.ucl.ac.uk/ ∼beal

Mar 2002

work withZoubin GhahramaniCarl E. Rasmussen

http://www.gatsby.ucl.ac.uk/

Abstract

We show that it is possible to extend hidden Markov models to have a countablyinfinite number of hidden states. By using the theory of Dirichlet processeswe can implicitly integrate out the infinitely many transition parameters, leavingonly three hyperparameters which can be learned from data. These threehyperparameters define a hierarchical Dirichlet process capable of capturinga rich set of transition dynamics. The three hyperparameters control the timescale of the dynamics, the sparsity of the underlying state-transition matrix,and the expected number of distinct hidden states in a finite sequence. Inthis framework it is also natural to allow the alphabet of emitted symbols tobe infinite—consider, for example, symbols being possible words appearing inEnglish text.

Overview

• Introduce the idea of an HMM with an infinite number of states and symbols.

• Make use of the infinite Dirichlet process, and generalise this to a two-levelhierarchical Dirichlet process over transitions and states.

• Explore the properties of the prior over trajectories.

• Present inference and learning procedures for the infinite HMM.

eg. Infinite Gaussian Mixtures

Following Neal (1991), Rasmussen (2000) showed that it is possible to doinference in countably infinite mixtures of Gaussians.

P (x1, . . . , xN |π,µ,Σ) =N∏

i=1

K∑j=1

πj N (xi|µj,Σj)

=∑

s

P (s,x|π,µ,Σ) =∑

s

N∏i=1

K∏j=1

[πj N (xi|µj,Σj)]δ(si,j)

Joint distribution of indicators is multinomial

P (s1, . . . , sN |π) =K∏

j=1

πnj

j , nj =N∑

i=1

δ(si, j) .

Mixing proportions are given symmetric Dirichlet prior

P (π|β) =Γ(β)

Γ(β/K)K

K∏j=1

πβ/K−1j

Infinite Mixtures (continued)

Joint distribution of indicators is multinomial

P (s1, . . . , sN |π) =K∏

j=1

πnj

j , nj =N∑

i=1

δ(si, j) .

Mixing proportions are given symmetric Dirichlet conjugate prior

P (π|β) =Γ(β)

Γ(β/K)K

K∏j=1

πβ/K−1j

Integrating out the mixing proportions we obtain

P (s1, . . . , sN |β) =∫

dπ P (s1, . . . , sN |π)P (π|β) =Γ(β)

Γ(n + β)

K∏j=1

Γ(nj + β/K)Γ(β/K)

This yields a Dirichlet Process over indicator variables.

• The probability of a set of states is only a function of the indicator variables.

• Analytical tractablility of this integral means we can work with the finitenumber of indicators rather than an infinite dimensional mixing proportion.

Dirichlet Process Conditional Probabilities

Conditional Probabilities: Finite K

P (si = j|s−i, β) =n−i,j + β/K

N − 1 + β

where s−i denotes all indices except i, and n−i,j is total number of observationsof indicator j excluding ith.DP: more populous classes are more more likely to be joined

Conditional Probabilities: Infinite KTaking the limit as K →∞ yields the conditionals

P (si = j|s−i, β) =

n−i,j

N−1+β j represented

βN−1+β all j not represented

Left over mass, β, ⇒ countably infinite number of indicator settings.

Gibbs sampling from posterior of indicators is easy!

Infinite Hidden Markov Models

S 3�

Y3�

S 1

Y1

S 2�

Y2�

S T�

YT�

Motivation: We want to model data with HMMs without worrying aboutoverfitting, picking number of states, picking architectures...

Review of Hidden Markov Models (HMMs)

Generative graphical model: hidden states st, emitted symbols yt

S 3�

Y3�

S 1

Y1

S 2�

Y2�

S T�

YT�

2ex Hidden state evolves as a Markov process

P (s1:T |A) = P (s1|π0)T−1∏t=1

P (st+1|st) ,P (st+1 = j|st = i) = Aij

i, j ∈ {1, . . . ,K} .

Observation model e.g. discrete yt symbols from an alphabet producedaccording to an emission matrix, P (yt = `|st = i) = Ei`.

Limitations of the standard HMM

There are some important limitations of the standard estimation procedure andthe model definition of HMMs:

• Maximum likelihood methods do not consider the complexity of the model,leading to over/under fitting.

• The model structure has to be specified in advance. In many scenarios itis unreasonable to a priori believe that the data was generated from a fixednumber of discrete states.

• We don’t know what form the state transition matrix should take on,

• We might not want to restrict ourselves to a finite alphabet of symbols.

Previous authors (Mackay 97,Stolcke and Omohundro 93) have attempted toapproximate a Bayesian analysis.

Generative model for hidden state

Propose transition to st+1 conditional on current state, st.Existing transitions are more probable,thus giving rise to typical trajectories.

nij =

st+1 →

st

3 17 0 1419 2 7 00 3 1 811 7 4 3

ββββ

nii + α

nij + β + α

β

nij + β + αj

�Σ�

j�Σ

s� elf t

�ransition

o� racle

nij

nij + β + αj

�Σ�

e� xisting t�ransition

j�=i

If oracle propose according to occupancies.Previously chosen oracle states are moreprobable.

noj =

st+1 →[4 0 9 11

njo

njo + γ

γ�

njo + γ

j�Σ

j�Σ

o� racle

e� xisting s� tate

n� ew s� tate

Trajectories under the Prior

explorative: α = 0.1, β = 1000, γ = 100 repetitive: α = 0, β = 0.1, γ = 100

self-transitioning: α = 2, β = 2, γ = 20 ramping: α = 1, β = 1, γ = 10000

Just 3 hyperparameters provide:

• slow/fast dynamics (α)• sparse/dense transition matrices (β)• many/few states (γ)• left→right structure, with multiple interacting cycles

Real dataLewis Carroll’s Alice’s Adventures in Wonderland

0 0.5 1 1.5 2 2.5x 10

4

0

500

1000

1500

2000

2500

word position in text

wor

d id

entit

y

With a finite alphabet a model would assign zero likelihood to a test sequencecontaining any symbols not present in the training set(s).In iHMMs, at each time step the hidden state st emits a symbol yt, which canpossibly come from an infinite alphabet.

state st → state st+1 : transition process, counts nij

state st → symbol yt : emission process, counts miq

Infinite HMMs

S 3�

Y3�

S 1

Y1

S 2�

Y2�

S T�

YT�

Approach: Countably-infinite hidden states. Deal with both transition andemission processes using a two-level hierarchical Dirichlet process.

Transition process Emission process

nii + α

nij + β + α

β

nij + β + αj

�Σ�

j�Σ

s� elf t

�ransition

o� racle

nij

nij + β + αj

�Σe� xisting

t�ransition

j�=i

njo

njo + γ

γ�

njo + γ

j�Σ

j�Σ

e� xisting s� tate

new s� tate

miq

miq + βe

βe

miq + βe

mq

mqo + γe

γ� e

mqo + γe

e� xisting e� mission

e� xisting s� ymbol

n� ew s� ymbol

o� racle

Gibbs sampling over the states is possible, while all parameters are implicitlyintegrated out; only five hyperparameters need to be inferred (Beal, Ghahramani,and Rasmussen, 2001).

Inference and learning

For a given observed sequence...

Inference: What is the distribution over hidden state trajectories?

In this model, we can straightforwardly sample the hidden state using Gibbssampling. The Gibbs probabilities are

P (st|s−t,y1:T , α, β, γ, βe, γe)

Unfortunately exact calculation takes in general O(T 2).We can use a sampling argument to reduce this to O(T ).

Learning: What can we infer about the transition and emission processes?

Involves updating the hyperparameters α, β, γ, βe and γe.

Q. How do we learn the transition and emission parameters?A. With infinite Dirichlet priors, they are represented non-parametrically as countmatrices, capturing countably infinite state- and observable- spaces.

Hyperparameter updates

In a standard Dirichlet process with concentration β, the likelihood of observingcounts nj is

P ({nj}j=1,...,K|β) ∝[β

β

11 + β

. . .nj − 1

nj − 1 + β

]. . . . . .[

β

n− nK + β

1n− nK + 1 + β

. . .nK − 1

n− 1 + β

]∝ βKΓ (β)

Γ (n + β)where n =

∑j

nj

More specifically in this model, the likelihood from (α, β) is approximately*

P ({ni1, . . . , nii, . . . , niK}|α, β) =βK(i)−1Γ(α + β)

Γ(∑

j nij + α + β)Γ(nii + α)

Γ(α)

All counts are their relevant expectations** over Gibbs sweeps.

A toy example: ascending-descending

ABCDEFEDCBABCDEFEDCBABCDEFEDCBABCDEFEDCB...This requires minimally 10 states to capture.

Num

ber

ofre

pres

ente

dst

ates

,K

0 20 40 60 80 10010

0

101

102

Gibbs sweeps

Synthetic examples: expansive & compressive HMMs

︷ ︸︸ ︷ ︷ ︸︸ ︷ ︷ ︸︸ ︷ ︷ ︸︸ ︷

True transition andemission matrices n(1) m(1) n(80) m(80) n(150) m(150)

True transition andemission matrices n(1) m(1) n(100) m(100) n(230) m(230)

True and learned transition and emission probabilities/count matrices up to permutation of thehidden states; lighter boxes correspond to higher values.

(top row) Expansive HMM. Count matrix pairs {n, m} are displayed after {1, 80, 150} sweepsof Gibbs sampling.

(bottom row) Compressive HMM. Similar to top row displaying count matrices after{1, 100, 230} sweeps of Gibbs sampling.

See hmm2.avi and hmm3.avi

Real data: Alice

• Trained on 1st chapter (10787 characters: A . . . Z, 〈space〉, 〈period〉) =2046 words.

• iHMM initialized with random sequence of 30 states. α = 0; β = βe = γ = γe = 1.

• 1000 Gibbs sweeps (=several hours in Matlab).

• n matrix starts out full, ends up sparse (14% full).

200 character fantasies...

1: LTEMFAODEOHADONARNL SAE UDSEN DTTET ETIR NM H VEDEIPH L.SHYIPFADBOMEBEGLSENTEI GN HEOWDA EELE HEFOMADEE IS AL THWRR KH TDAAAC CHDEE OIGWOHRBOOLEODT DSECT M OEDPGTYHIHNOL CAEGTR.ROHA NOHTR.L

250: AREDIND DUW THE JEDING THE BUBLE MER.FION SO COR.THAN THALD THEBATHERSTHWE ICE WARLVE I TOMEHEDS I LISHT LAKT ORTH.A CEUT.INY OBER.GERDPOR GRIEN THE THIS FICE HIGE TO SO.A REMELDLE THEN.SHILD TACE G

500: H ON ULY KER IN WHINGLE THICHEY TEIND EARFINK THATH IN ATS GOAPAT.FO ANICES IN RELL A GOR ARGOR PEN EUGUGTTHT ON THIND NOW BE WIT ORANND YOADE WAS FOUE CAIT DOND SEAS HAMBER ANK THINK ME.HES URNDEY

1000: BUT.THOUGHT ANGERE SHERE ACRAT OR WASS WILE DOOF SHE.WAS ABBOREGLEAT DOING ALIRE AT TOO AMIMESSOF ON SHAM LUZDERY AMALT ANDING A BUPLABUT THE LIDTIND BEKER HAGE FEMESETIMEY BUT NOTE GD I SO CALL OVE

Alice Results: Number of States and Hyperparameters

0 100 200 300 400 500 600 700 800 900 100034

36

38

40

42

44

46

48

50

K

0 100 200 300 400 500 600 700 800 900 10000.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8βγβe

γe

Likelihood calculation: The infinite-state particle filter

• The likelihood for a particular observable sequence of symbols involvesintractable sums over the possible hidden state trajectories.

• Integrating out the parameters in any HMM induces long range dependenciesbetween states.

– We cannot use standard tricks like dynamic programming.– The number of distinct states can grow with the sequence length as new

states are generated. Starting with K distinct states, at time t there couldbe K + t possible distinct states making the total number of trajectoriesover the entire length of the sequence (K + T )!/K!.

We propose estimating the likelihood of a test sequence given a learned modelusing particle filtering.

The idea is to start with some number of particles R distributed on therepresented hidden states according to the final state marginal from the trainingsequence (some of the R may fall onto new states).

The infinite-state particle filter (cont.)

Starting from the set of particles {s1t , . . . , s

Rt }, the tables from the training

sequences {n, no,m, mo}, and t = 1 the recursive procedure is as specifiedbelow, where P (st|y1, . . . , yt−1) ≈ 1

R

∑r δ(st, s

rt) :

1. Compute lrt = P (yt|st = srt) for each particle r.

2. Calculate lt = (1/R)∑

r lrt ≈ P (yt|y1, . . . , yt−1).3. Resample R particles sr

t ∼ (1/∑

r′ lr′t )

∑r′ l

r′t δ(st, sr′

t ).4. Update transition and emission tables nr, mr for each particle.5. For each r sample forward dynamics: sr

t+1 ∼ P (st+1|st = srt , nr, mr); this may cause

particles to land on novel states. Update nr and mr.6. If t < T , Goto 1 with t = t + 1.

The log likelihood of the test sequence is computed as∑

t log lt.

Since it is a discrete state space, with much of the probability mass concentratedon the represented states, it is feasible to use O(K) particles.

Conclusions

• It is possible to extend HMMs to a countably infinite number of hidden statesusing a hierarchical Dirichlet process.

• The resulting model is a non-parametric Bayesian HMM.

• With just three hyperparameters, prior trajectories demonstrate richappealing structure.

• We can use the same process for emissions.

• Tools:

– O(T ) approximate Gibbs sampler for inference.– Equations for hyperparameter learning.– Particle filtering scheme for likelihood evaluation.

• Presented empirical results on toy data sets and Alice.

Some References

1. Attias H. (1999) Inferring parameters and structure of latent variable models by variational Bayes. Proc. 15thConference on Uncertainty in Artificial Intelligence.

2. Barber D., Bishop C. M., (1998) Ensemble Learning for MultiLayer Networks. Advances in Neural InformationProcessing Systems 10..

3. Bishop, C. M. (1999). Variational principal components. Proceedings Ninth International Conference onArtificial Neural Networks, ICANN’99 (pp. 509–514).

4. Beal, M. J., Ghahramani, Z. and Rasmussen, C. E. (2001) The Infinite Hidden Markov Model. To appear inNIPS2001.

5. Ghahramani, Z. and Beal, M.J. (1999) Variational inference for Bayesian mixtures of factor analysers. In NeuralInformation Processing Systems 12.

6. Ghahramani, Z. and Beal, M.J. (2000) Propagation algorithms for variational Bayesian learning. In NeuralInformation Processing Systems 13

7. Hinton, G. E., and van Camp, D. (1993) Keeping neural networks simple by minimizing the description lengthof the weights. In Proc. 6th Annu. Workshop on Comput. Learning Theory , pp. 5–13. ACM Press, New York,NY.

8. MacKay, D. J. C. (1995) Probable networks and plausible predictions — a review of practical Bayesian methodsfor supervised neural networks. Network: Computation in Neural Systems 6: 469–505.

9. Miskin J. and D. J. C. MacKay, Ensemble learning independent component analysis for blind separation anddeconvolution of images, in Advances in Independent Component Analysis, M. Girolami, ed., pp. 123–141,Springer, Berlin, 2000.

10. Neal, R. M. (1991) Bayesian mixture modeling by Monte Carlo simulation, Technical Report CRG-TR-91-2,Dept. of Computer Science, University of Toronto, 23 pages.

11. Neal, R. M. (1994) Priors for infinite networks, Technical Report CRG-TR-94-1, Dept. of Computer Science,University of Toronto, 22 pages.

12. Rasmussen, C. E. (1996) Evaluation of Gaussian Processes and other Methods for Non-Linear Regression.Ph.D. thesis, Graduate Department of Computer Science, University of Toronto.

13. Rasmussen, C. E. (1999) The Infinite Gaussian Mixture Model. Advances in Neural Information ProcessingSystems 12, S.A. Solla, T.K. Leen and K.-R. Mller (eds.), pp. 554-560, MIT Press (2000).

14. Rasmussen, C. E and Ghahramani, Z. (2000) Occam’s Razor. Advances in Neural Information Systems 13,MIT Press (2001).

15. Rasmussen, C. E and Ghahramani, Z. (2001) Infinite Mixtures of Gaussian Process Experts. In NIPS2001.

16. Ueda, N. and Ghahramani, Z. (2000) Optimal model inference for Bayesian mixtures of experts. IEEE NeuralNetworks for Signal Processing. Sydney, Australia.

17. Waterhouse, S., MacKay, D.J.C. & Robinson, T. (1996). Bayesian methods for mixtures of experts. In D. S.Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems 8.Cambridge, MA: MIT Press.

18. Williams, C. K. I., and Rasmussen, C. E. (1996) Gaussian processes for regression. In Advances in NeuralInformation Processing Systems 8 , ed. by D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo.