Developments of Hidden Markov Models by Chandima Karunanayake 30 th March, 2004.

35
Developments of Hidden Markov Models by Chandima Karunanayake 30 th March, 2004

Transcript of Developments of Hidden Markov Models by Chandima Karunanayake 30 th March, 2004.

Developments of Hidden Markov Models

by

Chandima Karunanayake

30th March, 2004

Developments:

•Estimating the Order (Number of Hidden States) of a Hidden Markov Model

•Application of Decision Tree to HMM

A Hidden Markov Model consists of

1. A sequence of states {Xt|t T} = {X1,

X2, ... , XT} , and

2. A sequence of observations {Yt |t T} =

{Y1, Y2, ... , YT}

Some basic problems:

from the observations {Y1, Y2, ... , YT}

1. Determine the sequence of states {X1,

X2, ... , XT}.

2. Determine (or estimate) the parameters of the stochastic process that is generating the states and the observations.

Estimating the Order (Number of Hidden States) of a Hidden Markov Model

        Finite mixture models

 

),(1

)( jyfm

jjyF

Finite mixture model takes the form

Example: Poisson mixture model with m=3 components

The density function of Poisson mixture model:)3,(3)2,(2)1,(1)( yfyfyfyF

!

33

3!

22

2!

11

1 y

ey

y

ey

y

ey

Poi (λ1)α1 Poi (λ2)

α2

Poi (λ3)α3

Estimation of the number of components of a finite mixture model

•AIC-Akaike Information Criterion

•BIC-Bayesian Information Criterion

Most commonly used but not justified theoretically

mm dl

2/)(log mm dnl

md -The number of free parameters in the modelm -The number of components n -sample size

ml -log likelihood with m components

Solution

Penalized likelihood methods -Only for finite number of states

•Penalized Minimum distance method (Chen & Kalbfleisch, 1996)

• Consistent estimate of the number of components in a finite mixture model

Chen & Kalbfleisch Idea

The stationary HMMs form a class of finite mixture models with a Markovian property

Penalized Minimum Distance Method to estimate the number of Hidden States in HMM (MacKey, 2002)

+

Penalized Distance

Let { }, be a family of density functions and G( ) be a finite distribution function on . Then the density function of a finite mixture model is

),( xF

),(1

),( jxFk

jjpGxF

The mixing distribution is

)(1

)(

jIk

jjpG

The Penalized Distance is calculated using following way

k

j jPnCGxFnFdGxFnFD

1log)),(,()),(,(

Distance Measure Penalty term

-Sequence of positive constants Chen & Kalbfleisch used =0.01n-1/2logn where n is number of observations The penalty proposed here penalizes the overfitting of subpopulations which has an estimated probability close to zero and which differs only very slightly.

nC

nC

n

ix

iXI

nxnF

1)(1)(

The empirical distribution function

Different distance measures ) can be used       The Kolmogorov-Smirnov Distance       The Cramer-Von Mises Distance       The Kullback-Leibler Distance

2,

1( FFd

Application to Multiple Sclerosis Lesion Count Data

Patients afflicted with relapsing –remitting multiple sclerosis (MS) experience lesions on the brain stem, with symptoms typically worsening and improving in a somewhat cyclic fashion. -Reasonable to assume that the distribution of the lesion counts depends on the patient’s underlying disease activity. -The sequence of disease states is hidden. -Three patients, each of whom has monthly MRI scans for a period of 30 months.

Proposed model:

Yit|Zit ~ Poisson (μ0Zit)

Yit – the number of lesions observed on patient i at time t

Zit – the associated disease state (unobserved)

μ0Zit- Distinct Poisson means

Results: Penalized minimum –distances for different numbers of hidden states

Number of states

Estimated Poisson means

Minimum distance

1 4.03 0.1306

2 2.48, 6.25 0.0608

3 2.77, 2.62, 7.10 0.0639

4 2.05, 2.96, 3.53, 7.75

0.0774

5 1.83, 3.21, 3.40, 3.58, 8.35

0.0959

Estimates of the parameters of the hidden process Initial probability matrix

]406.0,594.0[0ˆ

Transition probability matrix

 

0P̂ 0.619 0.3810.558 0.442

The performance of the penalized minimum distance method

       Number of components        Sample size        Separation of components        Proportion of time in each state 

1. Application of Decision Tree to HMM Observed data sequence

…. Ot-1 Ot Ot+1 ….

  

 

Viterbi-labeled statesDecision Tree

Output probabilitiesPr(Lj, qt=si)

Lj

The Simulated Hidden Markov model for the Multiple Sclerosis Lesion Count Data (Laverty et al., 2002)

Transition Probability Matrix

State1 State 2 State 1 State 2

Initial Probability Matrix State1 State 20.594 0.406

Mean Vector State1 State 2

2.48 6.25

0.619 0.3810.558 0.442

Number of lesions Counts

State Number of lesions Counts

State Number of lesions Counts

State

4 2 1 1 3 2

3 2 4 2 4 2

4 2 2 2 7 2

7 2 0 1 0 1

1 1 2 2 5 2

1 1 1 1 3 2

0 1 2 1 4 2

1 1 3 2 6 2

3 1 1 1 4 2

2 1 4 2 1 2

How this works:Tree construction Greedy Tree Construction Algorithm

Step 0:start with all labeled data Step 1: while stopping condition is unmet do: Step 2: Find best split threshold over all thresholds and dimensions. Step 3: send data to left or right child depending on threshold test. Step 4: recursively repeat steps 1-4 for left and right children. 

The three rules characterize a tree- growing strategy:

A splitting rule: that determines when the decision threshold is placed, given the data in a node.

A stopping rule: that determines when recursion ends. This is the rule that determines whether a node is a leaf node.

A labeling rule: that assigns some values or class label to every leaf node. For the tree considered here, leaves will be associated (labeled) with the state-conditional output probabilities used in the HMM.

Splitting Rules: Entropy Criterion: The highest info-Gain is used to select the attribute to split.

The entropy of the set S (units are in bits)

Info(T)=

where size of S.Infox(T)=

Gain(X)=Info(T)-Infox(T)

S

SiCfreqm

i S

SiCfreq ),(2

log1

),(

S

)(inf1

||iTo

k

i TiT

GINI Criterion: The smallest value of GINI Index

is used to select the attribute to split.

GINI criteria for splitting is calculated by the following formula:

where N-the number of observations in the initial node. -the number of observations of wth class, which corresponds to lth nodeNl -the number of observations appropriate to lth new node

L

l

K

w l

wl

N

N

NLG

1 1

2)(11)(

wlN

Decision Tree

Lesion Count Data

State 1 State 2

Count 2 Count > 2

Decision Rule:

If count <= 2 Then Classification=State 1Else Then Classification=State 2

Decision Tree classification of States

Number of lesions Counts

State According to Decision Tree

Classification

Number of lesions Counts

State According to Decision Tree Classification

Number of lesions Counts

State According to Decision Tree Classification

4 2 2 1 1 1 3 2 2

3 2 2 4 2 2 4 2 2

4 2 2 2 2 1 7 2 2

7 2 2 0 1 1 0 1 1

1 1 1 2 2 1 5 2 2

1 1 1 1 1 1 3 2 2

0 1 1 2 1 1 4 2 2

1 1 1 3 2 2 6 2 2

3 1 1 1 1 1 4 2 2

2 1 1 4 2 2 1 2 1

Given the state

The state-conditional probability at time t and state Si

 Pr(Ot|qt=Si)

 

Can estimate the probabilities that a given state emitted a certain observation.

2. Application of Decision Tree to HMM Observed data sequence

…. Ot-1 Ot Ot+1 ….

  

 

Decision Tree

The Simplest possible model for

the given data

Decision Tree   The splitting criterion can be depending on several things:

•Type of observed data (independent/autoregressive)

    Type of the transition probabilities (balanced/ unbalanced among the states)

      Separation of Components (well separated or close together)

S

?

S-Well separated C-Close together

Unbalanced

C

ObservedData

Independent Autoregressive

Durbin Watson test

SC S

CSC

Balanced Balanced Unbalanced

??

?

S

Advantages of Decision Tree

•Trees can handle high-dimensional spaces gracefully.

•Because of the hierarchical nature, finding a tree-based output probability given the output is extremely fast.

•Trees can cope with categorical as well as continuous data.

Disadvantages of Decision Tree

•The set of class boundaries is relatively inelegant (rough).

•A decision tree model is non-parametric and has many more free parameters than a parametric model of similar power. Therefore this will require more storage and to obtain good estimates a large amount of training data is required.

Reference:

       Foote, J.T., Decision-Tree Probability modeling for HMM Speech Recognition, Ph.D. Thesis, Division of Engineering, Brown University, RI, USA, 1993. 

       Kantardzic, M, Data mining: concepts, models, methods and algorithms, New York; Chichester, Wiley, c2003        Laverty, W.H., M. J. Miket and I.W. Kelly, Simulation of Hidden Markov models with Excel, The Statistician, Volume 51, Part 1, pp. 31-40, 2002        MacKay, R.J., estimating the order of a Hidden Markov Model, The Canadian Journal of Statistics, Vol. 30, pp.573-589, 2002.

Thanking you

Prof. M.J. Miket

and my Supervisor

Prof. W. H. Laverty

giving me valuable advice and courage to make this presentation a success.