HMM, MEMM and CRF - SHARIF UNIVERSITY OF...
Transcript of HMM, MEMM and CRF - SHARIF UNIVERSITY OF...
HMM, MEMM and CRF
40-957 Special Topics in Artificial Intelligence:
Probabilistic Graphical Models
Sharif University of Technology
Soleymani
Spring 2014
Sequence labeling
2
Taking collective a set of interrelated instances ๐1, โฆ , ๐๐
and jointly labeling them
We get as input a sequence of observations ๐ฟ = ๐1:๐ and need
to label them with some joint label ๐ = ๐ฆ1:๐
Generalization of mixture models for
sequential data
3
[Jordan]
๐
๐
๐1 ๐2 ๐๐
๐1 ๐2 ๐๐
โฆ ๐๐โ1
๐๐โ1
Y: states (latent variables)
X: observations
HMM examples
4
Part-of-speech-tagging
Speech recognition
๐๐๐ ๐๐ต๐ ๐๐ต ๐๐
Students are
๐๐ต๐
expected to study
HMM: probabilistic model
5
Transitional probabilities: transition probabilities between
states
๐ด๐๐ โก ๐(๐๐ก = ๐|๐๐กโ1 = ๐)
Initial state distribution: start probabilities in different
states
๐๐ โก ๐(๐1 = ๐)
Observation model: Emission probabilities associated with
each state
๐(๐๐ก|๐๐ก , ๐ฝ)
HMM: probabilistic model
6
Transitional probabilities: transition probabilities between
states
๐(๐๐ก|๐๐กโ1 = ๐)~๐๐ข๐๐ก๐(๐ด๐1, โฆ , ๐ด๐๐) โ๐ โ ๐ ๐ก๐๐ก๐๐
Initial state distribution: start probabilities in different
states
๐(๐1)~๐๐ข๐๐ก๐(๐1, โฆ , ๐๐)
Observation model: Emission probabilities associated with
each state
Discrete observations: ๐(๐๐ก|๐๐ก = ๐)~๐๐ข๐๐ก๐(๐ต๐,1, โฆ , ๐ต๐,๐พ) โ๐ โ ๐ ๐ก๐๐ก๐๐
General: ๐ ๐๐ก ๐๐ก = ๐ = ๐(. |๐ฝ๐)
๐: states (latent variables)
๐: observations
Inference problems in sequential data
7
Decoding: argmax๐ฆ1,โฆ,๐ฆ๐
๐ ๐ฆ1, โฆ , ๐ฆ๐ ๐ฅ1, โฆ , ๐ฅ๐
Evaluation
Filtering: ๐ ๐ฆ๐ก ๐ฅ1, โฆ , ๐ฅ๐ก
Smoothing: ๐กโฒ < ๐ก, ๐ ๐ฆ๐กโฒ ๐ฅ1, โฆ , ๐ฅ๐ก
Prediction: ๐กโฒ > ๐ก, ๐ ๐ฆ๐กโฒ ๐ฅ1, โฆ , ๐ฅ๐ก
Some questions
8
๐ ๐ฆ๐ก|๐ฅ1, โฆ , ๐ฅ๐ก =?
Forward algorithm
๐ ๐ฅ1, โฆ , ๐ฅ๐ =?
Forward algorithm
๐ ๐ฆ๐ก|๐ฅ1, โฆ , ๐ฅ๐ =?
Forward-Backward algorithm
How do we adjust the HMM parameters:
Complete data: each training data includes a state sequence and
the corresponding observation sequence
Incomplete data: each training data includes only an observation
sequence
Forward algorithm
9
๐ผ๐ก ๐ = ๐ ๐ฅ1, โฆ , ๐ฅ๐ก , ๐๐ก = ๐
๐ผ๐ก ๐ = ๐ผ๐กโ1 ๐ ๐ ๐๐ก = ๐|๐๐กโ1 = ๐ ๐ ๐ฅ๐ก| ๐๐ก = ๐
๐
Initialization:
๐ผ1 ๐ = ๐ ๐ฅ1, ๐1 = ๐ = ๐ ๐ฅ1|๐1 = ๐ ๐ ๐1 = ๐
Iterations: ๐ก = 2 to ๐
๐ผ๐ก ๐ = ๐ผ๐กโ1 ๐ ๐ ๐๐ก = ๐|๐๐กโ1 = ๐ ๐ ๐ฅ๐ก|๐๐ก = ๐๐
๐, ๐ = 1, โฆ , ๐
๐1 ๐2 ๐๐
๐1 ๐2 ๐๐
โฆ ๐๐โ1
๐๐โ1
๐ผ1(. ) ๐ผ2(. ) ๐ผ๐โ1(. ) ๐ผ๐(. ) ๐ผ๐ก ๐ = ๐๐กโ1โ๐ก(๐)
Backward algorithm
10
๐ฝ๐ก ๐ = ๐๐กโ๐กโ1(๐) = ๐ ๐ฅ๐ก+1, โฆ , ๐ฅ๐|๐๐ก = ๐
๐ฝ๐กโ1 ๐ = ๐ฝ๐ก ๐ ๐ ๐๐ก = ๐|๐๐กโ1 = ๐ ๐ ๐ฅ๐ก| ๐๐ก = ๐
๐
Initialization:
๐ฝ๐ ๐ = 1
Iterations: ๐ก = ๐ down to 2
๐ฝ๐กโ1 ๐ = ๐ฝ๐ก ๐ ๐ ๐๐ก = ๐|๐๐กโ1 = ๐ ๐ ๐ฅ๐ก| ๐๐ก = ๐๐
๐, ๐ โ ๐ ๐ก๐๐ก๐๐
๐1 ๐2 ๐๐
๐1 ๐2 ๐๐
โฆ ๐๐โ1
๐๐โ1
๐ฝ๐ . = 1 ๐ฝ๐โ1 . ๐ฝ2 . ๐ฝ1 .
๐ฝ๐ก ๐ = ๐๐กโ๐กโ1(๐)
Forward-backward algorithm
11
๐ผ๐ก ๐ = ๐ผ๐กโ1 ๐ ๐ ๐๐ก = ๐|๐๐กโ1 = ๐ ๐ ๐ฅ๐ก| ๐๐ก = ๐
๐
๐ผ1 ๐ = ๐ ๐ฅ1, ๐1 = ๐ = ๐ ๐ฅ1|๐1 = ๐ ๐ ๐1 = ๐
๐ฝ๐กโ1 ๐ = ๐ฝ๐ก ๐ ๐ ๐๐ก = ๐|๐๐กโ1 = ๐ ๐ ๐ฅ๐ก| ๐๐ก = ๐
๐
๐ฝ๐ ๐ = 1
๐ ๐ฅ1, ๐ฅ2, โฆ , ๐ฅ๐ = ๐ผ๐ ๐
๐
๐ฝ๐ ๐ = ๐ผ๐ ๐
๐
๐ ๐๐ก = ๐|๐ฅ1, ๐ฅ2, โฆ , ๐ฅ๐ =๐ผ๐ก(๐)๐ฝ๐ก(๐)
๐ผ๐ ๐๐
๐ผ๐ก(๐) โก ๐ ๐ฅ1, ๐ฅ2, โฆ , ๐ฅ๐ก , ๐๐ก = ๐
๐ฝ๐ก(๐) โก ๐ ๐ฅ๐ก+1, ๐ฅ๐ก+2, โฆ , ๐ฅ๐|๐๐ก = ๐
Forward-backward algorithm
12
This will be used for expectation maximization to train a
HMM
๐ ๐๐ก = ๐ ๐ฅ1, โฆ , ๐ฅ๐ =๐ ๐ฅ1,โฆ,๐ฅ๐,๐๐ก=๐
๐ ๐ฅ1,โฆ,๐ฅ๐=
๐ผ๐ก(๐)๐ฝ๐ก(๐)
๐ผ๐(๐)๐๐=1
๐1 ๐2 ๐๐
๐1 ๐2 ๐๐
โฆ ๐๐โ1
๐๐โ1
๐ฝ๐ . = 1
๐ผ1(. ) ๐ผ2(. ) ๐ผ๐โ1(. ) ๐ผ๐(. )
๐ฝ๐โ1 . ๐ฝ2 . ๐ฝ1 .
Decoding Problem
13
Choose state sequence to maximize the observations:
argmax๐ฆ1,โฆ,๐ฆ๐ก
๐ ๐ฆ1, โฆ , ๐ฆ๐ก ๐ฅ1, โฆ , ๐ฅ๐ก
Viterbi algorithm:
Define auxiliary variable ๐ฟ:
๐ฟ๐ก ๐ = max๐ฆ1,โฆ,๐ฆ๐กโ1
๐(๐ฆ1, ๐ฆ2, โฆ , ๐ฆ๐กโ1, ๐๐ก = ๐, ๐ฅ1, ๐ฅ2, โฆ , ๐ฅ๐ก)
๐ฟ๐ก(๐): probability of the most probable path ending in state ๐๐ก = ๐
Recursive relation:
๐ฟ๐ก ๐ = max๐
๐ฟ๐กโ1 ๐ ๐(๐๐ก = ๐|๐๐กโ1 = ๐) ๐ ๐ฅ๐ก ๐๐ก = ๐
๐ฟ๐ก ๐ = max๐=1,โฆ,๐
๐ฟ๐กโ1 ๐ก ๐ ๐๐ก = ๐ ๐๐กโ1 = ๐ ๐(๐ฅ๐ก|๐๐ก = ๐)
Decoding Problem: Viterbi algorithm
14
Initialization ๐ฟ1 ๐ = ๐ ๐ฅ1|๐1 = ๐ ๐ ๐1 = ๐
๐1 ๐ = 0
Iterations: ๐ก = 2, โฆ , ๐
๐ฟ๐ก ๐ = max๐
๐ฟ๐กโ1 ๐ ๐ ๐๐ก = ๐ ๐๐กโ1 = ๐ ๐ ๐ฅ๐ก ๐๐ก = ๐
๐๐ก ๐ = argmax๐
๐ฟ๐ก ๐ ๐ ๐๐ก = ๐ ๐๐กโ1 = ๐
Final computation:
๐โ = max๐=1,โฆ,๐
๐ฟ๐ ๐
๐ฆ๐โ = argmax
๐=1,โฆ,๐๐ฟ๐ ๐
Traceback state sequence: ๐ก = ๐ โ 1 down to 1 ๐ฆ๐ก
โ = ๐๐ก+1 ๐ฆ๐ก+1โ
๐ = 1, โฆ , ๐
๐ = 1, โฆ , ๐
Max-product algorithm
15
๐๐๐๐๐๐ฅ ๐ฅ๐ = max
๐ฅ๐
๐ ๐ฅ๐ ๐ ๐ฅ๐ , ๐ฅ๐ ๐๐๐๐๐๐ฅ(๐ฅ๐)
๐โ๐ฉ(๐)\๐
๐ฟ๐ก ๐ = ๐๐กโ1,๐ก๐๐๐ฅ ร ๐ ๐ฅ๐
HMM Learning
16
Supervised learning: When we have a set of data samples,
each of them containing a pair of sequences (one is the
observation sequence and the other is the state sequence)
Unsupervised learning: When we have a set of data
samples, each of them containing a sequence of observations
HMM supervised learning by MLE
17
Initial state probability:
๐๐ = ๐ ๐1 = ๐ , 1 โค ๐ โค ๐
State transition probability:
๐ด๐๐ = ๐ ๐๐ก+1 = ๐ ๐๐ก = ๐ , 1 โค ๐, ๐ โค ๐
State transition probability:
๐ต๐๐ = ๐ ๐๐ก = ๐ ๐๐ก = ๐ , 1 โค ๐ โค ๐พ
๐1 ๐2 ๐๐ ๐๐โ1 โฆ
๐ ๐จ
๐2 ๐1 ๐๐โ1 ๐๐
๐ฉ
Discrete
observations
HMM: supervised parameter learning by MLE
18
๐ ๐|๐ฝ = ๐ ๐ฆ1(๐)
๐ ๐(๐ฆ๐ก(๐)
|๐ฆ๐กโ1(๐)
, ๐จ)๐
๐ก=2 ๐(๐๐ก
(๐)|๐ฆ๐ก
(๐), ๐ฉ)
๐
๐ก=1
๐
๐=1
๐ด ๐๐ = ๐ผ ๐ฆ๐กโ1
(๐)= ๐, ๐ฆ๐ก
(๐)= ๐๐
๐ก=2๐๐=1
๐ผ ๐ฆ๐กโ1(๐)
= ๐๐๐ก=2
๐๐=1
๐ ๐ = ๐ผ ๐ฆ1
(๐)= ๐๐
๐=1
๐
๐ต ๐๐ = ๐ผ ๐ฆ๐ก
(๐)= ๐, ๐ฅ๐ก
(๐)= ๐๐
๐ก=1๐๐=1
๐ผ ๐ฆ๐ก(๐)
= ๐๐๐ก=1
๐๐=1
Discrete
observations
Learning
19
Problem: how to construct an HHM given only observations?
Find ๐ฝ = (๐จ, ๐ฉ, ๐ ), maximizing ๐(๐1, โฆ , ๐๐|๐ฝ)
Incomplete data
EM algorithm
HMM learning by EM (Baum-Welch)
20
๐ฝ๐๐๐ = ๐ ๐๐๐ , ๐จ๐๐๐ , ๐ฝ๐๐๐
E-Step:
๐พ๐,๐ก๐ = ๐ ๐๐ก
(๐)= ๐ ๐ฟ(๐); ๐ฝ๐๐๐
๐๐,๐ก๐,๐
= ๐ ๐๐กโ1(๐)
= ๐, ๐๐ก(๐)
= ๐ ๐(๐); ๐ฝ๐๐๐
M-Step:
๐๐๐๐๐ค =
๐พ๐,1๐๐
๐=1
๐
๐ด๐,๐๐๐๐ค =
๐๐,๐ก๐,๐๐
๐ก=2๐๐=1
๐พ๐,๐ก๐๐โ1
๐ก=1๐๐=1
๐ต๐,๐๐๐๐ค =
๐พ๐,๐ก๐ ๐ผ ๐๐ก
(๐)=๐, ๐
๐ก=1๐๐=1
๐ผ ๐๐ก(๐)
=๐๐๐ก=1
๐๐=1
๐, ๐ = 1, โฆ , ๐
๐, ๐ = 1, โฆ , ๐
Baum-Welch algorithm (Baum, 1972)
Discrete observations
HMM learning by EM (Baum-Welch)
21
๐ฝ๐๐๐ = ๐ ๐๐๐ , ๐จ๐๐๐ , ๐ฝ๐๐๐
E-Step:
๐พ๐,๐ก๐ = ๐ ๐๐ก
(๐)= ๐ ๐ฟ(๐); ๐ฝ๐๐๐
๐๐,๐ก๐,๐
= ๐ ๐๐กโ1(๐)
= ๐, ๐๐ก(๐)
= ๐ ๐(๐); ๐ฝ๐๐๐
M-Step:
๐๐๐๐๐ค =
๐พ๐,1๐๐
๐=1
๐
๐ด๐,๐๐๐๐ค =
๐๐,๐ก๐,๐๐
๐ก=2๐๐=1
๐พ๐,๐ก๐๐โ1
๐ก=1๐๐=1
๐ต๐,๐๐๐๐ค =
๐พ๐,๐ก๐ ๐ผ ๐๐ก
(๐)=๐, ๐
๐ก=1๐๐=1
๐ผ ๐๐ก(๐)
=๐๐๐ก=1
๐๐=1
๐, ๐ = 1, โฆ , ๐
๐, ๐ = 1, โฆ , ๐
๐๐๐๐๐ค =
๐พ๐,๐ก๐ ๐๐ก
(๐)๐๐ก=1
๐๐=1
๐พ๐,๐ก๐๐
๐ก=1๐๐=1
๐ฎ๐๐๐๐ค =
๐พ๐,๐ก๐ ๐๐ก
(๐)โ ๐๐
๐๐๐ค ๐๐ก(๐)
โ ๐๐๐๐๐ค
๐๐๐ก=1
๐๐=1
๐พ๐,๐ก๐๐
๐ก=1๐๐=1
Assumption: Gaussian emission
probabilities
Baum-Welch algorithm (Baum, 1972)
Forward-backward algorithm for E-step
22
๐ ๐ฆ๐กโ1, ๐ฆ๐ก ๐1, โฆ , ๐๐
=๐ ๐1, โฆ , ๐๐ ๐ฆ๐กโ1, ๐ฆ๐ก)๐ ๐ฆ๐กโ1, ๐ฆ๐ก
๐(๐1, โฆ , ๐๐)
=๐ ๐1, โฆ , ๐๐กโ1 ๐ฆ๐กโ1)๐ ๐๐ก ๐ฆ๐ก)๐ ๐๐ก+1, โฆ , ๐๐ ๐ฆ๐ก)๐ ๐ฆ๐ก ๐ฆ๐กโ1)๐ ๐ฆ๐กโ1
๐(๐1, โฆ , ๐๐)
=๐ผ๐กโ1(๐ฆ๐กโ1)๐ ๐๐ก ๐ฆ๐ก)๐ ๐ฆ๐ก ๐ฆ๐กโ1)๐ฝ๐ก ๐ฆ๐ก
๐ผ๐(๐)๐๐=1
HMM shortcomings
23
In modeling the joint distribution ๐(๐, ๐ฟ), HMM ignores many
dependencies between observations ๐1, โฆ , ๐๐ (similar to
most generative models that need to simplify the structure)
In the sequence labeling task, we need to classify an
observation sequence using the conditional probability ๐(๐|๐ฟ)
However, HMM learns a joint distribution ๐(๐, ๐ฟ) while uses only
๐(๐|๐ฟ) for labeling
๐1 ๐2 ๐๐
๐1 ๐2 ๐๐
โฆ ๐๐โ1
๐๐โ1
Maximum Entropy Markov Model (MEMM)
24
๐ ๐1:๐ ๐1:๐ = ๐ ๐ฆ1 ๐1:๐ ๐ ๐ฆ๐ก ๐ฆ๐กโ1, ๐1:๐
๐
๐ก=2
๐ ๐ฆ๐ก ๐ฆ๐กโ1, ๐1:๐ =exp ๐๐๐ ๐ฆ๐ก , ๐ฆ๐กโ1, ๐1:๐
exp ๐๐๐ ๐ฆ๐ก, ๐ฆ๐กโ1, ๐1:๐๐ฆ๐ก
Discriminative model
Only models ๐(๐|๐ฟ) and completely ignores modeling ๐(๐ฟ) Maximizes the conditional likelihood ๐(๐๐|๐๐ฟ, ๐ฝ)
๐1 ๐2 ๐๐
๐1 ๐2 ๐๐
โฆ ๐๐โ1
๐๐โ1
๐1 ๐2 ๐๐
๐ฟ1:๐
โฆ ๐๐โ1
๐ ๐ฆ๐กโ1, ๐1:๐ , ๐ = exp ๐๐๐ ๐ฆ๐ก, ๐ฆ๐กโ1, ๐1:๐
๐ฆ๐ก
Feature function
25
Feature function ๐ ๐ฆ๐ก, ๐ฆ๐กโ1, ๐1:๐ can take account of
relations between both data and label space
However, they are often indicator functions showing absence
or presence of a feature
๐ค๐ captures how closely ๐ ๐ฆ๐ก, ๐ฆ๐กโ1, ๐1:๐ is related with
the label
MEMM disadvantages
26
The later observation in the sequence has absolutely no effect
on the posterior probability of the current state
model does not allow for any smoothing.
The model is incapable of going back and changing its prediction
about the earlier observations.
The label bias problem
there are cases that a given observation is not useful in
predicting the next state of the model.
if a state has a unique out-going transition, the given observation is
useless
Label bias problem in MEMM
27
Label bias problem: states with fewer arcs are preferred
Preference of states with lower entropy of transitions over others in decoding
MEMMs should probably be avoided in cases where many transitions are close to deterministic
Extreme case: When there is only one outgoing arc, it does not matter what the observation is
The source of this problem: Probabilities of outgoing arcs normalized separately for each state
sum of transition probability for any state has to sum to 1
Solution: Do not normalize probabilities locally
From MEMM to CRF
28
From local probabilities to local potentials
๐ ๐ ๐ฟ =1
๐(๐ฟ) ๐ ๐ฆ๐กโ1, ๐ฆ๐ก , ๐ฟ
๐
๐ก=1
=1
๐(๐ฟ, ๐) exp ๐๐๐ ๐ฆ๐ก, ๐ฆ๐กโ1, ๐
๐
๐ก=1
CRF is a discriminative model (like MEMM) can dependence between each state and the entire observation sequence
uses global normalizer ๐(๐ฟ, ๐) that overcomes the label bias problem of MEMM
MEMM use an exponential model for each state while CRF have a single
exponential model for the joint probability of the entire label sequence
๐1 ๐2 ๐๐
๐ฟ1:๐
โฆ ๐๐โ1
๐ฟ โก ๐1:๐
๐ โก ๐ฆ1:๐
CRF: conditional distribution
29
๐ ๐ ๐ =1
๐(๐, ๐)exp ๐๐๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐
๐
๐=1
=1
๐(๐, ๐)exp ๐๐๐๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐
๐
๐
๐=1
๐ ๐, ๐ = exp ๐๐๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐
๐
๐=1๐
CRF: MAP inference
30
Given CRF parameters ๐, find the ๐โ that maximizes ๐ ๐ ๐ :
๐โ = argmax๐
exp ๐๐๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐
๐
๐=1
๐(๐) is not a function of ๐ and so has been ignored
Max-product algorithm can be used for this MAP inference
problem
Same as Viterbi decoding used in HMMs
CRF: inference
31
Exact inference for 1-D chain CRFs
๐ ๐ฆ๐ , ๐ฆ๐โ1 = exp ๐๐๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐
๐1 ๐2 ๐๐
๐ฟ1:๐
โฆ ๐๐โ1
๐1, ๐2
๐2, ๐3
โฆ ๐๐โ1, ๐๐ ๐2 ๐๐โ1
CRF: learning
32
๐โ = argmax๐
๐ ๐(๐) ๐(๐), ๐๐
๐=1
๐ ๐(๐) ๐(๐), ๐๐
๐=1=
1
๐(๐(๐), ๐)exp ๐๐๐ ๐ฆ๐
(๐), ๐ฆ๐โ1
(๐), ๐(๐)
๐
๐=1
๐
๐=1
= exp ๐๐๐ ๐ฆ๐(๐)
, ๐ฆ๐โ1(๐)
, ๐(๐) โ ln ๐(๐(๐), ๐)
๐
๐=1
๐
๐=1
๐ฟ ๐ = ln ๐ ๐(๐) ๐(๐), ๐๐
๐=1
๐ป๐๐ฟ ๐ = ๐ ๐ฆ๐(๐)
, ๐ฆ๐โ1(๐)
, ๐(๐) โ ๐ป๐ ln ๐(๐(๐), ๐)
๐
๐=1
๐
๐=1
๐ป๐ ln ๐(๐(๐), ๐) = ๐(๐|๐ ๐ , ๐) ๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐ ๐
๐
๐=1๐
Maximum Conditional Likelihood
๐ฝ = argmax๐ฝ
๐(๐๐|๐๐, ๐ฝ)
Gradient of the log-partition function in an exponential family is
the expectation of the sufficient statistics.
CRF: learning
33
๐ป๐ ln ๐(๐(๐), ๐) = ๐(๐|๐ ๐ , ๐) ๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐ ๐
๐
๐=1๐
= ๐(๐|๐ ๐ , ๐)๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐ ๐
๐
๐
๐=1
= ๐(๐ฆ๐ , ๐ฆ๐โ1|๐ ๐ , ๐)๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐
๐ฆ๐,๐ฆ๐โ1
๐
๐=1
How do we find the above expectations?
๐(๐ฆ๐ , ๐ฆ๐โ1|๐ ๐ , ๐) must be computed for all ๐ = 2, โฆ , ๐
CRF: learning
Inference to find ๐(๐ฆ๐ , ๐ฆ๐โ1|๐ ๐ , ๐)
34
Junction tree algorithm
Initialization of clique potentials:
๐ ๐ฆ๐ , ๐ฆ๐โ1 = exp ๐๐๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐ ๐
๐ ๐ฆ๐ = 1
After calibration:๐โ ๐ฆ๐ , ๐ฆ๐โ1
๐ ๐ฆ๐ , ๐ฆ๐โ1|๐(๐), ๐ =๐โ ๐ฆ๐ , ๐ฆ๐โ1
๐โ ๐ฆ๐ , ๐ฆ๐โ1๐ฆ๐,๐ฆ๐โ1
๐1, ๐2
๐2, ๐3
โฆ ๐๐โ1, ๐๐ ๐2 ๐๐โ1
CRF learning: gradient descent
35
๐๐ก = ๐๐ก + ๐๐ป๐๐ฟ ๐๐ก
In each gradient step, for each data sample, we use inference
to find ๐(๐ฆ๐ , ๐ฆ๐โ1|๐ ๐ , ๐) required for computing feature expectation:
๐ป๐ ln ๐(๐(๐), ๐) = ๐ธ๐(๐|๐ ๐ ,๐๐ก) ๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐ ๐
= ๐(๐ฆ๐ , ๐ฆ๐โ1|๐ ๐ , ๐)๐ ๐ฆ๐ , ๐ฆ๐โ1, ๐ ๐
๐ฆ๐,๐ฆ๐โ1
๐ป๐๐ฟ ๐ = ๐ ๐ฆ๐(๐)
, ๐ฆ๐โ1(๐)
, ๐(๐) โ ๐ป๐ ln ๐(๐(๐), ๐)
๐
๐=1
๐
๐=1
Summary
36
Discriminative vs. generative
In cases where we have many correlated features, discriminative models (MEMM and CRF) are often better
avoid the challenge of explicitly modeling the distribution over features.
but, if only limited training data are available, the stronger bias of the generative model may dominate and these models may be preferred.
Learning
HMMs and MEMMs are much more easily learned.
CRF requires an iterative gradient-based approach, which is considerably more expensive
inference must be run separately for every training sequence
MEMM vs. CRF (Label bias problem of MEMM)
In many cases, CRFs are likely to be a safer choice (particularly in cases where many transitions are close to deterministic), but the computational cost may be prohibitive for large data sets.