HMM, MEMM and CRF - SHARIF UNIVERSITY OF...

36
HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2014

Transcript of HMM, MEMM and CRF - SHARIF UNIVERSITY OF...

Page 1: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

HMM, MEMM and CRF

40-957 Special Topics in Artificial Intelligence:

Probabilistic Graphical Models

Sharif University of Technology

Soleymani

Spring 2014

Page 2: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Sequence labeling

2

Taking collective a set of interrelated instances ๐’™1, โ€ฆ , ๐’™๐‘‡

and jointly labeling them

We get as input a sequence of observations ๐‘ฟ = ๐’™1:๐‘‡ and need

to label them with some joint label ๐’€ = ๐‘ฆ1:๐‘‡

Page 3: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Generalization of mixture models for

sequential data

3

[Jordan]

๐‘

๐‘‹

๐‘Œ1 ๐‘Œ2 ๐‘Œ๐‘‡

๐‘‹1 ๐‘‹2 ๐‘‹๐‘‡

โ€ฆ ๐‘Œ๐‘‡โˆ’1

๐‘‹๐‘‡โˆ’1

Y: states (latent variables)

X: observations

Page 4: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

HMM examples

4

Part-of-speech-tagging

Speech recognition

๐‘๐‘๐‘ƒ ๐‘‰๐ต๐‘ ๐‘‰๐ต ๐‘‡๐‘œ

Students are

๐‘‰๐ต๐‘

expected to study

Page 5: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

HMM: probabilistic model

5

Transitional probabilities: transition probabilities between

states

๐ด๐‘–๐‘— โ‰ก ๐‘ƒ(๐‘Œ๐‘ก = ๐‘—|๐‘Œ๐‘กโˆ’1 = ๐‘–)

Initial state distribution: start probabilities in different

states

๐œ‹๐‘– โ‰ก ๐‘ƒ(๐‘Œ1 = ๐‘–)

Observation model: Emission probabilities associated with

each state

๐‘ƒ(๐‘‹๐‘ก|๐‘Œ๐‘ก , ๐šฝ)

Page 6: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

HMM: probabilistic model

6

Transitional probabilities: transition probabilities between

states

๐‘ƒ(๐‘Œ๐‘ก|๐‘Œ๐‘กโˆ’1 = ๐‘–)~๐‘€๐‘ข๐‘™๐‘ก๐‘–(๐ด๐‘–1, โ€ฆ , ๐ด๐‘–๐‘€) โˆ€๐‘– โˆˆ ๐‘ ๐‘ก๐‘Ž๐‘ก๐‘’๐‘ 

Initial state distribution: start probabilities in different

states

๐‘ƒ(๐‘Œ1)~๐‘€๐‘ข๐‘™๐‘ก๐‘–(๐œ‹1, โ€ฆ , ๐œ‹๐‘€)

Observation model: Emission probabilities associated with

each state

Discrete observations: ๐‘ƒ(๐‘‹๐‘ก|๐‘Œ๐‘ก = ๐‘–)~๐‘€๐‘ข๐‘™๐‘ก๐‘–(๐ต๐‘–,1, โ€ฆ , ๐ต๐‘–,๐พ) โˆ€๐‘– โˆˆ ๐‘ ๐‘ก๐‘Ž๐‘ก๐‘’๐‘ 

General: ๐‘ƒ ๐‘‹๐‘ก ๐‘Œ๐‘ก = ๐‘– = ๐‘“(. |๐œฝ๐‘–)

๐‘Œ: states (latent variables)

๐‘‹: observations

Page 7: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Inference problems in sequential data

7

Decoding: argmax๐‘ฆ1,โ€ฆ,๐‘ฆ๐‘‡

๐‘ƒ ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘‡ ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘‡

Evaluation

Filtering: ๐‘ƒ ๐‘ฆ๐‘ก ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘ก

Smoothing: ๐‘กโ€ฒ < ๐‘ก, ๐‘ƒ ๐‘ฆ๐‘กโ€ฒ ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘ก

Prediction: ๐‘กโ€ฒ > ๐‘ก, ๐‘ƒ ๐‘ฆ๐‘กโ€ฒ ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘ก

Page 8: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Some questions

8

๐‘ƒ ๐‘ฆ๐‘ก|๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘ก =?

Forward algorithm

๐‘ƒ ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘‡ =?

Forward algorithm

๐‘ƒ ๐‘ฆ๐‘ก|๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘‡ =?

Forward-Backward algorithm

How do we adjust the HMM parameters:

Complete data: each training data includes a state sequence and

the corresponding observation sequence

Incomplete data: each training data includes only an observation

sequence

Page 9: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Forward algorithm

9

๐›ผ๐‘ก ๐‘– = ๐‘ƒ ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘ก , ๐‘Œ๐‘ก = ๐‘–

๐›ผ๐‘ก ๐‘– = ๐›ผ๐‘กโˆ’1 ๐‘— ๐‘ƒ ๐‘Œ๐‘ก = ๐‘–|๐‘Œ๐‘กโˆ’1 = ๐‘— ๐‘ƒ ๐‘ฅ๐‘ก| ๐‘Œ๐‘ก = ๐‘–

๐‘—

Initialization:

๐›ผ1 ๐‘– = ๐‘ƒ ๐‘ฅ1, ๐‘Œ1 = ๐‘– = ๐‘ƒ ๐‘ฅ1|๐‘Œ1 = ๐‘– ๐‘ƒ ๐‘Œ1 = ๐‘–

Iterations: ๐‘ก = 2 to ๐‘‡

๐›ผ๐‘ก ๐‘– = ๐›ผ๐‘กโˆ’1 ๐‘— ๐‘ƒ ๐‘Œ๐‘ก = ๐‘–|๐‘Œ๐‘กโˆ’1 = ๐‘— ๐‘ƒ ๐‘ฅ๐‘ก|๐‘Œ๐‘ก = ๐‘–๐‘—

๐‘–, ๐‘— = 1, โ€ฆ , ๐‘€

๐‘Œ1 ๐‘Œ2 ๐‘Œ๐‘‡

๐‘‹1 ๐‘‹2 ๐‘‹๐‘‡

โ€ฆ ๐‘Œ๐‘‡โˆ’1

๐‘‹๐‘‡โˆ’1

๐›ผ1(. ) ๐›ผ2(. ) ๐›ผ๐‘‡โˆ’1(. ) ๐›ผ๐‘‡(. ) ๐›ผ๐‘ก ๐‘– = ๐‘š๐‘กโˆ’1โ†’๐‘ก(๐‘–)

Page 10: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Backward algorithm

10

๐›ฝ๐‘ก ๐‘– = ๐‘š๐‘กโ†’๐‘กโˆ’1(๐‘–) = ๐‘ƒ ๐‘ฅ๐‘ก+1, โ€ฆ , ๐‘ฅ๐‘‡|๐‘Œ๐‘ก = ๐‘–

๐›ฝ๐‘กโˆ’1 ๐‘– = ๐›ฝ๐‘ก ๐‘— ๐‘ƒ ๐‘Œ๐‘ก = ๐‘—|๐‘Œ๐‘กโˆ’1 = ๐‘– ๐‘ƒ ๐‘ฅ๐‘ก| ๐‘Œ๐‘ก = ๐‘–

๐‘—

Initialization:

๐›ฝ๐‘‡ ๐‘– = 1

Iterations: ๐‘ก = ๐‘‡ down to 2

๐›ฝ๐‘กโˆ’1 ๐‘– = ๐›ฝ๐‘ก ๐‘— ๐‘ƒ ๐‘Œ๐‘ก = ๐‘—|๐‘Œ๐‘กโˆ’1 = ๐‘– ๐‘ƒ ๐‘ฅ๐‘ก| ๐‘Œ๐‘ก = ๐‘–๐‘—

๐‘–, ๐‘— โˆˆ ๐‘ ๐‘ก๐‘Ž๐‘ก๐‘’๐‘ 

๐‘Œ1 ๐‘Œ2 ๐‘Œ๐‘‡

๐‘‹1 ๐‘‹2 ๐‘‹๐‘‡

โ€ฆ ๐‘Œ๐‘‡โˆ’1

๐‘‹๐‘‡โˆ’1

๐›ฝ๐‘‡ . = 1 ๐›ฝ๐‘‡โˆ’1 . ๐›ฝ2 . ๐›ฝ1 .

๐›ฝ๐‘ก ๐‘– = ๐‘š๐‘กโ†’๐‘กโˆ’1(๐‘–)

Page 11: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Forward-backward algorithm

11

๐›ผ๐‘ก ๐‘– = ๐›ผ๐‘กโˆ’1 ๐‘— ๐‘ƒ ๐‘Œ๐‘ก = ๐‘–|๐‘Œ๐‘กโˆ’1 = ๐‘— ๐‘ƒ ๐‘ฅ๐‘ก| ๐‘Œ๐‘ก = ๐‘–

๐‘—

๐›ผ1 ๐‘– = ๐‘ƒ ๐‘ฅ1, ๐‘Œ1 = ๐‘– = ๐‘ƒ ๐‘ฅ1|๐‘Œ1 = ๐‘– ๐‘ƒ ๐‘Œ1 = ๐‘–

๐›ฝ๐‘กโˆ’1 ๐‘– = ๐›ฝ๐‘ก ๐‘— ๐‘ƒ ๐‘Œ๐‘ก = ๐‘—|๐‘Œ๐‘กโˆ’1 = ๐‘– ๐‘ƒ ๐‘ฅ๐‘ก| ๐‘Œ๐‘ก = ๐‘–

๐‘—

๐›ฝ๐‘‡ ๐‘– = 1

๐‘ƒ ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘‡ = ๐›ผ๐‘‡ ๐‘–

๐‘–

๐›ฝ๐‘‡ ๐‘– = ๐›ผ๐‘‡ ๐‘–

๐‘–

๐‘ƒ ๐‘Œ๐‘ก = ๐‘–|๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘‡ =๐›ผ๐‘ก(๐‘–)๐›ฝ๐‘ก(๐‘–)

๐›ผ๐‘‡ ๐‘–๐‘–

๐›ผ๐‘ก(๐‘–) โ‰ก ๐‘ƒ ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘ก , ๐‘Œ๐‘ก = ๐‘–

๐›ฝ๐‘ก(๐‘–) โ‰ก ๐‘ƒ ๐‘ฅ๐‘ก+1, ๐‘ฅ๐‘ก+2, โ€ฆ , ๐‘ฅ๐‘‡|๐‘Œ๐‘ก = ๐‘–

Page 12: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Forward-backward algorithm

12

This will be used for expectation maximization to train a

HMM

๐‘ƒ ๐‘Œ๐‘ก = ๐‘– ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘‡ =๐‘ƒ ๐‘ฅ1,โ€ฆ,๐‘ฅ๐‘‡,๐‘Œ๐‘ก=๐‘–

๐‘ƒ ๐‘ฅ1,โ€ฆ,๐‘ฅ๐‘‡=

๐›ผ๐‘ก(๐‘–)๐›ฝ๐‘ก(๐‘–)

๐›ผ๐‘‡(๐‘—)๐‘๐‘—=1

๐‘Œ1 ๐‘Œ2 ๐‘Œ๐‘‡

๐‘‹1 ๐‘‹2 ๐‘‹๐‘‡

โ€ฆ ๐‘Œ๐‘‡โˆ’1

๐‘‹๐‘‡โˆ’1

๐›ฝ๐‘‡ . = 1

๐›ผ1(. ) ๐›ผ2(. ) ๐›ผ๐‘‡โˆ’1(. ) ๐›ผ๐‘‡(. )

๐›ฝ๐‘‡โˆ’1 . ๐›ฝ2 . ๐›ฝ1 .

Page 13: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Decoding Problem

13

Choose state sequence to maximize the observations:

argmax๐‘ฆ1,โ€ฆ,๐‘ฆ๐‘ก

๐‘ƒ ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘ก ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘ก

Viterbi algorithm:

Define auxiliary variable ๐›ฟ:

๐›ฟ๐‘ก ๐‘– = max๐‘ฆ1,โ€ฆ,๐‘ฆ๐‘กโˆ’1

๐‘ƒ(๐‘ฆ1, ๐‘ฆ2, โ€ฆ , ๐‘ฆ๐‘กโˆ’1, ๐‘Œ๐‘ก = ๐‘–, ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘ก)

๐›ฟ๐‘ก(๐‘–): probability of the most probable path ending in state ๐‘Œ๐‘ก = ๐‘–

Recursive relation:

๐›ฟ๐‘ก ๐‘– = max๐‘—

๐›ฟ๐‘กโˆ’1 ๐‘— ๐‘ƒ(๐‘Œ๐‘ก = ๐‘–|๐‘Œ๐‘กโˆ’1 = ๐‘—) ๐‘ƒ ๐‘ฅ๐‘ก ๐‘Œ๐‘ก = ๐‘–

๐›ฟ๐‘ก ๐‘– = max๐‘—=1,โ€ฆ,๐‘

๐›ฟ๐‘กโˆ’1 ๐‘ก ๐‘ƒ ๐‘Œ๐‘ก = ๐‘– ๐‘Œ๐‘กโˆ’1 = ๐‘— ๐‘ƒ(๐‘ฅ๐‘ก|๐‘Œ๐‘ก = ๐‘–)

Page 14: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Decoding Problem: Viterbi algorithm

14

Initialization ๐›ฟ1 ๐‘– = ๐‘ƒ ๐‘ฅ1|๐‘Œ1 = ๐‘– ๐‘ƒ ๐‘Œ1 = ๐‘–

๐œ“1 ๐‘– = 0

Iterations: ๐‘ก = 2, โ€ฆ , ๐‘‡

๐›ฟ๐‘ก ๐‘– = max๐‘—

๐›ฟ๐‘กโˆ’1 ๐‘— ๐‘ƒ ๐‘Œ๐‘ก = ๐‘– ๐‘Œ๐‘กโˆ’1 = ๐‘— ๐‘ƒ ๐‘ฅ๐‘ก ๐‘Œ๐‘ก = ๐‘–

๐œ“๐‘ก ๐‘– = argmax๐‘—

๐›ฟ๐‘ก ๐‘— ๐‘ƒ ๐‘Œ๐‘ก = ๐‘– ๐‘Œ๐‘กโˆ’1 = ๐‘—

Final computation:

๐‘ƒโˆ— = max๐‘—=1,โ€ฆ,๐‘

๐›ฟ๐‘‡ ๐‘—

๐‘ฆ๐‘‡โˆ— = argmax

๐‘—=1,โ€ฆ,๐‘๐›ฟ๐‘‡ ๐‘—

Traceback state sequence: ๐‘ก = ๐‘‡ โˆ’ 1 down to 1 ๐‘ฆ๐‘ก

โˆ— = ๐œ“๐‘ก+1 ๐‘ฆ๐‘ก+1โˆ—

๐‘– = 1, โ€ฆ , ๐‘€

๐‘– = 1, โ€ฆ , ๐‘€

Page 15: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Max-product algorithm

15

๐‘š๐‘—๐‘–๐‘š๐‘Ž๐‘ฅ ๐‘ฅ๐‘– = max

๐‘ฅ๐‘—

๐œ™ ๐‘ฅ๐‘— ๐œ™ ๐‘ฅ๐‘– , ๐‘ฅ๐‘— ๐‘š๐‘˜๐‘—๐‘š๐‘Ž๐‘ฅ(๐‘ฅ๐‘—)

๐‘˜โˆˆ๐’ฉ(๐‘—)\๐‘–

๐›ฟ๐‘ก ๐‘– = ๐‘š๐‘กโˆ’1,๐‘ก๐‘š๐‘Ž๐‘ฅ ร— ๐œ™ ๐‘ฅ๐‘–

Page 16: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

HMM Learning

16

Supervised learning: When we have a set of data samples,

each of them containing a pair of sequences (one is the

observation sequence and the other is the state sequence)

Unsupervised learning: When we have a set of data

samples, each of them containing a sequence of observations

Page 17: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

HMM supervised learning by MLE

17

Initial state probability:

๐œ‹๐‘– = ๐‘ƒ ๐‘Œ1 = ๐‘– , 1 โ‰ค ๐‘– โ‰ค ๐‘€

State transition probability:

๐ด๐‘—๐‘– = ๐‘ƒ ๐‘Œ๐‘ก+1 = ๐‘– ๐‘Œ๐‘ก = ๐‘— , 1 โ‰ค ๐‘–, ๐‘— โ‰ค ๐‘€

State transition probability:

๐ต๐‘–๐‘˜ = ๐‘ƒ ๐‘‹๐‘ก = ๐‘˜ ๐‘Œ๐‘ก = ๐‘– , 1 โ‰ค ๐‘˜ โ‰ค ๐พ

๐‘Œ1 ๐‘Œ2 ๐‘Œ๐‘‡ ๐‘Œ๐‘‡โˆ’1 โ€ฆ

๐… ๐‘จ

๐‘‹2 ๐‘‹1 ๐‘‹๐‘‡โˆ’1 ๐‘‹๐‘‡

๐‘ฉ

Discrete

observations

Page 18: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

HMM: supervised parameter learning by MLE

18

๐‘ƒ ๐’Ÿ|๐œฝ = ๐‘ƒ ๐‘ฆ1(๐‘›)

๐… ๐‘ƒ(๐‘ฆ๐‘ก(๐‘›)

|๐‘ฆ๐‘กโˆ’1(๐‘›)

, ๐‘จ)๐‘‡

๐‘ก=2 ๐‘ƒ(๐’™๐‘ก

(๐‘›)|๐‘ฆ๐‘ก

(๐‘›), ๐‘ฉ)

๐‘‡

๐‘ก=1

๐‘

๐‘›=1

๐ด ๐‘–๐‘— = ๐ผ ๐‘ฆ๐‘กโˆ’1

(๐‘›)= ๐‘–, ๐‘ฆ๐‘ก

(๐‘›)= ๐‘—๐‘‡

๐‘ก=2๐‘๐‘›=1

๐ผ ๐‘ฆ๐‘กโˆ’1(๐‘›)

= ๐‘–๐‘‡๐‘ก=2

๐‘๐‘›=1

๐œ‹ ๐‘– = ๐ผ ๐‘ฆ1

(๐‘›)= ๐‘–๐‘

๐‘›=1

๐‘

๐ต ๐‘–๐‘˜ = ๐ผ ๐‘ฆ๐‘ก

(๐‘›)= ๐‘–, ๐‘ฅ๐‘ก

(๐‘›)= ๐‘˜๐‘‡

๐‘ก=1๐‘๐‘›=1

๐ผ ๐‘ฆ๐‘ก(๐‘›)

= ๐‘–๐‘‡๐‘ก=1

๐‘๐‘›=1

Discrete

observations

Page 19: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Learning

19

Problem: how to construct an HHM given only observations?

Find ๐œฝ = (๐‘จ, ๐‘ฉ, ๐…), maximizing ๐‘ƒ(๐’™1, โ€ฆ , ๐’™๐‘‡|๐œฝ)

Incomplete data

EM algorithm

Page 20: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

HMM learning by EM (Baum-Welch)

20

๐œฝ๐‘œ๐‘™๐‘‘ = ๐…๐‘œ๐‘™๐‘‘ , ๐‘จ๐‘œ๐‘™๐‘‘ , ๐šฝ๐‘œ๐‘™๐‘‘

E-Step:

๐›พ๐‘›,๐‘ก๐‘– = ๐‘ƒ ๐‘Œ๐‘ก

(๐‘›)= ๐‘˜ ๐‘ฟ(๐‘›); ๐œฝ๐‘œ๐‘™๐‘‘

๐œ‰๐‘›,๐‘ก๐‘–,๐‘—

= ๐‘ƒ ๐‘Œ๐‘กโˆ’1(๐‘›)

= ๐‘–, ๐‘Œ๐‘ก(๐‘›)

= ๐‘— ๐’™(๐‘›); ๐œฝ๐‘œ๐‘™๐‘‘

M-Step:

๐œ‹๐‘–๐‘›๐‘’๐‘ค =

๐›พ๐‘›,1๐‘–๐‘

๐‘›=1

๐‘

๐ด๐‘–,๐‘—๐‘›๐‘’๐‘ค =

๐œ‰๐‘›,๐‘ก๐‘–,๐‘—๐‘‡

๐‘ก=2๐‘๐‘›=1

๐›พ๐‘›,๐‘ก๐‘–๐‘‡โˆ’1

๐‘ก=1๐‘๐‘›=1

๐ต๐‘–,๐‘˜๐‘›๐‘’๐‘ค =

๐›พ๐‘›,๐‘ก๐‘– ๐ผ ๐‘‹๐‘ก

(๐‘›)=๐‘–, ๐‘‡

๐‘ก=1๐‘๐‘›=1

๐ผ ๐‘‹๐‘ก(๐‘›)

=๐‘–๐‘‡๐‘ก=1

๐‘๐‘›=1

๐‘–, ๐‘— = 1, โ€ฆ , ๐‘€

๐‘–, ๐‘— = 1, โ€ฆ , ๐‘€

Baum-Welch algorithm (Baum, 1972)

Discrete observations

Page 21: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

HMM learning by EM (Baum-Welch)

21

๐œฝ๐‘œ๐‘™๐‘‘ = ๐…๐‘œ๐‘™๐‘‘ , ๐‘จ๐‘œ๐‘™๐‘‘ , ๐šฝ๐‘œ๐‘™๐‘‘

E-Step:

๐›พ๐‘›,๐‘ก๐‘– = ๐‘ƒ ๐‘Œ๐‘ก

(๐‘›)= ๐‘˜ ๐‘ฟ(๐‘›); ๐œฝ๐‘œ๐‘™๐‘‘

๐œ‰๐‘›,๐‘ก๐‘–,๐‘—

= ๐‘ƒ ๐‘Œ๐‘กโˆ’1(๐‘›)

= ๐‘–, ๐‘Œ๐‘ก(๐‘›)

= ๐‘— ๐’™(๐‘›); ๐œฝ๐‘œ๐‘™๐‘‘

M-Step:

๐œ‹๐‘–๐‘›๐‘’๐‘ค =

๐›พ๐‘›,1๐‘–๐‘

๐‘›=1

๐‘

๐ด๐‘–,๐‘—๐‘›๐‘’๐‘ค =

๐œ‰๐‘›,๐‘ก๐‘–,๐‘—๐‘‡

๐‘ก=2๐‘๐‘›=1

๐›พ๐‘›,๐‘ก๐‘–๐‘‡โˆ’1

๐‘ก=1๐‘๐‘›=1

๐ต๐‘–,๐‘˜๐‘›๐‘’๐‘ค =

๐›พ๐‘›,๐‘ก๐‘– ๐ผ ๐‘‹๐‘ก

(๐‘›)=๐‘–, ๐‘‡

๐‘ก=1๐‘๐‘›=1

๐ผ ๐‘‹๐‘ก(๐‘›)

=๐‘–๐‘‡๐‘ก=1

๐‘๐‘›=1

๐‘–, ๐‘— = 1, โ€ฆ , ๐‘€

๐‘–, ๐‘— = 1, โ€ฆ , ๐‘€

๐๐‘–๐‘›๐‘’๐‘ค =

๐›พ๐‘›,๐‘ก๐‘– ๐’™๐‘ก

(๐‘›)๐‘‡๐‘ก=1

๐‘๐‘›=1

๐›พ๐‘›,๐‘ก๐‘–๐‘‡

๐‘ก=1๐‘๐‘›=1

๐œฎ๐‘–๐‘›๐‘’๐‘ค =

๐›พ๐‘›,๐‘ก๐‘– ๐’™๐‘ก

(๐‘›)โˆ’ ๐๐‘–

๐‘›๐‘’๐‘ค ๐’™๐‘ก(๐‘›)

โˆ’ ๐๐‘–๐‘›๐‘’๐‘ค

๐‘‡๐‘‡๐‘ก=1

๐‘๐‘›=1

๐›พ๐‘›,๐‘ก๐‘–๐‘‡

๐‘ก=1๐‘๐‘›=1

Assumption: Gaussian emission

probabilities

Baum-Welch algorithm (Baum, 1972)

Page 22: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Forward-backward algorithm for E-step

22

๐‘ƒ ๐‘ฆ๐‘กโˆ’1, ๐‘ฆ๐‘ก ๐’™1, โ€ฆ , ๐’™๐‘‡

=๐‘ƒ ๐’™1, โ€ฆ , ๐’™๐‘‡ ๐‘ฆ๐‘กโˆ’1, ๐‘ฆ๐‘ก)๐‘ƒ ๐‘ฆ๐‘กโˆ’1, ๐‘ฆ๐‘ก

๐‘ƒ(๐’™1, โ€ฆ , ๐’™๐‘‡)

=๐‘ƒ ๐’™1, โ€ฆ , ๐’™๐‘กโˆ’1 ๐‘ฆ๐‘กโˆ’1)๐‘ƒ ๐’™๐‘ก ๐‘ฆ๐‘ก)๐‘ƒ ๐’™๐‘ก+1, โ€ฆ , ๐’™๐‘‡ ๐‘ฆ๐‘ก)๐‘ƒ ๐‘ฆ๐‘ก ๐‘ฆ๐‘กโˆ’1)๐‘ƒ ๐‘ฆ๐‘กโˆ’1

๐‘ƒ(๐’™1, โ€ฆ , ๐’™๐‘‡)

=๐›ผ๐‘กโˆ’1(๐‘ฆ๐‘กโˆ’1)๐‘ƒ ๐’™๐‘ก ๐‘ฆ๐‘ก)๐‘ƒ ๐‘ฆ๐‘ก ๐‘ฆ๐‘กโˆ’1)๐›ฝ๐‘ก ๐‘ฆ๐‘ก

๐›ผ๐‘‡(๐‘–)๐‘€๐‘–=1

Page 23: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

HMM shortcomings

23

In modeling the joint distribution ๐‘ƒ(๐’€, ๐‘ฟ), HMM ignores many

dependencies between observations ๐‘‹1, โ€ฆ , ๐‘‹๐‘‡ (similar to

most generative models that need to simplify the structure)

In the sequence labeling task, we need to classify an

observation sequence using the conditional probability ๐‘ƒ(๐’€|๐‘ฟ)

However, HMM learns a joint distribution ๐‘ƒ(๐’€, ๐‘ฟ) while uses only

๐‘ƒ(๐’€|๐‘ฟ) for labeling

๐‘Œ1 ๐‘Œ2 ๐‘Œ๐‘‡

๐‘‹1 ๐‘‹2 ๐‘‹๐‘‡

โ€ฆ ๐‘Œ๐‘‡โˆ’1

๐‘‹๐‘‡โˆ’1

Page 24: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Maximum Entropy Markov Model (MEMM)

24

๐‘ƒ ๐’š1:๐‘‡ ๐’™1:๐‘‡ = ๐‘ƒ ๐‘ฆ1 ๐’™1:๐‘‡ ๐‘ƒ ๐‘ฆ๐‘ก ๐‘ฆ๐‘กโˆ’1, ๐’™1:๐‘‡

๐‘‡

๐‘ก=2

๐‘ƒ ๐‘ฆ๐‘ก ๐‘ฆ๐‘กโˆ’1, ๐’™1:๐‘‡ =exp ๐’˜๐‘‡๐’‡ ๐‘ฆ๐‘ก , ๐‘ฆ๐‘กโˆ’1, ๐’™1:๐‘‡

exp ๐’˜๐‘‡๐’‡ ๐‘ฆ๐‘ก, ๐‘ฆ๐‘กโˆ’1, ๐’™1:๐‘‡๐‘ฆ๐‘ก

Discriminative model

Only models ๐‘ƒ(๐’€|๐‘ฟ) and completely ignores modeling ๐‘ƒ(๐‘ฟ) Maximizes the conditional likelihood ๐‘ƒ(๐’Ÿ๐’€|๐’Ÿ๐‘ฟ, ๐œฝ)

๐‘Œ1 ๐‘Œ2 ๐‘Œ๐‘‡

๐‘‹1 ๐‘‹2 ๐‘‹๐‘‡

โ€ฆ ๐‘Œ๐‘‡โˆ’1

๐‘‹๐‘‡โˆ’1

๐‘Œ1 ๐‘Œ2 ๐‘Œ๐‘‡

๐‘ฟ1:๐‘‡

โ€ฆ ๐‘Œ๐‘‡โˆ’1

๐‘ ๐‘ฆ๐‘กโˆ’1, ๐’™1:๐‘‡ , ๐’˜ = exp ๐’˜๐‘‡๐’‡ ๐‘ฆ๐‘ก, ๐‘ฆ๐‘กโˆ’1, ๐’™1:๐‘‡

๐‘ฆ๐‘ก

Page 25: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Feature function

25

Feature function ๐’‡ ๐‘ฆ๐‘ก, ๐‘ฆ๐‘กโˆ’1, ๐’™1:๐‘‡ can take account of

relations between both data and label space

However, they are often indicator functions showing absence

or presence of a feature

๐‘ค๐‘– captures how closely ๐‘“ ๐‘ฆ๐‘ก, ๐‘ฆ๐‘กโˆ’1, ๐’™1:๐‘‡ is related with

the label

Page 26: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

MEMM disadvantages

26

The later observation in the sequence has absolutely no effect

on the posterior probability of the current state

model does not allow for any smoothing.

The model is incapable of going back and changing its prediction

about the earlier observations.

The label bias problem

there are cases that a given observation is not useful in

predicting the next state of the model.

if a state has a unique out-going transition, the given observation is

useless

Page 27: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Label bias problem in MEMM

27

Label bias problem: states with fewer arcs are preferred

Preference of states with lower entropy of transitions over others in decoding

MEMMs should probably be avoided in cases where many transitions are close to deterministic

Extreme case: When there is only one outgoing arc, it does not matter what the observation is

The source of this problem: Probabilities of outgoing arcs normalized separately for each state

sum of transition probability for any state has to sum to 1

Solution: Do not normalize probabilities locally

Page 28: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

From MEMM to CRF

28

From local probabilities to local potentials

๐‘ƒ ๐’€ ๐‘ฟ =1

๐‘(๐‘ฟ) ๐œ™ ๐‘ฆ๐‘กโˆ’1, ๐‘ฆ๐‘ก , ๐‘ฟ

๐‘‡

๐‘ก=1

=1

๐‘(๐‘ฟ, ๐’˜) exp ๐’˜๐‘‡๐‘“ ๐‘ฆ๐‘ก, ๐‘ฆ๐‘กโˆ’1, ๐’™

๐‘‡

๐‘ก=1

CRF is a discriminative model (like MEMM) can dependence between each state and the entire observation sequence

uses global normalizer ๐‘(๐‘ฟ, ๐’˜) that overcomes the label bias problem of MEMM

MEMM use an exponential model for each state while CRF have a single

exponential model for the joint probability of the entire label sequence

๐‘Œ1 ๐‘Œ2 ๐‘Œ๐‘‡

๐‘ฟ1:๐‘‡

โ€ฆ ๐‘Œ๐‘‡โˆ’1

๐‘ฟ โ‰ก ๐’™1:๐‘‡

๐’€ โ‰ก ๐‘ฆ1:๐‘‡

Page 29: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

CRF: conditional distribution

29

๐‘ƒ ๐’š ๐’™ =1

๐‘(๐’™, ๐€)exp ๐€๐‘‡๐’‡ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1, ๐’™

๐‘‡

๐‘–=1

=1

๐‘(๐’™, ๐€)exp ๐œ†๐‘˜๐‘“๐‘˜ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1, ๐’™

๐‘˜

๐‘‡

๐‘–=1

๐‘ ๐’™, ๐€ = exp ๐€๐‘‡๐’‡ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1, ๐’™

๐‘‡

๐‘–=1๐’š

Page 30: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

CRF: MAP inference

30

Given CRF parameters ๐€, find the ๐’šโˆ— that maximizes ๐‘ƒ ๐’š ๐’™ :

๐’šโˆ— = argmax๐’š

exp ๐€๐‘‡๐’‡ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1, ๐’™

๐‘‡

๐‘–=1

๐‘(๐’™) is not a function of ๐’š and so has been ignored

Max-product algorithm can be used for this MAP inference

problem

Same as Viterbi decoding used in HMMs

Page 31: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

CRF: inference

31

Exact inference for 1-D chain CRFs

๐œ™ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1 = exp ๐€๐‘‡๐’‡ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1, ๐’™

๐‘Œ1 ๐‘Œ2 ๐‘Œ๐‘‡

๐‘ฟ1:๐‘‡

โ€ฆ ๐‘Œ๐‘‡โˆ’1

๐‘Œ1, ๐‘Œ2

๐‘Œ2, ๐‘Œ3

โ€ฆ ๐‘Œ๐‘‡โˆ’1, ๐‘Œ๐‘‡ ๐‘Œ2 ๐‘Œ๐‘‡โˆ’1

Page 32: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

CRF: learning

32

๐€โˆ— = argmax๐€

๐‘ƒ ๐’š(๐‘›) ๐’™(๐‘›), ๐€๐‘

๐‘›=1

๐‘ƒ ๐’š(๐‘›) ๐’™(๐‘›), ๐€๐‘

๐‘›=1=

1

๐‘(๐’™(๐‘›), ๐€)exp ๐€๐‘‡๐’‡ ๐‘ฆ๐‘–

(๐‘›), ๐‘ฆ๐‘–โˆ’1

(๐‘›), ๐’™(๐‘›)

๐‘‡

๐‘–=1

๐‘

๐‘›=1

= exp ๐€๐‘‡๐’‡ ๐‘ฆ๐‘–(๐‘›)

, ๐‘ฆ๐‘–โˆ’1(๐‘›)

, ๐’™(๐‘›) โˆ’ ln ๐‘(๐’™(๐‘›), ๐€)

๐‘‡

๐‘–=1

๐‘

๐‘›=1

๐ฟ ๐€ = ln ๐‘ƒ ๐’š(๐‘›) ๐’™(๐‘›), ๐€๐‘

๐‘›=1

๐›ป๐€๐ฟ ๐€ = ๐’‡ ๐‘ฆ๐‘–(๐‘›)

, ๐‘ฆ๐‘–โˆ’1(๐‘›)

, ๐’™(๐‘›) โˆ’ ๐›ป๐€ ln ๐‘(๐’™(๐‘›), ๐€)

๐‘‡

๐‘–=1

๐‘

๐‘›=1

๐›ป๐€ ln ๐‘(๐’™(๐‘›), ๐€) = ๐‘ƒ(๐’š|๐’™ ๐‘› , ๐€) ๐’‡ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1, ๐’™ ๐‘›

๐‘‡

๐‘–=1๐’š

Maximum Conditional Likelihood

๐œฝ = argmax๐œฝ

๐‘ƒ(๐’Ÿ๐‘Œ|๐’Ÿ๐‘‹, ๐œฝ)

Gradient of the log-partition function in an exponential family is

the expectation of the sufficient statistics.

Page 33: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

CRF: learning

33

๐›ป๐€ ln ๐‘(๐’™(๐‘›), ๐€) = ๐‘ƒ(๐’š|๐’™ ๐‘› , ๐€) ๐’‡ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1, ๐’™ ๐‘›

๐‘‡

๐‘–=1๐’š

= ๐‘ƒ(๐’š|๐’™ ๐‘› , ๐€)๐’‡ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1, ๐’™ ๐‘›

๐’š

๐‘‡

๐‘–=1

= ๐‘ƒ(๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1|๐’™ ๐‘› , ๐€)๐’‡ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1, ๐’™

๐‘ฆ๐‘–,๐‘ฆ๐‘–โˆ’1

๐‘‡

๐‘–=1

How do we find the above expectations?

๐‘ƒ(๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1|๐’™ ๐‘› , ๐€) must be computed for all ๐‘– = 2, โ€ฆ , ๐‘‡

Page 34: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

CRF: learning

Inference to find ๐‘ƒ(๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1|๐’™ ๐‘› , ๐€)

34

Junction tree algorithm

Initialization of clique potentials:

๐œ“ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1 = exp ๐€๐‘‡๐’‡ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1, ๐’™ ๐‘›

๐œ™ ๐‘ฆ๐‘– = 1

After calibration:๐œ“โˆ— ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1

๐‘ƒ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1|๐’™(๐‘›), ๐€ =๐œ“โˆ— ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1

๐œ“โˆ— ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1๐‘ฆ๐‘–,๐‘ฆ๐‘–โˆ’1

๐‘Œ1, ๐‘Œ2

๐‘Œ2, ๐‘Œ3

โ€ฆ ๐‘Œ๐‘‡โˆ’1, ๐‘Œ๐‘‡ ๐‘Œ2 ๐‘Œ๐‘‡โˆ’1

Page 35: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

CRF learning: gradient descent

35

๐€๐‘ก = ๐€๐‘ก + ๐œ‚๐›ป๐€๐ฟ ๐€๐‘ก

In each gradient step, for each data sample, we use inference

to find ๐‘ƒ(๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1|๐’™ ๐‘› , ๐€) required for computing feature expectation:

๐›ป๐€ ln ๐‘(๐’™(๐‘›), ๐€) = ๐ธ๐‘ƒ(๐’š|๐’™ ๐‘› ,๐€๐‘ก) ๐’‡ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1, ๐’™ ๐‘›

= ๐‘ƒ(๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1|๐’™ ๐‘› , ๐€)๐’‡ ๐‘ฆ๐‘– , ๐‘ฆ๐‘–โˆ’1, ๐’™ ๐‘›

๐‘ฆ๐‘–,๐‘ฆ๐‘–โˆ’1

๐›ป๐€๐ฟ ๐€ = ๐’‡ ๐‘ฆ๐‘–(๐‘›)

, ๐‘ฆ๐‘–โˆ’1(๐‘›)

, ๐’™(๐‘›) โˆ’ ๐›ป๐€ ln ๐‘(๐’™(๐‘›), ๐€)

๐‘‡

๐‘–=1

๐‘

๐‘›=1

Page 36: HMM, MEMM and CRF - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/92-93/2/ce957-1/resources...HMM, MEMM and CRF 40-957 Special Topics in Artificial Intelligence: Probabilistic

Summary

36

Discriminative vs. generative

In cases where we have many correlated features, discriminative models (MEMM and CRF) are often better

avoid the challenge of explicitly modeling the distribution over features.

but, if only limited training data are available, the stronger bias of the generative model may dominate and these models may be preferred.

Learning

HMMs and MEMMs are much more easily learned.

CRF requires an iterative gradient-based approach, which is considerably more expensive

inference must be run separately for every training sequence

MEMM vs. CRF (Label bias problem of MEMM)

In many cases, CRFs are likely to be a safer choice (particularly in cases where many transitions are close to deterministic), but the computational cost may be prohibitive for large data sets.