Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski

1

Gholamreza Haffari Anoop Sarkar

Presenter: Milan Tofiloski

Natural Language Lab

Simon Fraser university

Homotopy-based Semi-Supervised Hidden Markov

Models for Sequence Labeling

2

P (x;y)y (1)

• Motivation & Contributions

• Experiments

• Homotopy method

• More experiments

Outline

3

P (x;y)y (1)

• Parameter setting for the joint probability of input-output which maximizes probability of the given data:

• L : labeled data

• U : unlabeled data

Maximum Likelihood Principle

4

Deficiency of MLE

• Usually |U| >> |L|, then

99

44

1qq

• Which means the relationship of input-output is ignored when estimating the parameters !– MLE focuses on modeling the input distribution P(x)– But we are interested in modeling the joint distribution

P(x,y)

5

Remedy for the Deficiency

99

44

1qq

• Balance the effect of lab and unlab data:

• Find which maximally take advantage of lab and unlab data

• MLE

6

An experiment with HMM

99

44

1qq

Lower is Better

MLE Performance

• MLE can hurt the performance• Balancing lab and unlab data related terms is beneficial

7

Our Contributions

1. Introducing a principled way to choose for HMM in sequence labeling (tagging) tasks

2. Introducing an efficient dynamic programming algorithm to compute second order statistics in HMM

8

P (x;y)y (1)


• Experiments

• Homotopy method


Outline

9

Task

99

44

1qq

• Field segmentation in information extraction• 13 tag fields: AUTHOR, TITLE, …

EDITOR EDITOR EDITOR EDITOR EDITOR EDITOR TITLE

A . Elmagarmid , editor . Transaction

TITLE TITLE TITLE TITLE TITLE TITLE PUB

Models for Advanced Database Applications , Morgan

PUB PUB PUB DATE DATE

- Kaufmann , 1992 .

10

Experimental Setup

• Use an HMM with 13 states– Freeze the transition (state->state) probabilities to what

has been observed in the lab data– Use the Homotopy method to just learn the emission

(state->alphabet) probabilities– Do add- smoothing for the initial values of emission

and transition probabilities

• Data statistics:– Average seq. length : 36.7– Average number of segments in a seq: 5.4– Size of Lab/Unlab data is 300/700

99

44

1qq

11

Baselines

• Held-out: put aside part of the lab data as a held-out set, and use it t choose

• Oracle: choose based on test data using per position accuracy

• Supervised: forgetting about unlab data, and just using lab data

99

44

1qq

12

P (x;y)y (1)

Homotopy vs Baselines

Higher is Better

• Sequence of most probable states decoding See paper for more results

Even very small values of can be useful.

In homotopy =.004, and in supervised = 0

13

P (x;y)y (1)


• Experiments

• Homotopy method


Outline

14

Path of Solutions

• Look at as changes from 0 to 1• Choose the best based on the path

99

44

1qq

Discontinuity

Bifurcation

15

EMfor HMM

• Let be a state->state or state->observation event in our HMM

• To find best parameter values which (locally) maximizes the objective function for a fixed :

99

44

1qq

Repeat until convergenceEM()

16

Fixed Points of EM

• Useful fact

• At the fixed points , then

• This is similar to using Homotopy for root finding– Same numerical techniques should be applicable here

99

44

1qq

17

Homotopy for Root Finding

• To find a root of G()– start from a root of a simple problem F() – trace the roots of intermediate problems while

morphing F to G

• To find which satisfy the above: – Set the derivative to zero: gives differential equation– Numerically solve the resulting differential eqn.

99

44

1qq

18

Solving the Differential Eqn

99

44

1qq

M . v = 0

Repeat until – Update in a proper direction parallel to v=Kernel(M)

– Update M

Jaccobian of EM1

19

Jaccobian of EM1

99

44

1qq

• So, we need to compute the covariance matrix of the events

• The entry in the row and column of the covariance matrix is

See the paper for details

Challenging for HMM

Forward-Backward

20

Expected Quadratic Counts for HMM

• Dynamic programming algorithm to efficiently compute

• Pre-compute a table Zx for each sequence

• Having table Zx, the EQC can be computed efficiently– The time complexity is where K is the number of states

in the HMM (see paper for more details)

99

44

1qq

k1k2

xi xi+1 xj…

… …

……

21

How to Choose based on Path

• monotone: the first point at which the monotonocity of changes

• MaxEnt: choose for which the model has maximum entropy on the unlab data

• minEig: when solving the diff eqn, consider the minimum singular value of the matrix M. Across rounds, choose for which the minimum singular value is the smallest

99

44

1qq

22

P (x;y)y (1)


• Experiments

• Homotopy method


Outline

23

Varying the Size of Unlab Data

99

44

1qq

Size of the labeled data: 100

• The three Homotopy-based methods outperform EM • maxEnt outperforms minEig and monotone • minEig and monotone have similar performances

24

Picked Values

99

44

1qq

25

99

44

1qq

• EM gives higher weight to unlabeled data compared to Homotopy-based method

Picked Values

selected by − maxEnt are much smaller than those

selected by minEig and monotone − minEig and monotone are close

26

Conclusion and Future Work

• Using EM can hurt performance in the case |L| << |U|

• Proposed a method to alleviate this problem for HMMs for seq. labeling tasks

• To speed up the method– Using sampling to find approximation to

covariance matrix– Using faster methods in recovering the

solution path, e.g. predictor-corrector

99

44

1qq

27

Questions?

99

44

1qq

28

Is Oracle outperformed by Homotopy?

99

44

1qq

• No!

- The performance measure used in selecting in oracle method may be different from that used in comparing homotopy and oracle

- The decoding alg used in oracle may be different from that used in comparing homotopy and oracle

29

Why not set ?

99

44

1qq

• This adhoc way of setting has two drawbacks:

- It still may hurt the performance. The proper may be much smaller than that.

- In some situations, the right choice of may be a big value. Setting is very conservative and dose not fully take advantage of the available unlabeled data.

30

Homotopy vs Baselines

– Viterbi Decoding: most probable seq of states decoding – SMS Decoding: seq of most probable states decoding

Our method(see the paper for more results)

Higher is Better

Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski

Documents

Transcript of Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski