1 Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski Natural Language Lab Simon Fraser...
-
Upload
dustin-willis -
Category
Documents
-
view
216 -
download
0
Transcript of 1 Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski Natural Language Lab Simon Fraser...
1
Gholamreza Haffari Anoop Sarkar
Presenter: Milan Tofiloski
Natural Language Lab
Simon Fraser university
Homotopy-based Semi-Supervised Hidden Markov
Models for Sequence Labeling
2
P (x;y)y (1)
• Motivation & Contributions
• Experiments
• Homotopy method
• More experiments
Outline
3
P (x;y)y (1)
• Parameter setting for the joint probability of input-output which maximizes probability of the given data:
• L : labeled data
• U : unlabeled data
Maximum Likelihood Principle
4
Deficiency of MLE
• Usually |U| >> |L|, then
99
44
1qq
• Which means the relationship of input-output is ignored when estimating the parameters !– MLE focuses on modeling the input distribution P(x)– But we are interested in modeling the joint distribution
P(x,y)
5
Remedy for the Deficiency
99
44
1qq
• Balance the effect of lab and unlab data:
• Find which maximally take advantage of lab and unlab data
• MLE
6
An experiment with HMM
99
44
1qq
Lower is Better
MLE Performance
• MLE can hurt the performance• Balancing lab and unlab data related terms is beneficial
7
Our Contributions
1. Introducing a principled way to choose for HMM in sequence labeling (tagging) tasks
2. Introducing an efficient dynamic programming algorithm to compute second order statistics in HMM
8
P (x;y)y (1)
• Motivation & Contributions
• Experiments
• Homotopy method
• More experiments
Outline
9
Task
99
44
1qq
• Field segmentation in information extraction• 13 tag fields: AUTHOR, TITLE, …
EDITOR EDITOR EDITOR EDITOR EDITOR EDITOR TITLE
A . Elmagarmid , editor . Transaction
TITLE TITLE TITLE TITLE TITLE TITLE PUB
Models for Advanced Database Applications , Morgan
PUB PUB PUB DATE DATE
- Kaufmann , 1992 .
10
Experimental Setup
• Use an HMM with 13 states– Freeze the transition (state->state) probabilities to what
has been observed in the lab data– Use the Homotopy method to just learn the emission
(state->alphabet) probabilities– Do add- smoothing for the initial values of emission
and transition probabilities
• Data statistics:– Average seq. length : 36.7– Average number of segments in a seq: 5.4– Size of Lab/Unlab data is 300/700
99
44
1qq
11
Baselines
• Held-out: put aside part of the lab data as a held-out set, and use it t choose
• Oracle: choose based on test data using per position accuracy
• Supervised: forgetting about unlab data, and just using lab data
99
44
1qq
12
P (x;y)y (1)
Homotopy vs Baselines
Higher is Better
• Sequence of most probable states decoding See paper for more results
Even very small values of can be useful.
In homotopy =.004, and in supervised = 0
13
P (x;y)y (1)
• Motivation & Contributions
• Experiments
• Homotopy method
• More experiments
Outline
14
Path of Solutions
• Look at as changes from 0 to 1• Choose the best based on the path
99
44
1qq
Discontinuity
Bifurcation
15
EMfor HMM
• Let be a state->state or state->observation event in our HMM
• To find best parameter values which (locally) maximizes the objective function for a fixed :
99
44
1qq
Repeat until convergenceEM()
16
Fixed Points of EM
• Useful fact
• At the fixed points , then
• This is similar to using Homotopy for root finding– Same numerical techniques should be applicable here
99
44
1qq
17
Homotopy for Root Finding
• To find a root of G()– start from a root of a simple problem F() – trace the roots of intermediate problems while
morphing F to G
• To find which satisfy the above: – Set the derivative to zero: gives differential equation– Numerically solve the resulting differential eqn.
99
44
1qq
18
Solving the Differential Eqn
99
44
1qq
M . v = 0
Repeat until – Update in a proper direction parallel to v=Kernel(M)
– Update M
Jaccobian of EM1
19
Jaccobian of EM1
99
44
1qq
• So, we need to compute the covariance matrix of the events
• The entry in the row and column of the covariance matrix is
See the paper for details
Challenging for HMM
Forward-Backward
20
Expected Quadratic Counts for HMM
• Dynamic programming algorithm to efficiently compute
• Pre-compute a table Zx for each sequence
• Having table Zx, the EQC can be computed efficiently– The time complexity is where K is the number of states
in the HMM (see paper for more details)
99
44
1qq
k1k2
xi xi+1 xj…
… …
……
21
How to Choose based on Path
• monotone: the first point at which the monotonocity of changes
• MaxEnt: choose for which the model has maximum entropy on the unlab data
• minEig: when solving the diff eqn, consider the minimum singular value of the matrix M. Across rounds, choose for which the minimum singular value is the smallest
99
44
1qq
22
P (x;y)y (1)
• Motivation & Contributions
• Experiments
• Homotopy method
• More experiments
Outline
23
Varying the Size of Unlab Data
99
44
1qq
Size of the labeled data: 100
• The three Homotopy-based methods outperform EM • maxEnt outperforms minEig and monotone • minEig and monotone have similar performances
24
Picked Values
99
44
1qq
25
99
44
1qq
• EM gives higher weight to unlabeled data compared to Homotopy-based method
Picked Values
selected by − maxEnt are much smaller than those
selected by minEig and monotone − minEig and monotone are close
26
Conclusion and Future Work
• Using EM can hurt performance in the case |L| << |U|
• Proposed a method to alleviate this problem for HMMs for seq. labeling tasks
• To speed up the method– Using sampling to find approximation to
covariance matrix– Using faster methods in recovering the
solution path, e.g. predictor-corrector
99
44
1qq
27
Questions?
99
44
1qq
28
Is Oracle outperformed by Homotopy?
99
44
1qq
• No!
- The performance measure used in selecting in oracle method may be different from that used in comparing homotopy and oracle
- The decoding alg used in oracle may be different from that used in comparing homotopy and oracle
29
Why not set ?
99
44
1qq
• This adhoc way of setting has two drawbacks:
- It still may hurt the performance. The proper may be much smaller than that.
- In some situations, the right choice of may be a big value. Setting is very conservative and dose not fully take advantage of the available unlabeled data.
30
Homotopy vs Baselines
– Viterbi Decoding: most probable seq of states decoding – SMS Decoding: seq of most probable states decoding
Our method(see the paper for more results)
Higher is Better