Semi-Supervised Learning of Sequence Models
via Method of MomentsEMNLP - Empirical Methods for Natural Language Processing
Zita MarinhoIST, University of LisbonRobotics Institute, CMU
Shay B. CohenSchool of InformaticsUniversity of Edinburgh
André F. T. MartinsIT, IST, University of LisbonUnbabel
Noah A. SmithComputer Science & Eng.University of Washington
November 1-6, 2016 Austin, Texas
[email protected] [email protected] [email protected] [email protected]
EMNLP 16 | Semi-supervised sequence labeling with MoM | Introduction 2
Sequence Labeling
w1
Herb
w2
fights
y1 y2 y3 y5
w3
like
w5
ninja
w4
a
w6
.
y4 y6
observed data {w1, w2, w3,…, w6} labels {y1, y2, y3,…, y6}
N V Pre .Det N
EMNLP 16 | Semi-supervised sequence labeling with MoM | Introduction
N V V .Det N
3
Sequence Labeling
w1
Herb
w2
fights
y1 y2 y3 y5
w3
like
w5
ninja
w4
a
w6
.
y4 y6
observed data {w1, w2, w3,…, w6} labels {y1, y2, y3,…, y6}
EMNLP 16 | Semi-supervised sequence labeling with MoM | Introduction
ADJ N V .Det N
4
Sequence Labeling
w1
Herb
w2
fights
y1 y2 y3 y5
w3
like
w5
ninja
w4
a
w6
.
y4 y6
observed data {w1, w2, w3,…, w6} labels {y1, y2, y3,…, y6}
EMNLP 16 | Semi-supervised sequence labeling with MoM | Introduction
? ? ? ?? ?
5
Sequence Labeling
w1
Herb
w2
fights
y1 y2 y3 y5
w3
like
w5
ninja
w4
a
w6
.
y4 y6
K6 possible assignments
observed data {w1, w2, w3,…, w6} labels {y1, y2, y3,…, y6}
EMNLP 16 | Semi-supervised sequence labeling with MoM | Introduction 6
Hidden Markov Model
w1 w2
y1 y2 y3 y5
w3 w5w4 w6
y4 y6
Learn parameters?
p(yt | yt-1)
p(wt | yt)
• supervised learning • unsupervised/semi-supervised (this talk)
EMNLP 16 | Semi-supervised sequence labeling with MoM | Introduction 7
Hidden Markov Model
w1 w2
y1 y2 y3 y5
w3 w5w4 w6
y4 y6
Learn parameters?
p(yt | yt-1)
p(wt | yt)
• model can be extended to include featuresBerg-Kirkpatrick, et al, Painless unsupervised learning with features. NAACL HLT, 2010.
• supervised learning • unsupervised/semi-supervised (this talk)
EMNLP 16 | Semi-supervised sequence labeling with MoM | Problem Statement 8
Maximum Likelihood estimation (MLE)
• exact inference is hard
• EM sensitive to local optima (depends on initialization)
• EM expensive in large datasets (several inference passes)
Method of Moments estimation (MoM)
computationally efficient
no local optima
one pass over data
EMNLP 16 | Semi-supervised sequence labeling with MoM | Introduction 9
Hidden Markov Model
via Maximum Likelihood Estimation
unsupervised learning
semi-supervised learning
feature HMM
MLE MoM
via Method of Moments
HMM feature HMMHMMMoM
?
?
?
MLE
Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees, ICML 2013
Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster and Lyle Ungar, Spectral Learning of Latent-Variable PCFGs: Algorithms and Sample Complexity, JMLR 2014
✓ ✓
✓ ✓ ✓
EMNLP 16 | Semi-supervised sequence labeling with MoM | Outline 10
Learning sequence models via MoM
1. Learn HMM models via MoM
2. Solve a QP
3. Extend to feature-based model
4. Experiments
Outline
5. Experiments
EMNLP 16 | Semi-supervised sequence labeling with MoM | Anchor Learning 11
Key insight:
2. Anchor Trick:
1. Conditional Independence:
learn a proxy for labels with anchors
infer label by looking at context
Method of Moments
EMNLP 16 | Semi-supervised sequence labeling with MoM | Anchor Learning 12
w1
hehe
w2
its
w3
gonna
w5
a
w4
b
w6
good
w7
day
y1 y2 y3 y5y4 y6 y7 stopstart
word
1. Conditional Independence
13
w1
:)
w2
wait
w3
now
w5
am
w4
I
w6
goin
w7
2
y1 y2 y3 y5y4 y6 y7 stopstart
= { w-1 , w+1 }
Log-linear model
context
1. Conditional Independence
EMNLP 16 | Semi-supervised sequence labeling with MoM | Problem Statement 14
adp
wt-1
tasted
yt-1 yt
wt
like
wt+1
chimichangas
yt+1
context
1. Conditional Independence
EMNLP 16 | Semi-supervised sequence labeling with MoM | Problem Statement 15
context
verb
wt-1
i
yt-1 yt
wt
like
wt+1
fajitas
yt+1
1. Conditional Independence
EMNLP 16 | Semi-supervised sequence labeling with MoM | Problem Statement 16
verb
wt-1
i
yt-1 yt
wt
like
wt+1
fajitas
yt+1
“You shall know a word by the company it keeps.”Firth, 1957
context
1. Conditional Independence
EMNLP 16 | Semi-supervised sequence labeling with MoM | Anchor Learning 17
w1
hehe
w2
its
w3
gonna
w5
a
w4
b
w6
good
w7
day
y1 y2 y3 y5y4 y6 y7 stopstart
contextword
contextword ? | label
1. Conditional Independence
EMNLP 16 | Semi-supervised sequence labeling with MoM | Anchor Learning 18
Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees, ICML 2013
verb
wt-1
yt-1 yt
wt
be
yt+1
label
anchor word
wt+1
p( label ≠ verb | be ) = 0
p( verb | be ) = 1
2. Anchor Trick
all instances of be = verb
EMNLP 16 | Semi-supervised sequence labeling with MoM | Anchor Learning 19
y
More anchors per label
verb
more than 1 anchor word less biased context estimates
verb = b, be, are, is, am, have, going
begoareis amhavegoing
2. Anchor Trick
EMNLP 16 | Semi-supervised sequence labeling with MoM | Anchor Learning 20
How to find anchors?
• small labeled corpus
• small lexicon Austinairportplayground
am,be,is,arego,make,madebecome
so,on,of
he,it,she
noun
verb
pron
adp
2. Anchor Trick
EMNLP 16 | Semi-supervised sequence labeling with MoM | Method of Moments 21
co-occurrences in dataunlabeled Method of moments
Andrew fights like Jet Li.
eat Fruit like cherry.
Ann sings like me.
Children like ice-cream.
wt
context
wt-1 wt+1 wt+2
EMNLP 16 | Semi-supervised sequence labeling with MoM | Method of Moments 22
Andrew fights like Jet Li.
eat Fruit like cherry.
Ann sings like me.
Children like ice-cream.
wt
context
wt-1 wt+1 wt+2
Method of moments
like
Child
ren
cher
ry
ice-cr
eam
fights
a Jet
me.context
wor
d love
there
will
Q p(context | word)
EMNLP 16 | Semi-supervised sequence labeling with MoM | Method of Moments 23
Let there be love.
Bill will be a ninja.
Method of moments
like
Child
ren
cher
ry
ice-cr
eam
fights
a Jet
me.context
wor
d love
be
there
will
Q p(context | word)
EMNLP 16 | Semi-supervised sequence labeling with MoM | Method of Moments 24
xcontext
Method of moments
label
contextword ? | label1. Conditional Independence p(label | word) p(context | word) p(context | label)
X
labels
=
=w
ord
label
wor
d
context
Q Γ
R
EMNLP 16 | Semi-supervised sequence labeling with MoM | Method of Moments 25
Method of moments
p(label | word) p(context | word) p(context | label)X
labels
=
contextword ? | label1. Conditional Independence
2. Anchor Trick p(label | word) p(context | word) p(context | anchors)
X
labels
=
= xw
ord
label
R
context
wor
d
context
Q Γan
chor
s
EMNLP 16 | Semi-supervised sequence labeling with MoM | Outline 26
Learning sequence models via MoM
Proposed work
1. Learn HMM models via MoM
2. Solve a QP
3. Extend to feature-based model
4. Experiments
Outline
5. Experiments
EMNLP 16 | Semi-supervised sequence labeling with MoM | Method of Moments 27
q
Method of Moments p(label | word) p(context | word) p(context | label)
= x
wor
d
label
R
contextw
ord
context
Q Γ
anch
ors
γ
• solve per word type ~(ms)
γ = argmin || q - R γ ||2
0 ≤ γ ≤ 1γ = 1
X
labels
EMNLP 16 | Semi-supervised sequence labeling with MoM | Method of Moments 28
q
Method of Moments p(label | word) p(context | word) p(context | label)
= x
wor
d
label
R
contextw
ord
context
Q Γ
anch
ors
γ
γ = argmin || q - R γ ||2
0 ≤ γ ≤ 1γ = 1
X
labels
+ λ || γsup - γ ||2
EMNLP 16 | Semi-supervised sequence labeling with MoM | Method of Moments 29
q
Method of Moments p(label | word) p(context | word) p(context | label)
= x
wor
d
label
R
contextw
ord
context
Q Γ
anch
ors
γ
γ = argmin || q - R γ ||2
0 ≤ γ ≤ 1γ = 1
X
labels
+ λ || γsup - γ ||2
estimated from labeled data
estimated from unlabeled data
EMNLP 16 | Semi-supervised sequence labeling with MoM | Method of Moments 30
γ p(word)X
words
p(label | word)
Learn parameters ?
p(label) =
HMM Learning
γ
coefficients
Bayes’ Rule p(word) p(label)
=
Observation Matrix
p(word | label) γ
EMNLP 16 | Semi-supervised sequence labeling with MoM | Method of Moments 31
Bayes’ Rule p(word) p(label)
=
Observation Matrix
Learn parameters ?
p(word | label)
HMM Learning
γ
Transition Matrix
• estimate from labeled data only
EMNLP 16 | Semi-supervised sequence labeling with MoM | Outline 32
Learning sequence models via MoM
1. Learn HMM models via MoM
2. Relax the notion of anchors
3. Solve a QP
4. Experiments
Outline
5. Experiments
EMNLP 16 | Semi-supervised sequence labeling with MoM | Experiments
Semi-supervised Twitter POS tagging
33
12 Universal POS
200k words
Twitter dataset
⇡
Slav Petrov et al., A Universal Part-of-Speech Tagset, 2011Owoputi et al., Improved part-of-speech tagging for online conversational text with word clusters. 2013
2.7 M unlabeled tweets 1000-100 labeled tweets
hehe its gonna b a good dayx prt verb verb det adj noun
EMNLP 16 | Semi-supervised sequence labeling with MoM | Experiments 34
Twitter POS tagging 150 training labeled sequences
71.7
77.278.2
84.3
707274767880828486
HMM
tagg
ing
accu
racy
HMM EM self-training AHMM
EMNLP 16 | Semi-supervised sequence labeling with MoM | Experiments 35
Twitter POS tagging 1000 training labeled sequences
81.1
83.1
86.1
88.0
8081828384858687888990
HMM
tagg
ing
accu
racy
HMM EM self-training AHMM
EMNLP 16 | Semi-supervised sequence labeling with MoM | Outline 36
Learning sequence models via MoM
Proposed work
1. Learn HMM models via MoM
2. Relax the notion of anchors
3. Extend to feature HMM
4. Experiments
Outline
5. Experiments
EMNLP 16 | Semi-supervised sequence labeling with MoM | Extend to features 37
w1
:)
w2
wait
w3
now
w5
am
w4
I
w6
goin
w7
2
y1 y2 y3 y5y4 y6 y7 stopstart
• is upper• is title• is digit• is url• starts #• is emoticon
T. Berg-Kirkpatrick, Painless unsupervised learning with features, ACL 2010.
(word)�
(word)�
Log-linear model
EMNLP 16 | Semi-supervised sequence labeling with MoM | Extend to features 38
w1
:)
w2
wait
w3
now
w5
am
w4
I
w6
goin
w7
2
y1 y2 y3 y5y4 y6 y7 stopstart
word ⟂ context | label
label
(word)�
ψ (context)
1. Conditional Independence
EMNLP 16 | Semi-supervised sequence labeling with MoM | Extend to features 39
E [ ψ(context) x Φ(word) ]
p (context | label)
Log-linear model
p ( context | word )Q p ( label | word ) E [ Φ(word) | label ] p( label )
E [ Φ(word) ]
E [ Φ(word) ]
E[ψ(context) | label ]R
= xlabel
R
ψ(context)
Q Γ
anch
ors
(wor
d)�
(wor
d)�
ψ(context)
Γ
EMNLP 16 | Semi-supervised sequence labeling with MoM | Method of Moments 40
γ = argmin || q - R γ ||2
γ = 1X
labels
+ λ || γsup - γ ||2
= xlabel
R
ψ(context)
Q Γ
anch
ors
(wor
d)�
(wor
d)�
ψ(context)
γq
• solve per feature dimension Φj
Log-linear model
EMNLP 16 | Semi-supervised sequence labeling with MoM | Method of Moments 41
E[Φ(word)] p(label)
=
Learn parameters ?
E[ Φ(word) | label ] γ
γ =
mean parameters
µy = E[�(X) | Y = y]=
E [ Φ(word) | label ] p( label )E [ Φ(word) ]
Log-linear model
EMNLP 16 | Semi-supervised sequence labeling with MoM | Extend to features 42
✓y
canonical parametersmean parameters
partition function
Learn parameters ?
Fenchel-Legendre Duality
max
✓y✓y
>µy � logZy
µy = E[�(X) | Y = y]
Log-linear model
✓yZy =
X
w
exp(✓>y µy)* argmax
max
✓y✓y
>µy � logZy
Zy =
X
w
exp (✓>y tw)
EMNLP 16 | Semi-supervised sequence labeling with MoM | Outline 43
canonical parameters
mean parameters
Algorithmcompute moments
Γ
solve maxent problem
µy = E[�(X) | Y = y]
Q
find anchors
solve QP
✓y
R
~ 10s min
~ 10s sec
~ 2-3h
~ secs
~ 10s min
EMNLP 16 | Semi-supervised sequence labeling with MoM | Outline 44
mean parameters
Algorithmcompute moments
Γ
µy = E[�(X) | Y = y]
Q
find anchors R
solve QP
supervision
canonical parameterssolve maxent problem ✓y
EMNLP 16 | Semi-supervised sequence labeling with MoM | Outline 45
Learning sequence models via MoM
1. Learn HMM models via MoM
2. Relax the notion of anchors
3. Solve a QP
4. Experiments
Outline
5. Experiments
EMNLP 16 | Semi-supervised sequence labeling with MoM | Experiments
81.8 81.883.4
85.3
707274767880828486
feature HMM
tagg
ing
accu
racy
HMM EM self-training AHMM
46
Twitter POS tagging 150 training labeled sequences
EMNLP 16 | Semi-supervised sequence labeling with MoM | Experiments 47
Twitter POS tagging 1000 training labeled sequences
89.1 89.1 89.4 89.1
8081828384858687888990
feature HMM
tagg
ing
accu
racy
HMM EM self-training AHMM
EMNLP 16 | Semi-supervised sequence labeling with MoM | Experiments 48
Twitter POS tagging
Tagging accuracy vs. labeled training size
0.7
0.75
0.8
0.85
0.9
0.95
0 100 200 300 400 500 600 700 800 900 1000
tagg
ing
accu
racy
Labeled sequences
feature HMM HMM anchor FHMM
EMNLP 16 | Semi-supervised sequence labeling with MoM | Experiments 49
Twitter POS tagging 1000 training sequences
42.0
14.910.3
3.80
10
20
30
40Tr
aini
ng T
ime
(h)
Brown Clusters EM self-training AHMM
EMNLP 16 | Semi-supervised sequence labeling with MoM | [email protected] 50
Conclusions
• MoM algorithm for semi-supervised learning
• flexible method (easy to add supervision)
• fast to train (only one pass over the data)
• particularly good with little supervision
Thank you [email protected]
Support for this research was provided by the Portuguese Science and Technology Foundation (FCT) and CMU Portugal Program, grant SFRH/BD/52015/2012. This work has also been partially supported by the European Union under H2020 project SUMMA, grant 688139, and by FCT, through contracts UID/EEA/50008/2013, through the LearnBig project (PTDC/EEISII/7092/2014), and the GoLocal project (grant CMUPERI/TIC/0046/2014).
Top Related