Download - Semi-Supervised Learning of Sequence Models via Method of …homepages.inf.ed.ac.uk/scohen/emnlp16anchor-slides.pdf · Owoputi et al., Improved part-of-speech tagging for online conversational

Semi-Supervised Learning of Sequence Models

via Method of MomentsEMNLP - Empirical Methods for Natural Language Processing

Zita MarinhoIST, University of LisbonRobotics Institute, CMU

Shay B. CohenSchool of InformaticsUniversity of Edinburgh

André F. T. MartinsIT, IST, University of LisbonUnbabel

Noah A. SmithComputer Science & Eng.University of Washington

November 1-6, 2016 Austin, Texas

[email protected] [email protected] [email protected] [email protected]

mailto:[email protected]?subject=


mailto:[email protected]


EMNLP 16 | Semi-supervised sequence labeling with MoM | Introduction 2

Sequence Labeling

w1

Herb

w2

fights

y1 y2 y3 y5

w3

like

w5

ninja

w4

a

w6

.

y4 y6

observed data {w1, w2, w3,…, w6} labels {y1, y2, y3,…, y6}

N V Pre .Det N

EMNLP 16 | Semi-supervised sequence labeling with MoM | Introduction

N V V .Det N

3

Sequence Labeling

w1

Herb

w2

fights

y1 y2 y3 y5

w3

like

w5

ninja

w4

a

w6

.

y4 y6



ADJ N V .Det N

4

Sequence Labeling

w1

Herb

w2

fights

y1 y2 y3 y5

w3

like

w5

ninja

w4

a

w6

.

y4 y6



? ? ? ?? ?

5

Sequence Labeling

w1

Herb

w2

fights

y1 y2 y3 y5

w3

like

w5

ninja

w4

a

w6

.

y4 y6

K6 possible assignments



Hidden Markov Model

w1 w2

y1 y2 y3 y5

w3 w5w4 w6

y4 y6

Learn parameters?

p(yt | yt-1)

p(wt | yt)

• supervised learning • unsupervised/semi-supervised (this talk)


Hidden Markov Model

w1 w2

y1 y2 y3 y5

w3 w5w4 w6

y4 y6

Learn parameters?

p(yt | yt-1)

p(wt | yt)

• model can be extended to include featuresBerg-Kirkpatrick, et al, Painless unsupervised learning with features. NAACL HLT, 2010.

• supervised learning • unsupervised/semi-supervised (this talk)

EMNLP 16 | Semi-supervised sequence labeling with MoM | Problem Statement 8

Maximum Likelihood estimation (MLE)

• exact inference is hard

• EM sensitive to local optima (depends on initialization)

• EM expensive in large datasets (several inference passes)

Method of Moments estimation (MoM)

computationally efficient

no local optima

one pass over data


Hidden Markov Model

via Maximum Likelihood Estimation

unsupervised learning

semi-supervised learning

feature HMM

MLE MoM

via Method of Moments

HMM feature HMMHMMMoM

?

?

?

MLE

Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees, ICML 2013

Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster and Lyle Ungar, Spectral Learning of Latent-Variable PCFGs: Algorithms and Sample Complexity, JMLR 2014

✓ ✓

✓ ✓ ✓

EMNLP 16 | Semi-supervised sequence labeling with MoM | Outline 10

Learning sequence models via MoM

1. Learn HMM models via MoM

2. Solve a QP

3. Extend to feature-based model

4. Experiments

Outline

5. Experiments

EMNLP 16 | Semi-supervised sequence labeling with MoM | Anchor Learning 11

Key insight:

2. Anchor Trick:

1. Conditional Independence:

learn a proxy for labels with anchors

infer label by looking at context

Method of Moments


w1

hehe

w2

its

w3

gonna

w5

a

w4

b

w6

good

w7

day

y1 y2 y3 y5y4 y6 y7 stopstart

word

1. Conditional Independence

13

w1

:)

w2

wait

w3

now

w5

am

w4

I

w6

goin

w7

2


= { w-1 , w+1 }

Log-linear model

context



adp

wt-1

tasted

yt-1 yt

wt

like

wt+1

chimichangas

yt+1

context



context

verb

wt-1

i

yt-1 yt

wt

like

wt+1

fajitas

yt+1



verb

wt-1

i

yt-1 yt

wt

like

wt+1

fajitas

yt+1

“You shall know a word by the company it keeps.”Firth, 1957

context



w1

hehe

w2

its

w3

gonna

w5

a

w4

b

w6

good

w7

day


contextword

contextword ? | label



Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees, ICML 2013

verb

wt-1

yt-1 yt

wt

be

yt+1

label

anchor word

wt+1

p( label ≠ verb | be ) = 0

p( verb | be ) = 1

2. Anchor Trick

all instances of be = verb


y

More anchors per label

verb

more than 1 anchor word less biased context estimates

verb = b, be, are, is, am, have, going

begoareis amhavegoing

2. Anchor Trick


How to find anchors?

• small labeled corpus

• small lexicon Austinairportplayground

am,be,is,arego,make,madebecome

so,on,of

he,it,she

noun

verb

pron

adp

2. Anchor Trick

EMNLP 16 | Semi-supervised sequence labeling with MoM | Method of Moments 21

co-occurrences in dataunlabeled Method of moments

Andrew fights like Jet Li.

eat Fruit like cherry.

Ann sings like me.

Children like ice-cream.

wt

context

wt-1 wt+1 wt+2


Andrew fights like Jet Li.

eat Fruit like cherry.

Ann sings like me.

Children like ice-cream.

wt

context

wt-1 wt+1 wt+2

Method of moments

like

Child

ren

cher

ry

ice-cr

eam

fights

a Jet

me.context

wor

d love

there

will

Q p(context | word)


Let there be love.

Bill will be a ninja.

Method of moments

like

Child

ren

cher

ry

ice-cr

eam

fights

a Jet

me.context

wor

d love

be

there

will

Q p(context | word)


xcontext

Method of moments

label

contextword ? | label1. Conditional Independence p(label | word) p(context | word) p(context | label)

X

labels

=

=w

ord

label

wor

d

context

Q Γ

R



Proposed work


2. Solve a QP

3. Extend to feature-based model

4. Experiments

Outline

5. Experiments


q

Method of Moments p(label | word) p(context | word) p(context | label)

= x

wor

d

label

R

contextw

ord

context

Q Γ

anch

ors

γ

• solve per word type ~(ms)

γ = argmin || q - R γ ||2

0 ≤ γ ≤ 1γ = 1

X

labels


q


= x

wor

d

label

R

contextw

ord

context

Q Γ

anch

ors

γ


0 ≤ γ ≤ 1γ = 1

X

labels

+ λ || γsup - γ ||2


q


= x

wor

d

label

R

contextw

ord

context

Q Γ

anch

ors

γ


0 ≤ γ ≤ 1γ = 1

X

labels

+ λ || γsup - γ ||2

estimated from labeled data

estimated from unlabeled data


γ p(word)X

words

p(label | word)

Learn parameters ?

p(label) =

HMM Learning

γ

coefficients

Bayes’ Rule p(word) p(label)

=

Observation Matrix

p(word | label) γ


Bayes’ Rule p(word) p(label)

=

Observation Matrix

Learn parameters ?

p(word | label)

HMM Learning

γ

Transition Matrix

• estimate from labeled data only




2. Relax the notion of anchors

3. Solve a QP

4. Experiments

Outline

5. Experiments

EMNLP 16 | Semi-supervised sequence labeling with MoM | Experiments

Semi-supervised Twitter POS tagging

33

12 Universal POS

200k words

Twitter dataset

⇡

Slav Petrov et al., A Universal Part-of-Speech Tagset, 2011Owoputi et al., Improved part-of-speech tagging for online conversational text with word clusters. 2013

2.7 M unlabeled tweets 1000-100 labeled tweets

hehe its gonna b a good dayx prt verb verb det adj noun

EMNLP 16 | Semi-supervised sequence labeling with MoM | Experiments 34

Twitter POS tagging 150 training labeled sequences

71.7

77.278.2

84.3

707274767880828486

HMM

tagg

ing

accu

racy

HMM EM self-training AHMM



81.1

83.1

86.1

88.0

8081828384858687888990

HMM

tagg

ing

accu

racy




Proposed work



3. Extend to feature HMM

4. Experiments

Outline

5. Experiments

EMNLP 16 | Semi-supervised sequence labeling with MoM | Extend to features 37

w1

:)

w2

wait

w3

now

w5

am

w4

I

w6

goin

w7

2


• is upper• is title• is digit• is url• starts #• is emoticon

T. Berg-Kirkpatrick, Painless unsupervised learning with features, ACL 2010.

(word)�

(word)�

Log-linear model


w1

:)

w2

wait

w3

now

w5

am

w4

I

w6

goin

w7

2


word ⟂ context | label

label

(word)�

ψ (context)



E [ ψ(context) x Φ(word) ]

p (context | label)

Log-linear model

p ( context | word )Q p ( label | word ) E [ Φ(word) | label ] p( label )

E [ Φ(word) ]

E [ Φ(word) ]

E[ψ(context) | label ]R

= xlabel

R

ψ(context)

Q Γ

anch

ors

(wor

d)�

(wor

d)�

ψ(context)

Γ



γ = 1X

labels

+ λ || γsup - γ ||2

= xlabel

R

ψ(context)

Q Γ

anch

ors

(wor

d)�

(wor

d)�

ψ(context)

γq

• solve per feature dimension Φj

Log-linear model


E[Φ(word)] p(label)

=

Learn parameters ?

E[ Φ(word) | label ] γ

γ =

mean parameters

µy = E[�(X) | Y = y]=

E [ Φ(word) | label ] p( label )E [ Φ(word) ]

Log-linear model


✓y

canonical parametersmean parameters

partition function

Learn parameters ?

Fenchel-Legendre Duality

max

✓y✓y

>µy � logZy

µy = E[�(X) | Y = y]

Log-linear model

✓yZy =

X

w

exp(✓>y µy)* argmax

max

✓y✓y

>µy � logZy

Zy =

X

w

exp (✓>y tw)


canonical parameters

mean parameters

Algorithmcompute moments

Γ

solve maxent problem

µy = E[�(X) | Y = y]

Q

find anchors

solve QP

✓y

R

~ 10s min

~ 10s sec

~ 2-3h

~ secs

~ 10s min


mean parameters

Algorithmcompute moments

Γ

µy = E[�(X) | Y = y]

Q

find anchors R

solve QP

supervision

canonical parameterssolve maxent problem ✓y





3. Solve a QP

4. Experiments

Outline

5. Experiments

EMNLP 16 | Semi-supervised sequence labeling with MoM | Experiments

81.8 81.883.4

85.3

707274767880828486

feature HMM

tagg

ing

accu

racy


46




89.1 89.1 89.4 89.1

8081828384858687888990

feature HMM

tagg

ing

accu

racy



Twitter POS tagging

Tagging accuracy vs. labeled training size

0.7

0.75

0.8

0.85

0.9

0.95

0 100 200 300 400 500 600 700 800 900 1000

tagg

ing

accu

racy

Labeled sequences

feature HMM HMM anchor FHMM


Twitter POS tagging 1000 training sequences

42.0

14.910.3

3.80

10

20

30

40Tr

aini

ng T

ime

(h)

Brown Clusters EM self-training AHMM

EMNLP 16 | Semi-supervised sequence labeling with MoM | [email protected] 50

Conclusions

• MoM algorithm for semi-supervised learning

• flexible method (easy to add supervision)

• fast to train (only one pass over the data)

• particularly good with little supervision

Thank you [email protected]

Support for this research was provided by the Portuguese Science and Technology Foundation (FCT) and CMU Portugal Program, grant SFRH/BD/52015/2012. This work has also been partially supported by the European Union under H2020 project SUMMA, grant 688139, and by FCT, through contracts UID/EEA/50008/2013, through the LearnBig project (PTDC/EEISII/7092/2014), and the GoLocal project (grant CMUPERI/TIC/0046/2014).