Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language...

Post on 03-Oct-2020

0 views 0 download

Transcript of Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language...

CS 562: Empirical Methodsin Natural Language Processing

Nov 2010

Liang Huang (lhuang@isi.edu)

Unit 3: Structured Learning(part 2) Discriminative Models: Perceptron & CRF

Thursday, November 18, 2010

Discriminative Models

A Quiz on Forest

2

Generative models

t̂ = arg maxt

P(t | w)

= arg maxt

P(w, t)P(w)

= arg maxt

P(w, t)

More work to learn to generate w

But can handle incomplete data: P(w) =�

t

P(w, t)

Wednesday, November 25, 2009Thursday, November 18, 2010

Discriminative Models3

The generative story

P(w, t) = P(t) × P(w | t)

≈ P(t1)n�

i=2

P(ti | ti−1)P(stop | tn) ×n�

i=1

P(wi | ti)

SOURCE-CHANNEL

HMM

P(t | w) ≈n�

i=1

P(ti | wi)

DEFICIENT

P(t | w) ≈n�

i=1

P(ti | wi) × P(t1)n�

i=2

P(ti | ti−1)P(stop | tn)

Wednesday, November 25, 2009

you generated ti twice!

Thursday, November 18, 2010

Discriminative Models

Perceptron is ...

• an extremely simple algorithm

• almost universally applicable

• and works very well in practice

4

Thursday, November 18, 2010

Discriminative Models

Perceptron is ...

• an extremely simple algorithm

• almost universally applicable

• and works very well in practice

4

vanilla perceptron(Rosenblatt, 1958)

structured perceptron(Collins, 2002)

the man bit the dog

DT NN VBD DT NN

Thursday, November 18, 2010

Discriminative Models

Generic Perceptron• online-learning: one example at a time

• learning by doing

• find the best output under the current weights

• update weights at mistakes

5

inferencexi

update weightszi

yi

w

Thursday, November 18, 2010

Discriminative Models

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

Thursday, November 18, 2010

Discriminative Models

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

x

x

y

z

!(x, z)

!(x, y)

Thursday, November 18, 2010

Discriminative Models

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

x

x

y

z

!(x, z)

!(x, y)

Thursday, November 18, 2010

Discriminative Models

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

x

x

y

z

!(x, z)

!(x, y)

Thursday, November 18, 2010

Discriminative Models

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

x

x

y

z

!(x, z)

!(x, y)

Thursday, November 18, 2010

Discriminative Models

!

Structured Perceptron

7

inferencexi

update weightszi

yi

w

Thursday, November 18, 2010

Discriminative Models

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Thursday, November 18, 2010

Discriminative Models

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Thursday, November 18, 2010

Discriminative Models

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Thursday, November 18, 2010

Discriminative Models

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Thursday, November 18, 2010

Discriminative Models

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Thursday, November 18, 2010

Discriminative Models

Averaged Perceptron

• more stable and accurate results

• approximation of voted perceptron (Freund & Schapire, 1999)

9

j

!

0

jj + 1

=

!

j

Wj

Thursday, November 18, 2010

Discriminative Models

Averaging Tricks

• Daume (2006, PhD thesis)

10

Thursday, November 18, 2010

Discriminative Models

Do we need smoothing?

• smoothing is much easier in discriminative models

• just make sure for each feature template, its subset templates are also included

• e.g., to include (t0 w0 w-1) you must also include

• (t0 w0) (t0 w-1) (w0 w-1)

• and maybe also (t0 t-1) because t is less sparse than w

11

x

y

Thursday, November 18, 2010

Comparison with Other Models

Thursday, November 18, 2010

Discriminative Models

from HMM to MEMM

13

HMM: joint distribution MEMM: locally normalized(per-state conditional)

Thursday, November 18, 2010

Discriminative Models

Label Bias Problem

• bias towards states with fewer outgoing transitions

• a problem with all locally normalized models

14

Thursday, November 18, 2010

Discriminative Models

Conditional Random Fields

• globally normalized (no label bias problem)

• but training requires expected features counts

• (related to the fractional counts in EM)

• need to use Inside-Outside algorithm (sum)

• Perceptron just needs Viterbi (max)

15

Thursday, November 18, 2010

Experiments

Thursday, November 18, 2010

Discriminative Models

Experiments: Tagging• (almost) identical features from (Ratnaparkhi, 1996)

• trigram tagger: current tag ti, previous tags ti-1, ti-2

• current word wi and its spelling features

• surrounding words wi-1 wi+1 wi-2 wi+2..

17

Thursday, November 18, 2010

Discriminative Models

Experiments: NP Chunking

• B-I-O scheme

• features:

• unigram model

• surrounding words and POS tags

18

Rockwell International Corp. B I I's Tulsa unit said it signed B I I O B O a tentative agreement ...B I I

Thursday, November 18, 2010

Discriminative Models

Experiments: NP Chunking• results

• (Sha and Pereira, 2003) trigram tagger

• voted perceptron: 94.09% vs. CRF: 94.38%19

Thursday, November 18, 2010

Discriminative Models

Other NLP Applications

• dependency parsing (McDonald et al., 2005)

• parse reranking (Collins)

• phrase-based translation (Liang et al., 2006)

• word segmentation

• ... and many many more ...

20

Thursday, November 18, 2010

Theory

Thursday, November 18, 2010

Discriminative Models

Vanilla Perceptron

22

Thursday, November 18, 2010

Discriminative Models

Vanilla Perceptron

23

Thursday, November 18, 2010

Discriminative Models

Vanilla Perceptron

24

Thursday, November 18, 2010

Discriminative Models

Convergence Theorem

• Data is separable if and only if perceptron converges

• number of updates is bounded by

• γ is the margin; R = maxi || xi ||

• This result generalizes to structured perceptron

• Also in Collins paper: theorems for non-separable cases and generalization bounds

25

(R/!)2

R = maxi

!!(xi, yi) " !(xi, zi)!

Thursday, November 18, 2010

Discriminative Models

Perceptron Conclusion

• a very simple framework that can work with many structured problems and that works very well

• all you need is (fast) 1-best inference

• much simpler than CRFs and SVMs

• can be applied to parsing, translation, etc.

• generalization bounds depend on separability

• not the (exponential) size of the search space

• limitation 1: features local on y (non-local on x)

• limitation 2: no probabilistic interpretation (vs. CRF)

26

Thursday, November 18, 2010

Discriminative Models

From HMM to CRF

27(Sutton and McCallum, 2007) -- recommended reading

Thursday, November 18, 2010

Discriminative Models28

Maximum likelihoodL = log

N�

j=1

P(t j | wj)

= logN�

j=1

1Z(wj)

exp

n�

i=1

λi fi(wj, t j)

= logN�

j=1

exp��n

i=1 λi fi(wj, t j)�

�t exp

��ni=1 λi fi(wj, t)

=

N�

j=1

n�

i=1

λi fi(wj, t j) − log�

t

expn�

i=1

λi fi(wj, t)

Wednesday, November 25, 2009Thursday, November 18, 2010

Discriminative Models29

Maximum likelihoodL = log

N�

j=1

P(t j | wj)

= logN�

j=1

1Z(wj)

exp

n�

i=1

λi fi(wj, t j)

= logN�

j=1

exp��n

i=1 λi fi(wj, t j)�

�t exp

��ni=1 λi fi(wj, t)

=

N�

j=1

n�

i=1

λi fi(wj, t j) − log�

t

expn�

i=1

λi fi(wj, t)

Wednesday, November 25, 2009Thursday, November 18, 2010

Discriminative Models30

Optimizing likelihood∂L∂λi=

N�

j=1

�fi(wj, t j) − E[ fi(wj, t)]

Gradient ascent:

Stochastic gradient ascent: for each

λi ← λi + αN�

j=1

�fi(wj, t j) − E[ fi(wj, t)]

(wj, t j)

λi ← λi + α�

fi(wj, t j) − E[ fi(wj, t)]�

Wednesday, November 25, 2009Thursday, November 18, 2010

Discriminative Models31

Expected feature counts

E[ fV,loves] =� 1

Zα[F]µ(V, loves)β[G]

Z = α[END]

F G

µ(V, loves)α[F] β[G]

Wednesday, November 25, 2009Thursday, November 18, 2010

Discriminative Models

CRF : SGD : Perceptron• CRF = conditional random fields = structured maxent

• SGD = stochastic gradient descent = online CRF

• CRF => (online) => SGD => (viterbi) => Perceptron

• update rules (very similar):

• CRF: batch, E[...]

• SGD: one example at a time, expectation E[...]

• Perceptron: one example at a time, and Viterbi

32

Thursday, November 18, 2010