Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language...
Transcript of Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language...
![Page 1: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/1.jpg)
CS 562: Empirical Methodsin Natural Language Processing
Nov 2010
Liang Huang ([email protected])
Unit 3: Structured Learning(part 2) Discriminative Models: Perceptron & CRF
Thursday, November 18, 2010
![Page 2: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/2.jpg)
Discriminative Models
A Quiz on Forest
2
Generative models
t̂ = arg maxt
P(t | w)
= arg maxt
P(w, t)P(w)
= arg maxt
P(w, t)
More work to learn to generate w
But can handle incomplete data: P(w) =�
t
P(w, t)
Wednesday, November 25, 2009Thursday, November 18, 2010
![Page 3: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/3.jpg)
Discriminative Models3
The generative story
P(w, t) = P(t) × P(w | t)
≈ P(t1)n�
i=2
P(ti | ti−1)P(stop | tn) ×n�
i=1
P(wi | ti)
SOURCE-CHANNEL
HMM
P(t | w) ≈n�
i=1
P(ti | wi)
DEFICIENT
P(t | w) ≈n�
i=1
P(ti | wi) × P(t1)n�
i=2
P(ti | ti−1)P(stop | tn)
Wednesday, November 25, 2009
you generated ti twice!
Thursday, November 18, 2010
![Page 4: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/4.jpg)
Discriminative Models
Perceptron is ...
• an extremely simple algorithm
• almost universally applicable
• and works very well in practice
4
Thursday, November 18, 2010
![Page 5: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/5.jpg)
Discriminative Models
Perceptron is ...
• an extremely simple algorithm
• almost universally applicable
• and works very well in practice
4
vanilla perceptron(Rosenblatt, 1958)
structured perceptron(Collins, 2002)
the man bit the dog
DT NN VBD DT NN
Thursday, November 18, 2010
![Page 6: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/6.jpg)
Discriminative Models
Generic Perceptron• online-learning: one example at a time
• learning by doing
• find the best output under the current weights
• update weights at mistakes
5
inferencexi
update weightszi
yi
w
Thursday, November 18, 2010
![Page 7: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/7.jpg)
Discriminative Models
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
Thursday, November 18, 2010
![Page 8: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/8.jpg)
Discriminative Models
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
x
x
y
z
!(x, z)
!(x, y)
Thursday, November 18, 2010
![Page 9: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/9.jpg)
Discriminative Models
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
x
x
y
z
!(x, z)
!(x, y)
Thursday, November 18, 2010
![Page 10: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/10.jpg)
Discriminative Models
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
x
x
y
z
!(x, z)
!(x, y)
Thursday, November 18, 2010
![Page 11: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/11.jpg)
Discriminative Models
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
x
x
y
z
!(x, z)
!(x, y)
Thursday, November 18, 2010
![Page 12: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/12.jpg)
Discriminative Models
!
Structured Perceptron
7
inferencexi
update weightszi
yi
w
Thursday, November 18, 2010
![Page 13: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/13.jpg)
Discriminative Models
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
Thursday, November 18, 2010
![Page 14: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/14.jpg)
Discriminative Models
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
Thursday, November 18, 2010
![Page 15: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/15.jpg)
Discriminative Models
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
Thursday, November 18, 2010
![Page 16: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/16.jpg)
Discriminative Models
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
Thursday, November 18, 2010
![Page 17: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/17.jpg)
Discriminative Models
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
Thursday, November 18, 2010
![Page 18: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/18.jpg)
Discriminative Models
Averaged Perceptron
• more stable and accurate results
• approximation of voted perceptron (Freund & Schapire, 1999)
9
j
!
0
jj + 1
=
!
j
Wj
Thursday, November 18, 2010
![Page 19: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/19.jpg)
Discriminative Models
Averaging Tricks
• Daume (2006, PhD thesis)
10
Thursday, November 18, 2010
![Page 20: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/20.jpg)
Discriminative Models
Do we need smoothing?
• smoothing is much easier in discriminative models
• just make sure for each feature template, its subset templates are also included
• e.g., to include (t0 w0 w-1) you must also include
• (t0 w0) (t0 w-1) (w0 w-1)
• and maybe also (t0 t-1) because t is less sparse than w
11
x
y
Thursday, November 18, 2010
![Page 21: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/21.jpg)
Comparison with Other Models
Thursday, November 18, 2010
![Page 22: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/22.jpg)
Discriminative Models
from HMM to MEMM
13
HMM: joint distribution MEMM: locally normalized(per-state conditional)
Thursday, November 18, 2010
![Page 23: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/23.jpg)
Discriminative Models
Label Bias Problem
• bias towards states with fewer outgoing transitions
• a problem with all locally normalized models
14
Thursday, November 18, 2010
![Page 24: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/24.jpg)
Discriminative Models
Conditional Random Fields
• globally normalized (no label bias problem)
• but training requires expected features counts
• (related to the fractional counts in EM)
• need to use Inside-Outside algorithm (sum)
• Perceptron just needs Viterbi (max)
15
Thursday, November 18, 2010
![Page 25: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/25.jpg)
Experiments
Thursday, November 18, 2010
![Page 26: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/26.jpg)
Discriminative Models
Experiments: Tagging• (almost) identical features from (Ratnaparkhi, 1996)
• trigram tagger: current tag ti, previous tags ti-1, ti-2
• current word wi and its spelling features
• surrounding words wi-1 wi+1 wi-2 wi+2..
17
Thursday, November 18, 2010
![Page 27: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/27.jpg)
Discriminative Models
Experiments: NP Chunking
• B-I-O scheme
• features:
• unigram model
• surrounding words and POS tags
18
Rockwell International Corp. B I I's Tulsa unit said it signed B I I O B O a tentative agreement ...B I I
Thursday, November 18, 2010
![Page 28: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/28.jpg)
Discriminative Models
Experiments: NP Chunking• results
• (Sha and Pereira, 2003) trigram tagger
• voted perceptron: 94.09% vs. CRF: 94.38%19
Thursday, November 18, 2010
![Page 29: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/29.jpg)
Discriminative Models
Other NLP Applications
• dependency parsing (McDonald et al., 2005)
• parse reranking (Collins)
• phrase-based translation (Liang et al., 2006)
• word segmentation
• ... and many many more ...
20
Thursday, November 18, 2010
![Page 30: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/30.jpg)
Theory
Thursday, November 18, 2010
![Page 31: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/31.jpg)
Discriminative Models
Vanilla Perceptron
22
Thursday, November 18, 2010
![Page 32: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/32.jpg)
Discriminative Models
Vanilla Perceptron
23
Thursday, November 18, 2010
![Page 33: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/33.jpg)
Discriminative Models
Vanilla Perceptron
24
Thursday, November 18, 2010
![Page 34: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/34.jpg)
Discriminative Models
Convergence Theorem
• Data is separable if and only if perceptron converges
• number of updates is bounded by
• γ is the margin; R = maxi || xi ||
• This result generalizes to structured perceptron
• Also in Collins paper: theorems for non-separable cases and generalization bounds
25
(R/!)2
R = maxi
!!(xi, yi) " !(xi, zi)!
Thursday, November 18, 2010
![Page 35: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/35.jpg)
Discriminative Models
Perceptron Conclusion
• a very simple framework that can work with many structured problems and that works very well
• all you need is (fast) 1-best inference
• much simpler than CRFs and SVMs
• can be applied to parsing, translation, etc.
• generalization bounds depend on separability
• not the (exponential) size of the search space
• limitation 1: features local on y (non-local on x)
• limitation 2: no probabilistic interpretation (vs. CRF)
26
Thursday, November 18, 2010
![Page 36: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/36.jpg)
Discriminative Models
From HMM to CRF
27(Sutton and McCallum, 2007) -- recommended reading
Thursday, November 18, 2010
![Page 37: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/37.jpg)
Discriminative Models28
Maximum likelihoodL = log
N�
j=1
P(t j | wj)
= logN�
j=1
1Z(wj)
exp
n�
i=1
λi fi(wj, t j)
= logN�
j=1
exp��n
i=1 λi fi(wj, t j)�
�t exp
��ni=1 λi fi(wj, t)
�
=
N�
j=1
n�
i=1
λi fi(wj, t j) − log�
t
expn�
i=1
λi fi(wj, t)
Wednesday, November 25, 2009Thursday, November 18, 2010
![Page 38: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/38.jpg)
Discriminative Models29
Maximum likelihoodL = log
N�
j=1
P(t j | wj)
= logN�
j=1
1Z(wj)
exp
n�
i=1
λi fi(wj, t j)
= logN�
j=1
exp��n
i=1 λi fi(wj, t j)�
�t exp
��ni=1 λi fi(wj, t)
�
=
N�
j=1
n�
i=1
λi fi(wj, t j) − log�
t
expn�
i=1
λi fi(wj, t)
Wednesday, November 25, 2009Thursday, November 18, 2010
![Page 39: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/39.jpg)
Discriminative Models30
Optimizing likelihood∂L∂λi=
N�
j=1
�fi(wj, t j) − E[ fi(wj, t)]
�
Gradient ascent:
Stochastic gradient ascent: for each
λi ← λi + αN�
j=1
�fi(wj, t j) − E[ fi(wj, t)]
�
(wj, t j)
λi ← λi + α�
fi(wj, t j) − E[ fi(wj, t)]�
Wednesday, November 25, 2009Thursday, November 18, 2010
![Page 40: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/40.jpg)
Discriminative Models31
Expected feature counts
E[ fV,loves] =� 1
Zα[F]µ(V, loves)β[G]
Z = α[END]
F G
µ(V, loves)α[F] β[G]
Wednesday, November 25, 2009Thursday, November 18, 2010
![Page 41: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)](https://reader033.fdocuments.us/reader033/viewer/2022052011/602797d08977d80d78368fce/html5/thumbnails/41.jpg)
Discriminative Models
CRF : SGD : Perceptron• CRF = conditional random fields = structured maxent
• SGD = stochastic gradient descent = online CRF
• CRF => (online) => SGD => (viterbi) => Perceptron
• update rules (very similar):
• CRF: batch, E[...]
• SGD: one example at a time, expectation E[...]
• Perceptron: one example at a time, and Viterbi
32
Thursday, November 18, 2010