Parameter Estimation: The Perceptron...
Transcript of Parameter Estimation: The Perceptron...
![Page 1: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/1.jpg)
Parameter Estimation:
The Perceptron Algorithm
![Page 2: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/2.jpg)
Collins Perceptron
What we have done so far...• packed forest
• as a general representation for many NLP problems
• formalized as a weighted hypergraph
• DP algorithms for 1-best and k-best on hypergraphs
2
Baoweier yu Shalong juxing le huitan Powell held a meeting w/ Sharon
![Page 3: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/3.jpg)
Collins Perceptron
What we have done so far...• packed forest
• as a general representation for many NLP problems
• formalized as a weighted hypergraph
• DP algorithms for 1-best and k-best on hypergraphs
2
Baoweier yu Shalong juxing le huitan Powell held a meeting w/ Sharon
Big Q: where do the weights come from?
![Page 4: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/4.jpg)
Collins Perceptron
A Quiz
3
![Page 5: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/5.jpg)
Collins Perceptron
Perceptron is ...
• an extremely simple algorithm
• almost universally applicable
• and works very well in practice
4
![Page 6: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/6.jpg)
Collins Perceptron
Perceptron is ...
• an extremely simple algorithm
• almost universally applicable
• and works very well in practice
4
vanilla perceptron(Rosenblatt, 1958)
structured perceptron(Collins, 2002)
the man bit the dog
DT NN VBD DT NN
![Page 7: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/7.jpg)
Collins Perceptron
Generic Perceptron• online-learning: one example at a time
• learning by doing
• find the best output under the current weights
• update weights at mistakes
5
inferencexi
update weightszi
yi
w
![Page 8: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/8.jpg)
Collins Perceptron
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
![Page 9: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/9.jpg)
Collins Perceptron
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
x
x
y
z
!(x, z)
!(x, y)
![Page 10: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/10.jpg)
Collins Perceptron
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
x
x
y
z
!(x, z)
!(x, y)
![Page 11: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/11.jpg)
Collins Perceptron
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
x
x
y
z
!(x, z)
!(x, y)
![Page 12: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/12.jpg)
Collins Perceptron
Example: POS Tagging• gold-standard: DT NN VBD DT NN
• the man bit the dog
• current output: DT NN NN DT NN
• the man bit the dog
• assume only two feature classes
• tag bigrams ti-1 ti
• word/tag pairs wi
• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)
• weights --: (NN, NN) (NN, DT) (NN→bit)
6
x
x
y
z
!(x, z)
!(x, y)
![Page 13: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/13.jpg)
Collins Perceptron
!
Structured Perceptron
7
inferencexi
update weightszi
yi
w
![Page 14: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/14.jpg)
Collins Perceptron
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
![Page 15: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/15.jpg)
Collins Perceptron
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
![Page 16: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/16.jpg)
Collins Perceptron
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
![Page 17: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/17.jpg)
Collins Perceptron
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
![Page 18: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/18.jpg)
Collins Perceptron
Efficiency vs. Expressiveness
• the inference (argmax) must be efficient
• either the search space GEN(x) is small, or factored
• features must be local to y (but can be global to x)
• e.g. bigram tagger, but look at all input words (cf. CRFs)
8
inferencexi
update weightszi
yi
w
x
y
argmaxy!GEN(x)
![Page 19: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/19.jpg)
Collins Perceptron
What about tree-to-string?
9
![Page 20: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/20.jpg)
Collins Perceptron
Averaged Perceptron
• more stable and accurate results
• approximation of voted perceptron (Freund & Schapire, 1999)
10
j
!
0
jj + 1
=
!
j
Wj
![Page 21: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/21.jpg)
Comparison with Other Models
![Page 22: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/22.jpg)
Collins Perceptron
from HMM to MEMM
12
HMM: joint distribution MEMM: locally normalized(per-state conditional)
![Page 23: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/23.jpg)
Collins Perceptron
Label Bias Problem
• bias towards states with fewer outgoing transitions
• a problem with all locally normalized models
13
![Page 24: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/24.jpg)
Collins Perceptron
Conditional Random Fields
• globally normalized (no label bias problem)
• but training requires expected features counts
• (related to the fractional counts in EM)
• need to use Inside-Outside algorithm (sum)
• Perceptron just needs Viterbi (max)
14
![Page 25: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/25.jpg)
Experiments
![Page 26: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/26.jpg)
Collins Perceptron
Experiments: Tagging• (almost) identical features from (Ratnaparkhi, 1996)
• trigram tagger: current tag ti, previous tags ti-1, ti-2
• current word wi and its spelling features
• surrounding words wi-1 wi+1 wi-2 wi+2..
16
![Page 27: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/27.jpg)
Collins Perceptron
Experiments: NP Chunking
• B-I-O scheme
• features:
• unigram model
• surrounding words and POS tags
17
Rockwell International Corp. B I I's Tulsa unit said it signed B I I O B O a tentative agreement ...B I I
![Page 28: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/28.jpg)
Collins Perceptron
Experiments: NP Chunking• results
• (Sha and Pereira, 2003) trigram tagger
• voted perceptron: 94.09% vs. CRF: 94.38%18
![Page 29: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/29.jpg)
Collins Perceptron
Other NLP Applications
• dependency parsing (McDonald et al., 2005)
• parse reranking (Collins)
• phrase-based translation (Liang et al., 2006)
• word segmentation
• ... and many many more ...
19
![Page 30: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/30.jpg)
Theory
![Page 31: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/31.jpg)
Collins Perceptron
Vanilla Perceptron
21
![Page 32: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/32.jpg)
Collins Perceptron
Vanilla Perceptron
22
![Page 33: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/33.jpg)
Collins Perceptron
Vanilla Perceptron
23
![Page 34: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/34.jpg)
Collins Perceptron
Convergence Theorem
• Data is separable if and only if perceptron converges
• number of updates is bounded by
• γ is the margin; R = maxi || xi ||
• This result generalizes to structured perceptron
• Also in the paper: theorems for non-separable cases and generalization bounds
24
(R/!)2
R = maxi
!!(xi, yi) " !(xi, zi)!
![Page 35: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest](https://reader030.fdocuments.us/reader030/viewer/2022040217/5d388ca188c9931d5c8c3239/html5/thumbnails/35.jpg)
Collins Perceptron
Conclusion
• a very simple framework that can work with many structured problems and that works very well
• all you need is (fast) 1-best inference
• much simpler than CRFs and SVMs
• can be applied to parsing, translation, etc.
• generalization bounds depend on separability
• not the (exponential) size of the search space
• extensions: MIRA, k-best MIRA, ...
• major limitation: only local features
25