CS 6840: Natural Language Processing Sequence Tagging with...
Transcript of CS 6840: Natural Language Processing Sequence Tagging with...
![Page 1: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/1.jpg)
CS 6840: Natural Language Processing
Razvan C. Bunescu
School of Electrical Engineering and Computer Science
Sequence Tagging with HMMs:Part of Speech Tagging
![Page 2: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/2.jpg)
Part of Speech (POS) Tagging
• Annotate each word in a sentence with its POS:– noun, verb, adjective, adverb, pronoun, preposition, interjection, …
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
2
![Page 3: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/3.jpg)
Parts of Speech
• Lexical categories that are defined based on:– Syntactic function:
• nouns can occur with determiners: a goat.• nouns can take possessives: IBM’s annual revenue.• most nouns can occur in the plural: goats.
– Morphological function:• many verbs can be composed with the prefix “un”.
• There are tendencies toward semantic coherence:– nouns often refer to “people, places, or things”.– adjectives often refer to properties.
3
![Page 4: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/4.jpg)
POS: Closed Class vs. Open Class
• Closed Class:– relatively fixed membership.– usually function words:
• short common words which have a structuring role in grammar.
– Prepositions: of, in, by, on, under, over, …– Auxiliaries: may, can, will had, been, should, …– Pronouns: I, you, she, mine, his, them, …– Determiners: a, an, the, which, that, …– Conjunctions: and, but, or (coord.), as, if, when, (subord.), …– Particles: up, down, on, off, …– Numerals: one, two, three, third, …
4
![Page 5: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/5.jpg)
POS: Open Class vs. Closed Class
• Open Class:– new members are continually added.
• to fax, to google, futon, …– English has 4: Nouns, Verbs, Adjectives, Adverbs.
• Many languages have these 4, but not all (e.g. Korean).– Nouns: people, places, or things– Verbs: actions and processes– Adjectives: properties or qualities– Adverbs: a hodge-podge
• Unfortunately, John walked home extremely slowly yesterday.• directional, locative, temporal, degree, manner, …
5
![Page 6: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/6.jpg)
POS: Open vs. Closed Classes
• Open Class: new members are continually added.
1. Annie: Do you love me?Alvy: Love is too weak a word for what I feel... I lurve you. Y'know, I loove you, I, I luff you. There are two f's. I have to invent... Of course I love you. (Annie Hall)
2. 'Twas brillig, and the slithy tovesDid gyre and gimble in the wabe;All mimsy were the borogoves,And the mome raths outgrabe.
(Jabberwocky, Lewis Caroll)
"Beware the Jabberwock, my son!The jaws that bite, the claws that catch!Beware the Jubjub bird, and shunThe frumious Bandersnatch!"
6
![Page 7: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/7.jpg)
Parts of Speech: Granularity
• Grammatical sketch of Greek [Dionysius Thrax, c. 100 B.C.]:– 8 tags: noun, verb, pronoun, preposition, adjective, conjunction,
participle, and article.
• Brown corpus [Francis, 1979]:– 87 tags.
• Penn Treebank [Marcus et al., 1993]:– 45 tags.
• British National Corpus (BNC) [Garside et al., 1997]:– C5 tagset: 61 tags.– C7 tagset: 146 tags.
We will focus on the Penn Treebank POS tags.7
![Page 8: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/8.jpg)
Penn Treebank POS Tagset
8
![Page 9: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/9.jpg)
Penn Treebank POS tags
• Selected from the original 87 tags of the Brown corpus:Þ lost finer distinctions between lexical categories.
1) Prepositions and subordinating conjunctions:– after/CS spending/VBG a/AT day/NN at/IN the/AT palace/NN– after/IN a/AT wedding/NN trip/NN to/IN Hawaii/NNP ./.
2) Infinitive to and prepositional to:– to/TO give/VB priority/NN to/IN teachers/NNS
3) Adverbial nouns:– Brown: Monday/NR, home/NR, west/NR, tomorrow/NR– PTB: Monday/NNP, (home, tomorrow, west)/(NN, RB)
9
![Page 10: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/10.jpg)
POS Tagging º POS Disambiguation
• Words often have more than one POS tag, e.g. back:– the back/JJ door– on my back/NN– win the voters back/RB– promised to back/VB the bill
• Brown corpus statistics [DeRose, 1998]:– 11.5% ambiguous English word types.– 40% of all word occurrences are ambiguous.
• most are easy to disambiguate– the tags are not equaly likely, i.e. low tag entropy: table
10
![Page 11: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/11.jpg)
POS Tag Ambiguity
11
![Page 12: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/12.jpg)
POS Tagging º POS Disambiguation
• Some distinctions are difficult even for humans:– Mrs. Shaefer never got around to joining
– All we gotta do is go around the corner
– Chateau Petrus costs around 250
• Use heuristics [Santorini, 1990]:– She told off/RP her friends She stepped off/IN the train
– She told her friends off/RP *She stepped the train off/IN
NNP NNP RB VBD RP TO VBG
DT PRP VBN VB VBZ VB IN DT NN
NNP NNP VBZ RB CD
12
![Page 13: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/13.jpg)
How Difficult is POS Tagging?
• Most current tagging algorithms: ~ 96% - 97% accuracy for Penn Treebank tagsets. – Current SofA 97.55% tagging accuracy. How good is this?
• Bidirectional LSTM-CRF Models for Sequence Tagging [Huang, Xu, Yu, 2015].
– Human Ceiling: how well humans do?• human annotators: about 96% - 97% [Marcus et al., 1993].• when allowed to discuss tags, consensus is 100% [Voutilainen, 95]
– Most Frequent Class Baseline:• 90% - 91% on the 87-tag Brown tagset [Charniak et al., 1993].• 93.69% on the 45-tag Penn Treebank, with unknown word model
[Toutanova et al., 2003].
13
![Page 14: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/14.jpg)
POS Tagging Methods
• Rule Based:– Rules are designed by human experts based on linguistic knowledge.
• Machine Learning:– Trained on data that has been manually labeled by humans.– Rule learning:
• Transformation Based Learning (TBL).– Sequence tagging:
• Hidden Markov Models (HMM).• Maximum Entropy (Logistic Regression).• Sequential Conditional Random Fields (CRF).• Recurrent Neural Networks (RNN):
– bidirectional, with a CRF layer (BI-LSTM-CRF).14
![Page 15: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/15.jpg)
POS Tagging: Rule Based
1) Start with a dictionary.
2) Assign all possible tags to words from the dictionary.
3) Write rules by hand to selectively remove tags, leaving the correct tag for each word.
15
![Page 16: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/16.jpg)
POS Tagging: Rule Based
1) Start with a dictionary:
she: PRPpromised: VBN,VBDto TOback: VB, JJ, RB, NNthe: DTbill: NN, VB
… for the ~100,000 words of English.
16
![Page 17: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/17.jpg)
POS Tagging: Rule Based
2) Assign every possible tag:
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
17
![Page 18: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/18.jpg)
POS Tagging: Rule Based
3) Write rules to eliminate incorrect tags.– Eliminate VBN if VBD is an option when VBN|VBD follows
“<S> PRP”
NNRB
VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill
18
![Page 19: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/19.jpg)
POS Tagging as Sequence Labeling
• Sequence Labeling:– Tokenization and Sentence Segmentation.– Part of Speech Tagging.– Information Extraction
• Named Entity Recognition– Shallow Parsing.– Semantic Role Labeling.– DNA Analysis.– Music Segmentation.
• Solved using ML models for classification:– Token-level vs. Sequence-level.
19
![Page 20: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/20.jpg)
Sequence Labeling
• Sentence Segmentation:
Mr. Burns is Homer Simpson’s boss. He is very rich.
• Tokenization:
Mr. Burns is Homer Simpson’s boss. He is very rich.
O O O O O O O O I OO IO… … … … …
20
![Page 21: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/21.jpg)
Sequence Labeling
• Information Extraction:– Named Entity Recognition
Drug giant Pfizer Inc. has reached an agreement to buy the
private biotechnology firm Rinat Neuroscience Corp.
O O I I O O O O O O O
O O O I I I
21
![Page 22: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/22.jpg)
Sequence Labeling
• Information Extraction:– Text Segmentation into topical sections.
Vine covered cottage , near Contra Costa Hills . 2 bedroom house ,
modern kitchen and dishwasher . No pets allowed . $ 1050 / month[Haghighi & Klein, NAACL ‘06]
22
![Page 23: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/23.jpg)
Sequence Labeling
• Information Extraction:– segmenting classifieds into topical sections.
Vine covered cottage , near Contra Costa Hills . 2 bedroom house ,
modern kitchen and dishwasher . No pets allowed . $ 1050 / month
– Features– Neighborhood– Size– Restrictions– Rent
[Haghighi & Klein, NAACL ‘06]
23
![Page 24: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/24.jpg)
Sequence Labeling
• Semantic Role Labeling:– For each clause, determine the semantic role played by each noun
phrase that is an argument to the verb:
John drove Mary from Athens to Columbus in his Toyota Prius.The hammer broke the window.
• agent• patient• source• destination• instrument
24
![Page 25: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/25.jpg)
Sequence Labeling
• DNA Analysis:– transcription factor binding sites.– promoters.– introns, exons, …
AATGCGCTAACGTTCGATACGAGATAGCCTAAGAGTCA
25
![Page 26: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/26.jpg)
Sequence Labeling
• Music Analysis:– segmentation into “musical phrases”
[Romeo & Juliet, Nino Rota]
26
![Page 27: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/27.jpg)
Sequence Labeling as Classification
1) Classifiy each token individually into one of a number of classes:– Token represented as a vector of features extracted from context.– To build classification model, use general ML algorithms:
• Maximum Entropy (i.e. Logistic Regression)• Support Vector Machines (SVMs)• Perceptrons.• Winnow.• Naïve Bayes, Bayesian Networks.• Decision Trees.• k-Nearest Neighbor, …
27
![Page 28: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/28.jpg)
A Maximum Entropy Model for POS Tagging
• Represent each position i in text as j(t, hi) ={jk(t, hi)}:– t is the potential POS tag at position i.– hi is the history/context of position i.
– j (t, hi) is a vector of features jk(t, hi), for k = 1..K.
• Represent the “unnormalized” score of a tag t as:
[Ratnaparkhi, EMNLP’96]
},,,,,,{ 212121 ----++= iiiiiiii ttwwwwwh
φk (t,hi ) =1 if suffix(wi ) = "ing" & t = VBG0 otherwise
⎧⎨⎪
⎩⎪
score(t,hi ) =wTφ(t,hi ) = wk
k=1
K
∑ φk (t,hi )
want wk to be large here28
![Page 29: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/29.jpg)
A Maximum Entropy Model for POS Tagging[Ratnaparkhi, EMNLP’96]
29
![Page 30: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/30.jpg)
A Maximum Entropy Model for POS Tagging[Ratnaparkhi, EMNLP’96]
the non-zero features for position 3
feature templates30
![Page 31: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/31.jpg)
A Maximum Entropy Model for POS Tagging[Ratnaparkhi, EMNLP’96]
the non-zero features for position 431
![Page 32: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/32.jpg)
A Maximum Entropy Model for POS Tagging
• How do we learn the weights w?– Train on manually annotated data (supervised learning).
• What does it mean “train w on annotated corpus”?– Probabilistic Discriminative Models:
• Maximum Entropy (Logistic Regression).– Distribution Free Methods:
• (Average) Perceptrons.• Support Vector Machines (SVMs).
[Collins, ACL 2002]
[Ratnaparkhi, EMNLP’96]
32
![Page 33: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/33.jpg)
A Maximum Entropy Model for POS Tagging
• Probabilistic Discriminative Model:Þneed to transform score(t,hi) into probability p(t |hi).
• Training using:– Maximum Likelihood (ML).– Maximum A Posteriori (MAP) with a Gaussian prior on w.
• Inference (i.e. Testing):
[Ratnaparkhi, EMNLP’96]
p(t | hi ) =exp(wTφ(t,hi ))exp(wTφ(t ',hi ))t '∑
),(maxarg)),(exp(maxarg)|(maxargˆii
T
Ttii
T
Ttii
Tti htwhtwhtpt
iii
jjÎÎÎ
===
33
![Page 34: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/34.jpg)
A Maximum Entropy Model for POS Tagging
• Inference, need to do Forward traversal of input sequence:
[Ratnaparkhi, EMNLP’96]
[Animation by Ray Mooney, UT Austin]
John saw the saw and decided to take it to the table.
classifier
NNP
34
![Page 35: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/35.jpg)
A Maximum Entropy Model for POS Tagging
• Inference, need to do Forward traversal of input sequence:
[Ratnaparkhi, EMNLP’96]
[Animation by Ray Mooney, UT Austin]NNPJohn saw the saw and decided to take it to the table.
classifier
VBD
35
![Page 36: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/36.jpg)
A Maximum Entropy Model for POS Tagging
• Inference, need to do Forward traversal of input sequence:
[Ratnaparkhi, EMNLP’96]
[Animation by Ray Mooney, UT Austin]NNP VBDJohn saw the saw and decided to take it to the table.
classifier
DT
36
![Page 37: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/37.jpg)
A Maximum Entropy Model for POS Tagging
• Inference, need to do Forward traversal of input sequence:
[Ratnaparkhi, EMNLP’96]
[Animation by Ray Mooney, UT Austin]NNP VBD DTJohn saw the saw and decided to take it to the table.
classifier
NN
37
![Page 38: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/38.jpg)
A Maximum Entropy Model for POS Tagging
• Inference, need to do Forward traversal of input sequence:
[Ratnaparkhi, EMNLP’96]
[Animation by Ray Mooney, UT Austin]NNP VBD DT NNJohn saw the saw and decided to take it to the table.
classifier
CC
38
![Page 39: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/39.jpg)
A Maximum Entropy Model for POS Tagging
• Inference, need to do Forward traversal of input sequence:
[Ratnaparkhi, EMNLP’96]
[Animation by Ray Mooney, UT Austin]NNP VBD DT NN CCJohn saw the saw and decided to take it to the table.
classifier
VBD
39
![Page 40: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/40.jpg)
A Maximum Entropy Model for POS Tagging
• Inference, need to do Forward traversal of input sequence:
[Ratnaparkhi, EMNLP’96]
[Animation by Ray Mooney, UT Austin]NNP VBD DT NN CC VBDJohn saw the saw and decided to take it to the table.
classifier
TO
40
![Page 41: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/41.jpg)
A Maximum Entropy Model for POS Tagging
• Inference, need to do Forward traversal of input sequence:
[Ratnaparkhi, EMNLP’96]
[Animation by Ray Mooney, UT Austin]NNP VBD DT NN CC VBD TOJohn saw the saw and decided to take it to the table.
classifier
VB
41
![Page 42: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/42.jpg)
A Maximum Entropy Model for POS Tagging
• Inference, need to do Forward traversal of input sequence:
[Ratnaparkhi, EMNLP’96]
[Animation by Ray Mooney, UT Austin]NNP VBD DT NN CC VBD TO VBJohn saw the saw and decided to take it to the table.
classifier
PRP
42
![Page 43: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/43.jpg)
A Maximum Entropy Model for POS Tagging
• Inference, need to do Forward traversal of input sequence:
[Ratnaparkhi, EMNLP’96]
[Animation by Ray Mooney, UT Austin]NNP VBD DT NN CC VBD TO VB PRPJohn saw the saw and decided to take it to the table.
classifier
IN
43
![Page 44: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/44.jpg)
A Maximum Entropy Model for POS Tagging
• Inference, need to do Forward traversal of input sequence:
[Ratnaparkhi, EMNLP’96]
[Animation by Ray Mooney, UT Austin]NNP VBD DT NN CC VBD TO VB PRP INJohn saw the saw and decided to take it to the table.
classifier
DT
44
![Page 45: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/45.jpg)
A Maximum Entropy Model for POS Tagging
• Inference, need to do Forward traversal of input sequence:
[Ratnaparkhi, EMNLP’96]
[Animation by Ray Mooney, UT Austin]NNP VBD DT NN CC VBD TO VB PRP IN DTJohn saw the saw and decided to take it to the table.
classifier
NN
45
![Page 46: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/46.jpg)
A Maximum Entropy Model for POS Tagging
• Inference, need to do Forward traversal of input sequence:
• Some POS tags would be easier to disambiguate backward, what can we do?– Use backward traversal, with backward features … but lose
forward info.
[Ratnaparkhi, EMNLP’96]
[Animation by Ray Mooney, UT Austin]
John saw the saw and decided to take it to the table.
classifier
NN
46
![Page 47: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/47.jpg)
Sequence Labeling as Classification
1) Classifiy each token individually into one of a number of classes.
2) Classify all tokens jointly into one of a number of classes:
– Hidden Markov Models.– Conditional Random Fields.– Structural SVMs.– Discriminatively Trained HMMs [Collins, EMNLP’02].– Bi-directional RNNs / LSTM-CRFs.
),...,,,...,(maxargˆ...ˆ 11,...,
11
nnT
ttn wwtttt
n
jl=
47
![Page 48: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/48.jpg)
Hidden Markov Models
• Probabilistic Generative Models:
),...,|,...,(maxargˆ...ˆ 11,...,
11
nntt
n wwttpttn
=
),...,(),...,|,...,(maxarg 111,...,1
nnntt
ttpttwwpn
=
Use state emission probs Use state transition probs
48
![Page 49: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/49.jpg)
Hidden Markov Models: Assumptions
1) A word event depends only on its POS tag:
2) A tag event depends only on the previous tag:
Þ POS tagging is
Õ=
=n
iiinn twpttwwp
111 )|(),...,|,...,(
Õ=
-=n
iiin ttpttp
111 )|(),...,(
Õ=
-=n
iiiii
ttn ttptwptt
n 11
,...,1 )|()|(maxargˆ...ˆ
1
49
![Page 50: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/50.jpg)
Interlude
Tales of HMMs
50
![Page 51: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/51.jpg)
Structured Data
• For many applications, the i.i.d. assumption does not hold:– pixels in images of real objects.– hyperlinked web pages.– cross-citations in scientific papers.– entities in social networks.– sequences of words/letters in text.– successive time frames in speech.– sequences of base pair in DNA.– musical notes in a tonal melody.– daily values of a particular stock.
51
![Page 52: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/52.jpg)
Structured Data
• For many applications, the i.i.d. assumption does not hold:– pixels in images of real objects.– hyperlinked web pages.– cross-citations in scientific papers.– entities in social networks.– sequences of words/letters in text.– successive time frames in speech.– sequences of base pair in DNA.– musical notes in a tonal melody.– daily values of a particular stock.
Sequential Data
52
![Page 53: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/53.jpg)
Probabilistic Graphical Models
• PGMs use a graph for compactly:1. Encoding a complex distribution over a multi-dimensional space.2. Representing a set of independencies that hold in the distribution.– Properties 1 and 2 are, in a “deep sense”, equivalent.
• Probabilistic Graphical Models:– Directed:
• i.e. Bayesian Networks i.e. Belief Networks.– Undirected:
• i.e. Markov Random Fields
53
![Page 54: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/54.jpg)
Probabilistic Graphical Models
• Directed PGMs:– Bayesian Networks:
• Dynamic Bayesian Networks:– State Observation Models:
» Hidden Markov Models.» Linear Dynamical Systems (Kalman filters).
• Undirected PGMs:– Markov Random Fields (MRF).
• Conditional Random Fields (CRF).– Sequential CRFs.
54
![Page 55: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/55.jpg)
Bayesian Networks
• A Bayesian Network structure G is a directed acyclic graph whose nodes X1, X2, ..., Xn represent random variables and edges correspond to “direct influences” between nodes:– Let Pa(Xi) denote the parents of Xi in G;– Let NonDescend(Xi) denote the variables in the graph that are not
descendants of Xi.– Then G encodes the following set of conditional independence
assumptions, called the local independencies:
For each Xi in G: Xi ⊥NonDescend(Xi) | Pa(Xi)
55
![Page 56: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/56.jpg)
Bayesian Networks
1. Because Xi ⊥NonDescend(Xi) | Pa(Xi), it follows that:
2. More generally, d-separation:1. Two sets of nodes X and Y are conditionally independent given a
set of nodes E (X ⊥ Y | E) if X and Y are d-separated by E.
P(X1,X2,...,Xn ) = P Xi | Pa(Xi )( )i=1
n
∏
56
![Page 57: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/57.jpg)
Sequential Data
Q: How can we model sequential data?
1) Ignore sequential aspects and treat the observations as i.i.d.
2) Relax the i.i.d. assumption by using a Markov model.
x1 xt+1 xTxtxt-1… …
x1 xt+1 xTxtxt-1… …
57
![Page 58: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/58.jpg)
Markov Models
• X = x1, …, xT is a sequence of random variables.• S = {s1, …, sN} is a state space, i.e. xt takes values from S.
1) Limited Horizon:
2) Stationarity:
Þ X is said to be a Markov chain.
)|(),...,|( 111 tkttkt xsxPxxsxP === ++
)|()|( 121 xsxPxsxP ktkt ===+
58
![Page 59: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/59.jpg)
Markov Models: Parameters
• S = {s1, …, sN} are the visible states.
• P = {pi} are the initial state probabilities.
• A = {aij} are the state transition probabilities.
)|( 1 itjtij sxsxPa === +
)( 1 ii sxP ==p
x1 xt+1 xTxtxt-1… …A A A A AAP
59
![Page 60: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/60.jpg)
Markov Models as DBNs
• A Markov Model is a Dynamic Bayesian Network:1. B0 = P is the initial distribution over states.
1. B→ = A is the 2-time-slice Bayesian Network (2-TBN).
– The unrolled DBN (Markov model) over T time steps:
xt+1xtA
x1P
x1 xt+1 xTxtxt-1… …A A A A AAP
60
![Page 61: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/61.jpg)
Markov Models: Inference
Õ-
=+
=1
111
T
txxx tt
ap
Õ-
=+=
1
111 )|()(
T
ttt xxPxp
),...,()( 1 TxxpXp =
x1 xt+1 xTxtxt-1… …A A A A AAP
• Exercise: compute p(t,a,p)
61
![Page 62: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/62.jpg)
mth Order Markov Models
• First order Markov model:
• Second order Markov model:
• mth order Markov model:
Õ-
=+=
1
111 )|()()(
T
ttt xxPxpXp
Õ-
=-+=
1
211121 ),|()|()()(
T
tttt xxxPxxpxpXp
Õ-
=+-+-=
1
1111121 ),...,|(),...,|()...|()()(T
mtmtttmm xxxPxxxpxxpxpXp
62
![Page 63: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/63.jpg)
Markov Models
• (Visible) Markov Models:– Developed by Andrei A. Markov [Markov, 1913]
• modeling the letter sequences in Pushkin’s “Eugene Onyegin”.
• Hidden Markov Models:– The states are hidden (latent) variables.– The states probabilistically generate surface events, or observations.
– Efficient training using Expectation Maximization (EM)• Maximum Likelihood (ML) when tagged data is available.
– Efficient inference using the Viterbi algorithm.
63
![Page 64: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/64.jpg)
Hidden Markov Models (HMMs)
• Probabilistic directed graphical models:– Hidden states (shown in brown).
– Visible observations (shown in lavender).
– Arrows model probabilistic (in)dependencies.
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
64
![Page 65: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/65.jpg)
HMMs: Parameters
• S = {s1, …, sN} is the set of states.• K = {k1, …, kM} = {1, …, M} is the observations alphabet.
• X = x1, …, xT is a sequence of states.• O = o1, …, oT is a sequence of observations.
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
65
![Page 66: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/66.jpg)
HMMs: Parameters
• P = {pi}, iÎS, are the initial state probabilities.
• A = {aij} }, i,jÎS, are the state transition probabilities.
• B = {bik}, iÎS, kÎK, are the symbol emision probabilities.
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
A A A A AAP
B B B B B
)|( ittik sxkoPb ===
66
![Page 67: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/67.jpg)
Hidden Markov Models as DBNs
• A Hidden Markov Model is a Dynamic Bayesian Network:1. B0 = P is the initial distribution over states.
1. B→ = A is the 2-time-slice Bayesian Network (2-TBN).
– The unrolled DBN (Markov model) over T time steps (prev. slide).
x1P
xt+1xtA
ot+1
B
67
![Page 68: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/68.jpg)
HMMs: Inference and Training
• Three fundamental questions:1) Given a model µ = (A, B, P), compute the probability of a given
observation sequence i.e. p(O|µ) (Forward-Backward).
2) Given a model µ and an observation sequence O, compute the
most likely hidden state sequence (Viterbi).
3) Given an observation sequence O, find the model µ = (A, B, P)
that best explains the observed data (EM).
• Given observation and state sequence O, X find µ (ML).
),|(maxargˆ µOXPXX
=
68
![Page 69: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/69.jpg)
HMMs: Decoding
1) Given a model µ = (A, B, P), compute the probability of a given observation sequence O = o1, …, oT i.e. p(O|µ)
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
69
![Page 70: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/70.jpg)
HMMs: Decoding
TToxoxox bbbXOP ...),|(2211
=µ
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
70
![Page 71: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/71.jpg)
HMMs: Decoding
TToxoxox bbbXOP ...),|(2211
=µ
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
TT xxxxxxx aaaXP132211
...)|(-
=pµ
71
![Page 72: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/72.jpg)
HMMs: Decoding
TToxoxox bbbXOP ...),|(2211
=µ
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
TT xxxxxxx aaaXP132211
...)|(-
=pµ
)|(),|()|,( µµµ XPXOPXOP =
72
![Page 73: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/73.jpg)
HMMs: Decoding
TToxoxox bbbXOP ...),|(2211
=µ
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
TT xxxxxxx aaaXP132211
...)|(-
=pµ
)|(),|()|,( µµµ XPXOPXOP =
å=X
XPXOPOP )|(),|()|( µµµ
73
![Page 74: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/74.jpg)
HMMs: Decoding
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
111
1
111
1
1}...{)|(
+++På-
=
=tttt
T
oxxx
T
txxoxx babOp pµ
Time complexity?
74
![Page 75: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/75.jpg)
HMMs: Forward Procedure
• Define:
• Then solution is: )|,...()( 1 µa ixooPt tti ==
å=
=N
ii TOp
1)()|( aµ
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
75
![Page 76: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/76.jpg)
HMMs: Decoding
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
)|(),...()()|()|...(
)()|...(),...(
1111
11111
1111
111
jxoPjxooPjxPjxoPjxooP
jxPjxooPjxooP
tttt
ttttt
ttt
tt
=======
=====
+++
++++
+++
++)1( +tja
76
![Page 77: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/77.jpg)
HMMs: Decoding
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
)|(),...()()|()|...(
)()|...(),...(
1111
11111
1111
111
jxoPjxooPjxPjxoPjxooP
jxPjxooPjxooP
tttt
ttttt
ttt
tt
=======
=====
+++
++++
+++
++)1( +tja
77
![Page 78: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/78.jpg)
HMMs: Decoding
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
)|(),...()()|()|...(
)()|...(),...(
1111
11111
1111
111
jxoPjxooPjxPjxoPjxooP
jxPjxooPjxooP
tttt
ttttt
ttt
tt
=======
=====
+++
++++
+++
++)1( +tja
78
![Page 79: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/79.jpg)
HMMs: Decoding
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
)|(),...()()|()|...(
)()|...(),...(
1111
11111
1111
111
jxoPjxooPjxPjxoPjxooP
jxPjxooPjxooP
tttt
ttttt
ttt
tt
=======
=====
+++
++++
+++
++)1( +tja
79
![Page 80: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/80.jpg)
HMMs: Decoding
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
å
å
å
å
=
+++=
++=
+
++=
+
+=
=====
=====
====
Nijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
a
)1( +tja
80
![Page 81: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/81.jpg)
HMMs: Decoding
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
å
å
å
å
=
+++=
++=
+
++=
+
+=
=====
=====
====
Nijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
a
)1( +tja
81
![Page 82: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/82.jpg)
HMMs: Decoding
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
å
å
å
å
=
+++=
++=
+
++=
+
+=
=====
=====
====
Nijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
a
)1( +tja
82
![Page 83: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/83.jpg)
The Forward Procedure
1. Initialization
2. Recursion:
3. Termination:
Ni1 ,)1(1
££= ioii bpa
TtN,1j1 ,)()1(...1
1<£££=+ å
=+
Nijoijij tbatt aa
å=
=N
ii TOp
1)()|( aµ
83
![Page 84: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/84.jpg)
The Forward Procedure: Trellis Computation
s1a1(t)
s2a2(t)
sjaj(t+1)
sNaN(t)
s3a3(t)
.
.
.
a1jbjot+1
a2jbjot+1
a3jbjot+1
aNjbjot+1
S
84
![Page 85: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/85.jpg)
HMMs: Backward Procedure
• Define:
• Then solution is: ),|...()( 1 µb ixooPt tTti == +
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
å=
=N
iiioibOp
1)1()|(
1bpµ
85
![Page 86: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/86.jpg)
The Backward Procedure
1. Initialization
2. Recursion:
3. Termination:
Ni1 ,1)( ££=Tib
TtN,1i1 ),1()(...1
1<£££+= å
=+
tbat jNj
joiji tbb
å=
=N
iiioibOp
1)1()|(
1bpµ
86
![Page 87: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/87.jpg)
HMMs: Decoding
• Forward Procedure:
• Backward Procedure:
• Combination:
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
å=
=N
iiioibOp
1)1()|(
1bpµ
å=
=N
ii TOp
1)()|( aµ
)()()|(1
ttOp i
N
ii baµ å
=
=
87
![Page 88: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/88.jpg)
HMMs: Inference and Training
• Three fundamental questions:1) Given a model µ = (A, B, P), compute the probability of a given
observation sequence i.e. p(O|µ) (Forward-Backward).
2) Given a model µ and an observation sequence O, compute the
most likely hidden state sequence (Viterbi).
3) Given an observation sequence O, find the model µ = (A, B, P)
that best explains the observed data (EM).
• Given observation and state sequence O, X find µ (ML).
),|(maxargˆ µOXPXX
=
88
![Page 89: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/89.jpg)
Best State Sequence with Viterbi Algorithm
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
),|(maxargˆ µOXpXX
=
)|,(maxarg µOXpX
=
)|,...,,,...,(maxarg 11,...,1
µTTxxooxxp
T
=
Time complexity?
89
![Page 90: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/90.jpg)
The Viterbi Algorithm
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
)|,...,,,...,(maxargˆ11,...,1
µTTxxooxxpX
T
=
),,...,...(max)( 1111... 11ttttxxj ojxooxxpt
t
== ---
d
• The probability of the most probable path that leads to xt = j:
)|,...,,,...,(max)ˆ( 11,...,1
µTTxxooxxpXp
T
=
)(max)ˆ(1
TXp jNjd
££=
90
![Page 91: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/91.jpg)
• The probability of the most probable path that leads to xt = j:
• It can be shown that:
The Viterbi Algorithm
x1 xt+1 xTxtxt-1… …
o1 ot-1 ot ot+1 oT
),,...,...(max)( 1111... 11ttttxxj ojxooxxpt
t
== ---
d
1)(max)1(
1 +££=+
tjoijiNij batt dd å=
+=+
Nijoijij tbatt
...11
)()1( aaCompare with:
91
![Page 92: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/92.jpg)
The Viterbi Algorithm: Trellis Computation
s1d1(t)
s2d2(t)
sjdj(t+1)
sNdN(t)
s3d3(t)
.
.
.
a1jbjot+1
a2jbjot+1
a3jbjot+1
aNjbjot+1
max
92
![Page 93: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/93.jpg)
The Viterbi Algorithm
1. Initialization
2. Recursion
3. Termination
4. State sequence backtracking
1)1( jojj bpd =
0)1( =jy
1)(max)1(
1 +££=+
tjoijiNij batt dd
1)(maxarg)1(
1 +££=+
tjoijiNij batt dy
)(max)ˆ(1
TXp jNjd
££=
)(maxargˆ1
Tx jNjT d££
=
)ˆ(ˆ 11 ++= ttt xx y
Time complexity?
93
![Page 94: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/94.jpg)
HMMs: Inference and Training
• Three fundamental questions:1) Given a model µ = (A, B, P), compute the probability of a given
observation sequence i.e. p(O|µ) (Forward-Backward).
2) Given a model µ and an observation sequence O, compute the
most likely hidden state sequence (Viterbi).
3) Given an observation sequence O, find the model µ = (A, B, P)
that best explains the observed data (EM).
• Given observation and state sequence O, X find µ (ML).
94
![Page 95: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/95.jpg)
Parameter Estimation with Maximum Likelihood
• Given observation and state sequences O, X find µ =(A,B,P).
)|( 1 itjtij sxsxpa === +
)(),(
ˆ 1
it
itjtij sxC
sxsxCa
===
= +
)|( ittik sxkopb ===
)(),(ˆ
it
ittik sxC
sxkoCb=
===
)|,(maxargˆ µµµ
XOp=
)( 1 ii sxp ==pXsxC i
i)(ˆ 1 ==p
Exercise:Rewrite to use Laplace smoothing.
95
![Page 96: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/96.jpg)
Parameter Estimation with Expectation Maximization
• Given observation sequences O find µ =(A,B,P).
• There is no known analytic method to find solution.
• Locally maximize p(O|µ) using iterative hill-climbing:Þ the Baum-Welch or Forward-Backward algorithm:
- Given a model µ and observation sequence, update the model parameters to to better fit the observations.
- A special case of the Expectation Maximization method.
)|(maxargˆ µµµ
Op=
µ̂
96
![Page 97: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/97.jpg)
The Baum-Welch Algorithm (EM)
[E] Assume µ is known, compute “hidden” parameters x, g :1) xt(i, j) = the probability of being in state si at time t and
state sj at time t+1.
2) gt(i) = the probability of being in state si at time t.
å=
+= +
Nmmm
jjoijit tt
tbatji t
...1)()()1()(
),( 1
baba
x
å=
=Njti jit
...1),()( xg
å-
=
=1
1 to from ns transitioofnumber expected),(
T
tjit ssjix
å-
=
=1
1 from ns transitioofnumber expected)(
T
tit sig
97
![Page 98: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/98.jpg)
The Baum-Welch Algorithm
[M] Re-estimate µ using expectations of x, g :
• Baum has proven that
)1(ˆ igp =i
åå
=
== T
t i
T
t tij
t
jia
1
1
)(
),(ˆ
g
x
åå
=
== T
t i
kot tik
t
ib t
1
}:{
)(
)(ˆg
g
µ̂
)|()ˆ|( µµ OpOp ³
98
![Page 99: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/99.jpg)
The Baum-Welch Algorithm
1. Start with some (random) model µ = (A,B,P).
2. [E step] Compute xt(i, j), gt(i) and their expectations.
3. [M step] Compute ML estimate .
4. Set and repeat from 2. until convergence.
µ̂
µµ ˆ=
99
![Page 100: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/100.jpg)
HMMs
• Three fundamental questions:1) Given a model µ = (A, B, P), compute the probability of a given
observation sequence i.e. p(O|µ) (Forward/Backward).
2) Given a model µ and an observation sequence O, compute the
most likely hidden state sequence (Viterbi).
3) Given an observation sequence O, find the model µ = (A, B, P)
that best explains the observed data (Baum-Welch, or EM).
• Given observation and state sequence O, X find µ (ML).
100
![Page 101: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/101.jpg)
Supplemental Reading
• Section 7.1, 7.2, 7.3, and 7.4 from Eisenstein.• Chapter 8 in Jurafsky & Martin:
– https://web.stanford.edu/~jurafsky/slp3/8.pdf
• Appendix A in Jurafsky & Martin:– https://web.stanford.edu/~jurafsky/slp3/A.pdf
101
![Page 102: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/102.jpg)
102
![Page 103: CS 6840: Natural Language Processing Sequence Tagging with …ace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf · 2019. 9. 15. · Sequence Labeling •Information Extraction:](https://reader036.fdocuments.us/reader036/viewer/2022081407/60550aebb9fca33d3265aba6/html5/thumbnails/103.jpg)
POS Disambiguation: Context
“Here's a movie where you forgive the preposterous because it takes you to the perplexing.”
[Source Code, by Roger Ebert, March 31, 2011]
“The good, the bad, and the ugly”
“The young and the restless”
“The bold and the beautiful”
103