Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign...
-
Upload
janel-bryant -
Category
Documents
-
view
224 -
download
0
Transcript of Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign...
![Page 1: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/1.jpg)
Conditional Random Fields
![Page 2: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/2.jpg)
Sequence Labeling: The Problem
• Given a sequence (in NLP, words), assign appropriate labels to each word.
• For example, POS tagging:
The cat sat on the mat .DT NN VBD IN DT NN .
![Page 3: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/3.jpg)
Sequence Labeling: The Problem
• Given a sequence (in NLP, words), assign appropriate labels to each word.
• Another example, partial parsing (aka chunking):
The cat sat on the matB-NP I-NP B-VPB-PP B-NP I-NP
![Page 4: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/4.jpg)
Sequence Labeling: The Problem
• Given a sequence (in NLP, words), assign appropriate labels to each word.
• Another example, relation extraction:
The cat sat on the matB-ArgI-ArgB-Rel I-Rel B-Arg I-Arg
![Page 5: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/5.jpg)
The CRF Equation
• A CRF model consists of – F = <f1, …, fk>, a vector of “feature functions”
– θ = < θ1, …, θk>, a vector of weights for each feature function.
• Let O = < o1, …, oT> be an observed sentence
• Let A = <a1, …, aT> be the labent variables.
• This is the same as the Maximum Entropy equation!
y
OyFθ
OyFθO|yA
,exp
,exp)(P
![Page 6: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/6.jpg)
• Note that the denominator depends on O, but not on y (it’s marginalizing over y).
• Typically, we write
where
CRF Equation, standard format
Oy,FθO
O|yA exp)(
1)(Z
P
y
O,yFθO exp)(Z
![Page 7: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/7.jpg)
Making Structured Predictions
![Page 8: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/8.jpg)
Aside: Structured prediction vs. Text Classification
Recall: max. ent. for text classification:
CRFs for sequence labeling:
What’s the difference?
doc,Fθ
doc,Fθdoc
docO|
c
cZ
cAP
c
cc
maxarg
exp)(
1maxarg)(maxarg
Oy,Fθ
Oy,FθO
O|yA
y
yy
maxarg
exp)(
1maxarg)(maxarg
ZP
![Page 9: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/9.jpg)
Aside: Structured prediction vs. Text Classification
Two (related) differences, both for the sake of efficiency:
1)Feature functions in CRFs are restricted to graph parts (described later)
2)We can’t do brute force to compute the argmax. Instead, we do Viterbi.
![Page 10: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/10.jpg)
Finding the Best Sequence
Best sequence is
Recall from HMM discussion:If there are
K possible states for each yi variable,
and N total yi variables,
Then there are KN possible settings for ySo brute force can’t find the best sequence. Instead, we resort to a Viterbi-like dynamic program.
Oy,Fθ
Oy,FθO
O|yA
y
yy
maxarg
exp)(
1maxarg)(maxarg
ZP
![Page 11: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/11.jpg)
oTo1 otot-1 ot+1
Viterbi Algorithm
),,...,...(max)( 1111... 11
ttttyy
j ojyooyytt
Fθ
The state sequence which maximizes the score of seeing the observations to time t-1, landing in state j at time t, and seeing the observation at time t
A1 At-1 At=j
![Page 12: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/12.jpg)
oTo1 otot-1 ot+1
Viterbi Algorithm
)(maxargˆ TX ii
T
)1(ˆ1
^
tXtX
t
)(maxarg)ˆ( TXP ii
Compute the most likely state sequence by working backwards
x1 xt-1 xt xt+1 xT
![Page 13: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/13.jpg)
Viterbi Algorithm
1)(max)1(
tjoijii
j batt
1)(maxarg)1(
tjoijii
j batt Recursive Computation
oTo1 otot-1 ot+1
A1 At-1 At=j At+1
),,...,...(max)( 1111... 11
ttttyy
j ojyooyytt
Fθ
??!
??!
![Page 14: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/14.jpg)
Feature functions and Graph parts
To make efficient computation (dynamic programs) possible, we restrict the feature functions to:
Graph parts (or just parts): A feature function that counts how often a particular configuration occurs for a clique in the CRF graph.
Clique: a set of completely connected nodes in a graph. That is, each node in the clique has an edge connecting it to every other node in the clique.
![Page 15: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/15.jpg)
Clique Example
The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes.
1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
CRF
![Page 16: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/16.jpg)
Clique Example
The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes.
1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
CRF
Individual node cliques
![Page 17: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/17.jpg)
Clique Example
The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes.
1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
CRF
Pair-of-node cliques
![Page 18: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/18.jpg)
Clique Example
For non-linear-chain CRFs (something we won’t normally consider in this class), you can get larger cliques:
1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
CRF
Larger cliques
5’
![Page 19: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/19.jpg)
Graph part as Feature Function Example
Graph parts are feature functions p(y,x) that count how many cliques have a particular configuration.
For example, p(y,x) = count of [yi = Noun].
Here, y2 and y6 are both Nouns, so p(y,x) = 2.
y1=D
x1
y2=N
x2
y3=V
x3
y4=D
x4
y5=A
x5
y6=N
x6
CRF
![Page 20: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/20.jpg)
Graph part as Feature Function Example
For a pair-of-nodes example, p(y,x) = count of [yi = Noun,yi+1=Verb]
Here, y2 is a Noun and y3 is a Verb, so p(y,x) = 1.
y1=D
x1
y2=N
x2
y3=V
x3
y4=D
x4
y5=A
x5
y6=N
x6
CRF
![Page 21: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/21.jpg)
Features can depend on the whole observation
In a CRF, each feature function can depend on x, in addition to a clique in y
Normally, we draw a CRF like this:
1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
HMM
CRF 1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
![Page 22: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/22.jpg)
Features can depend on the whole observation
In a CRF, each feature function can depend on x, in addition to a clique in y
But really, it’s more like this:
This would cause problems for a generative model, but in a conditional model, x is always a fixed constant. So we can still calculate relevant algorithms like Viterbi efficiently.
1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
1
x1
2
x2
3
x3
4
x4
5
x5
6
x6
HMM
CRF
![Page 23: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/23.jpg)
Graph part as Feature Function Example
An example part including x: p(y,x) = count of [yi = A or D,yi+1=N,x2=cat]
Here, y1 is a D and y2 is a N, plus y5 is a A and y6 is a N, plus x2=cat, so p(y,x) = 2.
Notice that the clique y5-y6 is allowed to depend on x2.
y1=D
The
y2=N
cat
y3=V
chased
y4=D
the
y5=A
tiny
y6=N
fly
CRF
![Page 24: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/24.jpg)
Graph part as Feature Function Example
An more usual example including x: p(y,x) = count of [yi = A or D,yi+1=N,xi+1=cat]
Here, y1 is a D and y2 is a N, plus x2=cat, so p(y,x)=1.
y1=D
The
y2=N
cat
y3=V
chased
y4=D
the
y5=A
tiny
y6=N
fly
CRF
![Page 25: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/25.jpg)
The CRF Equation, with Parts
• A CRF model consists of – P = <p1, …, pk>, a vector of parts
– θ = < θ1, …, θk>, a vector of weights for each part.
• Let O = < o1, …, oT> be an observed sentence
• Let A = <a1, …, aT> be the labent variables.
)(
,exp)(
O
OyPθO|yA
ZP
![Page 26: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/26.jpg)
Viterbi Algorithm – 2nd Try
),,(
),(
)(
max)1(
1
1
oPθ
oPθ
jyiy
jy
t
t
ttpairpair
toneone
i
ij
Recursive
Computation
oTo1 otot-1 ot+1
A1 At-1 At=j At+1
),,...(max)( 11... 11
oPθ jyyyt ttyy
jt
),,(
),(
)(
maxarg)1(
1
1
oPθ
oPθ
jyiy
jy
t
t
ttpairpair
toneone
i
ij
![Page 27: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/27.jpg)
Supervised Parameter Estimation
![Page 28: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/28.jpg)
Conditional Training• Given a set of observations o and the correct labels y
for each, determine the best θ:
• Because the CRF equation is just a special form of the maximum entropy equation, we can train it exactly the same way: – Determine the gradient– Step in the direction of the gradient– Repeat until convergence
)(maxarg θo|yθ
,P
![Page 29: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/29.jpg)
Recall: Training a ME model
Training is an optimization problem:find the value for λ that maximizes the conditional log-likelihood of the training data:
29
Traindc iii
Traindc
dZdcf
dcPTrainCLL
,
,
)(log),(
)|(log)(
![Page 30: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/30.jpg)
Recall: Training a ME model
Optimization is normally performed using some form of gradient descent:0) Initialize λ0 to 0
1) Compute the gradient: ∇CLL2) Take a step in the direction of the gradient:λi+1 = λi + α ∇CLL
3) Repeat until CLL doesn’t improve:stop when |CLL(λi+1) – CLL(λi)| < ε
30
![Page 31: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/31.jpg)
Recall: Training a ME model
Computing the gradient:
31
TraindciPi
Traindcc i
ii
c iiii
i
Traindc c iii
ii
Traindc iii
ii
dcfdcf
dcf
dcfdcfdcf
dcfdcf
dZdcfTrainCLL
,
,
,
,
),(E),(
),(exp
),(exp),(),(
),(explog),(
)(log),()(
![Page 32: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/32.jpg)
Recall: Training a ME model
Computing the gradient:
32
TraindciPi
Traindcc i
ii
c iiii
i
Traindc c iii
ii
Traindc iii
ii
dcfdcf
dcf
dcfdcfdcf
dcfdcf
dZdcfTrainCLL
,
,
,
,
),(E),(
),(exp
),(exp),(),(
),(explog),(
)(log),()(
The hard part for CRFs
![Page 33: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/33.jpg)
Training a CRF: Expected feature counts
• … (sorry, ran out of time)
![Page 34: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/34.jpg)
CRFs vs. HMMs
![Page 35: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/35.jpg)
Generative (Joint Probability) Models
• HMMs are generative models: That is, they can compute the joint probability P(sentence, hidden-states)
• From a generative model, one can compute– Conditional models P(sentence | hidden-states) and
P(hidden-states| sentence)– Marginal models P(sentence) and P(hidden-states)
• For sequence labeling, we want P(hidden-states | sentence)
![Page 36: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/36.jpg)
Discriminative (Conditional) Models
• Most often, people are most interested in the conditional probability P(hidden-states | sentence)For example, this is the distribution needed for sequence labeling.
• Discriminative (also called conditional) models directly represent the conditional distribution P(hidden-states | sentence)– These models cannot tell you the joint distribution, marginals, or other
conditionals.– But they’re quite good at this particular conditional distribution.
![Page 37: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/37.jpg)
Discriminative vs. GenerativeHMM (generative) CRF (discriminative)
Marginal, orLanguage model:P(sentence)
Forward algorithm or Backward algorithm,
linear in length of sentence
Can’t do it.
Find optimal label sequence
Viterbi,Linear in length of
sentence
Viterbi,Linear in length of
sentence
Supervised parameter estimation
Bayesian learning,Easy and fast
Convex optimization,Can be quite slow
Unsupervised parameter estimation
Baum-Welch (non-convex optimization),
Slow but doable
Very difficult, and requires making extra assumptions.
Feature functions Parents and children in the graph
Restrictive!
Arbitrary functions of a latent state and any
portion of the observed nodes
![Page 38: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/38.jpg)
CRFs vs. HMMs, a closer look
It’s possible to convert an HMM into a CRF:Set pprior,state(y,x) = count[y1=state]Set θprior,state = log PHMM(y1=state) = log state
Set ptrans,state1,state2(y,x)= count[yi=state1,yi+1=state2]Set θtrans,state1,state2 = log PHMM(yi+1=state2|yi=state1)
= log Astate1,state2
Set pobs,state,word(y,x)= count[yi=state,xi=word]Set θobs,state,word = log PHMM(xi=word|yi=state)
= log Bstate,word
![Page 39: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/39.jpg)
CRF vs. HMM, a closer look
If we convert an HMM to a CRF, all of the CRF parameters θ will be logs of probabilities.Therefore, they will all be between –∞ and 0
Notice: CRF parameters can be between –∞ and +∞.
So, how do HMMs and CRFs compare in terms of bias and variance (as sequence labelers)?– HMMs have more bias– CRFs have more variance
![Page 40: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/40.jpg)
Comparing feature functionsThe biggest advantage of CRFs over HMMs is that they can handle
overlapping features.
For example, for POS tagging, using words as a features (like xi=“the” or xj=“jogging”) is quite useful.
However, it’s often also useful to use “orthographic” features, like “the word ends in –ing” or “the word starts with a capital letter.”
These features overlap: some words end in “ing”, some don’t.
• Generative models have to include in the model parameters for predicting when features will overlap.
• Discriminative models don’t: they can simply use the features.
![Page 41: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/41.jpg)
CRF Example
A CRF POS Tagger for English
![Page 42: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/42.jpg)
Vocabulary
We need to determine the set of possible word types V.
Let V = {all types in 1 million tokens of Wall Street Journal text, which we’ll use for training}
U {UNKNOWN} (for word types we haven’t seen)
![Page 43: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/43.jpg)
L = Label Set
Standard Penn Treebank tagsetNumber Tag Description
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
Number Tag Description
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
![Page 44: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/44.jpg)
L = Label SetNumber Tag Description
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
Number Tag Description
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb
![Page 45: Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:](https://reader036.fdocuments.us/reader036/viewer/2022062421/56649e0a5503460f94af24c6/html5/thumbnails/45.jpg)
CRF FeaturesFeature Type Description
Prior k yi = k
Transition k,k’ yi = k and yi+1=k’
Word k,w yi = k and xi=wk,w yi = k and xi-1=wk,w yi = k and xi+1=wk,w,w’ yi = k and xi=w and xi-1=w’k,w,w’ yi = k and xi=w and xi+1=w’
Orthography: Suffix s in {“ing”,”ed”,”ogy”,”s”,”ly”,”ion”,”tion”, “ity”, …} and k yi=k and xi ends with s
Orthography: Punctuation k yi = k and xi is capitalizedk yi = k and xi is hyphenatedk yi = k and xi contains a periodk yi = k and xi is ALL CAPSk yi = k and xi contains a digit (0-9)…