Ivan Titov
description
Transcript of Ivan Titov
Learning for Structured Prediction
Linear Methods For Sequence Labeling: Hidden Markov Models vs Structured Perceptron
Ivan Titov
Last Time: Structured Prediction1. Selecting feature representation
It should be sufficient to discriminate correct structure from incorrect ones It should be possible to decode with it (see (3))
2. Learning Which error function to optimize on the training set, for example
How to make it efficient (see (3))
3. Decoding: Dynamic programming for simpler representations ? Approximate search for more powerful ones?
We illustrated all these challenges on the example of dependency parsing
y = argmaxy02Y (x ) w¢' (x;y0)
w¢' (x;y?) ¡ maxy02Y (x );y6=y ? w¢' (x;y0) > °
' (x;y)
'
x is an input (e.g., sentence), y is an output (syntactic
tree)
3
Outline
Sequence labeling / segmentation problems: settings and example problems:
Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model Standard definition + maximum likelihood estimation General views: as a representative of linear models
Perceptron and Structured Perceptron algorithms / motivations
Decoding with the Linear Model Discussion: Discriminative vs. Generative
4
Sequence Labeling Problems Definition:
Input: sequences of variable length Output: every position is labeled
Examples: Part-of-speech tagging
Named-entity recognition, shallow parsing (“chunking”), gesture recognition from video-streams, …
x = (x1;x2; : : :;xjx j), xi 2 Xy = (y1;y2; : : : ;yjx j), yi 2 Y
John carried
a tin can .
NP VBD DT NN NN .
x =y =
5
Part-of-speech tagging
Labels: NNP – proper singular noun; VBD - verb, past tense DT - determiner
Consider
John carried
a tin can .
NNP VBD DT NN NN or MD? .
x =y =
NN – singular noun MD - modal . - final
punctuation
If you just predict the most frequent tag for
each word you will make a mistake here
In fact, even knowing that the previous word is a noun is not
enough
Tin can cause poisoning …NN MD VB NN …
x =y =
One need to model interactions between labels to successfully resolve ambiguities, so this should be tackled as a structured prediction problem
3
6
Named Entity Recognition[ORG Chelsea], despite their name, are not based in [LOC Chelsea], but in
neighbouring [LOC Fulham] .
Not as trivial as it may seem, consider: [PERS Bill Clinton] will not embarrass [PERS Chelsea] at her
wedding Tiger failed to make a birdie
Encoding example (BIO-encoding)Bill Clinto
nembarrass
edChels
eaat
her
wedding at Astor Courts
B-PERS
I-PERS O B-PERS O O O O B-LOC I-LOC
x =y =
Chelsea can be a person too!
Is it an animal or a person?in the South Course
…
3
7
Vision: Gesture Recognition Given a sequence of frames in a video annotate each frame with a
gesture type:
Types of gestures:
It is hard to predict gestures from each frame in isolation, you need to exploit relations between frames and gesture types
Flip back Shrink vertically
Expand
vertically
Double back
Point and back
Expand horizontally
Figures from (Wang et al., CVPR 06)
8
Outline
Sequence labeling / segmentation problems: settings and example problems:
Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model Standard definition + maximum likelihood estimation General views: as a representative of linear models
Perceptron and Structured Perceptron algorithms / motivations
Decoding with the Linear Model Discussion: Discriminative vs. Generative
9
Hidden Markov Models We will consider the part-of-speech (POS) tagging example
A “generative” model, i.e.: Model: Introduce a parameterized model of how both words
and tags are generated Learning: use a labeled training set to estimate the most likely
parameters of the model Decoding:
John carried
a tin can .
NP VBD DT NN NN .
P (x;yjµ)
µy = argmaxy0 P (x;y0jµ)
10
Hidden Markov Models A simplistic state diagram for noun phrases: N – tags, M –
vocabulary size
States correspond to POS tags, Words are emitted independently from each POS tag
Parameters (to be estimated from the training set): Transition probabilities : [ N x N ] matrix
Emission probabilities : [ N x M] matrix
Stationarity assumption: this probability does not
depend on the position in the sequence t
$
Det
Adj
Noun
[0.5 : the]
[0.01: red,
, …]0.1
0.8
P (y(t) jy(t¡ 1))P (x(t) jy(t))
0.1
0.50.2
1.0
0.5
0.8
[0.01 : herring, …]
0.01 : dog
0.01 : hungry
0.5 : a
aExample:
hungry
dog
11
Hidden Markov Models Representation as an instantiation of a graphical model: N – tags, M
– vocabulary size
States correspond to POS tags, Words are emitted independently from each POS tag
Parameters (to be estimated from the training set): Transition probabilities : [ N x N ] matrix
Emission probabilities : [ N x M] matrix
Stationarity assumption: this probability does not
depend on the position in the sequence t
P (y(t) jy(t¡ 1))P (x(t) jy(t))
…
x(4)
…
x(1) x(2) x(3)
y(4)y(1) y(2) y(3) A arrow means that in the generative story x(4) is
generated from some P(x(4) | y(4))
= Det
= a
= Adj = Noun
= hungry
= dog
12
Hidden Markov Models: Estimation N – the number tags, M – vocabulary size Parameters (to be estimated from the training set):
Transition probabilities , A - [ N x N ] matrix
Emission probabilities , B - [ N x M] matrix
Training corpus: x(1)= (In, an, Oct., 19, review, of, …. ), y(1)= (IN, DT, NNP, CD, NN,
IN, …. ) x(2)= (Ms., Haag, plays, Elianti,.), y(2)= (NNP, NNP, VBZ, NNP, .) … x(L)= (The, company, said,…), y(L)= (DT, NN, VBD, NNP, .)
How to estimate the parameters using maximum likelihood estimation?
You probably can guess what these estimation should be?
aj i = P (y(t) = ijy(t¡ 1) = j )bik = P (x(t) = kjy(t) = i)
13
Hidden Markov Models: Estimation Parameters (to be estimated from the training set):
Transition probabilities , A - [ N x N ] matrix
Emission probabilities , B - [ N x M] matrix
Training corpus: (x(1),y(1) ), l = 1, … L Write down the probability of the corpus according to the HMM:P (fx(l);y(l)gL
l=1) = Q Ll=1 P (x(l);y(l)) =
= Q Ll=1 a$;y( l )
1
³ Q jx l j¡ 1t=1 by( l )
t ;x( l )t
ay( l )t ;y( l )
t + 1
´by( l )
j x l j ;x( l )j x l j
ay( l )j x l j ;$
=
Select tag for the first word
Draw a word from this
state
Select the next state
Draw last
word
Transit into the $ state
= Q Ni ;j =1 aCT (i ;j )
i ;jQ N
i=1Q M
k=1 bCE (i ;k)i ;k
CT(i,j) is #times tag i is followed by tag j. Here we assume that $ is a
special tag which precedes and succeeds every sentence
CE(i,k) is #times word k is emitted
by tag i
aj i = P (y(t) = ijy(t¡ 1) = j )bik = P (x(t) = kjy(t) = i)
8
Hidden Markov Models: Estimation Maximize:
Equivalently maximize the logarithm of this:
subject to probabilistic constraints:
Or, we can decompose it into 2N optimization tasks:
P (fx(l);y(l)gLl=1) =
log(P (f x(l);y(l)gLl=1)) =
= P Ni=1
³ P Nj =1 CT (i; j ) logai ;j + P M
k=1 CE (i;k) logbi ;k´
P Nj =1 ai ;j = 1; P N
i=1 bi ;k = 1; i = 1;:: :;N
14
= Q Ni ;j =1 aCT (i ;j )
i ;jQ N
i=1Q M
k=1 bCE (i ;k)i ;k
CT(i,j) is #times tag i is followed by tag j.
CE(i,k) is #times word k is emitted by tag i
i = 1;:: : ;N :maxai ;1 ;:::;ai ;N
P Nj =1 CT (i; j ) logai ;j
s.t. P Nj =1 ai ;j = 1
maxbi ;1 ;:::;bi ;M CE (i;k) logbi ;k
s.t. P Ni=1 bi ;k = 1
For transitionsi = 1;:: : ;N :For emissions
Hidden Markov Models: Estimation For transitions (some i)
Constrained optimization task, Lagrangian:
Find critical points of Lagrangian by solving the system of equations:
Similarly, for emissions:
15
maxai ;1;:::;ai ;NP N
j =1 CT (i; j ) logai ;j
s.t. 1¡ P Nj =1 ai ;j = 0
L(ai ;1; : : : ;ai ;N ;¸) = P Nj =1 CT (i; j ) logai ;j + ¸ £ (1¡ P N
j =1 ai ;j )
@L@ = 1¡ P N
j =1 ai ;j = 0@L
@ai j= CT (i ;j )
ai j¡ ¸ = 0 =) ai j = CT (i ;j )
¸
P (yt = j jyt¡ 1 = i) = ai ;j = CT (i ;j )Pj 0 CT (i ;j 0)
P (xt = kjyt = i) = bi ;k = CE (i ;k)Pk 0 CE (i ;k0)
The maximum likelihood solution is just normalized
counts of events. Always like this for
generative models if all the labels are visible in training
I ignore “smoothing” to process rare or unseen word tag combinations… Outside
score of the seminar2
16
HMMs as linear models
Decoding: We will talk about the decoding algorithm slightly later, let us
generalize Hidden Markov Model:
But this is just a linear model!!
John carried
a tin can .
? ? ? ? ? .y = argmaxy0 P (x;y0jA;B) = argmaxy0 logP (x;y0jA;B)
logP (x;y0jA;B) = P jxj+1l=1 logby0
i ;xi + logay0i ;y0
i + 1
= P Ni=1
P Nj =1 CT (y0; i; j ) £ logai ;j + P N
i=1P M
k=1 CE (x;y0; i;k) £ logbi ;k
The number of times tag i is followed by
tag j in the candidate y’
The number of times tag i corresponds to word k
in (x, y’)
2
)
)
Scoring: example
Their inner product is exactly
But may be there other (and better?) ways to estimate , especially when we know that HMM is not a faithful model of reality?
It is not only a theoretical question! (we’ll talk about that in a moment)
wM L = (
' (x;y0) = (
(x;y0) = John carried
a tin can .
NP VBD DT NN NN . 1 0 … 1 1 0 …
NP: John NP:Mary ... NP-VBD NN-. MD-. …Unary features
logbN P ;J ohn logbN P ;M ar y : : : logaN N ;V B D logaN N ;: logaM D ;:
Edge features
wM L ¢' (x;y0) = P Ni=1
P Nj =1 CT (y0; i; j ) £ logai ;j +
P Ni=1
P Mk=1 CE (x;y0; i;k) £ logbi ;k
CE (x;y0; i;k)CT (y0; i; j )
: : :
logP (x;y0jA;B)
w
18
Feature view Basically, we define features which correspond to edges in the graph:
…
x(4)
…
x(1) x(2) x(3)
y(4)y(1) y(2) y(3)
Shaded because they are visible (both in training
and testing)
19
Generative modeling For a very large dataset (asymptotic analysis):
If data is generated from some “true” HMM, then (if the training set is sufficiently large), we are guaranteed to have an optimal tagger
Otherwise, (generally) HMM will not correspond to an optimal linear classifier
Discriminative methods which minimize the error more directly are guaranteed (under some fairly general conditions) to converge to an optimal linear classifier
For smaller training sets Generative classifiers converge faster to their optimal error [Ng &
Jordan, NIPS 01]
Real case: HMM is a coarse
approximation of reality
A discriminative classifier
A generative model
Errors on a regression dataset (predict housing prices in Boston area):
# train examples1
20
Outline
Sequence labeling / segmentation problems: settings and example problems:
Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model Standard definition + maximum likelihood estimation General views: as a representative of linear models
Perceptron and Structured Perceptron algorithms / motivations
Decoding with the Linear Model Discussion: Discriminative vs. Generative
21
Perceptron Let us start with a binary classification problem
For binary classification the prediction rule is: Perceptron algorithm, given a training set
y 2 f+1;¡ 1gy = sign(w ¢' (x))
= 0 // initialize do err = 0
for = 1 .. L // over the training examples if ( < 0) // if mistake += // update, err ++ // # errors endif endfor while ( err > 0 ) // repeat until no errors return
y(l)¡w ¢' (x(l))¢l
w
w ´y(l) ' (x(l))
w
f x(l);y(l)gLl=1
´ > 0
break ties (0) in some
deterministic way
22
Linear classification Linear separable case, “a perfect” classifier:
Linear functions are often written as: , but we can assume that
y = sign(w ¢' (x) + b)
(w ¢' (x) + b) = 0
w
' (x)1
' (x)2
' (x)0 = 1 for any x
Figure adapted from Dan Roth’s class at UIUC
23
Perceptron: geometric interpretationif ( < 0) // if mistake += // updateendif
y(l)¡w ¢' (x(l))¢w ´y(l) ' (x(l))
24
Perceptron: geometric interpretationif ( < 0) // if mistake += // updateendif
y(l)¡w ¢' (x(l))¢w ´y(l) ' (x(l))
25
Perceptron: geometric interpretationif ( < 0) // if mistake += // updateendif
y(l)¡w ¢' (x(l))¢w ´y(l) ' (x(l))
26
Perceptron: geometric interpretationif ( < 0) // if mistake += // updateendif
y(l)¡w ¢' (x(l))¢w ´y(l) ' (x(l))
27
Perceptron: algebraic interpretationif ( < 0) // if mistake += // updateendif
y(l)¡w ¢' (x(l))¢w ´y(l) ' (x(l))
We want after the update to increase If the increase is large enough than there will be no
misclassification Let’s see that’s what happens after the update
So, the perceptron update moves the decision hyperplane towards misclassified
y(l)¡(w + ´y(l) ' (x(l))) ¢' (x(l))¢
y(l)¡w¢' (x(l))¢
= y(l)¡w ¢' (x(l))¢+ ´(y(l))2 ¡' (x(l)) ¢' (x(l))¢
(y(l))2 = 1 squared norm > 0
' (x(l))
28
Perceptron
The perceptron algorithm, obviously, can only converge if the training set is linearly separable
It is guaranteed to converge in a finite number of iterations, dependent on how well two classes are separated (Novikoff, 1962)
29
Averaged Perceptron A small modification
= 0, = 0 // initialize for k = 1 .. K // for a number of iterations for = 1 .. L // over the training examples if ( < 0) // if mistake += // update, endif += // sum of over the course of training endfor endfor return
y(l)¡w ¢' (x(l))¢l
wP
w ´y(l) ' (x(l)) ´ > 0
w wP
w w
1K L wP
w
Do not run until convergence
Note: it is after endif
More stable in training: a vector which survived more iterations without updates is more similar to the resulting vector , as it was added a larger number of times
w1
K L wP
2
30
Structured Perceptron Let us start with structured problem: Perceptron algorithm, given a training set
= 0 // initialize do err = 0 for = 1 .. L // over the training examples // model prediction
if ( ) // if mistake += // update err ++ // # errors endif endfor while ( err > 0 ) // repeat until no errors return
l
w
w
f x(l);y(l)gLl=1
y = argmaxy02Y (x ) w ¢' (x;y0)
y = argmaxy02Y (x ( l ) ) w ¢' (x(l);y0)w ¢' (x(l); y) > w ¢' (x(l);y(l))w ´ ¡' (x(l);y(l)) ¡ ' (x(l); y)¢
Pushes the correct sequence up and
the incorrectly predicted one down
31
Str. perceptron: algebraic interpretationif ( ) // if mistake += // updateendif
w
We want after the update to increase If the increase is large enough then will be scored
above Clearly, that this is achieved as this product will be increased
byThere might be
other
but we will deal with them on the next iterations
w ¢' (x(l); y) > w ¢' (x(l);y(l))´ ¡' (x(l);y(l)) ¡ ' (x(l); y)¢
w ¢(' (x(l);y(l)) ¡ ' (x(l); y))y(l) y
y02 Y(x(l))´jj' (x(l);y(l)) ¡ ' (x(l); y)jj2
32
Structured Perceptron Positive:
Very easy to implement Often, achieves respectable results As other discriminative techniques, does not make assumptions about
the generative process Additional features can be easily integrated, as long as decoding is
tractable Drawbacks
“Good” discriminative algorithms should optimize some measure which is closely related to the expected testing results: what perceptron is doing on non-linearly separable data seems not clear
However, for the averaged (voted) version a generalization bound which generalization properties of Perceptron (Freund & Shapire 98)
Later, we will consider more advance learning algorithms
33
Outline
Sequence labeling / segmentation problems: settings and example problems:
Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model Standard definition + maximum likelihood estimation General views: as a representative of linear models
Perceptron and Structured Perceptron algorithms / motivations
Decoding with the Linear Model Discussion: Discriminative vs. Generative
34
Decoding with the Linear model
Decoding: Again a linear model with the following edge features (a
generalization of a HMM) In fact, the algorithm does not depend on the feature of input (they
do not need to be local)
…
x(4)
…
x(1) x(2) x(3)
y(4)y(1) y(2) y(3)
y = argmaxy02Y (x ) w¢' (x;y0)
35
Decoding with the Linear model
Decoding: Again a linear model with the following edge features (a
generalization of a HMM) In fact, the algorithm does not depend on the feature of input (they
do not need to be local)
…
x
y(4)y(1) y(2) y(3)
y = argmaxy02Y (x ) w¢' (x;y0)
36
Decoding with the Linear model
Decoding:
Let’s change notation: Edge scores : roughly corresponds to
Defined for t = 0 too (“start” feature: ) Decode:
Decoding: a dynamic programming algorithm - Viterbi algorithm
…
x
y(4)y(1) y(2) y(3)y = argmaxy02Y (x ) w¢' (x;y0)
logayt ¡ 1;yt + logbyt ;xtf t(yt¡ 1;yt;x)
y = argmaxy02Y (x )P jx j
t=1 f t(y0t¡ 1;y0
t;x)
Start/Stop symbol information ($) can
be encoded with them too.
y0 = $
37
Time complexity ?
Viterbi algorithm Decoding:
Loop invariant: ( ) scoret[y] - score of the highest scoring sequence up to position t
with prevt[y] - previous tag on this sequence
Init: score0[$] = 0, score0[y] = - 1 for other y Recomputation ( )
Return: retrace prev pointers starting from
…x
y(4)y(1) y(2) y(3)
t = 1;: : :; jxj
y = argmaxy02Y (x )P jx j
t=1 f t(y0t¡ 1;y0
t;x)
t = 1;: : :; jxjprevt[y] = argmaxy0 scoret[y0]+ f t(y0;y;x)scoret[y] = scoret¡ 1[prevt[y]]+ f t(prevt[y];y;x)
argmaxy scorejxj[y]
O(N 2jxj)
1
38
Outline
Sequence labeling / segmentation problems: settings and example problems:
Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model Standard definition + maximum likelihood estimation General views: as a representative of linear models
Perceptron and Structured Perceptron algorithms / motivations
Decoding with the Linear Model Discussion: Discriminative vs. Generative
39
Recap: Sequence Labeling Hidden Markov Models:
How to estimate Discriminative models
How to learn with structured perceptron Both learning algorithms result in a linear model
How to label with the linear models
40
Discriminative vs Generative Generative models:
Cheap to estimate: simply normalized counts Hard to integrate complex features: need to come up with
a generative story and this story may be wrong Does not result in an optimal classifier when model
assumptions are wrong (i.e., always) Discriminative models
More expensive to learn: need to run decoding (here, Viterbi) during training and usually multiple times per an example
Easy to integrate features: though some feature may make decoding intractable
Usually less accurate on small datasets
Not necessary the case for generative models with latent
variables
41
Reminders
Speakers: slides about a week before the talk, meetings with me before/after this point will normally be needed
Reviewers: reviews are accepted only before the day we consider the topic
Everyone: References to the papers to read at GoogleDocs,
These slides (and previous ones) will be online today speakers: send me the last version of your slides too
Next time: Lea about Models of Parsing, PCFGs vs general WCFGs (Michael Collins’ book chapter)