Post on 06-Mar-2021
CS388:NaturalLanguageProcessing
GregDurre8
Lecture5:CRFs
Administrivia
‣ Project1isout,samplewriteupsonwebsite
‣Mini1gradingunderway
Recall:HMMs
‣ Inferenceproblem:
‣ Viterbi:
‣ ObservaMonsO(=inputx) OutputQ(sequenceofstates)=labelsy
y1 y2 yn
x1 x2 xn
… P (y,x) = P (y1)nY
i=2
P (yi|yi�1)nY
i=1
P (xi|yi)
argmaxyP (y|x) = argmaxyP (y,x)
P (x)
‣ Training:maximumlikelihoodesMmaMon(withsmoothing)
scorei(s) = maxyi�1
P (s|yi�1)P (xi|s)scorei�1(yi�1)
Recall:ViterbiAlgorithm
slidecredit:RayMooney
a0:IniMalstatedistribuMonaij:Probabilityofi-jtransiMonbj(ot):ProbabilityofemiZng
symbolotfromstatej
Viterbi/HMMs:OtherResources
‣ Lecturenotesfrommyundergradcourse(postedonline)
‣ EisensteinChapter7.3butthenotaMoncoversamoregeneralcasethanwhat’sdiscussedforHMMs
‣ Jurafsky+MarMn8.4.5
ThisLecture
‣ (ifMme)Beamsearch
‣ CRFs:model(+featuresforNER),inference,learning
‣ NamedenMtyrecogniMon(NER)
NamedEnMtyRecogniMon
BarackObamawilltraveltoHangzhoutodayfortheG20mee>ng.
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
‣ BIOtagset:begin,inside,outside
‣WhymightanHMMnotdosowellhere?
‣ LotsofO’s
‣ Sequenceoftags—shouldweuseanHMM?
‣ Insufficientfeatures/capacitywithmulMnomials(especiallyforunks)
CRFs
Wherewe’regoing
‣ FlexiblediscriminaMvemodelfortaggingtasksthatcanusearbitraryfeaturesoftheinput.SimilartologisMcregression,butstructured
BarackObamawilltraveltoHangzhoutodayfortheG20mee>ng.
B-PER I-PER
Curr_word=Barack&Label=B-PERNext_word=Obama&Label=B-PERCurr_word_starts_with_capital=True&Label=B-PERPosn_in_sentence=1st&Label=B-PERLabel=B-PER&Next-Label=I-PER…
HMMs,Formally
‣ HMMsareexpressibleasBayesnets(factorgraphs)
y1 y2 yn
x1 x2 xn
…
‣ ThisreflectsthefollowingdecomposiMon:
‣ Locallynormalizedmodel:eachfactorisaprobabilitydistribuMonthatnormalizes
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
CondiMonalRandomFields
anyreal-valuedscoringfuncMonofitsarguments
‣ CRFs:discriminaMvemodelswiththefollowingglobally-normalizedform:
‣ HMMs:
P (y|x) = 1
Z
Y
k
exp(�k(x,y))
normalizer
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
‣ Specialcase:linearfeature-basedpotenMals �k(x,y) = w>fk(x,y)
P (y|x) = 1
Zexp
nX
k=1
w>fk(x,y)
!‣ LookslikeoursingleweightvectormulMclasslogisMcregressionmodel
HMMsvs.CRFs
‣ NaiveBayes:logisMcregression::HMMs:CRFslocalvs.globalnormalizaMon<->generaMvevs.discriminaMve
(locallynormalizeddiscriminaMvemodelsdoexist(MEMMs))
P (y|x) = 1
Zexp
nX
k=1
w>fk(x,y)
!y1 y2
x1
x2
x3
f1
f2
f3
‣ HMMs:inthestandardsetup,emissionsconsideronewordataMme
‣ CRFs:featuresovermanywordssimultaneously,non-independentfeatures(e.g.,suffixesandprefixes),doesn’thavetobeageneraMvemodel
‣ CondiMonalmodel:x’sareobserved
ProblemwithCRFs
P (y|x) = 1
Zexp
nX
k=1
w>fk(x,y)
!y1 y2
x1
x2
x3
f1
f2
f3
‣ Normalizingconstant
Z =X
y0
exp
nX
k=1
w>fk(x,y0)
!
‣ Inference:
‣ Ifyconsistsof5variableswith30valueseach,howexpensivearethese?
ybest = argmaxy0 exp
nX
k=1
w>fk(x,y0)
!
‣ NeedtoconstraintheformofourCRFstomakeittractable
SequenMalCRFs
‣ NotaMon:omitxfromthefactorgraphenMrely(implicit),buteveryfeaturefuncMonconnectstoit
y1 y2 yn…
�t
‣ Twotypesoffactors:transi>ons(lookatadjacenty’s,butnotx)andemissions(lookatyandallofx)
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
SequenMalCRF:(oneform)
�e
�t
�e
FeaturesforNER
FeatureFuncMons
y1 y2 yn…
�e
�t
‣ Phisareflexible(canbeNNwith1B+parameters).Here:sparselinearfcns
�t(yi�1, yi) = w>ft(yi�1, yi)
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#�e(yi, i,x) = w>fe(yi, i,x)
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
(lookslikeMini1features)
BasicFeaturesforNER
BarackObamawilltraveltoHangzhoutodayfortheG20mee>ng.
OB-LOC
TransiMons:
Emissions: Ind[B-LOC&Currentword=Hangzhou]Ind[B-LOC&Prevword=to]
ft(yi�1, yi) = Ind[yi�1 & yi]
fe(y6, 6,x) =
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
=Ind[O—B-LOC]
EmissionFeaturesforNER
Leicestershireisaniceplacetovisit…
Itookavaca>ontoBoston
Applereleasedanewversion…
AccordingtotheNewYorkTimes…
ORG
ORG
LOC
LOC
TexasgovernorGregAbboJsaid
LeonardoDiCapriowonanaward…
PER
PER
LOC
�e(yi, i,x)
EmissionFeaturesforNER
‣ Contextfeatures(can’tuseinHMM!)‣Wordsbefore/arer‣ Tagsbefore/arer
‣Wordfeatures(canuseinHMM)‣ CapitalizaMon‣Wordshape‣ Prefixes/suffixes‣ Lexicalindicators
‣ Gaze8eers‣Wordclusters
Leicestershire
Boston
Applereleasedanewversion…
AccordingtotheNewYorkTimes…
CRFsOutline
‣Model: P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
‣ Inference
‣ Learning
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
InferenceandLearninginCRFs
CompuMng(arg)maxes
y1 y2 yn…
�e
�t
‣ :canuseViterbiexactlyasinHMMcase
‣ andplaytheroleofthePsnow,samedynamicprogramexp(�t(yi�1, yi)) exp(�e(yi, i,x))
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
argmaxyP (y|x)
{maxy1,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x)e�t(y1,y2)e�e(y1,1,x)
= maxy2,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x) maxy1
e�t(y1,y2)e�e(y1,1,x)
= maxy3,...,yn
e�t(yn�1,yn)e�e(yn,n,x) · · ·maxy2
e�t(y2,y3)e�e(y2,2,x) maxy1
e�t(y1,y2)score1(y1){
InferenceinGeneralCRFs
y1 y2 yn…
�e
�t
‣ Candoefficientinferenceinanytree-structuredCRF
‣Max-productalgorithm:generalizaMonofViterbitoarbitrarytree-structuredgraphs(sum-productisgeneralizaMonofforward-backward)
CRFsOutline
‣Model: P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
‣ Inference:argmaxP(y|x)fromViterbi
‣ Learning
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
TrainingCRFs
‣ GradientiscompletelyanalogoustologisMcregression:
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
P (y|x) / expw>f(x, y)‣ LogisMcregression:
‣Maximize L(y⇤,x) = logP (y⇤|x)
intractable!
@
@wL(y⇤,x) =
nX
i=2
ft(y⇤i�1, y
⇤i ) +
nX
i=1
fe(y⇤i , i,x)
�Ey
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
TrainingCRFs
‣ Let’sfocusonemissionfeatureexpectaMon
@
@wL(y⇤,x) =
nX
i=2
ft(y⇤i�1, y
⇤i ) +
nX
i=1
fe(y⇤i , i,x)
�Ey
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
Ey
"nX
i=1
fe(yi, i,x)
#=
X
y2YP (y|x)
"nX
i=1
fe(yi, i,x)
#=
nX
i=1
X
y2YP (y|x)fe(yi, i,x)
=nX
i=1
X
s
P (yi = s|x)fe(s, i,x)
Forward-BackwardAlgorithm‣ Howdowecomputethesemarginals?P (yi = s|x)
P (yi = s|x) =X
y1,...,yi�1,yi+1,...,yn
P (y|x)
‣WhatdidViterbicompute? P (ymax|x) = maxy1,...,yn
P (y|x)
‣ Cancomputemarginalswithdynamicprogrammingaswellusingforward-backward
Forward-BackwardAlgorithm
P (y3 = 2|x) =sum of all paths through state 2 at time 3
sum of all paths
Forward-BackwardAlgorithm
slidecredit:DanKlein
P (y3 = 2|x) =sum of all paths through state 2 at time 3
sum of all paths
=
‣ Easiestandmostflexibletodoonepasstocomputeandonetocompute
Forward-BackwardAlgorithm
‣ IniMal:
‣ Recurrence:
‣ SameasViterbibutsumminginsteadofmaxing!
‣ ThesequanMMesgetverysmall!StoreeverythingaslogprobabiliMes
↵1(s) = exp(�e(s, 1,x))<latexit sha1_base64="H++BwoHSkR4cFwia4mrESCMTX7M=">AAACE3icbVBNS8NAEN3Ur1q/oh69LBahlVKSKuhFEL14VLC20JSw2U7apZtk2d1IS+h/8OJf8eJBEa9evPlv3NYetPXBwOO9GWbmBYIzpR3ny8otLC4tr+RXC2vrG5tb9vbOnUpSSaFOE57IZkAUcBZDXTPNoSkkkCjg0Aj6l2O/cQ9SsSS+1UMB7Yh0YxYySrSRfPvQI1z0iO+WVBmfYQ8GouSJHvOhpCpuxYuI7gVhNhiVy75ddKrOBHieuFNSRFNc+/an10loGkGsKSdKtVxH6HZGpGaUw6jgpQoEoX3ShZahMYlAtbPJTyN8YJQODhNpKtZ4ov6eyEik1DAKTOf4RjXrjcX/vFaqw9N2xmKRaojpz6Iw5VgneBwQ7jAJVPOhIYRKZm7FtEckodrEWDAhuLMvz5O7WtU9qtZujovnF9M48mgP7aMSctEJOkdX6BrVEUUP6Am9oFfr0Xq23qz3n9acNZ3ZRX9gfXwDGCWcbA==</latexit>
↵t(st) =X
st�1
↵t�1(st�1) exp(�e(st, t,x))<latexit sha1_base64="MsrARqnMVgv4lQtn8MktFrMi2DM=">AAAChHicbVFNT9wwEHVCS+n2g2175GKxWjWR0lUMreilCNFLjyB1AWmzshyvw1o4iWVPEKsov4R/1Rv/pk4IogsdydLze2/G45lUK2khju88f+PFy81XW68Hb96+e789/PDxzJaV4WLKS1Wai5RZoWQhpiBBiQttBMtTJc7Tq5+tfn4tjJVl8RtWWsxzdlnITHIGjqLD23HClF4ySgIb4h84ETc6SPRSUhHYiERJzmCZZvVNE4aD3gqBpdCZbZXT2tIavpCmwb3c3YKeDdcrUohgveb4UYeHpKitH2I6HMWTuAv8HJAejFAfJ3T4J1mUvMpFAVwxa2ck1jCvmQHJlWgGSWWFZvyKXYqZgwXLhZ3X3RAbPHbMAmelcacA3LH/ZtQst3aVp87Z9m+fai35P21WQfZ9XstCVyAKfv9QVikMJW43ghfSCA5q5QDjRrpeMV8ywzi4vQ3cEMjTLz8HZ3sTsj/ZO/06Ojrux7GFdtAuChBBB+gI/UInaIq453mfvdgj/qYf+fv+t3ur7/U5n9Ba+Id/AUKdvsc=</latexit> exp(�t(st�1, st))
<latexit sha1_base64="+aaYS+pf0KkwlV5IpyMknO6c7lg=">AAAChHicbVFNT9wwEHUCpXT7tdBjLxYr1ERKVzG0gksr1F44gtQFpM3KcrwOa+Eklj1BrKL8kv4rbvybOiGoLDCSpef33ozHM6lW0kIc33n+2vqrjdebbwZv373/8HG4tX1my8pwMeGlKs1FyqxQshATkKDEhTaC5akS5+nV71Y/vxbGyrL4A0stZjm7LGQmOQNH0eHf3YQpvWCUBDbEP3AibnSQ6IWkIrARiZKcwSLN6psmDAcPXggshc5tq5zWltbwlTQN7uXuFvRsuFqSQgSrRf/L8JATteVDTIejeBx3gZ8D0oMR6uOEDm+TecmrXBTAFbN2SmINs5oZkFyJZpBUVmjGr9ilmDpYsFzYWd0NscG7jpnjrDTuFIA79nFGzXJrl3nqnG379qnWki9p0wqyw1ktC12BKPj9Q1mlMJS43QieSyM4qKUDjBvpesV8wQzj4PY2cEMgT7/8HJztjcn+eO/02+joVz+OTfQZ7aAAEXSAjtAxOkETxD3P++LFHvE3/Mjf97/fW32vz/mEVsL/+Q8xFr7H</latexit>
Forward-BackwardAlgorithm
‣ IniMal:�n(s) = 1
‣ Recurrence:
‣ Bigdifferences:countemissionforthenextMmestep(notcurrentone)
�t(st) =X
st+1
�t+1(st+1) exp(�e(st+1, t+ 1,x))
<latexit sha1_base64="o+YDBL44WMQt+0KfoSuWzIAvSK8=">AAAC3XicbVLLbtQwFHXCq4RHp7BkYzGqlIgwSgpS2SBVsGFZJKatmIwsx+N0rDqJZd+gjqJIbFiAEFv+ix3/wQfguClipr2SreNz7r3Hr1xJYSBJfnv+jZu3bt/Zuhvcu//g4fZo59GRqRvN+JTVstYnOTVciopPQYDkJ0pzWuaSH+dnb3v9+BPXRtTVB1gpPi/paSUKwShYioz+7GZUqiUlaWgi/Bpn/FyFmVoKwkMTp3FWUljmRXveRVFwmQuhIeCyTVOS1pAWnqddhwfZrcKBjdZbEohhs+k/HS6L4r5/hIMs53Ct3TNn51S3CAdy082RsZ3WPMlonEwSF/gqSAcwRkMcktGvbFGzpuQVMEmNmaWJgnlLNQgmeRdkjeGKsjN6ymcWVrTkZt661+nwrmUWuKi1HRVgx/5f0dLSmFWZ28x+j2ZT68nrtFkDxat5KyrVAK/YhVHRSAw17p8aL4TmDOTKAsq0sHvFbEk1ZWA/RGAvId088lVwtDdJX0z23r8cH7wZrmMLPUFPUYhStI8O0Dt0iKaIeR+9z95X75tP/C/+d//HRarvDTWP0Vr4P/8Cs27g5w==</latexit> exp(�t(st, st+1))<latexit sha1_base64="MMxuphUYCcsoCkHdOiUL0s5PNcs=">AAAC+nicbVLLbtQwFHVSHu3wmpYlG4tRpUSEUVKQYINUwYZlkZi20mRkOR6nY9VJLPsGOgr5FDYsQIgtX9Idf4PjZhAz7ZVsHZ9z7z1+ZUoKA3H8x/O3bt2+c3d7Z3Dv/oOHj4a7e8emqjXjE1bJSp9m1HApSj4BAZKfKs1pkUl+kp2/6/STT1wbUZUfYan4rKBnpcgFo2ApsusN91Mq1YKSJDAhfoNTfqGCVC0E4YGJkigtKCyyvLlow3CwyoXAEHDZpi5IY0gDz5O2xb3sVkHPhustCUSw2fSfDquiqOsfYqtlHG70e+b8nOoWQU9u2jkystO66bonRKvqEJPhKB7HLvB1kPRghPo4IsPLdF6xuuAlMEmNmSaxgllDNQgmeTtIa8MVZef0jE8tLGnBzaxxT9fifcvMcV5pO0rAjv2/oqGFMcsis5nd9s2m1pE3adMa8tezRpSqBl6yK6O8lhgq3P0DPBeaM5BLCyjTwu4VswXVlIH9LQN7Ccnmka+D44Nx8mJ88OHl6PBtfx3b6Al6igKUoFfoEL1HR2iCmPfZ++p99374X/xv/k//11Wq7/U1j9Fa+L//AuUc6iA=</latexit>
Forward-BackwardAlgorithm
�n(s) = 1
P (s3 = 2|x) = ↵3(2)�3(2)Pi ↵3(i)�3(i)
‣Whatisthedenominatorhere?
‣ Doesthisexplainwhybetaiswhatitis?P (x)
↵1(s) = exp(�e(s, 1,x))<latexit sha1_base64="H++BwoHSkR4cFwia4mrESCMTX7M=">AAACE3icbVBNS8NAEN3Ur1q/oh69LBahlVKSKuhFEL14VLC20JSw2U7apZtk2d1IS+h/8OJf8eJBEa9evPlv3NYetPXBwOO9GWbmBYIzpR3ny8otLC4tr+RXC2vrG5tb9vbOnUpSSaFOE57IZkAUcBZDXTPNoSkkkCjg0Aj6l2O/cQ9SsSS+1UMB7Yh0YxYySrSRfPvQI1z0iO+WVBmfYQ8GouSJHvOhpCpuxYuI7gVhNhiVy75ddKrOBHieuFNSRFNc+/an10loGkGsKSdKtVxH6HZGpGaUw6jgpQoEoX3ShZahMYlAtbPJTyN8YJQODhNpKtZ4ov6eyEik1DAKTOf4RjXrjcX/vFaqw9N2xmKRaojpz6Iw5VgneBwQ7jAJVPOhIYRKZm7FtEckodrEWDAhuLMvz5O7WtU9qtZujovnF9M48mgP7aMSctEJOkdX6BrVEUUP6Am9oFfr0Xq23qz3n9acNZ3ZRX9gfXwDGCWcbA==</latexit>
↵t(st) =X
st�1
↵t�1(st�1) exp(�e(st, t,x))<latexit sha1_base64="MsrARqnMVgv4lQtn8MktFrMi2DM=">AAAChHicbVFNT9wwEHVCS+n2g2175GKxWjWR0lUMreilCNFLjyB1AWmzshyvw1o4iWVPEKsov4R/1Rv/pk4IogsdydLze2/G45lUK2khju88f+PFy81XW68Hb96+e789/PDxzJaV4WLKS1Wai5RZoWQhpiBBiQttBMtTJc7Tq5+tfn4tjJVl8RtWWsxzdlnITHIGjqLD23HClF4ySgIb4h84ETc6SPRSUhHYiERJzmCZZvVNE4aD3gqBpdCZbZXT2tIavpCmwb3c3YKeDdcrUohgveb4UYeHpKitH2I6HMWTuAv8HJAejFAfJ3T4J1mUvMpFAVwxa2ck1jCvmQHJlWgGSWWFZvyKXYqZgwXLhZ3X3RAbPHbMAmelcacA3LH/ZtQst3aVp87Z9m+fai35P21WQfZ9XstCVyAKfv9QVikMJW43ghfSCA5q5QDjRrpeMV8ywzi4vQ3cEMjTLz8HZ3sTsj/ZO/06Ojrux7GFdtAuChBBB+gI/UInaIq453mfvdgj/qYf+fv+t3ur7/U5n9Ba+Id/AUKdvsc=</latexit> exp(�t(st�1, st))
<latexit sha1_base64="+aaYS+pf0KkwlV5IpyMknO6c7lg=">AAAChHicbVFNT9wwEHUCpXT7tdBjLxYr1ERKVzG0gksr1F44gtQFpM3KcrwOa+Eklj1BrKL8kv4rbvybOiGoLDCSpef33ozHM6lW0kIc33n+2vqrjdebbwZv373/8HG4tX1my8pwMeGlKs1FyqxQshATkKDEhTaC5akS5+nV71Y/vxbGyrL4A0stZjm7LGQmOQNH0eHf3YQpvWCUBDbEP3AibnSQ6IWkIrARiZKcwSLN6psmDAcPXggshc5tq5zWltbwlTQN7uXuFvRsuFqSQgSrRf/L8JATteVDTIejeBx3gZ8D0oMR6uOEDm+TecmrXBTAFbN2SmINs5oZkFyJZpBUVmjGr9ilmDpYsFzYWd0NscG7jpnjrDTuFIA79nFGzXJrl3nqnG379qnWki9p0wqyw1ktC12BKPj9Q1mlMJS43QieSyM4qKUDjBvpesV8wQzj4PY2cEMgT7/8HJztjcn+eO/02+joVz+OTfQZ7aAAEXSAjtAxOkETxD3P++LFHvE3/Mjf97/fW32vz/mEVsL/+Q8xFr7H</latexit>
�t(st) =X
st+1
�t+1(st+1) exp(�e(st+1, t+ 1,x))
<latexit sha1_base64="o+YDBL44WMQt+0KfoSuWzIAvSK8=">AAAC3XicbVLLbtQwFHXCq4RHp7BkYzGqlIgwSgpS2SBVsGFZJKatmIwsx+N0rDqJZd+gjqJIbFiAEFv+ix3/wQfguClipr2SreNz7r3Hr1xJYSBJfnv+jZu3bt/Zuhvcu//g4fZo59GRqRvN+JTVstYnOTVciopPQYDkJ0pzWuaSH+dnb3v9+BPXRtTVB1gpPi/paSUKwShYioz+7GZUqiUlaWgi/Bpn/FyFmVoKwkMTp3FWUljmRXveRVFwmQuhIeCyTVOS1pAWnqddhwfZrcKBjdZbEohhs+k/HS6L4r5/hIMs53Ct3TNn51S3CAdy082RsZ3WPMlonEwSF/gqSAcwRkMcktGvbFGzpuQVMEmNmaWJgnlLNQgmeRdkjeGKsjN6ymcWVrTkZt661+nwrmUWuKi1HRVgx/5f0dLSmFWZ28x+j2ZT68nrtFkDxat5KyrVAK/YhVHRSAw17p8aL4TmDOTKAsq0sHvFbEk1ZWA/RGAvId088lVwtDdJX0z23r8cH7wZrmMLPUFPUYhStI8O0Dt0iKaIeR+9z95X75tP/C/+d//HRarvDTWP0Vr4P/8Cs27g5w==</latexit> exp(�t(st, st+1))<latexit sha1_base64="MMxuphUYCcsoCkHdOiUL0s5PNcs=">AAAC+nicbVLLbtQwFHVSHu3wmpYlG4tRpUSEUVKQYINUwYZlkZi20mRkOR6nY9VJLPsGOgr5FDYsQIgtX9Idf4PjZhAz7ZVsHZ9z7z1+ZUoKA3H8x/O3bt2+c3d7Z3Dv/oOHj4a7e8emqjXjE1bJSp9m1HApSj4BAZKfKs1pkUl+kp2/6/STT1wbUZUfYan4rKBnpcgFo2ApsusN91Mq1YKSJDAhfoNTfqGCVC0E4YGJkigtKCyyvLlow3CwyoXAEHDZpi5IY0gDz5O2xb3sVkHPhustCUSw2fSfDquiqOsfYqtlHG70e+b8nOoWQU9u2jkystO66bonRKvqEJPhKB7HLvB1kPRghPo4IsPLdF6xuuAlMEmNmSaxgllDNQgmeTtIa8MVZef0jE8tLGnBzaxxT9fifcvMcV5pO0rAjv2/oqGFMcsis5nd9s2m1pE3adMa8tezRpSqBl6yK6O8lhgq3P0DPBeaM5BLCyjTwu4VswXVlIH9LQN7Ccnmka+D44Nx8mJ88OHl6PBtfx3b6Al6igKUoFfoEL1HR2iCmPfZ++p99374X/xv/k//11Wq7/U1j9Fa+L//AuUc6iA=</latexit>
CompuMngMarginals
y1 y2 yn…
�e
�t
P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
Z =X
y
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
‣ ForbothHMMsandCRFs:
‣ Normalizingconstant
P (yi = s|x) = forwardi(s)backwardi(s)Ps0 forwardi(s
0)backwardi(s0)
ZforCRFs,P(x)forHMMs
‣ AnalogoustoP(x)forHMMs
Posteriorsvs.ProbabiliMes
P (yi = s|x) = forwardi(s)backwardi(s)Ps0 forwardi(s
0)backwardi(s0)
‣ Posteriorisderivedfromtheparametersandthedata(condiMonedonx!)
HMM
CRF
Modelparameter(usuallymulMnomialdistribuMon)
InferredquanMtyfromforward-backward
InferredquanMtyfromforward-backward
Undefined(modelisbydefiniMoncondiMonedonx)
P (xi|yi), P (yi|yi�1) P (yi|x), P (yi�1, yi|x)
TrainingCRFs
‣ TransiMonfeatures:needtocompute
‣ …butyoucanbuildapre8ygoodsystemwithoutlearnedtransiMonfeatures(useheurisMcweights,orjustenforceconstraintslikeB-PER->I-ORGisillegal)
P (yi = s1, yi+1 = s2|x)usingforward-backwardaswell
‣ Foremissionfeatures:
goldfeatures—expectedfeaturesundermodel
@
@wL(y⇤,x) =
nX
i=1
fe(y⇤i , i,x)�
nX
i=1
X
s
P (yi = s|x)fe(s, i,x)
CRFsOutline
‣Model: P (y|x) = 1
Z
nY
i=2
exp(�t(yi�1, yi))nY
i=1
exp(�e(yi, i,x))
‣ Inference:argmaxP(y|x)fromViterbi
‣ Learning:runforward-backwardtocomputeposteriorprobabiliMes;then
P (y|x) / expw>
"nX
i=2
ft(yi�1, yi) +nX
i=1
fe(yi, i,x)
#
@
@wL(y⇤,x) =
nX
i=1
fe(y⇤i , i,x)�
nX
i=1
X
s
P (yi = s|x)fe(s, i,x)
Pseudocode
foreachepoch
foreachexample
extractfeaturesoneachemissionandtransiMon(lookupincache)
computemarginalprobabiliMeswithforward-backward
computepotenMalsphibasedonfeatures+weights
accumulategradientoverallemissionsandtransiMons
ImplementaMonTipsforCRFs‣ Cachingisyourfriend!Cachefeaturevectorsespecially
‣ TrytoreduceredundantcomputaMon,e.g.ifyoucomputeboththegradientandtheobjecMvevalue,don’trerunthedynamicprogram
‣ Ifthingsaretooslow,runaprofilerandseewhereMmeisbeingspent.Forward-backwardshouldtakemostoftheMme
‣ Exploitsparsityinfeaturevectorswherepossible,especiallyinfeaturevectorsandgradients
‣ DoalldynamicprogramcomputaMoninlogspacetoavoidunderflow
DebuggingTipsforCRFs‣ Hardtoknowwhetherinference,learning,orthemodelisbroken!
‣ ComputetheobjecMve—isopMmizaMonworking?
‣ Learning:istheobjecMvegoingdown?Trytofit1example/10examples.Areyouapplyingthegradientcorrectly?
‣ Inference:checkgradientcomputaMon(mostlikelyplaceforbugs)
‣ Isthesameforalli?‣ DoprobabiliMesnormalizecorrectly+look“reasonable”?(Nearlyuniformwhenuntrained,thenslowlyconvergingtotherightthing)
‣ IfobjecMveisgoingdownbutmodelperformanceisbad:
‣ Inference:checkperformanceifyoudecodethetrainingset
X
s
forwardi(s)backwardi(s)