Download - Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1.

1

Conditional Random Fields

Advanced Statistical Methods in NLPLing 572

February 9, 2012

2

RoadmapGraphical Models

Modeling independenceModels revisitedGenerative & discriminative models

Conditional random fieldsLinear chain models

Skip chain models

3

PreviewConditional random fields

Undirected graphical modelDue to Lafferty, McCallum, and Pereira, 2001

4



Discriminative modelSupports integration of rich feature sets

5




Allows range of dependency structuresLinear-chain, skip-chain, generalCan encode long-distance dependencies

6




Allows range of dependency structuresLinear-chain, skip-chain, generalCan encode long-distance dependencies

Used diverse NLP sequence labeling tasks:Named entity recognition, coreference resolution, etc

7

Graphical Models

8

Graphical ModelsGraphical model

Simple, graphical notation for conditional independence

Probabilistic model where:Graph structure denotes conditional independence

b/t random variables

9





Nodes: random variables

10





Nodes: random variablesEdges: dependency relation between random

variables

11


Simple, graphical notation for conditional independence Probabilistic model where:

Graph structure denotes conditional independence b/t random variables

Nodes: random variablesEdges: dependency relation between random variables

Model types: Bayesian Networks Markov Random Fields

12

Modeling (In)dependenceBayesian network

13


Directed acyclic graph (DAG)

14


Directed acyclic graph (DAG)Nodes = Random VariablesArc ~ directly influences, conditional

dependency

15


Directed acyclic graph (DAG)Nodes = Random VariablesArc ~ directly influences, conditional

dependency

Arcs = Child depends on parent(s)No arcs = independent (0 incoming: only a priori)Parents of X = For each X need

)(X))(|( XXP

16

Example I

Russel & Norvig, AIMA

17

Example I


18

Example I


19

Simple Bayesian NetworkMCBN1

A

B C

D E

A B depends on C depends on D depends on E depends on

Need: Truth table

20


A

B C

D E

A = only a prioriB depends on C depends on D depends on E depends on

Need:P(A)

Truth table2

21


A

B C

D E

A = only a prioriB depends on AC depends onD depends onE depends on

Need:P(A)P(B|A)

Truth table22*2

22


A

B C

D E

A = only a prioriB depends on AC depends on AD depends on E depends on

Need:P(A)P(B|A)P(C|A)

Truth table22*22*2

23


A

B C

D E

A = only a prioriB depends on AC depends on AD depends on B,CE depends on C

Need:P(A)P(B|A)P(C|A)P(D|B,C)P(E|C)

Truth table22*22*22*2*22*2

24

Holmes Example (Pearl)Holmes is worried that his house will be burgled. Forthe time period of interest, there is a 10^-4 a priori chanceof this happening, and Holmes has installed a burglar alarmto try to forestall this event. The alarm is 95% reliable insounding when a burglary happens, but also has a false positive rate of 1%. Holmes’ neighbor, Watson, is 90% sure to call Holmes at his office if the alarm sounds, but he is alsoa bit of a practical joker and, knowing Holmes’ concern, might (30%) call even if the alarm is silent. Holmes’ otherneighbor Mrs. Gibbons is a well-known lush and often befuddled, but Holmes believes that she is four times morelikely to call him if there is an alarm than not.

25

Holmes Example: Model

There a four binary random variables:

26


There a four binary random variables:B: whether Holmes’ house has been burgledA: whether his alarm soundedW: whether Watson calledG: whether Gibbons called

B A

W

G

27



B A

W

G

28



B A

W

G

29



B A

W

G

30

Holmes Example: Tables

B = #t B=#f

0.0001 0.9999

A=#t A=#fB

#t#f

0.95 0.05 0.01 0.99

W=#t W=#fA

#t#f

0.90 0.10 0.30 0.70

G=#t G=#fA

#t#f

0.40 0.60 0.10 0.90

31

Bayes’ Nets: Markov Property

Bayes’s Nets:Satisfy the local Markov property

Variables: conditionally independent of non-descendents given their parents

32




33




34

Simple Bayesian NetworkMCBN1 A

B C

D E


P(A,B,C,D,E)=

35


B C

D E


P(A,B,C,D,E)=P(A)

36


B C

D E


P(A,B,C,D,E)=P(A)P(B|A)

37


B C

D E


P(A,B,C,D,E)=P(A)P(B|A)P(C|A)

38


B C

D E


P(A,B,C,D,E)=P(A)P(B|A)P(C|A)P(D|B,C)P(E|C)There exist algorithms for training, inference on BNs

39

Naïve Bayes Model

Bayes’ Net: Conditional independence of features given class

Y

f1 f2 f3 fk

40

Naïve Bayes Model


Y

f1 f2 f3 fk

41

Naïve Bayes Model


Y

f1 f2 f3 fk

42

Hidden Markov ModelBayesian Network where:

yt depends on

43


yt depends on yt-1

xt

44


yt depends on yt-1

xt depends on yt

y1 y2 y3 yk

x1 x2 x3 xk

45


yt depends on yt-1

xt depends on yt

y1 y2 y3 yk

x1 x2 x3 xk

46


yt depends on yt-1

xt depends on yt

y1 y2 y3 yk

x1 x2 x3 xk

47


yt depends on yt-1

xt depends on yt

y1 y2 y3 yk

x1 x2 x3 xk

48

Generative ModelsBoth Naïve Bayes and HMMs are generative

models

49


models

We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.

(Sutton & McCallum, 2006)State y generates an observation (instance) x

50


models

We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.

(Sutton & McCallum, 2006)State y generates an observation (instance) x

Maximum Entropy and linear-chain Conditional Random Fields (CRFs) are, respectively, their discriminative model counterparts

51

Markov Random Fieldsaka Markov Network

Graphical representation of probabilistic modelUndirected graph

Can represent cyclic dependencies(vs DAG in Bayesian Networks, can represent induced

dep)

52

Markov Random Fieldsaka Markov Network

Graphical representation of probabilistic modelUndirected graph

Can represent cyclic dependencies(vs DAG in Bayesian Networks, can represent induced

dep)

Also satisfy local Markov property:where ne(X) are the neighbors of X

53

Factorizing MRFsMany MRFs can be analyzed in terms of cliques

Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)

Example due to F. Xia

54



Maximal clique can not be extended


55



Maximal clique can not be extendedMaximum clique is largest clique in G.

Clique:

Maximal clique:

Maximum clique:


A

B C

E D

56

MRFsGiven an undirected graph G(V,E), random vars:

X

Cliques over G: cl(G)


57


X


B C

E D


58


X


B C

E D


59

Conditional Random FieldsDefinition due to Lafferty et al, 2001:

Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G

60

Conditional Random FieldsDefinition due to Lafferty et al, 2001:

Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G.

A CRF is a Markov Random Field globally conditioned on the observation X, and has the form:

61

Linear-Chain CRFCRFs can have arbitrary graphical structure, but..

62


Most common form is linear chain Supports sequence modelingMany sequence labeling NLP problems:

Named Entity Recognition (NER), Coreference

63



Named Entity Recognition (NER), CoreferenceSimilar to combining HMM sequence w/MaxEnt

modelSupports sequence structure like HMM

but HMMs can’t do rich feature structure

64



Named Entity Recognition (NER), CoreferenceSimilar to combining HMM sequence w/MaxEnt

modelSupports sequence structure like HMM

but HMMs can’t do rich feature structure

Supports rich, overlapping features like MaxEnt but MaxEnt doesn’t directly supports sequences labeling

65

Discriminative & Generative

Model perspectives (Sutton & McCallum)

66

Linear-Chain CRFsFeature functions:

In MaxEnt: f: X x Y {0,1}e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0

o.w.

67



o.w.

In CRFs, f: Y x Y x X x T Re.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0

o.w.frequently indicator function, for efficiency

68



o.w.

In CRFs, f: Y x Y x X x T Re.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0

o.w.frequently indicator function, for efficiency

69

Linear-Chain CRFs

70

Linear-Chain CRFs

71

Linear-chain CRFs:Training & Decoding

Training:

72


Training: Learn λj

Approach similar to MaxEnt: e.g. L-BFGS

73


Training: Learn λj

Approach similar to MaxEnt: e.g. L-BFGS

Decoding:Compute label sequence that optimizes P(y|x)Can use approaches like HMM, e.g. Viterbi

74

Skip-chain CRFs

75

MotivationLong-distance dependencies:

76


Linear chain CRFs, HMMs, beam search, etcAll make very local Markov assumptions

Preceding label; current data given current labelGood for some tasks

77


Linear chain CRFs, HMMs, beam search, etcAll make very local Markov assumptions


However, longer context can be usefule.g. NER: Repeated capitalized words should get same

tag

78


Linear chain CRFs, HMMs, beam search, etcAll make local Markov assumptions


However, longer context can be usefule.g. NER: Repeated capitalized words should get same

tag

79

Skip-Chain CRFsBasic approach:

Augment linear-chain CRF model withLong-distance ‘skip edges’

Add evidence from both endpoints

80




Which edges?

81




Which edges? Identical words, words with same stem?

82





How many edges?

83





How many edges?Not too many, increases inference cost

84

Skip Chain CRF ModelTwo clique templates:

Standard linear chain template

85


Standard linear chain templateSkip edge template

86



87



88

Skip Chain NERNamed Entity Recognition:

Task: start time, end time, speaker, locationIn corpus of seminar announcement emails

89



All approaches:Orthographic, gazeteer, POS features

Within preceding, following 4 word window

90



All approaches:Orthographic, gazeteer, POS features

Within preceding, following 4 word window

Skip chain CRFs: Skip edges between identical capitalized words

91

NER Features

92

Skip Chain NER Results

Skip chain improves substantially on ‘speaker’ recognition- Slight reduction in accuracy for times

93

SummaryConditional random fields (CRFs)

Undirected graphical modelCompare with Bayesian Networks, Markov Random

Fields

94



Fields

Linear-chain modelsHMM sequence structure + MaxEnt feature models

95



Fields


Skip-chain modelsAugment with longer distance dependencies

Pros:

96



Fields



Pros: Good performanceCons:

97



Fields



Pros: Good performanceCons: Compute intensive

98

HW #5

99

HW #5: Beam Search Apply Beam Search to MaxEnt sequence

decoding

Task: POS tagging

Given files:test data: usual formatboundary file: sentence lengthsmodel file

Comparisons:Different topN, topK, beam_width

Tag ContextFollowing Ratnaparkhi ‘96, model uses previous

tag (prevT=tag) and previous tag bigram (prevTwoTags=tagi-2+tagi-1)

These are NOT in the data file; you compute them on the fly.

Notes:Due to sparseness, it is possible a bigram may not

appear in the model file. Skip it.These are feature functions: If you have a different

candidate tag for the same word, weights will differ.

100

101

UncertaintyReal world tasks:

Partially observable, stochastic, extremely complex

Probabilities capture “Ignorance & Laziness”Lack relevant facts, conditions

Failure to enumerate all conditions, exceptions

102

MotivationUncertainty in medical diagnosis

Diseases produce symptoms In diagnosis, observed symptoms => disease IDUncertainties

Symptoms may not occurSymptoms may not be reportedDiagnostic tests not perfect

False positive, false negative

How do we estimate confidence?