Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1.

Post on 18-Dec-2015

213 views 0 download

Tags:

Transcript of Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1.

1

Conditional Random Fields

Advanced Statistical Methods in NLPLing 572

February 9, 2012

2

RoadmapGraphical Models

Modeling independenceModels revisitedGenerative & discriminative models

Conditional random fieldsLinear chain models

Skip chain models

3

PreviewConditional random fields

Undirected graphical modelDue to Lafferty, McCallum, and Pereira, 2001

4

PreviewConditional random fields

Undirected graphical modelDue to Lafferty, McCallum, and Pereira, 2001

Discriminative modelSupports integration of rich feature sets

5

PreviewConditional random fields

Undirected graphical modelDue to Lafferty, McCallum, and Pereira, 2001

Discriminative modelSupports integration of rich feature sets

Allows range of dependency structuresLinear-chain, skip-chain, generalCan encode long-distance dependencies

6

PreviewConditional random fields

Undirected graphical modelDue to Lafferty, McCallum, and Pereira, 2001

Discriminative modelSupports integration of rich feature sets

Allows range of dependency structuresLinear-chain, skip-chain, generalCan encode long-distance dependencies

Used diverse NLP sequence labeling tasks:Named entity recognition, coreference resolution, etc

7

Graphical Models

8

Graphical ModelsGraphical model

Simple, graphical notation for conditional independence

Probabilistic model where:Graph structure denotes conditional independence

b/t random variables

9

Graphical ModelsGraphical model

Simple, graphical notation for conditional independence

Probabilistic model where:Graph structure denotes conditional independence

b/t random variables

Nodes: random variables

10

Graphical ModelsGraphical model

Simple, graphical notation for conditional independence

Probabilistic model where:Graph structure denotes conditional independence

b/t random variables

Nodes: random variablesEdges: dependency relation between random

variables

11

Graphical ModelsGraphical model

Simple, graphical notation for conditional independence Probabilistic model where:

Graph structure denotes conditional independence b/t random variables

Nodes: random variablesEdges: dependency relation between random variables

Model types: Bayesian Networks Markov Random Fields

12

Modeling (In)dependenceBayesian network

13

Modeling (In)dependenceBayesian network

Directed acyclic graph (DAG)

14

Modeling (In)dependenceBayesian network

Directed acyclic graph (DAG)Nodes = Random VariablesArc ~ directly influences, conditional

dependency

15

Modeling (In)dependenceBayesian network

Directed acyclic graph (DAG)Nodes = Random VariablesArc ~ directly influences, conditional

dependency

Arcs = Child depends on parent(s)No arcs = independent (0 incoming: only a priori)Parents of X = For each X need

)(X))(|( XXP

16

Example I

Russel & Norvig, AIMA

17

Example I

Russel & Norvig, AIMA

18

Example I

Russel & Norvig, AIMA

19

Simple Bayesian NetworkMCBN1

A

B C

D E

A B depends on C depends on D depends on E depends on

Need: Truth table

20

Simple Bayesian NetworkMCBN1

A

B C

D E

A = only a prioriB depends on C depends on D depends on E depends on

Need:P(A)

Truth table2

21

Simple Bayesian NetworkMCBN1

A

B C

D E

A = only a prioriB depends on AC depends onD depends onE depends on

Need:P(A)P(B|A)

Truth table22*2

22

Simple Bayesian NetworkMCBN1

A

B C

D E

A = only a prioriB depends on AC depends on AD depends on E depends on

Need:P(A)P(B|A)P(C|A)

Truth table22*22*2

23

Simple Bayesian NetworkMCBN1

A

B C

D E

A = only a prioriB depends on AC depends on AD depends on B,CE depends on C

Need:P(A)P(B|A)P(C|A)P(D|B,C)P(E|C)

Truth table22*22*22*2*22*2

24

Holmes Example (Pearl)Holmes is worried that his house will be burgled. Forthe time period of interest, there is a 10^-4 a priori chanceof this happening, and Holmes has installed a burglar alarmto try to forestall this event. The alarm is 95% reliable insounding when a burglary happens, but also has a false positive rate of 1%. Holmes’ neighbor, Watson, is 90% sure to call Holmes at his office if the alarm sounds, but he is alsoa bit of a practical joker and, knowing Holmes’ concern, might (30%) call even if the alarm is silent. Holmes’ otherneighbor Mrs. Gibbons is a well-known lush and often befuddled, but Holmes believes that she is four times morelikely to call him if there is an alarm than not.

25

Holmes Example: Model

There a four binary random variables:

26

Holmes Example: Model

There a four binary random variables:B: whether Holmes’ house has been burgledA: whether his alarm soundedW: whether Watson calledG: whether Gibbons called

B A

W

G

27

Holmes Example: Model

There a four binary random variables:B: whether Holmes’ house has been burgledA: whether his alarm soundedW: whether Watson calledG: whether Gibbons called

B A

W

G

28

Holmes Example: Model

There a four binary random variables:B: whether Holmes’ house has been burgledA: whether his alarm soundedW: whether Watson calledG: whether Gibbons called

B A

W

G

29

Holmes Example: Model

There a four binary random variables:B: whether Holmes’ house has been burgledA: whether his alarm soundedW: whether Watson calledG: whether Gibbons called

B A

W

G

30

Holmes Example: Tables

B = #t B=#f

0.0001 0.9999

A=#t A=#fB

#t#f

0.95 0.05 0.01 0.99

W=#t W=#fA

#t#f

0.90 0.10 0.30 0.70

G=#t G=#fA

#t#f

0.40 0.60 0.10 0.90

31

Bayes’ Nets: Markov Property

Bayes’s Nets:Satisfy the local Markov property

Variables: conditionally independent of non-descendents given their parents

32

Bayes’ Nets: Markov Property

Bayes’s Nets:Satisfy the local Markov property

Variables: conditionally independent of non-descendents given their parents

33

Bayes’ Nets: Markov Property

Bayes’s Nets:Satisfy the local Markov property

Variables: conditionally independent of non-descendents given their parents

34

Simple Bayesian NetworkMCBN1 A

B C

D E

A = only a prioriB depends on AC depends on AD depends on B,CE depends on C

P(A,B,C,D,E)=

35

Simple Bayesian NetworkMCBN1 A

B C

D E

A = only a prioriB depends on AC depends on AD depends on B,CE depends on C

P(A,B,C,D,E)=P(A)

36

Simple Bayesian NetworkMCBN1 A

B C

D E

A = only a prioriB depends on AC depends on AD depends on B,CE depends on C

P(A,B,C,D,E)=P(A)P(B|A)

37

Simple Bayesian NetworkMCBN1 A

B C

D E

A = only a prioriB depends on AC depends on AD depends on B,CE depends on C

P(A,B,C,D,E)=P(A)P(B|A)P(C|A)

38

Simple Bayesian NetworkMCBN1 A

B C

D E

A = only a prioriB depends on AC depends on AD depends on B,CE depends on C

P(A,B,C,D,E)=P(A)P(B|A)P(C|A)P(D|B,C)P(E|C)There exist algorithms for training, inference on BNs

39

Naïve Bayes Model

Bayes’ Net: Conditional independence of features given class

Y

f1 f2 f3 fk

40

Naïve Bayes Model

Bayes’ Net: Conditional independence of features given class

Y

f1 f2 f3 fk

41

Naïve Bayes Model

Bayes’ Net: Conditional independence of features given class

Y

f1 f2 f3 fk

42

Hidden Markov ModelBayesian Network where:

yt depends on

43

Hidden Markov ModelBayesian Network where:

yt depends on yt-1

xt

44

Hidden Markov ModelBayesian Network where:

yt depends on yt-1

xt depends on yt

y1 y2 y3 yk

x1 x2 x3 xk

45

Hidden Markov ModelBayesian Network where:

yt depends on yt-1

xt depends on yt

y1 y2 y3 yk

x1 x2 x3 xk

46

Hidden Markov ModelBayesian Network where:

yt depends on yt-1

xt depends on yt

y1 y2 y3 yk

x1 x2 x3 xk

47

Hidden Markov ModelBayesian Network where:

yt depends on yt-1

xt depends on yt

y1 y2 y3 yk

x1 x2 x3 xk

48

Generative ModelsBoth Naïve Bayes and HMMs are generative

models

49

Generative ModelsBoth Naïve Bayes and HMMs are generative

models

We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.

(Sutton & McCallum, 2006)State y generates an observation (instance) x

50

Generative ModelsBoth Naïve Bayes and HMMs are generative

models

We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.

(Sutton & McCallum, 2006)State y generates an observation (instance) x

Maximum Entropy and linear-chain Conditional Random Fields (CRFs) are, respectively, their discriminative model counterparts

51

Markov Random Fieldsaka Markov Network

Graphical representation of probabilistic modelUndirected graph

Can represent cyclic dependencies(vs DAG in Bayesian Networks, can represent induced

dep)

52

Markov Random Fieldsaka Markov Network

Graphical representation of probabilistic modelUndirected graph

Can represent cyclic dependencies(vs DAG in Bayesian Networks, can represent induced

dep)

Also satisfy local Markov property:where ne(X) are the neighbors of X

53

Factorizing MRFsMany MRFs can be analyzed in terms of cliques

Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)

Example due to F. Xia

54

Factorizing MRFsMany MRFs can be analyzed in terms of cliques

Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)

Maximal clique can not be extended

Example due to F. Xia

55

Factorizing MRFsMany MRFs can be analyzed in terms of cliques

Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)

Maximal clique can not be extendedMaximum clique is largest clique in G.

Clique:

Maximal clique:

Maximum clique:

Example due to F. Xia

A

B C

E D

56

MRFsGiven an undirected graph G(V,E), random vars:

X

Cliques over G: cl(G)

Example due to F. Xia

57

MRFsGiven an undirected graph G(V,E), random vars:

X

Cliques over G: cl(G)

B C

E D

Example due to F. Xia

58

MRFsGiven an undirected graph G(V,E), random vars:

X

Cliques over G: cl(G)

B C

E D

Example due to F. Xia

59

Conditional Random FieldsDefinition due to Lafferty et al, 2001:

Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G

60

Conditional Random FieldsDefinition due to Lafferty et al, 2001:

Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G.

A CRF is a Markov Random Field globally conditioned on the observation X, and has the form:

61

Linear-Chain CRFCRFs can have arbitrary graphical structure, but..

62

Linear-Chain CRFCRFs can have arbitrary graphical structure, but..

Most common form is linear chain Supports sequence modelingMany sequence labeling NLP problems:

Named Entity Recognition (NER), Coreference

63

Linear-Chain CRFCRFs can have arbitrary graphical structure, but..

Most common form is linear chain Supports sequence modelingMany sequence labeling NLP problems:

Named Entity Recognition (NER), CoreferenceSimilar to combining HMM sequence w/MaxEnt

modelSupports sequence structure like HMM

but HMMs can’t do rich feature structure

64

Linear-Chain CRFCRFs can have arbitrary graphical structure, but..

Most common form is linear chain Supports sequence modelingMany sequence labeling NLP problems:

Named Entity Recognition (NER), CoreferenceSimilar to combining HMM sequence w/MaxEnt

modelSupports sequence structure like HMM

but HMMs can’t do rich feature structure

Supports rich, overlapping features like MaxEnt but MaxEnt doesn’t directly supports sequences labeling

65

Discriminative & Generative

Model perspectives (Sutton & McCallum)

66

Linear-Chain CRFsFeature functions:

In MaxEnt: f: X x Y {0,1}e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0

o.w.

67

Linear-Chain CRFsFeature functions:

In MaxEnt: f: X x Y {0,1}e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0

o.w.

In CRFs, f: Y x Y x X x T Re.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0

o.w.frequently indicator function, for efficiency

68

Linear-Chain CRFsFeature functions:

In MaxEnt: f: X x Y {0,1}e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0

o.w.

In CRFs, f: Y x Y x X x T Re.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0

o.w.frequently indicator function, for efficiency

69

Linear-Chain CRFs

70

Linear-Chain CRFs

71

Linear-chain CRFs:Training & Decoding

Training:

72

Linear-chain CRFs:Training & Decoding

Training: Learn λj

Approach similar to MaxEnt: e.g. L-BFGS

73

Linear-chain CRFs:Training & Decoding

Training: Learn λj

Approach similar to MaxEnt: e.g. L-BFGS

Decoding:Compute label sequence that optimizes P(y|x)Can use approaches like HMM, e.g. Viterbi

74

Skip-chain CRFs

75

MotivationLong-distance dependencies:

76

MotivationLong-distance dependencies:

Linear chain CRFs, HMMs, beam search, etcAll make very local Markov assumptions

Preceding label; current data given current labelGood for some tasks

77

MotivationLong-distance dependencies:

Linear chain CRFs, HMMs, beam search, etcAll make very local Markov assumptions

Preceding label; current data given current labelGood for some tasks

However, longer context can be usefule.g. NER: Repeated capitalized words should get same

tag

78

MotivationLong-distance dependencies:

Linear chain CRFs, HMMs, beam search, etcAll make local Markov assumptions

Preceding label; current data given current labelGood for some tasks

However, longer context can be usefule.g. NER: Repeated capitalized words should get same

tag

79

Skip-Chain CRFsBasic approach:

Augment linear-chain CRF model withLong-distance ‘skip edges’

Add evidence from both endpoints

80

Skip-Chain CRFsBasic approach:

Augment linear-chain CRF model withLong-distance ‘skip edges’

Add evidence from both endpoints

Which edges?

81

Skip-Chain CRFsBasic approach:

Augment linear-chain CRF model withLong-distance ‘skip edges’

Add evidence from both endpoints

Which edges? Identical words, words with same stem?

82

Skip-Chain CRFsBasic approach:

Augment linear-chain CRF model withLong-distance ‘skip edges’

Add evidence from both endpoints

Which edges? Identical words, words with same stem?

How many edges?

83

Skip-Chain CRFsBasic approach:

Augment linear-chain CRF model withLong-distance ‘skip edges’

Add evidence from both endpoints

Which edges? Identical words, words with same stem?

How many edges?Not too many, increases inference cost

84

Skip Chain CRF ModelTwo clique templates:

Standard linear chain template

85

Skip Chain CRF ModelTwo clique templates:

Standard linear chain templateSkip edge template

86

Skip Chain CRF ModelTwo clique templates:

Standard linear chain templateSkip edge template

87

Skip Chain CRF ModelTwo clique templates:

Standard linear chain templateSkip edge template

88

Skip Chain NERNamed Entity Recognition:

Task: start time, end time, speaker, locationIn corpus of seminar announcement emails

89

Skip Chain NERNamed Entity Recognition:

Task: start time, end time, speaker, locationIn corpus of seminar announcement emails

All approaches:Orthographic, gazeteer, POS features

Within preceding, following 4 word window

90

Skip Chain NERNamed Entity Recognition:

Task: start time, end time, speaker, locationIn corpus of seminar announcement emails

All approaches:Orthographic, gazeteer, POS features

Within preceding, following 4 word window

Skip chain CRFs: Skip edges between identical capitalized words

91

NER Features

92

Skip Chain NER Results

Skip chain improves substantially on ‘speaker’ recognition- Slight reduction in accuracy for times

93

SummaryConditional random fields (CRFs)

Undirected graphical modelCompare with Bayesian Networks, Markov Random

Fields

94

SummaryConditional random fields (CRFs)

Undirected graphical modelCompare with Bayesian Networks, Markov Random

Fields

Linear-chain modelsHMM sequence structure + MaxEnt feature models

95

SummaryConditional random fields (CRFs)

Undirected graphical modelCompare with Bayesian Networks, Markov Random

Fields

Linear-chain modelsHMM sequence structure + MaxEnt feature models

Skip-chain modelsAugment with longer distance dependencies

Pros:

96

SummaryConditional random fields (CRFs)

Undirected graphical modelCompare with Bayesian Networks, Markov Random

Fields

Linear-chain modelsHMM sequence structure + MaxEnt feature models

Skip-chain modelsAugment with longer distance dependencies

Pros: Good performanceCons:

97

SummaryConditional random fields (CRFs)

Undirected graphical modelCompare with Bayesian Networks, Markov Random

Fields

Linear-chain modelsHMM sequence structure + MaxEnt feature models

Skip-chain modelsAugment with longer distance dependencies

Pros: Good performanceCons: Compute intensive

98

HW #5

99

HW #5: Beam Search Apply Beam Search to MaxEnt sequence

decoding

Task: POS tagging

Given files:test data: usual formatboundary file: sentence lengthsmodel file

Comparisons:Different topN, topK, beam_width

Tag ContextFollowing Ratnaparkhi ‘96, model uses previous

tag (prevT=tag) and previous tag bigram (prevTwoTags=tagi-2+tagi-1)

These are NOT in the data file; you compute them on the fly.

Notes:Due to sparseness, it is possible a bigram may not

appear in the model file. Skip it.These are feature functions: If you have a different

candidate tag for the same word, weights will differ.

100

101

UncertaintyReal world tasks:

Partially observable, stochastic, extremely complex

Probabilities capture “Ignorance & Laziness”Lack relevant facts, conditions

Failure to enumerate all conditions, exceptions

102

MotivationUncertainty in medical diagnosis

Diseases produce symptoms In diagnosis, observed symptoms => disease IDUncertainties

Symptoms may not occurSymptoms may not be reportedDiagnostic tests not perfect

False positive, false negative

How do we estimate confidence?