Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1.

Conditional Random Fields

Advanced Statistical Methods in NLPLing 572

February 9, 2012

RoadmapGraphical Models

Modeling independenceModels revisitedGenerative & discriminative models

Conditional random fieldsLinear chain models

Skip chain models

PreviewConditional random fields

Undirected graphical modelDue to Lafferty, McCallum, and Pereira, 2001

Discriminative modelSupports integration of rich feature sets

Allows range of dependency structuresLinear-chain, skip-chain, generalCan encode long-distance dependencies

Used diverse NLP sequence labeling tasks:Named entity recognition, coreference resolution, etc

Graphical Models

Graphical ModelsGraphical model

Simple, graphical notation for conditional independence

Probabilistic model where:Graph structure denotes conditional independence

b/t random variables

Nodes: random variables

Nodes: random variablesEdges: dependency relation between random

variables

Simple, graphical notation for conditional independence Probabilistic model where:

Graph structure denotes conditional independence b/t random variables

Nodes: random variablesEdges: dependency relation between random variables

Model types: Bayesian Networks Markov Random Fields

Modeling (In)dependenceBayesian network

Directed acyclic graph (DAG)

Directed acyclic graph (DAG)Nodes = Random VariablesArc ~ directly influences, conditional

dependency

Directed acyclic graph (DAG)Nodes = Random VariablesArc ~ directly influences, conditional

dependency

Arcs = Child depends on parent(s)No arcs = independent (0 incoming: only a priori)Parents of X = For each X need

)(X))(|( XXP

Example I

Russel & Norvig, AIMA

Example I

Simple Bayesian NetworkMCBN1

A B depends on C depends on D depends on E depends on

Need: Truth table

A = only a prioriB depends on C depends on D depends on E depends on

Need:P(A)

Truth table2

A = only a prioriB depends on AC depends onD depends onE depends on

Need:P(A)P(B|A)

Truth table22*2

A = only a prioriB depends on AC depends on AD depends on E depends on

Need:P(A)P(B|A)P(C|A)

Truth table22*22*2

A = only a prioriB depends on AC depends on AD depends on B,CE depends on C

Need:P(A)P(B|A)P(C|A)P(D|B,C)P(E|C)

Truth table22*22*22*2*22*2

Holmes Example (Pearl)Holmes is worried that his house will be burgled. Forthe time period of interest, there is a 10^-4 a priori chanceof this happening, and Holmes has installed a burglar alarmto try to forestall this event. The alarm is 95% reliable insounding when a burglary happens, but also has a false positive rate of 1%. Holmes’ neighbor, Watson, is 90% sure to call Holmes at his office if the alarm sounds, but he is alsoa bit of a practical joker and, knowing Holmes’ concern, might (30%) call even if the alarm is silent. Holmes’ otherneighbor Mrs. Gibbons is a well-known lush and often befuddled, but Holmes believes that she is four times morelikely to call him if there is an alarm than not.

Holmes Example: Model

There a four binary random variables:

There a four binary random variables:B: whether Holmes’ house has been burgledA: whether his alarm soundedW: whether Watson calledG: whether Gibbons called

Holmes Example: Tables

B = #t B=#f

0.0001 0.9999

A=#t A=#fB

0.95 0.05 0.01 0.99

W=#t W=#fA

0.90 0.10 0.30 0.70

G=#t G=#fA

0.40 0.60 0.10 0.90

Bayes’ Nets: Markov Property

Bayes’s Nets:Satisfy the local Markov property

Variables: conditionally independent of non-descendents given their parents

Simple Bayesian NetworkMCBN1 A

P(A,B,C,D,E)=

P(A,B,C,D,E)=P(A)

P(A,B,C,D,E)=P(A)P(B|A)

P(A,B,C,D,E)=P(A)P(B|A)P(C|A)

P(A,B,C,D,E)=P(A)P(B|A)P(C|A)P(D|B,C)P(E|C)There exist algorithms for training, inference on BNs

Naïve Bayes Model

Bayes’ Net: Conditional independence of features given class

f1 f2 f3 fk

Naïve Bayes Model

f1 f2 f3 fk

Naïve Bayes Model

f1 f2 f3 fk

Hidden Markov ModelBayesian Network where:

yt depends on

yt depends on yt-1

xt depends on yt

y1 y2 y3 yk

x1 x2 x3 xk

yt depends on yt-1

xt depends on yt

y1 y2 y3 yk

x1 x2 x3 xk

yt depends on yt-1

xt depends on yt

y1 y2 y3 yk

x1 x2 x3 xk

yt depends on yt-1

xt depends on yt

y1 y2 y3 yk

x1 x2 x3 xk

Generative ModelsBoth Naïve Bayes and HMMs are generative

models

We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.

(Sutton & McCallum, 2006)State y generates an observation (instance) x

models

We use the term generative model to refer to a directed graphical model in which the outputs topologically precede the inputs, that is, no x in X can be a parent of an output y in Y.

(Sutton & McCallum, 2006)State y generates an observation (instance) x

Maximum Entropy and linear-chain Conditional Random Fields (CRFs) are, respectively, their discriminative model counterparts

Markov Random Fieldsaka Markov Network

Graphical representation of probabilistic modelUndirected graph

Can represent cyclic dependencies(vs DAG in Bayesian Networks, can represent induced

Markov Random Fieldsaka Markov Network

Graphical representation of probabilistic modelUndirected graph

Can represent cyclic dependencies(vs DAG in Bayesian Networks, can represent induced

Also satisfy local Markov property:where ne(X) are the neighbors of X

Factorizing MRFsMany MRFs can be analyzed in terms of cliques

Clique: in undirected graph G(V,E), clique is a subset of vertices v in V, s.t. for every pair of vertices vi,vj, there exists E(vi,vj)

Example due to F. Xia

Maximal clique can not be extended

Maximal clique can not be extendedMaximum clique is largest clique in G.

Clique:

Maximal clique:

Maximum clique:

MRFsGiven an undirected graph G(V,E), random vars:

Cliques over G: cl(G)

Conditional Random FieldsDefinition due to Lafferty et al, 2001:

Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G

Conditional Random FieldsDefinition due to Lafferty et al, 2001:

Let G = (V,E) be a graph such that Y=(Yv)vinV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w!=v)=p(Yv|X,Yw,w~v), where w∼v means that w and v are neighbors in G.

A CRF is a Markov Random Field globally conditioned on the observation X, and has the form:

Linear-Chain CRFCRFs can have arbitrary graphical structure, but..

Most common form is linear chain Supports sequence modelingMany sequence labeling NLP problems:

Named Entity Recognition (NER), Coreference

Named Entity Recognition (NER), CoreferenceSimilar to combining HMM sequence w/MaxEnt

modelSupports sequence structure like HMM

but HMMs can’t do rich feature structure

Named Entity Recognition (NER), CoreferenceSimilar to combining HMM sequence w/MaxEnt

modelSupports sequence structure like HMM

but HMMs can’t do rich feature structure

Supports rich, overlapping features like MaxEnt but MaxEnt doesn’t directly supports sequences labeling

Discriminative & Generative

Model perspectives (Sutton & McCallum)

Linear-Chain CRFsFeature functions:

In MaxEnt: f: X x Y {0,1}e.g. fj(x,y) = 1, if x=“rifle” and y=talk.politics.guns, 0

In CRFs, f: Y x Y x X x T Re.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0

o.w.frequently indicator function, for efficiency

In CRFs, f: Y x Y x X x T Re.g. fk(yt,yt-1,x,t)=1, if yt=V and yt-1=N and xt=“flies”,0

o.w.frequently indicator function, for efficiency

Linear-Chain CRFs

Linear-chain CRFs:Training & Decoding

Training:

Training: Learn λj

Approach similar to MaxEnt: e.g. L-BFGS

Training: Learn λj

Approach similar to MaxEnt: e.g. L-BFGS

Decoding:Compute label sequence that optimizes P(y|x)Can use approaches like HMM, e.g. Viterbi

Skip-chain CRFs

MotivationLong-distance dependencies:

Linear chain CRFs, HMMs, beam search, etcAll make very local Markov assumptions

Preceding label; current data given current labelGood for some tasks

Linear chain CRFs, HMMs, beam search, etcAll make very local Markov assumptions

However, longer context can be usefule.g. NER: Repeated capitalized words should get same

Linear chain CRFs, HMMs, beam search, etcAll make local Markov assumptions

However, longer context can be usefule.g. NER: Repeated capitalized words should get same

Skip-Chain CRFsBasic approach:

Augment linear-chain CRF model withLong-distance ‘skip edges’

Add evidence from both endpoints

Which edges?

Which edges? Identical words, words with same stem?

How many edges?

How many edges?Not too many, increases inference cost

Skip Chain CRF ModelTwo clique templates:

Standard linear chain template

Standard linear chain templateSkip edge template

Skip Chain NERNamed Entity Recognition:

Task: start time, end time, speaker, locationIn corpus of seminar announcement emails

All approaches:Orthographic, gazeteer, POS features

Within preceding, following 4 word window

All approaches:Orthographic, gazeteer, POS features

Within preceding, following 4 word window

Skip chain CRFs: Skip edges between identical capitalized words

NER Features

Skip Chain NER Results

Skip chain improves substantially on ‘speaker’ recognition- Slight reduction in accuracy for times

SummaryConditional random fields (CRFs)

Undirected graphical modelCompare with Bayesian Networks, Markov Random

Fields

Linear-chain modelsHMM sequence structure + MaxEnt feature models

Fields

Skip-chain modelsAugment with longer distance dependencies

Fields

Pros: Good performanceCons:

Fields

Pros: Good performanceCons: Compute intensive

HW #5: Beam Search Apply Beam Search to MaxEnt sequence

decoding

Task: POS tagging

Given files:test data: usual formatboundary file: sentence lengthsmodel file

Comparisons:Different topN, topK, beam_width

Tag ContextFollowing Ratnaparkhi ‘96, model uses previous

tag (prevT=tag) and previous tag bigram (prevTwoTags=tagi-2+tagi-1)

These are NOT in the data file; you compute them on the fly.

Notes:Due to sparseness, it is possible a bigram may not

appear in the model file. Skip it.These are feature functions: If you have a different

candidate tag for the same word, weights will differ.

UncertaintyReal world tasks:

Partially observable, stochastic, extremely complex

Probabilities capture “Ignorance & Laziness”Lack relevant facts, conditions

Failure to enumerate all conditions, exceptions

MotivationUncertainty in medical diagnosis

Diseases produce symptoms In diagnosis, observed symptoms => disease IDUncertainties

Symptoms may not occurSymptoms may not be reportedDiagnostic tests not perfect

False positive, false negative

How do we estimate confidence?

Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1.

Documents

Transcript of Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1.

Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.

Introduction to Deep Processing Techniques for NLP Deep Processing Techniques for Natural Language Processing Ling 571 January 3, 2011 Gina-Anne Levow.

Introduction to Semantics and Pragmatics. LING 2000 - 2006 NLP 2 NLP tends to focus on: Syntax – Grammars, parsers, parse trees, dependency structures.

Introduction to Mallet LING 572 Fei Xia Week 1: 1/08/2009.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.

Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26, 2012 1.

Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.

Systems & Applications: Introduction Ling 573 NLP Systems and Applications March 29, 2011.

1 Introduction LING 572 Fei Xia, Dan Jinguji Week 1: 1/08/08.

NLP Practitioner Heart of NLP - NLP Courses

Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.

Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics.

KM C364e-20181002091502...572— co EMPLOYEE: HAGE PIN LOWER 572 772 772 AUGER FL'-Gh'lNG 94- AUGER sorroM - 572 572. AUGER AUGER • 572 7AlUNGS AUGER - 572. 772 • 572 HOPPE* BRACKET

Decision List LING 572 Fei Xia 1/12/06. Outline Basic concepts and properties Case study.

Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues In this lecture, “attribute” and “feature” are interchangeable.

The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.

Feature-based Grammar Ling 571 Deep Techniques for NLP February 2, 2001.

(Nlp) Nlp Secrets