Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling...

38
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation: Inna Weiner
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling...

Page 1: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Conditional Random Fields: Probabilistic Models

for Segmenting and Labeling Sequence Data

J. Lafferty, A. McCallum, F. Pereira

Presentation: Inna Weiner

Page 2: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Outline

• Labeling sequence data problem

• Classification with probabilistic models: Generative and Discriminative– Why HMMs and MEMMs are not good

enough

• Conditional Field Model

• Experimental Results

Page 3: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Labeling Sequence Data Problem

• X is a random variable over data sequences• Y is a random variable over label sequences

• Yi is assumed to range over a finite label alphabet A

• The problem:– Learn how to give labels from a closed set Y to a data sequence

X

Thinking is beingX:

x1 x2 x3

noun verb noun

y1 y2 y3

Y:

Page 4: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Labeling Sequence Data Problem

• The lab setup: Let a monkey do some behavioral task while recording movement and neural activity

• Motor task: Reach to target• Goal:

Map neural activityto behavior

• In our notation:– X: Neural Data– Y: Hand movements

Page 5: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Generative Probabilistic Models

• Learning problem:

Choose Θ to maximize joint likelihood:

L(Θ)= Σ log pΘ (yi,xi)

• The goal: maximization of the joint likelihood of training examples

y = argmax p*(y|x) = argmax p*(y,x)/p(x)

• Needs to enumerate all possible observation sequences

Page 6: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Markov Model

• A Markov process or model assumes that we can predict the future based just on the present (or on a limited horizon into the past):

• Let {X1,…,XT} be a sequence of random variables taking values {1,…,N} then the Markov properties are:

• Limited Horizon: P(Xt+1|X1,…,Xt) = P(Xt+1|Xt) =

• Time invariant (stationary):= P(X2|X1)

Page 7: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Describing a Markov Chain

• A Markov Chain can be described by the transition matrix A and the initial probabilities Q:

Aij = P(Xt+1=j|Xt=i)

qi = P(X1=i)

Page 8: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Hidden Markov Model

• In a Hidden Markov Model (HMM) we do not observe the sequence that the model passed through (X) but only some probabilistic function of it (Y). Thus, it is a Markov model with the addition of emission probabilities:

Bik = P(Yt = k|Xt = i)

Page 9: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

The Three Problems of HMM

• Likelihood: Given a series of observations y and a model λ = {A,B,q}, compute the likelihood p(y| λ)

• Inference: Given a series of observations y and a model lambda compute the most likely series of hidden states x.

• Learning: Given a series of observations, learn the best model λ

Page 10: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Likelihood in HMMs

• Given a model λ = {A,B,q}, we can compute the likelihood by

P(y) = p(y| λ) = Σp(x)p(y|x) =

= q(x1)ΠA(xt+1|xt) ΠB(yt|xt)

• But … this computation complexity is O(NT), when |xi| = N impossible in practice

Page 11: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Forward-Backward algorithm

• To compute likelihood: – Need to enumerate over all paths in the lattice (all

possible instantiations of X1…XT). But … some starting subpath(blue) is common to many continuing paths (blue+red)

The idea: using dynamic programming, calculate a path in terms of shorter sub-paths

Page 12: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Forward-Backward algorithm (cont’d)

• We build a matrix of the probability of being at time t at state i: αt(i)=P(xt=i,y1y2…yt). This is a function of the previous column (forward procedure):

Page 13: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Forward-Backward algorithm (cont’d)

We can similarly define a backwards procedure for filling the matrix βt(i) = P(yt+1…yT|xt=i)

Page 14: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Forward-Backward algorithm (cont’d)

• And we can easily combine:

P(y,xt=i) = P(xt=i,y1y2…yt)* P(yt+1…yT|xt=i)= =αt(i)βt(i)

• And then we get:

P(y) = Σ P(y,xt=i) = Σ αt(i)βt(i)

• Summary: we presented a polynomial algorithm for computing likelihood in HMMs.

Page 15: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

HMM – why not?

• Advantages:– Estimation very easy.– Closed form solution– The parameters can be estimated with

relatively high confidence from small samples

• But:– The model represents all possible (x,y)

sequences and defines joint probability over all possible observation and label sequences needless effort

Page 16: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Discriminative Probabilistic Models

“Solve the problem you need to solve”: The traditional approach inappropriately uses a generative joint model in order to solve a conditional problem in which the observations are given. To classify we need p(y|x) – there’s no need to implicitly approximate p(x).

Generative Discriminative

Page 17: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Discriminative Models - Estimation

• Choose Θy to maximize conditional likelihood:

L(Θy)= Σ log pΘy(yi|xi)

• Estimation usually doesn’t have closed form

• Example – MinMI discriminative approach (2nd week lecture)

Page 18: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Maximum Entropy Markov Model

• MEMM: – a conditional model that represents the

probability of reaching a state given an observation and the previous state

– These conditional probabilities are specified by exponential models based on arbitrary observation features

Page 19: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

The Label Bias Problem

• The mass that arrives at the state must be distributed among the possible successor states

• Potential victims: Discriminative Models

Page 20: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

The Label Bias Problem: Solutions

• Determinization of the Finite State MachineNot always possibleMay lead to combinatorial explosion

• Start with a fully connected model and let the training procedure to find a good structurePrior structural knowledge has proven to be

valuable in information extraction tasks

Page 21: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Random Field Model: Definition

• Let G = (V, E) be a finite graph, and let A be a finite alphabet.

• The configuration space Ω is the set of all labelings of the vertices in V by letters in A. If C is a part of V and ω is an element of Ω is a configuration, the ωc denotes the configuration restricted to C.

• A random field on G is a probability distribution on Ω.

Page 22: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Random Field Model: The Problem

• Assume that a finite number of features can define a class

• The features fi(w) are given and fixed.

• The goal: estimating λ to maximize likelihood for training examples

Page 23: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Conditional Random Field: Definition

• X – random variable over data sequences

• Y - random variable over label sequences

• Yi is assumed to range over a finite label alphabet A

• Discriminative approach: we construct a conditional model p(y|x) and do not explicitly model marginal p(x)

Page 24: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

CRF - Definition

• Let G = (V, E) be a finite graph, and let A be a finite alphabet

• Y is indexed by the vertices of G • Then (X,Y) is a conditional random field if the

random variables Yv, conditioned on X, obey the Markov property with respect to the graph:

p(Y|X,Yw,w≠v) = p(Yv|X,Yw,w~v),

where w~v means that w and v are neighbors in G

Page 25: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

CRF on Simple Chain Graph

• We will handle the case when G is a simple chain: G = (V = {1,…,m}, E={ (I,i+1) })

HMM (Generative) MEMM (Discriminative) CRF

Page 26: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Fundamental Theorem of Random Fields (Hammersley & Clifford)

• Assumption:– G structure is a tree, of which simple chain is

a private case

Page 27: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

CRF – the Learning Problem

• Assumption: the features fk and gk are given and fixed.– For example, a boolean feature gk is TRUE if

the word Xi is upper case and the label Yi is a “noun”.

• The learning problem– We need to determine the parameters

Θ = (λ1, λ2, . . . ; µ1, µ2, . . .) from training data D = {(x(i), y(i))} with empirical distribution p~(x, y).

Page 28: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

CRF – Estimation

• And we return to the log-likelihood maximization problem, this time – we need to find Θ that maximizes the conditional log-likelihood:

Page 29: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

CRF – Estimation

• From now on we assume that the dependencies of Y, conditioned on X, form a chain.

• To simplify some

expressions, we add

special start and stop

states Y0 = start and

Yn+1 = stop.

Page 30: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

CRF – Estimation

• Suppose that p(Y|X) is a CRF. For each position i in the observation sequence X, we define the |Y|*|Y| matrix random variable Mi(x) = [Mi(y', y|x)] by:

ei is the edge with labels (Yi-1,Yi) and vi is the vertex with label Yi

Page 31: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

CRF – Estimation

• The normalization function Z(x) is

• The conditional probability of a label sequence y is written as

Page 32: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Parameter Estimation for CRFs

• The parameter vector Θ that maximizes the log-likelihood is found using a iterative scaling algorithm.

• We define standard

HMM-like forward and

backward vectors α and β,

which allow polynomial

calculations.• For example:

Page 33: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Experimental Results – Set 1• Set 1: modeling label bias• Data was generated from a simple HMM which encodes a noisy

version of the finite-state network (“rib/ rob”)• Each state emits its designated symbol with probability 29/32 and

any of the other symbols with probability 1/32 • We train both an MEMM and a CRF• The observation features are simply the identity of the observation

symbols.• 2, 000 training and 500 test samples were used• Results:

– CRF error: 4.6%– MEMM error: 42%

• Conclusion:– MEMM fails to discriminate between the two branches and we get the

label bias problem

Page 34: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Experimental Results – Set 2

• Set 2: modeling mixed order sources• Data was generated from a mixed-order HMM

with state transition probabilities given by p(yi|yi-1, yi-2) = α p2(yi|yi-1, yi-2) + (1 - α) p1(yi|yi-1)

• Similarly, emission probabilities given by p(xi|yi, xi-1) = α p2(xi|yi, xi-1)+(1- α) p1(xi|yi)

• Thus, for α = 0 we have a standard first-order HMM.

• For each randomly generated model, a sample of 1,000 sequences of length 25 is generated for training and testing.

Page 35: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Experimental Results – Set 2

Page 36: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Experimental Results – Set 3

• Set 3: Part-Of-Speech tagging experiments

Page 37: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Conclusions

• Conditional random fields offer a unique combination of properties:– discriminatively trained models for sequence segmentation and

labeling– combination of arbitrary and overlapping observation features

from both the past and future– efficient training and decoding based on dynamic programming

for a simple chain graph– parameter estimation guaranteed to find the global optimum

• CRFs main current limitation is the slow convergence of the training algorithm relative to MEMMs, let alone to HMMs, for which training on fully observed data is very efficient.

Page 38: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Learning Seminar, 2004

Thank you