Conditional Random Fields - axon.cs.byu.edu

92
Conditional Random Fields Mike Brodie CS 778

Transcript of Conditional Random Fields - axon.cs.byu.edu

Page 1: Conditional Random Fields - axon.cs.byu.edu

Conditional Random Fields

Mike Brodie CS 778

Page 2: Conditional Random Fields - axon.cs.byu.edu

Motivation• Part-Of-Speech Tagger

2

Page 3: Conditional Random Fields - axon.cs.byu.edu

Motivation

object

3

Page 4: Conditional Random Fields - axon.cs.byu.edu

Motivation

I object!

4

Page 5: Conditional Random Fields - axon.cs.byu.edu

Motivation

object‘Do you see that object?’

5

Page 6: Conditional Random Fields - axon.cs.byu.edu

Motivation• Part-Of-Speech Tagger - Java CRFTagger

Logan woke up this morning and ate a big bowl of Fruit Loops. On his way to school, a small dog chased after him. Fortunately, Logan's leg had healed and he outran the dog.

INPUT FILE - Logan.txt

6

Page 7: Conditional Random Fields - axon.cs.byu.edu

Motivation• Part-Of-Speech Tagger - Java CRFTagger

Logan/NNP woke/VBD up/RP this/DT morning/NN and/CC ate/VB a/DT big/JJ bowl/NN of/IN Fruit/NNP Loops/NNP ./. On/IN his/PRP$ way/NN to/TO school/VB ,/, a/DT small/JJ dog/NN chased/VBN after/IN him/PRP ./. Fortunately/RB ,/, Logan/NNP 's/POS leg/NN had/VBD healed/VBN and/CC he/PRP outran/VB the/DT dog/NN ./.

OUTPUT FILE - Logan.txt.pos

7

Page 8: Conditional Random Fields - axon.cs.byu.edu

Motivation• Stanford Named Entity Recognition

• Recognizes names, places, organizations

• http://nlp.stanford.edu:8080/ner/process

8

Page 9: Conditional Random Fields - axon.cs.byu.edu

• Image Applications

Segmentation

Labels: Hat, Shirt

Cup, Stanley

Multi-Label Classification

Motivation

9

Page 10: Conditional Random Fields - axon.cs.byu.edu

• Bioinformatics Applications

RNA Structural Alignment Protein Structure Prediction

Motivation

10

Page 11: Conditional Random Fields - axon.cs.byu.edu

Cupcakes with Rainbow Filling

Motivation

11

Page 12: Conditional Random Fields - axon.cs.byu.edu

BackgroundIsn't it about time…

12

Page 13: Conditional Random Fields - axon.cs.byu.edu

BackgroundIsn't it about time… …for Graphical Models?

Represent distribution as a product of local functions that depend on a smaller

subset of variables13

Page 14: Conditional Random Fields - axon.cs.byu.edu

BackgroundGenerative vs Discriminative

P (x|y)P (y) = P (x, y) = P (y|x)P (x)

• Generative models describe how a label vector y can probabilistically “generate” a feature vector x.

• Discriminative models describe how to take a feature vector x and assign it a label y.

• Either type of model can be converted to the other type using Bayes’ rule

14

Page 15: Conditional Random Fields - axon.cs.byu.edu

BackgroundJoint Distributions

P (x, y)

Problems: ?

15

Page 16: Conditional Random Fields - axon.cs.byu.edu

BackgroundJoint Distributions

P (x, y)

Problems: - Modeling inputs => intractable models - Ignoring inputs => poor performance

16

Page 17: Conditional Random Fields - axon.cs.byu.edu

BackgroundJoint Distributions

P (x, y)Naive Bayes

Generative model Strong assumptions to simplify computation

17

Page 18: Conditional Random Fields - axon.cs.byu.edu

BackgroundGenerative vs Discriminative

Naive Bayes Logistic Regression

Conditional

18

Page 19: Conditional Random Fields - axon.cs.byu.edu

BackgroundGenerative vs Discriminative

Naive Bayes Logistic Regression

Conditional

If trained to maximize conditional likelihood

19

Page 20: Conditional Random Fields - axon.cs.byu.edu

BackgroundGenerative vs Discriminative

Naive Bayes Logistic Regression

Conditional

If trained to maximize joint likelihood

20

Page 21: Conditional Random Fields - axon.cs.byu.edu

BackgroundSequence Labeling

Hidden Markov Model (HMM)

21

Page 22: Conditional Random Fields - axon.cs.byu.edu

BackgroundSequence Labeling

Hidden Markov Model (HMM)

ASSUMPTIONS:yt ?? y1, . . . , yt�2|yt�1

xt ?? Y \ {yt}, X \ {xt} |yt�1

Page 23: Conditional Random Fields - axon.cs.byu.edu

BackgroundSequence Labeling

Hidden Markov Model (HMM)

PROBLEMS:- Later labels cannot influence previous labels - Cannot represent overlapping features

23

Page 24: Conditional Random Fields - axon.cs.byu.edu

BackgroundImprovements to HMMs

Maximum Entropy Markov Model

24

Page 25: Conditional Random Fields - axon.cs.byu.edu

BackgroundImprovements to HMMs

Maximum Entropy Markov Model

25

P (S1, . . . , Sn|O1, . . . , On) =nX

t=1

P (St|St�1, Ot)

St-1 St

Ot

MEMM

Page 26: Conditional Random Fields - axon.cs.byu.edu

BackgroundImprovements to HMMs

Conditional Markov Model

More flexible form of context dependence

CONS:Only models log-linear

distribution over component transition distributions

PROS:

Independence assumptions among

labels, not observations

Page 27: Conditional Random Fields - axon.cs.byu.edu

BackgroundImprovements to HMMs

Label Bias Problem

27

Page 28: Conditional Random Fields - axon.cs.byu.edu

SolutionConditional Random Fields

- Drop Local Normalization Constant

- Normalize GLOBALLY Over Full Graph

Avoids Label Bias Problem

28

Page 29: Conditional Random Fields - axon.cs.byu.edu

SolutionConditional Random Fields

Calculate path probability by normalizing the path score by the sum of the

scores of all possible paths

Poorly matching path -> low score (probability) Well-matching path -> high score

29

Page 30: Conditional Random Fields - axon.cs.byu.edu

SolutionConditional Random Fields

Similar to CMM

- Discriminative - Model Conditional Distribution p(y|x) - Allow arbitrary, overlapping features

30

Page 31: Conditional Random Fields - axon.cs.byu.edu

SolutionConditional Random Fields

Similar to CMM

- Discriminative - Model Conditional Distribution p(y|x) - Allow arbitrary, overlapping features

TAKEAWAY: Retains advantages of CMM over HMM

Overcomes label bias problem of CMM and MEMM

31

Page 32: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsConditional Random Fields

Remember the HMM

32

Page 33: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsConditional Random Fields

Remember the HMM

Page 34: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsConditional Random Fields

By expanding Z, we have the conditional distribution p(y|x)

34

Page 35: Conditional Random Fields - axon.cs.byu.edu

Feature Functions

students.cs.byu.edu/~mbrodie/778

35

Page 36: Conditional Random Fields - axon.cs.byu.edu

36

Page 37: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsConditional Random Fields

37

Page 38: Conditional Random Fields - axon.cs.byu.edu

Feature Engineering• Larger feature set benefits:

• Greater prediction accuracy

• More flexible decision boundary

• However, may lead to overfitting

38

Page 39: Conditional Random Fields - axon.cs.byu.edu

Feature EngineeringLabel-Observation Features

Where is an observation function rather than a specific word value

For example: “word is capitalized” “word ends in ing”

xt

xt

39

Page 40: Conditional Random Fields - axon.cs.byu.edu

Feature EngineeringUnsupported Features

For example: word is ‘with’ and label is ‘NOUN’ xt yt

Greatly increases the number of parameters.

40

Page 41: Conditional Random Fields - axon.cs.byu.edu

Feature EngineeringUnsupported Features

For example: word is ‘with’ and label is ‘NOUN’ xt yt

Can be useful -> Assign negative weights to

prevent spurious labels from receiving high probability

Greatly increases the number of parameters. However, usually gives a slight boost in accuracy

41

Page 42: Conditional Random Fields - axon.cs.byu.edu

Feature EngineeringBoundary Labels

For example:

Capitalization in the middle of the sentence generally indicates a proper noun (but not

necessarily at the beginning of a sentence).

Boundary labels (start/end of sentence, edge of image) can have different characteristics from other labels.

42

Page 43: Conditional Random Fields - axon.cs.byu.edu

Feature EngineeringOther Methods

• Feature Induction • Begin with a number of base features. • Gradually add conjunctions to those features

• Normalize Features • Convert categorical to binary features

• Features from Different Time Steps

43

Page 44: Conditional Random Fields - axon.cs.byu.edu

Feature EngineeringOther Methods

• Features as Model Combination

• Input Dependent Structure • e.g. Skip-Chain CRFs

• Encourage identical words in a sentence to have the same label

• Do this by adding an edge in the graph

44

Page 45: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsConditional Random Fields

HMM-like CRF

Previous observation affects transition45

Page 46: Conditional Random Fields - axon.cs.byu.edu

Feature Effects

Features Accuracy # Feature Functions

yt-1 and yt 0.0449 296

xt and yt 0.6282 486

xt and yt AND xt and yt-1and yt 0.6603 1176

xt and yt AND yt-1 and yt 0.0929 782

Pseudo Weighted Functions on the Brown Corpus: 38 Training, 12 Test Sentences

46

Page 47: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsParameter Estimation

We want to find parameters for feature functions:

47

Page 48: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsParameter Estimation

We want to find parameters for feature functions:

Maximize the conditional log likelihood:

48

Page 49: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsParameter Estimation

49

Page 50: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsParameter Estimation

Which becomes…

50

Page 51: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsParameter Estimation

Which becomes…

Penalize large weight vectors

Page 52: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsParameter Estimation

`(✓)Cannot maximize in closed form

Instead use numerical optimization => Find derivatives w.r.t. �k

52

Page 53: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsParameter Estimation

`(✓)Cannot maximize in closed form

Instead use numerical optimization => Find derivatives w.r.t. �k

53

Page 54: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsParameter Estimation

`(✓)Cannot maximize in closed form

Instead use numerical optimization => Find derivatives w.r.t. �k

54

Page 55: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsParameter Estimation

`(✓)Cannot maximize in closed form

Instead use numerical optimization => Find derivatives w.r.t. �k

55

Page 56: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsParameter Estimation

How to optimize `(✓) ?

Steepest ascent along Gradient Newton’s Method (L-) BFGS

56

Page 57: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsParameter Estimation

How to optimize `(✓) ?

Steepest ascent along Gradient Newton’s Method (L-) BFGS

57

Page 58: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsParameter Estimation

How to optimize `(✓) ?

Steepest ascent along Gradient Newton’s Method (L-) BFGS

58

Page 59: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsParameter Estimation

How to optimize `(✓) ?

Steepest ascent along Gradient Newton’s Method (L-) BFGS

59

Page 60: Conditional Random Fields - axon.cs.byu.edu

Feature Effects

Features Accuracy # Feature Functions

Trained Accuracy 10 Epochs

yt-1 and yt 0.0449 296 0.1185

xt and yt 0.6282 486 0.6538

xt and yt AND xt and yt-1and

yt0.6603 1176 0.6731

xt and yt AND yt-1 and yt 0.0929 782 0.2724

Pseudo Weighted Functions on the Brown Corpus: 38 Training, 12 Test Sentences

Page 61: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsInference

During Training During Testing

Viterbi Algorithm to compute the most likely labeling

Marginal Distributions for each edge

Compute Z(x)

61

Page 62: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsInference

Use Forward-Backward Approach in HMM:

62

Page 63: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsInference

Use Forward-Backward Approach in HMM:

Factor Definition:

63

Page 64: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsInference

Use Forward-Backward Approach in HMM:

Factor Definition:

Rewrite Using Distributive Law:

Page 65: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsInference

Use Forward-Backward Approach in HMM:This leads to a set of forward variables

65

Page 66: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsInference

Use Forward-Backward Approach in HMM:This leads to a set of forward variables

Compute by recursion↵t

Page 67: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsInference

Use Forward-Backward Approach in HMM:Similarly, we computer a set of backward variables

Compute by recursion�t

67

Page 68: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsInference

Compute Marginals Needed for Gradient Using Forward and Backward Results

68

Page 69: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsInference

Compute Marginals Needed for Gradient Using Forward and Backward Results

69

=

Page 70: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsInference

How does this compare to CRF training/inference?

70

Page 71: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsInference

How does this compare to CRF training/inference?

71

Page 72: Conditional Random Fields - axon.cs.byu.edu

Linear-Chain CRFsInference

How does this compare to CRF training/inference?

72

Page 73: Conditional Random Fields - axon.cs.byu.edu

Breakthrough

73

Page 74: Conditional Random Fields - axon.cs.byu.edu

My Implementation• 3 Different Implementations

1. Tensorflow

74

Page 75: Conditional Random Fields - axon.cs.byu.edu

Tensorflow TipsDo NOT start with CRFs

Page 76: Conditional Random Fields - axon.cs.byu.edu

Tensorflow TipsBe prepared for awkward, sometimes unintuitive formats

76

Page 77: Conditional Random Fields - axon.cs.byu.edu

Tensorflow• Gaaaaah!!

• len(X) => X.get_shape().dims[0].value

• Start out with Theano

• More widely used

• More intuitive

77

Page 78: Conditional Random Fields - axon.cs.byu.edu

My Implementation• 3 Different Implementations

1. Tensorflow

2. Scipy Optimization (fmin_l_bfgs_b)

78

Page 79: Conditional Random Fields - axon.cs.byu.edu

My Implementation• 3 Different Implementations

1. Tensorflow

2. Scipy Optimization (fmin_l_bfgs_b)

3. Gradient Descent

A. Trainable Version

B. Simplified, Web-based Version79

Page 80: Conditional Random Fields - axon.cs.byu.edu

Experiments

Brown Corpus News Articles

• 5,000 POS Sentences • 27,596 Tags • 75/25 Train-Test Split

Data SetsPenn Treebank Corpus

WSJ

• 3,914 sentence SAMPLE • $1,700 (or free BYU version) • 75/25 Train-Test Split

80

Page 81: Conditional Random Fields - axon.cs.byu.edu

ExperimentsComparison Models

• Hidden Markov Model • Conditional Random Field • Averaged-Multilayer Perceptron

• Average the weight updates -> prevent radical changes due to different training batches

Python Natural Language Toolkit (NLTK)

81

Page 82: Conditional Random Fields - axon.cs.byu.edu

ExperimentsBrown Corpus Results

Accu

racy

30

47.5

65

82.5

100

HMM Avg-MLP NLTK-CRF Brodie-CRF

82

Page 83: Conditional Random Fields - axon.cs.byu.edu

ExperimentsPenn Corpus Results

Accu

racy

75

81.25

87.5

93.75

100

HMM Avg-MLP NLTK-CRF Brodie-CRF

83

?

Page 84: Conditional Random Fields - axon.cs.byu.edu

Pros/Cons of Implementation

Pros

• Can use forward/backward and Viterbi Algorithm

• Learn weights for features

Cons

• Slow

• Limited to gradient descent training

84

Page 85: Conditional Random Fields - axon.cs.byu.edu

Extensions

• General CRFs

• Skip-chain CRFs

• Hidden Node CRFs

85

Page 86: Conditional Random Fields - axon.cs.byu.edu

CRF-RNN + CNN• Problem: Traditional CNNs produce coarse

outputs for pixel-level labels.

• Solution: Apply CRF as a post-processing refinement

Page 87: Conditional Random Fields - axon.cs.byu.edu

BI-LSTM-CRF

Page 88: Conditional Random Fields - axon.cs.byu.edu

Future Directions• New model combinations of CRFs

• Better training approximations (i.e. besides MCMC sampling methods)

• Additional ‘hidden’ architectures

88

Page 89: Conditional Random Fields - axon.cs.byu.edu

Bibliography

89

Page 90: Conditional Random Fields - axon.cs.byu.edu

Bibliography

90

Page 91: Conditional Random Fields - axon.cs.byu.edu

Bibliography

• Tom Mitchell’s Machine Learning. Online draft version of ‘Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression.’ http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf. Accessed January 25, 2016.

• Edwin Chen Blog. Introduction to Conditional Random Fields. http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/. Accessed February 2, 2016.

• LingPipe Tutorial: http://alias-i.com/lingpipe/demos/tutorial/crf/read-me.html. Accesed February 2, 2016.

Other Resources

91

Page 92: Conditional Random Fields - axon.cs.byu.edu

Thank YouQuestions?

92