CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags •...

43
Behavioral Data Mining Lecture 16 Natural Language Processing I

Transcript of CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags •...

Page 1: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Behavioral Data Mining

Lecture 16

Natural Language Processing I

Page 2: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Outline

• Parts-of-Speech and POS taggers

• HMMs

• Named Entity Recognition

• Conditional Random Fields (CRFs)

Page 3: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Part-Of-Speech Tags

• Part-of-speech tags provide limited but useful information

about word type (e.g. local sentiment attachment).

• They are widely used as an alternative to full parsing, which

can be very slow.

• While our intuition about tags tends to be semantic (nouns

are “people, places, things”), classification and inference is

based on morphology and context.

Page 4: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Penn Treebank Tagset

Page 5: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Part-Of-Speech Tags

POS tagging normally follows tokenization, which includes

punctuation:

Sheena

leads

,

Sheila

needs

.

Page 6: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

POS tags

• Morphology:

– “liked,” “follows,” “poke”

• Context: “can” can be:

– “trash can,” “can do,” “can it”

• Therefore taggers should look at the neighborhood of a

word.

Page 7: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Constraint-Based Tagging

Based on a table of words with morphological and context

features:

Page 8: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Outline

• Parts-of-Speech and POS taggers

• HMMs

• Named Entity Recognition

• Conditional Random Fields (CRFs)

Page 9: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Markov Models (notes due to David Hall)

• A Markov model is a chain-structured Bayes Net

– Each node is identically distributed (stationarity)

– Value of X at a given time is called the state

– As a Bayes Net:

• Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial probs)

– Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial probs)

X2 X1 X3 X4

Page 10: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Hidden Markov Models

• Markov chains not so useful for most applications – Eventually you don’t know anything anymore

– Need observations to update your beliefs

• Hidden Markov models (HMMs) – Underlying Markov chain over states S

– You observe outputs (effects) at each time step

– As a Bayes’ net:

X5 X2

E1

X1 X3 X4

E2 E3 E4 E5

Page 11: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Example

• An HMM is defined by: – Initial distribution:

– Transitions:

– Emissions:

Page 12: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Conditional Independence • HMMs have two important independence properties:

– Markov hidden process, future depends on past via the present

– Current observation independent of all else given current state

• Does this mean that observations are independent given no evidence?

– [No, correlated by the hidden state]

X5 X2

E1

X1 X3 X4

E2 E3 E4 E5

Page 13: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Real HMM Examples

POS tagging

DNA alignment Observations are nucleotides (ACGT)

Transitions are substitutions, insertions, deletions

Named Entity Recognition/Field Segmentation Observations are words (tens of thousands)

States are “Person”, “Place”, “Other”, etc. (usually 3-20)

Ad Targeting: Observations are click streams

States are interests, or likelihood to buy

Page 14: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Named Entity Recognition

• People, places, (specific) things – “Chez Panisse”: Restaurant

– “Berkeley, CA”: Place

– “Churchill-Brenneis Orchards”: Company(?)

– Not “Page” or “Medjool”: part of a non-rigid designator

14

Page 15: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Named Entity Recognition

• Why a sequence model? – “Page” can be “Larry Page” or “Page mandarins” or just a

page.

– “Berkeley” can be a person.

15

Page 16: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Named Entity Recognition

• Chez Panisse, Berkeley, CA - A bowl of Churchill-Brenneis Orchards Page mandarins and Medjool dates

• States: – Restaurant, Place, Company, GPE, Movie, “Outside”

– Usually use BeginRestaurant, InsideRestaurant. • Why?

• Emissions: words – Estimating these probabilities is harder.

16

Page 17: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Filtering / Monitoring

• Filtering, or monitoring, is the task of tracking the distribution

B(X) (the belief state) over time

• We start with B(X) in an initial setting, usually uniform

• As time passes, or we get observations, we update B(X)

• The Kalman filter (a continuous HMM) was invented in the

60’s and first implemented as a method of trajectory

estimation for the Apollo program

Page 18: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Passage of Time

Assume we have current belief P(X | evidence to date)

Then, after one time step passes:

Or, compactly:

Basic idea: beliefs get “pushed” through the transitions

X2 X1

Page 19: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Example HMM

Page 20: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

The Forward Algorithm

We are given evidence at each time and want to know

We can derive the following updates We can normalize

as we go if we want

to have P(x|e) at

each time step, or

just once at the

end…

Page 21: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

The Forward Algorithm

• Expressible as repeated matrix multiplication

• Careful: can get numerical underflow very quickly! – Use logs or renormalize after each product

21

Page 22: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Most Likely Sequence

The Viterbi Algorithm

Just like the forward algorithm with max instead of sum

Compare to the Forward Algorithm

How do we actually extract a max sequence from V?

– Store back pointers of the best

Page 23: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Estimating HMMs: Forward-Backward

• Given observed state sequences for an HMM, the forward-

backward algorithm (aka Baum-Welch) produces an estimate

for the model parameters.

• Baum-Welch is a special case of EM (Expectation-

Maximization), and requires iteration to find the best

parameters.

• Convergence is generally quite fast (10s of iterations for

speech HMMs).

23

Page 24: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Higher-Order Markov Models

• Sometimes, first-order history isn’t enough – Part-of-Speech Tagging

• Need to model

• Claim: Can use a first-order model – How?

• Solution: Product States

• Estimating probabilities a little more subtle

24

Page 25: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Outline

• Parts-of-Speech and POS taggers

• HMMs

• Named Entity Recognition

• Conditional Random Fields (CRFs)

Page 26: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Conditional Random Fields

• HMMs are generative models of sequences.

• Hidden states X

• Observations E

• Two kinds of parameters: – Transmissions: P(X | X)

– Emission: P(E | X)

26

F1 F2 F3 F4

X1 X2 X3 X4

Page 27: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Generative Models

• Generative Models have both a prior term p(X) and

a likelihood term P(E|X).

• By Bayes’ Rule:

– P(X|E) = P(X)P(E|X)/ P(E)

• We always observe E...

• Why do we waste time modeling E?

27

Page 28: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Generative Models

• Forward algorithm

28

Page 29: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Discriminative Models

• Key Idea: Model P(X|E) directly!

– Consequence: P(E) and P(E|X) no longer (exactly) make

sense.

• Use a linear score function, just like SVMs or

regression:

• (Generic version:)

– If you have a dynamic program in a generative model with

P(...|...), just replace them with a score function.

29

Page 30: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

CRF intuition

Naïve Bayes HMMs

Logistic Regression CRFs

30

Page 31: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Conditional Random Fields

• is called the “partition function” or

normalizing constant. –It’s like P(E)

Page 32: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

CRFs vs HMMs

• Easy to map HMMs to CRFs

• How? – One feature for transition

– One feature for emission

32

Page 33: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Named Entity Recognition

• People, places, (specific) things – “Chez Panisse”: Restaurant

– “Berkeley, CA”: Place

– “Churchill-Brenneis Orchards”: Company(?)

– Not “Page” or “Medjool”: part of a non-rigid designator

33

Page 34: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Named Entity Recognition

• With an HMM, have to estimate probability P(E|X) – But what’s P(Panisse|Restaurant)

• If you’ve never seen Panisse before?

• Hard to estimate!

34

Page 35: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

CRF features

• NER: – Features to detect names:

• Word on the left is “Mr.” or “Ms.”

• Word two back is “Mr.” or “Ms.”

• Word is capitalized.

• Caution: If you have features over an “arbitrary” window size, complexity guarantees of forward algorithm go out the window!

35

Page 36: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

CRFs vs HMMs

• Not possible to do unsupervised learning – You need a generative model, or a MRF

• Slower to train

• Much easier to “overfit” – Use regularization!

36

Page 37: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Training CRFs

• Needs gradient updates (SGD/LBFGS)

• Optimize log-likelihood

37

x’

features

x’

features

Page 38: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Training CRFs

• Needs gradient updates (SGD/LBFGS)

• Optimize log-likelihood

38

...

Page 39: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Training CRFs

• Interpretation: derivative looks to minimize difference between true and guessed features

39

Ground truth

features

Model guess

features

Page 40: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Training CRFs

• CRF log likelihood is convex

• Any minimum is a global minimum

40

Page 41: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

CRF Viterbi

• HMM Viterbi

• CRF Viterbi

41

Page 43: CS294-1 Behavioral Data Miningjfc/DataMining/SP13/lecs/lec16.pdf · Part-Of-Speech Tags • Part-of-speech tags provide limited but useful information about word type (e.g. local

Summary

• Parts-of-Speech and POS taggers

• HMMs

• Named Entity Recognition

• Conditional Random Fields (CRFs)