CS626-449: Speech, NLP and the Web/Topics in AI

CS626-449: Speech, NLP and the Web/Topics in AI

Pushpak BhattacharyyaCSE Dept., IIT Bombay

Lecture-14: Probabilistic parsing; sequence labeling, PCFG

Recap: Two Views of NLP

1. Classical View: Layered Procssing;Various Ambiguities (already discussed)

2. Statistical/Machine Learning View

What is the output of an ML-NLP System (1/2)

• Option 1: A set of rules, e.g.,– If the word to the left of the verb is a noun and has

animacy feature, then it is the likely agent of the action denoted by the verb.

• The child broke the toy (child is the agent)• The window broke (window is not the agent; inanimate)

What is the output of an ML-NLP System (2/2)

• Option 2: a set of probability values– P(agent|word is to the left of verb and has animacy) >

P(object|word is to the left of verb and has animacy)> P(instrument|word is to the left of verb and has animacy) etc.

How is this different from classical NLP

corpusText data

Linguist

Computerrules

rules/probabilities

Classical NLP

Statistical NLP

Classification appears as sequence labeling

A set of Sequence Labeling Tasks: smaller to larger units

• Words:– Part of Speech tagging– Named Entity tagging– Sense marking

• Phrases: Chunking• Sentences: Parsing• Paragraphs: Co-reference annotating

Example of word labeling: POS Tagging

<s> Come September, and the UJF campus is abuzz

with new and returning students.</s>

<s> Come_VB September_NNP ,_, and_CC the_DT

UJF_NNP campus_NN is_VBZ abuzz_JJ with_IN new_JJ and_CC returning_VBG students_NNS ._.

</s>

Example of word labeling: Named Entity Tagging

<month_name>

September

</month_name>

<org_name>

UJF

</org_name>

Example of word labeling: Sense Marking

Word Synset WN-synset-no

come {arrive, get, come} 01947900

.

.

.

abuzz {abuzz, buzzing, droning} 01859419

Example of phrase labeling: Chunking

Come July, and is

abuzz with .

the UJF campus

new and returning students

Example of Sentence labeling: Parsing

[S1[S[S[VP[VBCome][NP[NNPJuly]]]]

[,,]

[CC and]

[S [NP [DT the] [JJ UJF] [NN campus]]

[VP [AUX is]

[ADJP [JJ abuzz]

[PP[IN with]

[NP[ADJP [JJ new] [CC and] [ VBG returning]]

[NNS students]]]]]]

[..]]]

Handling labeling through the Noisy Channel Model

w t

(wn, wn-1, … , w1) (tm, tm-1, … , t1)

Noisy Channel

Sequence w is transformed into sequence t.

Bayesian Decision Theory and Noisy Channel Model are close to each other

• Bayes Theorem : Given the random variables A and B,

( ) ( | )( | )

( )

P A P B AP A B

P B

( | )P A B

( )P A

( | )P B A

Posterior probability

Prior probability

Likelihood

Corpus

• A collection of text called corpus, is used for collecting various language data

• With annotation: more information, but manual labor intensive

• Practice: label automatically; correct manually• The famous Brown Corpus contains 1 million tagged words.• Switchboard: very famous corpora 2400 conversations, 543

speakers, many US dialects, annotated with orthography and phonetics

Discriminative vs. Generative Model

W* = argmax (P(W|SS)) W

Compute directly fromP(W|SS)

Compute fromP(W).P(SS|W)

DiscriminativeModel

GenerativeModel

Notion of Language Models

Language Models

• N-grams: sequence of n consecutive words/chracters

• Probabilistic / Stochastic Context Free Grammars: Simple probabilistic models capable of handling recursion A CFG with probabilities attached to rules Rule probabilities how likely is it that a particular rewrite

rule is used?

PCFGs

• Why PCFGs? Intuitive probabilistic models for tree-structured languages Algorithms are extensions of HMM algorithms Better than the n-gram model for language modeling.

Formal Definition of PCFG

• A PCFG consists of A set of terminals {wk}, k = 1,….,V

{wk} = { child, teddy, bear, played…}

A set of non-terminals {Ni}, i = 1,…,n

{Ni} = { NP, VP, DT…}

A designated start symbol N1

A set of rules {Ni j}, where j is a sequence of

terminals & non-terminals NP DT NN

A corresponding set of rule probabilities

Rule Probabilities

Rule probabilities are such that

E.g., P( NP DT NN) = 0.2 P( NP NN) = 0.5 P( NP NP PP) = 0.3

P( NP DT NN) = 0.2 Means 20 % of the training data parses use the

rule NP DT NN

i

P(N ) 1i ji

CS626-449: Speech, NLP and the Web/Topics in AI

Documents

Transcript of CS626-449: Speech, NLP and the Web/Topics in AI