Resources - CSE, IIT Bombaycs626-449/cs626-460-2008/...(ambiguous in Marathi and Bengali too; not in...

3/1/2009

1

CS460/626 : Natural Language

Processing/Language Technology for the

Web(Lecture 1 – Introduction)

Pushpak BhattacharyyaCSE Dept.,

IIT Bombay

Persons involved

Faculty instructors: Dr. Pushpak Bhattacharyya (www.cse.iitb.ac.in/~pb) and Dr. Om Damani (www.cse.iitb.ac.in/~damani)

TAs: to be decided

Course home page (to be created)

www.cse.iitb.ac.in/~cs626-460-2009

Perpectivising NLP: Areas of AI and

their inter-dependencies

Search

Vision

PlanningMachine

Learning

Knowledge

RepresentationLogic

Expert

SystemsRoboticsNLP

Web brings new perspectives: QSA Triangle

Search Analystics

Query

Web 2.0 tasks

Business Intelligence on the Internet Platform

Opinion Mining

Reputation Management

Sentiment Analysis (some observations at the end)

NLP is thought to play a key role

Books etc.

Main Text(s):

Natural Language Understanding: James Allan

Speech and NLP: Jurafsky and Martin

Foundations of Statistical NLP: Manning and Schutze

Other References:

NLP a Paninian Perspective: Bharati, Cahitanya and Sangal

Statistical NLP: Charniak

Journals

Computational Linguistics, Natural Language Engineering, AI, AI Magazine, IEEE SMC

Conferences

ACL, EACL, COLING, MT Summit, EMNLP, IJCNLP, HLT, ICON, SIGIR, WWW, ICML, ECML

http://www.cse.iitb.ac.in/~pb

3/1/2009

2

Allied DisciplinesPhilosophy Semantics, Meaning of “meaning”, Logic

(syllogism)

Linguistics Study of Syntax, Lexicon, Lexical Semantics etc.

Probability and Statistics Corpus Linguistics, Testing of Hypotheses,

System Evaluation

Cognitive Science Computational Models of Language Processing,

Language Acquisition

Psychology Behavioristic insights into Language Processing,

Psychological Models

Brain Science Language Processing Areas in Brain

Physics Information Theory, Entropy, Random Fields

Computer Sc. & Engg. Systems for NLP

Topics proposed to be covered

Shallow Processing Part of Speech Tagging and Chunking using HMM, MEMM, CRF, and

Rule Based Systems

EM Algorithm

Language Modeling N-grams

Probabilistic CFGs

Basic Linguistics Morphemes and Morphological Processing

Parse Trees and Syntactic Processing: Constituent Parsing and Dependency Parsing

Deep Parsing Classical Approaches: Top-Down, Bottom-UP and Hybrid Methods

Chart Parsing, Earley Parsing

Statistical Approach: Probabilistic Parsing, Tree Bank Corpora

Topics proposed to be covered (contd.)

Knowledge Representation and NLP

Predicate Calculus, Semantic Net, Frames, Conceptual Dependency, Universal Networking Language (UNL)

Lexical Semantics

Lexicons, Lexical Networks and Ontology

Word Sense Disambiguation

Applications

Machine Translation

IR

Summarization

Question Answering

Grading

Based on

Midsem

Endsem

Assignments

Seminar

Except the first two everything else in groups of 4. Weightages will be revealed soon.

Definitions etc.

What is NLP

Branch of AI

2 Goals

Science Goal: Understand the way language operates

Engineering Goal: Build systems that analyse and generate language; reduce the man machine gap

3/1/2009

3

The famous Turing Test: Language Based Interaction

Machine

Human

Test conductor

Can the test conductor find out which is the machine and which the human

Inspired Eliza

http://www.manifestation.com/neurotoys/eliza.php3

Inspired Eliza (another sample

interaction)

A Sample of Interaction:

“What is it” question: NLP is concerned with Grounding

Ground the language into perceptual, motor and cognitive capacities.

Grounding

Chair

Computer

Two Views of NLP and the Associated Challenges

1. Classical View

2. Statistical/Machine Learning View

3/1/2009

4

Stages of processing

Phonetics and phonology

Morphology

Lexical Analysis

Syntactic Analysis

Semantic Analysis

Pragmatics

Discourse

Phonetics

Processing of speech

Challenges

Homophones: bank (finance) vs. bank (river

bank)

Near Homophones: maatraa vs. maatra (hin)

Word Boundary

aajaayenge (aa jaayenge (will come) or aaj aayenge (will come today)

I got [ua]plate

Phrase boundary

mtech1 students are especially exhorted to attend as such seminars are integral to one's post-graduate education

Disfluency: ah, um, ahem etc.

Morphology

Word formation rules from root words

Nouns: Plural (boy-boys); Gender marking (czar-czarina)

Verbs: Tense (stretch-stretched); Aspect (e.g. perfective sit-had sat); Modality (e.g. request khaanaa khaaiie)

First crucial first step in NLP

Languages rich in morphology: e.g., Dravidian, Hungarian,

Turkish

Languages poor in morphology: Chinese, English

Languages with rich morphology have the advantage of easier

processing at higher stages of processing

A task of interest to computer science: Finite State Machines for Word Morphology

Lexical Analysis

Essentially refers to dictionary access and obtaining the properties of the word

e.g. dognoun (lexical property)take-‟s‟-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

Challenge: Lexical or word sense disambiguation

Lexical Disambiguation

First step: part of Speech Disambiguation Dog as a noun (animal)

Dog as a verb (to pursue)

Sense Disambiguation Dog (as animal)

Dog (as a very detestable person)

Needs word relationships in a context The chair emphasised the need for adult education

Very common in day to day communications

Satellite Channel Ad: Watch what you want, when you want (two senses of watch)

e.g., Ground breaking ceremony/research

Technological developments bring in new terms, additional meanings/nuances for existing terms

Justify as in justify the right margin (word processing context)

Xeroxed: a new verb

Digital Trace: a new expression

Communifaking: pretending to talk on mobile when you are actually not

Discomgooglation: anxiety/discomfort at not being able to access internet

Helicopter Parenting: over parenting

3/1/2009

5

Syntax Processing Stage

Structure Detection

S

NPVP

V NP

Ilike

mangoes

Parsing Strategy

Driven by grammar S-> NP VP

NP-> N | PRON

VP-> V NP | V PP

N-> Mangoes

PRON-> I

V-> like

Challenges in Syntactic Processing: Structural Ambiguity

Scope1.The old men and women were taken to safe locations(old men and women) vs. ((old men) and women)2. No smoking areas will allow Hookas inside

Preposition Phrase Attachment I saw the boy with a telescope

(who has the telescope?) I saw the mountain with a telescope

(world knowledge: mountain cannot be an instrument of seeing)

I saw the boy with the pony-tail(world knowledge: pony-tail cannot be an instrument of seeing)

Very ubiquitous: newspaper headline “20 years later, BMC pays father 20 lakhs for causing son‟s death”

Structural Ambiguity…

Overheard I did not know my PDA had a phone for 3 months

An actual sentence in the newspaper The camera man shot the man with the gun when he was

near Tendulkar

(P.G. Wodehouse, Ring in Jeeves) Jill had rubbed ointment on Mike the Irish Terrier, taken a look at the goldfish belonging to the cook, which had caused anxiety in the kitchen by refusing its ant‟s eggs…

(Times of India, 26/2/08) Aid for kins of cops killed in terrorist attacks

Headache for Parsing: Garden Path sentences

Garden Pathing

The horse raced past the garden fell.

The old man the boat.

Twin Bomb Strike in Baghdad kill 25(Times of India 05/09/07)

Semantic Analysis

Representation in terms of

Predicate calculus/Semantic Nets/Frames/Conceptual Dependencies and Scripts

John gave a book to Mary

Give action: Agent: John, Object: Book, Recipient: Mary

Challenge: ambiguity in semantic role labeling

(Eng) Visiting aunts can be a nuisance

(Hin) aapko mujhe mithaai khilaanii padegii (ambiguous in Marathi and Bengali too; not in Dravidian languages)

3/1/2009

6

Pragmatics

Very hard problem

Model user intention

Tourist (in a hurry, checking out of the hotel, motioning to the service boy): Boy, go upstairs and see if my sandals are under the divan. Do not be late. I just have 15 minutes to catch the train.

Boy (running upstairs and coming back panting): yes sir, they are there.

World knowledge

WHY INDIA NEEDS A SECOND OCTOBER (ToI, 2/10/07)

Discourse

Processing of sequence of sentences Mother to John:

John go to school. It is open today. Should you bunk? Father will be very angry.

Ambiguity of openbunk what?Why will the father be angry?

Complex chain of reasoning and application of world knowledge Ambiguity of father

father as parent or

father as headmaster

Complexity of Connected Text

John was returning from school dejected – today was the math test

He couldn’t control the class

Teacher shouldn’t have made him

responsible

After all he is just a janitor

Giving a flavour of what is done: Structure Disambiguation

Scope, Clause and Preposition/Postpositon

Structure Disambiguation is as critical as Sense Disambiguation

Scope (portion of text in the scope of a modifier) Old men and women will be taken to safe locations No smoking areas allow hookas inside

Clause I told the child that I liked that he came to the

game on time Preposition

I saw the boy with a telescope

Structure Disambiguation is as critical as Sense Disambiguation (contd.)

Semantic role Visiting aunts can be a nuisance Mujhe aapko mithaai khilaani padegii (“I have to give you sweets”

or “You have to give me sweets”)

Postposition unhone teji se bhaaagte hue chor ko pakad liyaa (“he caught the

thief that was running fast” or “he ran fast and caught the thief”)

All these ambiguities lead to the construction multiple parse trees for each sentence and need semantic, pragmatic and discourse cues for disambiguation

3/1/2009

7

Higher level knowledge needed for disambiguation

Semantics I saw the boy with a pony tail (pony tail cannot be

an instrument of seeing) Pragmatics

((old men) and women) as opposed to (old men and women) in “Old men and women were taken to safe location”, since women- both and young and old- were very likely taken to safe locations

Discourse: No smoking areas allow hookas inside, except the

one in Hotel Grand. No smoking areas allow hookas inside, but not

cigars.

Preposition

Problem definition

4-tuples of the form V N1 P N2

saw (V) boys (N1) with (P) telescopes (N2)

Attachment choice is between the matrix verb V and the object noun N

Lexical Association Table (Hindle and Rooth, 1991 and 1993)

From a large corpus of parsed text

first find all noun phrase heads

then record the verb (if any) that precedes

the head

and the preposition (if any) that follows it

as well as some other syntactic information about the sentence.

Extract attachment information from this table of co-occurrences

Example: lexical association

A table entry is considered a definite instance of the prepositional phrase attaching to the verb if:

the verb definitely licenses the prepositional phrase

E.g. from Propbank,

absolve

frames

absolve.XX: NP-ARG0 NP-ARG2-of obj-ARG1 1

absolve.XX NP-ARG0 NP-ARG2-of obj-ARG1

On Friday , the firms filed a suit *ICH*-1 against West Virginia in New York state court asking for [ARG0 a declaratory judgment] [rel absolving] [ARG1

them] of [ARG2-of liability] .

Core steps

Seven different procedures for deciding whether a table entry is an instance of no attachment, sure noun attach, sure verb attach, or ambiguous attach

able to extract frequency information, counting the number of times a particular verb or noun attaches with a particular preposition

http://www.cis.upenn.edu/~cotton/cgi-bin/pblex_fmt.cgi?lemma=absolve

http://www.cs.rochester.edu/~gildea/PropBank/Sort/A/absolve.html









3/1/2009

8

Core steps (contd.)

These frequencies serve as the training data for the statistical model used to predict correct attachment

To disambiguate a sentence, compute the likelihood of the particular preposition given the particular verb and contrast with the likelihood of the preposition given the particular noun i.e., compare P(with|saw) with P(with|telescope)

as in I saw the boy with a telescope

Critique

Limited by the number of relationships in the training corpora

Too large a parameter space

Model acquired during training is represented in a huge table of probabilities, precluding any straightforward analysis of its workings

Approach based on Transformation Based Error Driven Learning, Brill and Resnick, COLING 1994 Example Transformations

Initial attach-ments by default

are to N1 pre-dominantly.

Transformation rules with word classes

Wordnet synsetsandSemantic classes used

Accuracy values of the transformation based approach: 12000 training and 500 test examples

Method Accuracy #of transformation rules

Hindle and Rooth

(baseline)

70.4 to 75.8% NA

Transformations 79.2 418

Transformations

(word classes)

81.8 266

3/1/2009

9

Maximum Entropy Based Approach: (Ratnaparki, Reyner, Roukos, 1994)

Use more features than (V N1) bigram and (N1 P) bigram

Apply Maximum Entropy Principle

Core formulation

We denote

the partially parsed verb phrase, i.e., the verb phrase without the attachment decision, as a history h, and

the conditional probability of an attachment as P(d|h),

where d and corresponds to a noun or verb attachment- 0 or 1- respectively.

Maximize the training data log likelihood

--(1)

--(2)

Equating the model expected parameters and training data

parameters

--(3)

--(4)

Features

Two types of binary-valued questions:

Questions about the presence of any n-gram of the four head words, e.g., a bigram maybe V == „„is‟‟, P == „„of‟‟

Features comprised solely of questions on words are denoted as “word” features

Features (contd.)

Questions that involve the class membership of a head word

Binary hierarchy of classes derived by mutual information

3/1/2009

10

Features (contd.)

Given a binary class hierarchy, we can associate a bit string with every word

in the vocabulary

Then, by querying the value of certain bit positions we can construct

binary questions.

Features comprised solely of questions about class bits are denoted as “class” features, and features containing questions about both class bits and words are denoted as “mixed” features.

Word classes (Brown et. al. 1992)

Experimental data size

Performance of ME Model on Test Events

Examples of Features Chosen for Wall St. Journal Data Average Performance of Human & ME Model on

300 Events of WSJ Data

3/1/2009

11

Human and ME model performance on consensus set for WSJ

Average Performance of Human & ME Model on200 Events of Computer Manuals Data

Back-off model based approach (Collins and Brooks, 1995)

NP-attach: (joined ((the board) (as a non executive director)))

VP-attach: ((joined (the board)) (as a non executive director))

Correspondingly, NP-attach:

1 joined board as director

VP-attach: 0 joined board as director

Quintuple of (attachment: A: 0/1, V, N1, P, N2) 5 random variables

Probabilistic formulation

Or briefly,

If

Then the attachment is to the noun, else to the verb

Maximum Likelihood estimate The Back-off estimateo Inspired by speech recognition

o Prediction of the Nth word from previous (N-1) words

Data sparsity problem

f(w1, w2, w3,…wn) will frequently be 0

for large values on n

3/1/2009

12

Back-off estimate contd.

The cut off frequencies (c1, c2 ....) are thresholds determining whether to back-off or not at each level-

counts lower than ci at stage i are deemed to be too low to give an accurate estimate, so in this case

backing-off continues.

Back off for PPT attachment

Note: the back off tuples always retain the preposition

The backoff algorithm

Lower and upper bounds on performance

Lower bound

(most frequent) Upper bound

(human experts

Looking at 4 word

only)

ResultsComparison with other systems

Maxent,

Ratnaparkhi et. al.

Transformation

Learning,

Brill et. al.

3/1/2009

13

Flexible Unsupervised PP Attachment using WSD and Data Sparsity Reduction: (Medimi Srinivas and Pushpak Bhattacharyya, IJCAI 2007)

Unsupervised approach (some way similar to Ratnaparkhi 1998): The training data is extracted from raw text

The unambiguous training data of the form V-P-N and N1-P-N2 TEACH the system how to resolve PP-attachment in ambiguous test data V-N1-P-N2

Refinement of extracted training data. And use of N2 in PP-attachment resolution process.

Flexible Unsupervised PP Attachment using WSD and Data Sparsity Reduction: (Medimi Srinivas and Pushpak Bhattacharyya, IJCAI 2007)

PP-attachment is determined by the semantic property of lexical items in the context of preposition using WordNet

An Iterative Graph based unsupervised approach is used for Word Sense disambiguation (Similar to Mihalcea 2005)

Use of a Data sparseness Reduction (DSR) Process which uses lemmatization, Synset replacement and a form of inferencing. DSRP uses WordNet.

Flexible use of WSD and DSR processes for PP-Attachment

Graph based disambiguation: page rank based algorithm,

Mihalcea 2005 Experimental setup

Training Data:

Brown corpus (raw text). Corpus size is 6 MB, consists of 51763 sentences, nearly 1 million 27 thousand words.

Most frequent Prepositions in the syntactic context N1-P-N2: of, in, for, to, with, on, at, from, by

Most frequent Prepositions in the syntactic context V-P-N: in, to, by, with, on, for, from, at, of

The Extracted unambiguous N1-P-N2: 54030 and V-P-N: 22362

Test Data:

Penn Treebank Wall Street Journal (WSJ) data extracted by Ratnaparkhi

It consists of V-N1-P-N2 tuples: 20801(training), 4039(development) and 3097(Test)

Experimental setup contd.

BaseLine: The unsupervised approach by Ratnaparkhi, 1998

(Base-RP).

Preprocessing: Upper case to lower case Any four digit number less than 2100 as a year

Any other number or % signs are converted to num

Experiments are performed using DSRP: with different stages of DSRP

Experiments are performed using GuWSD and DSRP: with different senses

The process of extracting training data: Data Sparsity Reduction

Tools/process Output

Raw Text The professional conduct of the doctors is guided by Indian Medical Association.

POS Tagger The_DT professional_JJ conduct_NN of_IN the_DT doctors_NNS is_VBZ guided_VBN by_ IN Indian_NNP Medical_NNP Association_NNP._.

Chunker [The_DT professional_JJ conduct_NN ] of_IN [the_DT doctors_NNS ] (is_VBZ guided_VBN) by_IN [Indian_NNP Medical_NNP Association_NNP].

After replacing each chunk by its head word it results in:

conduct_NN of_IN doctors_NNS guided_VBN by_IN Association_NNP

Extraction Heuristics

N1PN2: conduct of doctors and VPN: guided by Association

Morphing N1PN2: conduct of doctor and VPN: guide by association

DSRP (Synset Replacement)

N1PN2: {conduct, behavior} of {doctor, physician} can result in 4 combination with the same sense and similarly for VPN: {guide, direct} by {association} can result in 2 combinations with the same sense.

3/1/2009

14

Data Sparsity Reduction: Inferencing

If

V1-P-N1 and V2-P-N1 exist as also do V1-P- N2

and V2-P-N2, then

if

V3-P-Ni exist (i=1,2), then

we can infer the existence of V3-P-NJ

(i ≠ j) with a frequency count of V3-P-Ni that can be added to the corpus.

Example of DSR by inferencing

V1-P-N1: play in garden and V2-P-N1: sit in garden

V1-P-N2: play in house and V2-P-N2: sit in house

V3-P-N2: jump in house exists

Infer the existence of V3-P-N1: jump in garden

ResultsEffect of various processes on FlexPPAttach algorithm

Precision vs. various processes

Is NLP Really Needed

3/1/2009

15

Post-1

POST----5 TITLE: "Wants to invest in IPO? Think again" | <br /><br />Hereâ€™s a sobering thought for those who believe in investing in IPOs. Listing gains â€” the return on the IPO scrip at the close of listing day over the allotment price â€” have been falling substantially in the past two years. Average listing gains have fallen from 38% in 2005 to as low as 2% in the first half of 2007.Of the 159 book-built initial public offerings (IPOs) in India between 2000 and 2007, two-thirds saw listing gains. However, these gains have eroded sharply in recent years.Experts say this trend can be attributed to the aggressive pricing strategy that investment bankers adopt before an IPO. â€&oelig;While the drop in average listing gains is not a good sign, it could be due to the fact that IPO issue managers are getting aggressive with pricing of the issues,â€ says Anand Rathi, chief economist, Sujan Hajra.While the listing gain was 38% in 2005 over 34 issues, it fell to 30% in 2006 over 61 issues and to 2% in 2007 till mid-April over 34 issues. The overall listing gain for 159 issues listed since 2000 has been 23%, according to an analysis by Anand Rathi Securities.Aggressive pricing means the scrip has often been priced at the high end of the pricing range, which would restrict the upward movement of the stock, leading to reduced listing gains for the investor. It also tends to suggest investors should not indiscriminately pump in money into IPOs.But some market experts point out that India fares better than other countries. â€&oelig;Internationally, there have been periods of negative returns and low positive returns in India should not be considered a bad thing.

Post-2

POST----7TITLE: "[IIM-Jobs] ***** Bank: International Projects Group -Manager"| <br />Please send your CV & cover letter to anup.abraham@*****bank.com ***** Bank, through its International Banking Group (IBG), is expanding beyond the Indian market with an intent to become a significant player in the global marketplace. The exciting growth in the overseas markets is driven not only by India linked opportunities, but also by opportunities of impact that we see as a local player in these overseas markets and / or as a bank with global footprint. IBG comprises of Retail banking, Corporate banking & Treasury in 17 overseas markets we are present in. Technology is seen as key part of the business strategy, and critical to business innovation & capability scale up. The International Projects Group in IBG takes ownership of defining & delivering business critical IT projects, and directly impact business growth. Role: Manager Â– International Projects Group Purpose of the role: Define IT initiatives and manage IT projects to achieve business goals. The project domain will be retail, corporate & treasury. The incumbent will work with teams across functions (including internal technology teams & IT vendors for development/implementation) and locations to deliver significant & measurable impact to the business. Location: Mumbai (Short travel to overseas locations may be needed) Key Deliverables: Conceptualize IT initiatives, define business requirements

Sentiment Classification

Positive, negative, neutral – 3 class

Sports, economics, literature - multi class

Create a representation for the document

Classify the representation

The most popular way of representing a document is feature vector (indicator sequence).

Established Techniques

Naïve Bayes Classifier (NBC)

Support Vector Machines (SVM)

Neural Networks

K nearest neighbor classifier

Latent Semantic Indexing

Decision Tree ID3

Concept based indexing

Successful Approaches

The following are successful approaches as reported in literature.

NBC – simple to understand and implement

SVM – complex, requires foundations of perceptions

P(C+|D) > P(C-|D)

Mathematical Setting

We have training set

A: Positive Sentiment Docs

B: Negative Sentiment Docs

Let the class of positive and negative documents be C+ and C- , respectively.

Given a new document D label it positive if

Indicator/feature vectors to be formed

3/1/2009

16

Priori Probability

Document

Vector Classification

D1 V1 +

D2 V2 -

D3 V3 +

.. .. ..

D4000 V4000 -

Let T = Total no of documentsAnd let |+| = MSo,|-| = T-M

Priori probability is calculated without considering any features of the new document.

P(D being positive)=M/T

Apply Bayes Theorem

Steps followed for the NBC algorithm:

Calculate Prior Probability of the classes. P(C+ ) and P(C-)

Calculate feature probabilities of new document. P(D| C+ ) and P(D| C-)

Probability of a document D belonging to a class C can be

calculated by Baye‟s Theorem as follows:

P(C|D) = P(C) * P(D|C)P(D)

• Document belongs to C+ , if

P(C+ ) * P(D|C+) > P(C- ) * P(D|C-

)

Calculating P(D|C+)

P(D|C+) is the probability of class C+ given D. This is calculated as

follows:

Identify a set of features/indicators to evaluate a document and

generate a feature vector (VD). VD = <x1 , x2 , x3 … xn >

Hence, P(D|C+) = P(VD|C+)

= P( <x1 , x2 , x3 … xn > | C+)

= |<x1,x2,x3…..xn>, C+ |

| C+ |

Based on the assumption that all features are Independently

Identically Distributed (IID)

= P( <x1 , x2 , x3 … xn > | C+ )

= P(x1 |C+) * P(x2 |C+) * P(x3 |C+) *…. P(xn |C+)

=∏ i=1n P(xi |C+)

P(xi |C+) can now be calculated as |xi |

|C |

Baseline Accuracy

Just on Tokens as features, 80% accuracy

20% probability of a document being misclassified

On large sets this is significant

To improve accuracy…

Clean corpora

POS tag

Concentrate on critical POS tags (e.g. adjective)

Remove „objective‟ sentences ('of' ones)

Do aggregation

Use minimal to sophisticated NLP

Resources - CSE, IIT Bombaycs626-449/cs626-460-2008/...(ambiguous in Marathi and Bengali too; not in...

Documents

Transcript of Resources - CSE, IIT Bombaycs626-449/cs626-460-2008/...(ambiguous in Marathi and Bengali too; not in...