IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies...

93
IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies...

Page 1: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

IR and NLP

Jimmy LinCollege of Information StudiesInstitute for Advanced Computer StudiesUniversity of Maryland

Wednesday, March 15, 2006

Page 2: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

On the Menu

Overview of information retrieval

Evaluation

Three IR models Boolean Vector space Language modeling

NLP for IR

Page 3: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Types of Information Needs

Retrospective “Searching the past” Different queries posed against a static collection Time invariant

Prospective “Searching the future” Static query posed against a dynamic collection Time dependent

Page 4: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Retrospective Searches (I)

Ad hoc retrieval: find documents “about this”

Known item search

Directed exploration

Identify positive accomplishments of the Hubble telescope since it was launched in 1991.

Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them.

Find Jimmy Lin’s homepage.

What’s the ISBN number of “Modern Information Retrieval”?

Who makes the best chocolates?

What video conferencing systems exist for digital reference desk services?

Page 5: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Retrospective Searches (II)

Question answeringWho discovered Oxygen?When did Hawaii become a state?Where is Ayer’s Rock located?What team won the World Series in 1992?

“Factoid”

What countries export oil?Name U.S. cities that have a “Shubert” theater.“List”

Who is Aaron Copland?What is a quasar?

“Definition”

Page 6: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Prospective “Searches”

Filtering Make a binary decision about each incoming document

Routing Sort incoming documents into different bins?

Spam or not spam?

Categorize news headlines: World? Nation? Metro? Sports?

Page 7: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

The Information Retrieval Cycle

SourceSelection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

query reformulation,vocabulary learning,relevance feedback

source reselection

Page 8: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

Indexing Index

Acquisition Collection

Page 9: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Evaluation

Page 10: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

IR is an experimental science!

Formulate a research question: the hypothesis Questions about the system Questions about the system + user

Design an experiment to answer the question

Perform the experiment Compare with a baseline

Does the experiment answer the question? Are the results significant? Or is it just luck?

Report the results!

Rinse, repeat…

Page 11: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

The Importance of Evaluation

The ability to measure differences underlies experimental science How well do our systems work? Is A better than B? Is it really? Under what conditions?

Evaluation drives what to research Identify techniques that work and don’t work Build on techniques that work

Page 12: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Evaluating the Black Box

Search

Query

Ranked List

Page 13: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Automatic Evaluation Model

IR Black Box

Query

Ranked List

Documents

EvaluationModule

Measure of Effectiveness

Relevance Judgments

These are the four things we need!

Page 14: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Test Collections

Reusable test collections consist of: Collection of documents

• Should be “representative”

• Things to consider: size, sources, genre, topics, … Sample of information needs

• Should be “randomized” and “representative”

• Usually formalized topic statements Known relevance judgments

• Assessed by humans, for each topic-document pair (topic, not query!)

• Binary judgments make evaluation easier

Measure of effectiveness Usually a numeric score for quantifying “performance” Used to compare different systems

Page 15: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Which is the Best Rank Order?

= relevant document

A.

B.

C.

D.

E.

F.

Page 16: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Set-Based Measures

Precision = A ÷ (A+B)

Recall = A ÷ (A+C)

Miss = C ÷ (A+C)

False alarm (fallout) = B ÷ (B+D)

Relevant Not relevant

Retrieved A B

Not retrieved C D

Collection size = A+B+C+DRelevant = A+CRetrieved = A+B

When is precision important?When is recall important?

Page 17: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Another View

Relevant RetrievedRelevant +Retrieved

Not Relevant + Not Retrieved

Space of all documents

Page 18: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

ROC Curves

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Prec

isio

n

Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

Page 19: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Building Test Collections

Where do test collections come from? Someone goes out and builds them (expensive) As the byproduct of large scale evaluations

TREC = Text REtrieval Conferences Sponsored by NIST Series of annual evaluations, started in 1992 Organized into “tracks” Larger tracks may draw a few dozen participants

See proceedings online at http://trec.nist.gov/

Page 20: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Ad Hoc Topics

In TREC, a statement of information need is called a topic

Title: Health and Computer Terminals

Description: Is it hazardous to the health of individuals to work with computer terminals on a daily basis? Narrative: Relevant documents would contain any information that expands on any physical disorder/problems that may be associated with the daily working with computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been said to be associated, but how widespread are these or other problems and what is being done to alleviate any health problems.

Page 21: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Obtaining Judgments

Exhaustive assessment is usually impractical TREC has 50 queries Collection has >1 million documents

Random sampling won’t work If relevant docs are rare, none may be found!

IR systems can help focus the sample Each system finds some relevant documents Different systems find different relevant documents Together, enough systems will find most of them Leverages cooperative evaluations

Page 22: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Pooling Methodology

Systems submit top 1000 documents per topic

Top 100 documents from each are judged Single pool, duplicates removed, arbitrary order Judged by the person who developed the topic

Treat unevaluated documents as not relevant

Evaluate down to 1000 documents

To make pooling work: Systems must do reasonable well Systems must not all “do the same thing”

Gather topics and relevance judgments to create a reusable test collection

Page 23: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Retrieval Models

Boolean

Vector space

Language Modeling

Page 24: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

What is a model?

A model is a construct designed help us understand a complex system A particular way of “looking at things”

Models inevitably make simplifying assumptions What are the limitations of the model?

Different types of models: Conceptual models Physical analog models Mathematical models …

Page 25: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

The Central Problem in IRInformation Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?

Page 26: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

The IR Black Box

DocumentsQuery

Hits

RepresentationFunction

RepresentationFunction

Query Representation Document Representation

ComparisonFunction Index

Page 27: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

How do we represent text?

How do we represent the complexities of language? Keeping in mind that computers don’t “understand”

documents or queries

Simple, yet effective approach: “bag of words” Treat unique words as independent features of the

document

Page 28: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Sample Document

McDonald's slims down spuds

Fast-food chain to reduce certain types of fat in its french fries with new cooking oil.

NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.

But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.

But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.

Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.

16 × said

14 × McDonalds

12 × fat

11 × fries

8 × new

6 × company french nutrition

5 × food oil percent reduce taste Tuesday

“Bag of Words”

Page 29: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

What’s the point?

Retrieving relevant information is hard! Evolving, ambiguous user needs, context, etc. Complexities of language

To operationalize information retrieval, we must vastly simplify the picture

Bag-of-words approach: Information retrieval is all (and only) about matching

words in documents with words in queries Obviously, not true… But it works pretty well!

Page 30: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Representing Documents

The quick brown fox jumped over the lazy dog’s back.

Document 1

Document 2

Now is the time for all good men to come to the aid of their party.

the

isfor

to

of

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

00110110110010100

11001001001101011

Term Doc

ume

nt 1

Doc

ume

nt 2

Stopword List

Page 31: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Boolean Retrieval

Weights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document “1” represents “presence”: term is in the document

Build queries by combining terms with Boolean operators AND, OR, NOT

The system returns all documents that satisfy the query

Page 32: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Boolean View of a Collection

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

00110000010010110

01001001001100001

Term

Doc

1D

oc 2

00110110110010100

11001001001000001

Doc

3D

oc 4

00010110010010010

01001001000101001

Doc

5D

oc 6

00110010010010010

10001001001111000

Doc

7D

oc 8

Each column represents the view of a particular document: What terms are contained in this document?

Each row represents the view of a particular term: What documents contain this term?

To execute a query, pick out rows corresponding to query terms and then apply logic table of corresponding Boolean operator

Page 33: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Sample Queries

foxdog 0

000

11

00

11

00

01

00

Term

Doc

1D

oc 2

Doc

3D

oc 4

Doc

5D

oc 6

Doc

7D

oc 8

dog fox 0 0 1 0 1 0 0 0

dog fox 0 0 1 0 1 0 1 0

dog fox 0 0 0 0 0 0 0 0

fox dog 0 0 0 0 0 0 1 0

dog AND fox Doc 3, Doc 5

dog OR fox Doc 3, Doc 5, Doc 7

dog NOT fox empty

fox NOT dog Doc 7

goodparty

00

10

00

10

00

11

00

11

g p 0 0 0 0 0 1 0 1

g p o 0 0 0 0 0 1 0 0

good AND party Doc 6, Doc 8over 1 0 1 0 1 0 1 1

good AND party NOT over Doc 6

Term

Doc

1D

oc 2

Doc

3D

oc 4

Doc

5D

oc 6

Doc

7D

oc 8

Page 34: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Proximity Operators

More “precise” versions of AND “NEAR n” allows at most n-1 intervening terms “WITH” requires terms to be adjacent and in order Other extensions: within n sentences, within n

paragraphs, etc.

Relatively easy to implement, but less efficient Store position information for each word in the

document vectors Perform normal Boolean computations, but treat WITH

and NEAR as extra constraints

Page 35: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Why Boolean Retrieval Works

Boolean operators approximate natural language Find documents about a good party that is not over

AND can discover relationships between concepts good party

OR can discover alternate terminology excellent party, wild party, etc.

NOT can discover alternate meanings Democratic party

Page 36: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Why Boolean Retrieval Fails

Natural language is way more complex

AND “discovers” nonexistent relationships Terms in different sentences, paragraphs, …

Guessing terminology for OR is hard good, nice, excellent, outstanding, awesome, …

Guessing terms to exclude is even harder! Democratic party, party to a lawsuit, …

Page 37: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Strengths and Weaknesses

Strengths Precise, if you know the right strategies Precise, if you have an idea of what you’re looking for Efficient for the computer

Weaknesses Users must learn Boolean logic Boolean logic insufficient to capture the richness of

language No control over size of result set: either too many

documents or none When do you stop reading? All documents in the result

set are considered “equally good” What about partial matches? Documents that “don’t

quite match” the query may be useful also

Page 38: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

The Vector Space Model

Let’s replace relevance with “similarity” Rank documents by their similarity with the query

Treat the query as if it were a document Create a query bag-of-words

Find its similarity to each document

Rank order the documents by similarity

Surprisingly, this works pretty well!

Page 39: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Vector Space Model

Postulate: Documents that are “close together” in vector space “talk about” the same things

t1

d2

d1

d3

d4

d5

t3

t2

θ

φ

Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

Page 40: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Similarity Metric

How about |d1 – d2|?

This is the Euclidean distance between the vectors

Instead of distance, use “angle” between the vectors:

Why is this not a good idea?

n

i ki

n

i ji

n

i kiji

kj

kjkj

ww

ww

dd

ddddsim

1

2,1

2,

1 ,,),(

kj

kj

dd

dd

)cos(

Page 41: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

How do we weight doc terms?

Here’s the intuition: Terms that appear often in a document should get high

weights

Terms that appear in many documents should get low weights

How do we capture this mathematically? Term frequency Inverse document frequency

The more often a document contains the term “dog”, the more likely that the document is “about” dogs.

Words like “the”, “a”, “of” appear in (nearly) all documents.

Page 42: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

TF.IDF Term Weighting

Simple, yet effective!

ijiji n

Nw logtf ,,

jiw ,

ji ,tf

N

in

weight assigned to term i in document j

number of occurrence of term i in document j

number of documents in entire collection

number of documents with term i

Page 43: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

What is a Language Model?

Probability distribution over strings of text How likely is a string in a given “language”?

Probabilities depend on what language we’re modeling

p1 = P(“a quick brown dog”)

p2 = P(“dog quick a brown”)

p3 = P(“быстрая brown dog”)

p4 = P(“быстрая собака”)

In a language model for English: p1 > p2 > p3 > p4

In a language model for Russian: p1 < p2 < p3 < p4

Page 44: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Noisy-Channel Model of IR

Information need

Query

User has a information need, “thinks” of a relevant document…

and writes down some queries

Task of information retrieval: given the query, figure out which document it came from?

d1

d2

dn

document collection

Page 45: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Retrieval w/ Language Models

Build a model for every document

Rank document d based on P(MD | q)

Expand using Bayes’ Theorem

Same as ranking by P(q | MD)

)(

)()|()|(

qP

MPMqPqMP DD

D

P(q) is same for all documents; doesn’t change ranksP(MD) [the prior] is assumed to be the same for all d

Page 46: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

What does it mean?

Ranking by P(MD | q)…

Hey, what’s the probability this query came from you?

model1

Hey, what’s the probability that you generated this

query?

model1

is the same as ranking by P(q | MD)

Hey, what’s the probability this query came from you?

model2

Hey, what’s the probability that you generated this

query?

model2

Hey, what’s the probability this query came from you?

modeln

Hey, what’s the probability that you generated this

query?

modeln

… …

Page 47: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Ranking Models?

Hey, what’s the probability that you generated this

query?

model1

Ranking by P(q | MD)

Hey, what’s the probability that you generated this

query?

model2

Hey, what’s the probability that you generated this

query?

modeln

… is a model of document1

… is a model of document2

… is a model of documentn

… is the same as ranking documents

Page 48: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Unigram Language Model

Assume each word is generated independently Obviously, this is not true… But it seems to work well in practice!

The probability of a string, given a model:

k

iik MqPMqqP

11 )|()|(

The probability of a sequence of words decomposes into a product of the probabilities of individual words

Page 49: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Modeling

How do we build a language model for a document?

What’s in the urn?

Page 50: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

NLP for IR

Page 51: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

The Central Problem in IRInformation Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?

Page 52: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Why is IR hard?

IR is hard because natural language is so rich (among other reasons)

What are the issues? Tokenization Morphological Variation Synonymy Polysemy Paraphrase Ambiguity Anaphora

Page 53: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Possible Solutions

Vary the unit of indexing Strings and segments Tokens and words Phrases and entities Senses and concepts

Manipulate queries and results Term expansion Post-processing of results

Page 54: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Tokenization

What’s a word? First try: words are separated by spaces

What about clitics?

What about languages without spaces?

Same problem with speech!

I’m not saying that I don’t want John’s input on this.

The cat on the mat. the, cat, on, the, mat

天主教教宗若望保祿二世因感冒再度住進醫院。天主教 教宗 若望保祿二世 因 感冒 再度 住進 醫院。

Where are the spaces?

Page 55: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Word-Level Issues

Morphological variation

= different forms of the same concept Inflectional morphology: same part of speech

Derivational morphology: different parts of speech

Synonymy

= different words, same meaning

Polysemy

= same word, different meanings

{dog, canine, doggy, puppy, etc.} concept of dog

Bank: financial institution or side of a river?Crane: bird or construction equipment?Is: depends on what the meaning of “is” is!

break, broke, broken; sing, sang, sung; etc.

destroy, destruction; invent, invention, reinvention; etc.

Page 56: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Paraphrase

Who killed Abraham Lincoln?

(1) John Wilkes Booth killed Abraham Lincoln.(2) John Wilkes Booth altered history with a bullet. He will forever be

known as the man who ended Abraham Lincoln’s life.

When did Wilt Chamberlain score 100 points?

(1) Wilt Chamberlain scored 100 points on March 2, 1962 against the New York Knicks.

(2) On December 8, 1961, Wilt Chamberlain scored 78 points in a triple overtime game. It was a new NBA record, but Warriors coach Frank McGuire didn’t expect it to last long, saying, “He’ll get 100 points someday.” McGuire’s prediction came true just a few months later in a game against the New York Knicks on March 2.

Language provides different ways of saying the same thing

Page 57: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Ambiguity

What exactly do you mean?

Why don’t we have problems (most of the time)?

I saw the man on the hill with the telescope?Who has the telescope?

Time flies like an arrow.Say what?

Visiting relatives can be annoying.Who’s visiting?

Page 58: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Ambiguity in Action

Different documents with the same keywords may have different meanings…

What do frogs eat?

(1) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.

(2) Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.

(3) Some bats catch fish with their claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats.

keywords: frogs, eat

What is the largest volcano in the Solar System?

(1) Mars boasts many extreme geographic features; for example, Olympus Mons, is the largest volcano in the solar system.

(2) The Galileo probe's mission to Jupiter, the largest planet in the Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons.

(3) Even the largest volcanoes found on Earth are puny in comparison to others found around our own cosmic backyard, the Solar System.

keywords: largest, volcano, solar, system

Page 59: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Anaphora

Who killed Abraham Lincoln?

(1) John Wilkes Booth killed Abraham Lincoln.(2) John Wilkes Booth altered history with a bullet. He will forever be

known as the man who ended Abraham Lincoln’s life.

When did Wilt Chamberlain score 100 points?

(1) Wilt Chamberlain scored 100 points on March 2, 1962 against the New York Knicks.

(2) On December 8, 1961, Wilt Chamberlain scored 78 points in a triple overtime game. It was a new NBA record, but Warriors coach Frank McGuire didn’t expect it to last long, saying, “He’ll get 100 points someday.” McGuire’s prediction came true just a few months later in a game against the New York Knicks on March 2.

Language provides different ways of referring to the same entity

Page 60: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

More Anaphora

Terminology Anaphor = an expression that refers to another Anaphora = the phenomenon

Other different types of referring expressions:

Anaphora resolution can be hard!

Fujitsu and NEC said they were still investigating, and that knowledge of more such bids could emerge... Other major Japanese computer companies contacted yesterday said they have never made such bids.

The city council denied the demonstrators a permit because…they feared violence.they advocated violence.

The hotel recently went through a $200 million restoration… original artworks include an impressive collection of Greek statues in the lobby.

Page 61: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

What can we do?

Here are the some of the problems: Tokenization Morphological variation, synonymy, polysemy Paraphrase, ambiguity Anaphora

General approaches: Vary the unit of indexing Manipulate queries and results

Page 62: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

What do we index?

In information retrieval, we are after the concepts represented in the documents

… but we can only index strings

So what’s the best unit of indexing?

Page 63: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

The Tokenization Problem

In many languages, words are not separated by spaces…

Tokenization = separating a string into “words”

Simple greedy approach: Start with a list of every possible term (e.g., from a

dictionary) Look for the longest word in the unsegmented string Take longest matching term as the next word and

repeat

Page 64: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Probabilistic Segmentation

For an input word: c1 c2 c3 … cn

Try all possible partitions:

Choose the highest probability partition E.g., compute P(c1 c2 c3) using a language model

Challenges: search, probability estimation

c1 c2 c3 c4 … cn

c1 c2 c3 c4 … cn

c1 c2 c3 c4 … cn

Page 65: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Indexing N-Grams

Consider a Chinese document: c1 c2 c3 … cn

Don’t segment (you could be wrong!)

Instead, treat every character bigram as a term

Break up queries the same way

Works at least as well as trying to segment correctly!

c1 c2 c3 c4 c5 … cn

c1 c2 c2 c3 c3 c4 c4 c5 … cn-1 cn

Page 66: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Morphological Variation

Handling morphology: related concepts have different forms Inflectional morphology: same part of speech

Derivational morphology: different parts of speech

Different morphological processes: Prefixing Suffixing Infixing Reduplication

dogs = dog + PLURAL

broke = break + PAST

destruction = destroy + ion

researcher = research + er

Page 67: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Stemming

Dealing with morphological variation: index stems instead of words Stem: a word equivalence class that preserves the

central concept

How much to stem? organization organize organ? resubmission resubmit/submission submit? reconstructionism?

Page 68: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Does Stemming Work?

Generally, yes! (in English) Helps more for longer queries Lots of work done in this area

Donna Harman (1991) How Effective is Suffixing? Journal of the American Society for Information Science, 42(1):7-15.

Robert Krovetz. (1993) Viewing Morphology as an Inference Process. Proceedings of SIGIR 1993.

David A. Hull. (1996) Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society for Information Science, 47(1):70-84.

And others…

Page 69: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Stemming in Other Languages

Arabic makes frequent use of infixes

What’s the most effective stemming strategy in Arabic? Open research question…

maktab (office), kitaab (book), kutub (books), kataba (he wrote), naktubu (we write), etc.

the root ktb

Page 70: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Words = wrong indexing unit!

Synonymy

= different words, same meaning

Polysemy

= same word, different meanings

It’d be nice if we could index concepts! Word sense: a coherent cluster in semantic space Indexing word senses achieves the effect of conceptual

indexing

{dog, canine, doggy, puppy, etc.} concept of dog

Bank: financial institution or side of a river?Crane: bird or construction equipment?

Page 71: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Indexing Word Senses

How does indexing word senses solve the synonym/polysemy problem?

Okay, so where do we get the word senses? WordNet Automatically find “clusters” of words that describe the

same concepts Other methods also have been tried…

{dog, canine, doggy, puppy, etc.} concept 112986

I deposited my check in the bank. bank concept 76529I saw the sailboat from the bank. bank concept 53107

Page 72: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Word Sense Disambiguation

Given a word in context, automatically determine the sense (concept) This is the Word Sense Disambiguation (WSD) problem

Context is the key: For each ambiguous word, note the surrounding words

Learn a classifier from a collection of examples Use the classifier to determine the senses of words in

the documents

bank {river, sailboat, water, etc.} side of a riverbank {check, money, account, etc.} financial institution

Page 73: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Does it work?

Nope!

Examples of limited success….

Ellen M. Voorhees. (1993) Using WordNet to Disambiguate Word Senses for Text Retrieval. Proceedings of SIGIR 1993.

Mark Sanderson. (1994) Word-Sense Disambiguation and Information Retrieval. Proceedings of SIGIR 1994

And others…

Hinrich Schütze and Jan O. Pedersen. (1995) Information Retrieval Based on Word Senses. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval.

Rada Mihalcea and Dan Moldovan. (2000) Semantic Indexing Using WordNet Senses. Proceedings of ACL 2000 Workshop on Recent Advances in NLP and IR.

Page 74: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Why Disambiguation Hurts

Bag-of-words techniques already disambiguate Context for each term is established in the query

WSD is hard! Many words are highly polysemous, e.g., interest Granularity of senses is often domain/application

specific

WSD tries to improve precision But incorrect sense assignments would hurt recall Slight gains in precision do not offset large drops in

recall

Page 75: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

An Alternate Approach

Indexing word senses “freezes” concepts at index time

What if we expanded query terms at query time instead?

Two approaches Manual thesaurus, e.g., WordNet, UMLS, etc. Automatically-derived thesaurus, e.g., co-occurrence

statistics

dog AND cat ( dog OR canine ) AND ( cat OR feline )

Page 76: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Does it work?

Yes… if done “carefully”

User should be involved in the process Otherwise, poor choice of terms can hurt performance

Page 77: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Handling Anaphora

Anaphora resolution: finding what the anaphor refers to (i.e., the antecedent)

Most common example: pronominal anaphora resolution Simplest method works pretty well: find previous noun

phrase matching in gender and number

John Wilkes Booth altered history with a bullet. He will forever be known as the man who ended Abraham Lincoln’s life.

He = John Wilkes Booth

Page 78: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Expanding Anaphors

When indexing, replace anaphors with their antecedents

Does it work? Somewhat … but can be computationally expensive … helps more if you want to retrieve sub-document

segments

Page 79: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Beyond Word-Level Indexing

Words are the wrong unit to index…

Many multi-word combinations identify entities Persons: George W. Bush, Dr. Jones Organizations: Red Cross, United Way Corporations: Hewlett Packard, Kraft Foods Locations: Easter Island, New York City

Entities often have finer-grained structuresProfessor Stephen W. Hawking

title first name middle initial last name

Cambridge, Massachusetts

city state

Page 80: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Indexing Named Entities

Why would we want to index named entities?

Index named entities as special tokens

And treat special tokens like query terms

Works pretty well for question answering

In reality, at the time of Edison’s 1879 patent, the light bulb

had been in existence for some five decades ….

PERSON DATE

Who patented the light bulb?

When was the light bulb patented?

patent light bulb PERSON

patent light bulb DATE

John Prager, Eric Brown, and Anni Coden. (2000) Question-Answering by Predictive Annotation. Proceedings of SIGIR 2000.

Page 81: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Indexing Phrases

Two types of phrases Those that make sense, e.g., “school bus”, “hot dog” Those that don’t, e.g., bigrams in Chinese

Treat multi-word tokens as index terms

Three sources of evidence: Dictionary lookup Linguistic analysis Statistical analysis (e.g., co-occurrence)

Page 82: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Known Phrases

Compile a term list that includes phrases Technical terminology can be very helpful

Index any phrase that occurs in the list

Most effective in a limited domain Otherwise hard to capture most useful phrases

Page 83: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Syntactic Phrases

Parsing = automatically assign structure to a sentence

“Walk” the tree and extract phrases Index all noun phrases Index subjects and verbs Index verbs and objects etc.

Sentence

Noun Phrase

The quick brown fox jumped over the lazy black dog

Noun phrase

Det Adj Adj Noun Verb Adj NounAdjDet

Prepositional Phrase

Prep

Page 84: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Syntactic Variations

What does linguistic analysis buy? Coordinations

Substitutions

Permutations

lung and breast cancer lung cancer, breast cancer

inflammatory sinonasal disease inflammatory disease, sinonasal disease

addition of calcium calcium addition

Page 85: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Statistical Analysis

Automatically discover phrases based on co-occurrence probabilities

If terms are not independent, they may form a phrase

Use this method to automatically learn a phrase dictionary

P(“kick the bucket”) = P(“kick”) P(“the”) P(“bucket”) ?

Page 86: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Does Phrasal Indexing Work?

Yes…

But the gains are so small they’re not worth the cost

Primary drawback: too slow!

Page 87: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

What about ambiguity?

Different documents with the same keywords may have different meanings…

What do frogs eat?

(1) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.

(2) Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.

(3) Some bats catch fish with their claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats.

keywords: frogs, eat

What is the largest volcano in the Solar System?

(1) Mars boasts many extreme geographic features; for example, Olympus Mons, is the largest volcano in the solar system.

(2) The Galileo probe's mission to Jupiter, the largest planet in the Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons.

(3) Even the largest volcanoes found on Earth are puny in comparison to others found around our own cosmic backyard, the Solar System.

keywords: largest, volcano, solar, system

Page 88: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Indexing Relations

Instead of terms, index syntactic relations between entities in the text

Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.

< frogs subject-of eat >< insects object-of eat >< animals object-of eat >< adult modifies frogs >< small modifies animals >

Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.

< alligators subject-of eat >< kinds object-of animals >< small modifies animals >

From the relations, it is clear who’s eating whom!

Page 89: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Are syntactic relations enough?

Consider this example:

Syntax sometimes isn’t enough… we need semantics (or meaning)!

Semantics, for example, allows us to relate the following two fragments:

John broke the window.The window broke.

< John subject-of break >< window subject-of break>

“John” and “window” are both subjects…But John is the person doing the breaking (or “agent”),and the window is the thing being broken (or “theme”)

The barbarians destroyed the city…The destruction of the city by the barbarians…

event: destroyagent: barbarianstheme: city

Page 90: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Semantic Roles

Semantic roles are invariant with respect to syntactic expression

The idea: Identify semantic roles Index “frame structures” with filled slots Retrieve answers based on semantic-level matching

Mary loaded the truck with hay. Hay was loaded onto the truck by Mary.

event: loadagent: Marymaterial: haydestination: truck

Page 91: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Does it work?

No, not really…

Why not? Syntactic and semantic analysis is difficult: errors offset

whatever gain is gotten As with WSD, these techniques are precision-

enhancers… recall usually takes a dive It’s slow!

Page 92: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Alternative Approach

Sophisticated linguistic analysis is slow! Unnecessary processing can be avoided by query time

analysis

Two-stage retrieval Use standard document retrieval techniques to fetch a

candidate set of documents Use passage retrieval techniques to choose a few

promising passages (e.g., paragraphs) Apply sophisticated linguistic techniques to pinpoint the

answer

Passage retrieval Find “good” passages within documents Key Idea: locate areas where lots of query terms

appear close together

Page 93: IR and NLP Jimmy Lin College of Information Studies Institute for Advanced Computer Studies University of Maryland Wednesday, March 15, 2006.

Key Ideas

IR is hard because language is rich and complex (among other reasons)

Two general approaches to the problem Attempt to find the best unit of indexing Try to fix things at query time

It is hard to predict a priori what techniques work Questions must be answered experimentally

Words are really the wrong thing to index But there isn’t really a better alternative…