The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik.

The Boolean Retrieval Model

LBSC 708A/CMSC 838L

Session 2 - September 11, 2001

Philip Resnik

Agenda• Questions• General model for detection• The “bag of words” representation• Boolean “free text” retrieval• Proximity operators• Controlled vocabulary retrieval• Automating controlled vocabulary• Retrieval versus filtering

But First ...

• Rate the textbook reading:

– Was it easy to understand?

– How long did it take you to read?

Retrieval System Model

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Query Reformulation and

Relevance Feedback

SourceReselection

Nominate ChoosePredict

Search Goal

• Choose the same documents a human would– Without human intervention (less work)– Faster than a human could (less time)– As accurately as possible (less accuracy)

• Humans start with an information need– Machines start with a query

• Humans match documents to information needs– Machines match document & query representations

Search Component Model

Comparison Function

Representation Function

Query Formulation

Human Judgment

Representation Function

Retrieval Status Value

Utility

Query

Information Need Document

Query Representation Document Representation

Que

ry P

roce

ssin

g

Doc

umen

t P

roce

ssin

g

Detection Component Model

• “Retrieval status value” is an estimate of utility– Utility what the user would pay for the document

• A co-design problem– Document representation function– Query representation function– Comparison function

• Boolean “free text” retrieval is one way of allocating functionality to each function

“Bag of Words” Representation

• Bag = multiset: keeps track of members and counts• The quick brown fox jumped over the lazy dog’s back {back, brown, dog, fox, jumped, lazy, over, quick, ‘s, the, the}• A “term” is any lexical item that you chose

– A fixed-length sequence of characters (an “n-gram”)

– A word (delimited by “white space” or punctuation)

– “Root form” of each word (destroyed destroy)

– “Stem” of each word (destroyed destr)

– A phrase (e.g., phrases listed in a dictionary)

• Counts can be recorded in any consistent order

Bag of Words Example

The quick brown fox jumped over the lazy dog’s back.

Document 1

Document 2

Now is the time for all good men to come to the aid of their party.

the

quick

brown

fox

over

lazy

dog

back

now

is

time

forall

good

men

tocome

jump

aid

of

their

party

00110110110010100

11001001001101011

Indexed Term D

ocum

ent 1

Doc

umen

t 2

Stopword List

‘s

Boolean “Free Text” Retrieval

• Limit the bag of words to “absent” and “present”– “Boolean” values, represented as 0 and 1

• Represent terms as a “bag of documents”– Same representation, but rows rather than columns

• Combine the rows using “Boolean operators”– AND, OR, NOT

• Any document with a 1 remaining is “detected”

Boolean Operators

0 1

1 1

0 1

0

1A OR B

A AND B A NOT B

AB

0 0

0 1

0 1

0

1

AB

0 0

1 0

0 1

0

1

AB

1 0

0 1B

NOT B

(= A AND NOT B)

Boolean Free Text Example

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

00110000010010110

01001001001100001

Term Doc

1

Doc

2

00110110110010100

11001001001000001

Doc

3D

oc 4

00010110010010010

01001001000101001

Doc

5D

oc 6

00110010010010010

10001001001111000

Doc

7D

oc 8

• dog AND fox – Doc 3, Doc 5

• dog NOT fox – Empty

• fox NOT dog – Doc 7

• dog OR fox – Doc 3, Doc 5, Doc 7

• good AND party – Doc 6, Doc 8

• good AND party NOT over

– Doc 6

Why Boolean Retrieval Works

• Boolean operators approximate natural language– Find documents about a good party that is not over

• AND can discover relationships between concepts– good party

• OR can discover alternate terminology– excellent party

• NOT can discover alternate meanings– Democratic party

The Perfect Query Paradox

• Every information need has a perfect doc set– If not, there would be no sense doing retrieval

• Almost every document set has a perfect query– AND every word to get a query for document 1– Repeat for each document in the set– OR every document query to get the set query

• But users find Boolean query formulation hard– They get too much, too little, useless stuff, …

Why Boolean Retrieval Fails

• Natural language is way more complex– She saw the man on the hill with a telescope

• AND “discovers” nonexistent relationships– Terms in different paragraphs, chapters, …

• Guessing terminology for OR is hard– good, nice, excellent, outstanding, awesome, …

• Guessing terms to exclude is even harder!– Democratic party, party to a lawsuit, …

Proximity Operators

• More precise versions of AND– “NEAR n” allows at most n-1 intervening terms– “WITH” requires terms to be adjacent and in order

• Easy to implement, but less efficient– Store a list of positions for each word in each doc

• Stopwords become very important!

– Perform normal Boolean computations• Treat WITH and NEAR like AND with an extra constraint

Proximity Operator Example

• time AND come– Doc 2

• time (NEAR 2) come– Empty

• quick (NEAR 2) fox– Doc 1

• quick WITH fox– Empty

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

0 1 (9)

Term1 (13)1 (6)

1 (7)

1 (8)

1 (16)

1 (1)

1 (2)1 (15)1 (4)

0

00

0

00

0

0

0

0

0

0

00

0

0

1 (5)

1 (9)

1 (3)

1 (4)

1 (8)

1 (6)

1 (10)

Doc

1

Doc

2

Concept Retrieval

• Goal: retrieve using “concepts,” not just words– Some words have many meanings (e.g., bank)

• This is a bigger problem for large diverse collections

– Some meanings are associated with many words• Especially when shades of meaning are unimportant

• This is the holy grail of information retrieval– Everyone agrees that it is a good idea– But every known approach has some limitations

Controlled Vocabulary Retrieval

• A straightforward concept retrieval approach– Works equally well for non-text materials– Index terms are a form of meta-data

• Assign a unique “descriptor” to each concept– Can be done by hand for collections of limited scope– In theory, descriptors are unambiguous

• Assign some descriptors to each document– Practical for valuable collections of limited size

• Use Boolean retrieval based on descriptors

Controlled Vocabulary Example

• Canine AND Fox– Doc 1

• Canine AND Political action– Empty

• Canine OR Political action– Doc 1, Doc 2

The quick brown fox jumped over the lazy dog’s back.

Document 1

Document 2

Now is the time for all good men to come to the aid of their party.

VolunteerismPolitical action

FoxCanine 0

011

1100

Descriptor Doc

1D

oc 2

[Canine][Fox]

[Political action][Volunteerism]

Thesaurus Design

• Thesauri contain descriptors and relationships– Broader term ( IS-A), narrower term, used for,

…

• Indexers select descriptors for each document– Thesaurus must match the document collection

• Searchers select descriptors for each query– Thesaurus must match information needs

• Indexers must anticipate searchers’ info needs– Or searchers must discern indexers’ perspective– Or thesaurus itself must be accessible/browsable

Challenges

• Thesaurus design is expensive– Shifting concepts generate continuing expense

• Manual indexing is even more expensive– And consistent indexing is very expensive

• User needs are often difficult to anticipate– Challenge for thesaurus designers and indexers

• End users find thesauri hard to use– Co-design problem with query formulation

Applications

• When implied concepts must be captured– Political action, volunteerism, …

• When terminology selection is impractical– Searching foreign language materials

• When no words are present– Photos w/o captions, videos w/o transcripts, …

• When user needs are easily anticipated– Weather reports, yellow pages*, …

*But cf. Bill Woods’ classic example of the paraphrase problem: “car washing” vs. “automobile cleaning”

Machine Assisted Indexing

• Goal: Automatically suggest descriptors– Better consistency with lower cost

• Chosen by a rule-based expert system– Design thesaurus by hand in the usual way– Design an expert system to process text

• String matching, proximity operators, …

– Write rules for each thesaurus/collection/language– Try it out and fine tune the rules by hand

Machine Assisted Indexing Example

//TEXT: scienceIF (all caps) USE research policy USE community programENDIFIF (near “Technology” AND with “Development”) USE community development USE development aidENDIF

near: within 250 wordswith: in the same sentence

Access Innovations system:

Text Categorization

• Goal: fully automatic descriptor assignment

• Machine learning approach– Assign descriptors manually for a “training set”– Design a learning algorithm find and use patterns

• Bayesian classifier, neural network, genetic algorithm, …

– Present new documents• System assigns descriptors like those in training set

Supervised Learningf1 f2 f3 f4 … fN

v1 v2 v3 v4 … vN Cv

w1 w2 w3 w4 … wN Cw

Learner

Classifier

New example

x1 x2 x3 x4 … xNCx

Labelled training examples

Cw

Retrieval vs. Filtering

• Retrospective retrieval: relatively static collection; constant flow of queries

• Information filtering: relatively static profile (query); constant stream of new documents

• Examples:– Yahoo categorization of new Web pages

(could also be viewed as an ongoing indexing task)

– Personalized newspaper

Case Study: Individual Inc.

• First of the personalized newspapers

(original delivery mechanism: 8am fax)• Core technology: SMART + extended Boolean• Key insights:

– Targeted, industry-specific marketing

– Large staff of non-technical domain specialists

– “Building block” Boolean profiles

– Nightly update of profiles based on data stream

e.g. (OJ or “orange juice”) and not Simpson

– Inexpensive detection and selection, more costly examination/delivery.

Things to Do This Week

• Homework 1– Due next week

• Do the readings

• Note reading list changes

One Minute Paper

• Brief answers, no names, online– In your opinion, what is the most important positive

and most important negative characteristic of Boolean retrieval? Please provide exactly one of each.

– What was the muddiest point in today’s lecture?– What was the most interesting point in today’s lecture?

• I’ll summarize the answers next class

The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik.

Documents

Transcript of The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik.