The Linguistics of Spin Stephan Greene Philip Resnik MITH: November 13, 2007.
The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik.
-
Upload
brian-spencer -
Category
Documents
-
view
220 -
download
0
Transcript of The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik.
The Boolean Retrieval Model
LBSC 708A/CMSC 838L
Session 2 - September 11, 2001
Philip Resnik
Agenda• Questions• General model for detection• The “bag of words” representation• Boolean “free text” retrieval• Proximity operators• Controlled vocabulary retrieval• Automating controlled vocabulary• Retrieval versus filtering
But First ...
• Rate the textbook reading:
– Was it easy to understand?
– How long did it take you to read?
Retrieval System Model
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Query Reformulation and
Relevance Feedback
SourceReselection
Nominate ChoosePredict
Search Goal
• Choose the same documents a human would– Without human intervention (less work)– Faster than a human could (less time)– As accurately as possible (less accuracy)
• Humans start with an information need– Machines start with a query
• Humans match documents to information needs– Machines match document & query representations
Search Component Model
Comparison Function
Representation Function
Query Formulation
Human Judgment
Representation Function
Retrieval Status Value
Utility
Query
Information Need Document
Query Representation Document Representation
Que
ry P
roce
ssin
g
Doc
umen
t P
roce
ssin
g
Detection Component Model
• “Retrieval status value” is an estimate of utility– Utility what the user would pay for the document
• A co-design problem– Document representation function– Query representation function– Comparison function
• Boolean “free text” retrieval is one way of allocating functionality to each function
“Bag of Words” Representation
• Bag = multiset: keeps track of members and counts• The quick brown fox jumped over the lazy dog’s back {back, brown, dog, fox, jumped, lazy, over, quick, ‘s, the, the}• A “term” is any lexical item that you chose
– A fixed-length sequence of characters (an “n-gram”)
– A word (delimited by “white space” or punctuation)
– “Root form” of each word (destroyed destroy)
– “Stem” of each word (destroyed destr)
– A phrase (e.g., phrases listed in a dictionary)
• Counts can be recorded in any consistent order
Bag of Words Example
The quick brown fox jumped over the lazy dog’s back.
Document 1
Document 2
Now is the time for all good men to come to the aid of their party.
the
quick
brown
fox
over
lazy
dog
back
now
is
time
forall
good
men
tocome
jump
aid
of
their
party
00110110110010100
11001001001101011
Indexed Term D
ocum
ent 1
Doc
umen
t 2
Stopword List
‘s
Boolean “Free Text” Retrieval
• Limit the bag of words to “absent” and “present”– “Boolean” values, represented as 0 and 1
• Represent terms as a “bag of documents”– Same representation, but rows rather than columns
• Combine the rows using “Boolean operators”– AND, OR, NOT
• Any document with a 1 remaining is “detected”
Boolean Operators
0 1
1 1
0 1
0
1A OR B
A AND B A NOT B
AB
0 0
0 1
0 1
0
1
AB
0 0
1 0
0 1
0
1
AB
1 0
0 1B
NOT B
(= A AND NOT B)
Boolean Free Text Example
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
00110000010010110
01001001001100001
Term Doc
1
Doc
2
00110110110010100
11001001001000001
Doc
3D
oc 4
00010110010010010
01001001000101001
Doc
5D
oc 6
00110010010010010
10001001001111000
Doc
7D
oc 8
• dog AND fox – Doc 3, Doc 5
• dog NOT fox – Empty
• fox NOT dog – Doc 7
• dog OR fox – Doc 3, Doc 5, Doc 7
• good AND party – Doc 6, Doc 8
• good AND party NOT over
– Doc 6
Why Boolean Retrieval Works
• Boolean operators approximate natural language– Find documents about a good party that is not over
• AND can discover relationships between concepts– good party
• OR can discover alternate terminology– excellent party
• NOT can discover alternate meanings– Democratic party
The Perfect Query Paradox
• Every information need has a perfect doc set– If not, there would be no sense doing retrieval
• Almost every document set has a perfect query– AND every word to get a query for document 1– Repeat for each document in the set– OR every document query to get the set query
• But users find Boolean query formulation hard– They get too much, too little, useless stuff, …
Why Boolean Retrieval Fails
• Natural language is way more complex– She saw the man on the hill with a telescope
• AND “discovers” nonexistent relationships– Terms in different paragraphs, chapters, …
• Guessing terminology for OR is hard– good, nice, excellent, outstanding, awesome, …
• Guessing terms to exclude is even harder!– Democratic party, party to a lawsuit, …
Proximity Operators
• More precise versions of AND– “NEAR n” allows at most n-1 intervening terms– “WITH” requires terms to be adjacent and in order
• Easy to implement, but less efficient– Store a list of positions for each word in each doc
• Stopwords become very important!
– Perform normal Boolean computations• Treat WITH and NEAR like AND with an extra constraint
Proximity Operator Example
• time AND come– Doc 2
• time (NEAR 2) come– Empty
• quick (NEAR 2) fox– Doc 1
• quick WITH fox– Empty
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
0 1 (9)
Term1 (13)1 (6)
1 (7)
1 (8)
1 (16)
1 (1)
1 (2)1 (15)1 (4)
0
00
0
00
0
0
0
0
0
0
00
0
0
1 (5)
1 (9)
1 (3)
1 (4)
1 (8)
1 (6)
1 (10)
Doc
1
Doc
2
Concept Retrieval
• Goal: retrieve using “concepts,” not just words– Some words have many meanings (e.g., bank)
• This is a bigger problem for large diverse collections
– Some meanings are associated with many words• Especially when shades of meaning are unimportant
• This is the holy grail of information retrieval– Everyone agrees that it is a good idea– But every known approach has some limitations
Controlled Vocabulary Retrieval
• A straightforward concept retrieval approach– Works equally well for non-text materials– Index terms are a form of meta-data
• Assign a unique “descriptor” to each concept– Can be done by hand for collections of limited scope– In theory, descriptors are unambiguous
• Assign some descriptors to each document– Practical for valuable collections of limited size
• Use Boolean retrieval based on descriptors
Controlled Vocabulary Example
• Canine AND Fox– Doc 1
• Canine AND Political action– Empty
• Canine OR Political action– Doc 1, Doc 2
The quick brown fox jumped over the lazy dog’s back.
Document 1
Document 2
Now is the time for all good men to come to the aid of their party.
VolunteerismPolitical action
FoxCanine 0
011
1100
Descriptor Doc
1D
oc 2
[Canine][Fox]
[Political action][Volunteerism]
Thesaurus Design
• Thesauri contain descriptors and relationships– Broader term ( IS-A), narrower term, used for,
…
• Indexers select descriptors for each document– Thesaurus must match the document collection
• Searchers select descriptors for each query– Thesaurus must match information needs
• Indexers must anticipate searchers’ info needs– Or searchers must discern indexers’ perspective– Or thesaurus itself must be accessible/browsable
Challenges
• Thesaurus design is expensive– Shifting concepts generate continuing expense
• Manual indexing is even more expensive– And consistent indexing is very expensive
• User needs are often difficult to anticipate– Challenge for thesaurus designers and indexers
• End users find thesauri hard to use– Co-design problem with query formulation
Applications
• When implied concepts must be captured– Political action, volunteerism, …
• When terminology selection is impractical– Searching foreign language materials
• When no words are present– Photos w/o captions, videos w/o transcripts, …
• When user needs are easily anticipated– Weather reports, yellow pages*, …
*But cf. Bill Woods’ classic example of the paraphrase problem: “car washing” vs. “automobile cleaning”
Yahoo
Machine Assisted Indexing
• Goal: Automatically suggest descriptors– Better consistency with lower cost
• Chosen by a rule-based expert system– Design thesaurus by hand in the usual way– Design an expert system to process text
• String matching, proximity operators, …
– Write rules for each thesaurus/collection/language– Try it out and fine tune the rules by hand
Machine Assisted Indexing Example
//TEXT: scienceIF (all caps) USE research policy USE community programENDIFIF (near “Technology” AND with “Development”) USE community development USE development aidENDIF
near: within 250 wordswith: in the same sentence
Access Innovations system:
Text Categorization
• Goal: fully automatic descriptor assignment
• Machine learning approach– Assign descriptors manually for a “training set”– Design a learning algorithm find and use patterns
• Bayesian classifier, neural network, genetic algorithm, …
– Present new documents• System assigns descriptors like those in training set
Supervised Learningf1 f2 f3 f4 … fN
v1 v2 v3 v4 … vN Cv
w1 w2 w3 w4 … wN Cw
Learner
Classifier
New example
x1 x2 x3 x4 … xNCx
Labelled training examples
Cw
Retrieval vs. Filtering
• Retrospective retrieval: relatively static collection; constant flow of queries
• Information filtering: relatively static profile (query); constant stream of new documents
• Examples:– Yahoo categorization of new Web pages
(could also be viewed as an ongoing indexing task)
– Personalized newspaper
Case Study: Individual Inc.
• First of the personalized newspapers
(original delivery mechanism: 8am fax)• Core technology: SMART + extended Boolean• Key insights:
– Targeted, industry-specific marketing
– Large staff of non-technical domain specialists
– “Building block” Boolean profiles
– Nightly update of profiles based on data stream
e.g. (OJ or “orange juice”) and not Simpson
– Inexpensive detection and selection, more costly examination/delivery.
Things to Do This Week
• Homework 1– Due next week
• Do the readings
• Note reading list changes
One Minute Paper
• Brief answers, no names, online– In your opinion, what is the most important positive
and most important negative characteristic of Boolean retrieval? Please provide exactly one of each.
– What was the muddiest point in today’s lecture?– What was the most interesting point in today’s lecture?
• I’ll summarize the answers next class