INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool...
-
Upload
vanessa-robbins -
Category
Documents
-
view
214 -
download
0
description
Transcript of INFM 700: Session 7 Search (Part I) Introduction to Information Retrieval Paul Jacobs The iSchool...
INFM 700: Session 7Search (Part I)Introduction to Information Retrieval
Paul JacobsThe iSchoolUniversity of Maryland
Monday, November 9, 2009
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
iSchool
Goals for Search Sessions Understand the basic issues in information
retrieval (searching primarily unstructured text) Know the techniques generally used by modern
search engines
Learn how search engines can be used most effectively in information architecture
iSchool
Today’s Topics Introduction to Information Retrieval Keywords, inverted indices, and Boolean retrieval
The vector space model, ranked retrieval
Major issues
Some additional tricks Examples: web search and site search
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Levels of Structure Different types of data
Structured data Semi-structured data Unstructured data
How do you provide access to unstructured data? Manually develop an organization system (add
structure) Provide search capabilities
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
What is search? Search is query-based access
How is this different from browsing?
Things one can search on: Content Metadata Organization systems Labels …
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Some Key Concepts Different search paradigms
Boolean, “keyword” “Natural language” or “free text” (full text) search Current search engines are primarily full text and
statistical
The fundamental challenge: words & concepts
The basic method: weighting and context
Other tricks (there are many!) Structuring Popularity and importance (of pages, documents) Metadata and thesauri User feedback
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
The Central Problem in IR
SearcherAuthors
Concepts Concepts
Query Documents
Do these represent the same concepts?
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Architecture of IR SystemsDocumentsQuery
Hits
RepresentationFunction
RepresentationFunction
Query Representation Document Representation
ComparisonFunction Index
offlineonline
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
How do we represent text? Remember: computers don’t “understand”
documents or queries Simple, yet effective approach: “bag of words”
Treat all the words in a document as index terms Assign a “weight” to each term based on “importance” Disregard order, structure, meaning, etc. of the words
Assumptions Term occurrence is independent (of other terms) Document relevance is independent (of other
documents) “Words” can be defined
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
What’s a word?天主教教宗若望保祿二世因感冒再度住進醫院。這是他今年第二度因同樣的病因住院。 - باسم الناطق ريجيف مارك وقال
قبل - شارون إن اإلسرائيلية الخارجيةبزيارة األولى للمرة وسيقوم الدعوة
المقر طويلة لفترة كانت التي تونس،لبنان من خروجها بعد الفلسطينية التحرير لمنظمة الرسمي
1982عام . Выступая в Мещанском суде Москвы экс-глава ЮКОСа заявил не совершал ничего противозаконного, в чем обвиняет его генпрокуратура России.
भारत सरकार ने आर्थि� क सर्वे�क्षण में विर्वेत्तीय र्वेर्ष� 2005-06 में सात फ़ीसदी विर्वेकास दर हासिसल करने का आकलन विकया है और कर सुधार पर ज़ोर दिदया है
日米連合で台頭中国に対処…アーミテージ前副長官提言 조재영 기자 = 서울시는 25 일 이명박 시장이 ` 행정중심복합도시 '' 건설안에 대해 ` 군대라도 동원해 막고싶은 심정 '' 이라고 말했다는 일부 언론의 보도를 부인했다 .
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Sample DocumentMcDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.
But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.
But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.
Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.
…
14 × McDonald’s
12 × fat
11 × fries
8 × new
6 × company, french, nutrition
5 × food, oil, percent, reduce, taste, Tuesday
…
“Bag of Words”
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Why does “bag of words” work (at all)? Words alone tell us a lot about content! Words are our main tool for describing concepts
Words in context are especially powerful
Getting beyond words is hard
Structure usually (but not always) can be guessed from content “355 back correction Dow pulls signaling” “blind Venetian” vs. “Venetian blind”
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Boolean Retrieval Users express queries as a Boolean (logical)
expression “terms” (usually words or phrases) joined by AND, OR,
NOT Can be arbitrarily nested
Difference between “term” and “keyword”?
Retrieval is based on the notion of sets Any given query divides the collection into two sets:
retrieved, not-retrieved (complement) Pure Boolean systems do not define an ordering of the
results (no ranking)
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
AND/OR/NOT
A B
All documents
C
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Logic Tables
A OR B
A AND B A NOT B
NOT B
0 1
1 1
0 1
0
1
AB
(= A AND NOT B)
0 0
0 1
0 1
0
1
AB
0 0
1 0
0 1
0
1
AB
1 0
0 1B
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Representing Documents
The quick brown fox jumped over the lazy dog’s back.
Document 1
Document 2
Now is the time for all good men to come to the aid of their party.
the
isfor
to
of
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
00110110110010100
11001001001101011
Term Doc
umen
t 1
Doc
umen
t 2
Stopword List
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Boolean View of a Collection
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
00110000010010110
01001001001100001
Term
Doc
1D
oc 2
00110110110010100
11001001001000001
Doc
3D
oc 4
00010110010010010
01001001000101001
Doc
5D
oc 6
00110010010010010
10001001001111000
Doc
7D
oc 8
Each column represents the view of a particular document: What terms are contained in this document?
Each row represents the view of a particular term: What documents contain this term?
To execute a query, pick out rows corresponding to query terms and then apply logic table of corresponding Boolean operator
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Sample Queries
foxdog 0
000
11
00
11
00
01
00
Term
Doc
1D
oc 2
Doc
3D
oc 4
Doc
5D
oc 6
Doc
7D
oc 8
dog fox 0 0 1 0 1 0 0 0
dog fox 0 0 1 0 1 0 1 0
dog fox 0 0 0 0 0 0 0 0
fox dog 0 0 0 0 0 0 1 0
dog AND fox Doc 3, Doc 5
dog OR fox Doc 3, Doc 5, Doc 7
dog NOT fox empty
fox NOT dog Doc 7
goodparty
00
10
00
10
00
11
00
11
g p 0 0 0 0 0 1 0 1
g p o 0 0 0 0 0 1 0 0
good AND party Doc 6, Doc 8over 1 0 1 0 1 0 1 1
good AND party NOT over Doc 6
Term
Doc
1D
oc 2
Doc
3D
oc 4
Doc
5D
oc 6
Doc
7D
oc 8
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Inverted Index
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
00110000010010110
01001001001100001
Term
Doc
1D
oc 2
00110110110010100
11001001001000001
Doc
3D
oc 4
00010110010010010
01001001000101001
Doc
5D
oc 6
00110010010010010
10001001001111000
Doc
7D
oc 8
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
4 82 4 61 3 71 3 5 72 4 6 83 53 5 72 4 6 831 3 5 7
1 3 5 7 8
2 4 82 6 8
1 5 72 4 6
1 36 8
Term Postings
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Boolean Retrieval To execute a Boolean query:
Build query syntax tree
For each clause, look up postings
Traverse postings and apply Boolean operator
Efficiency analysis Postings traversal is linear (assuming sorted postings) Start with shortest posting first
( fox or dog ) and quick
fox dog
ORquick
AND
foxdog 3 5
3 5 7
foxdog 3 5
3 5 7OR = union 3 5 7IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Why Boolean Retrieval Works Boolean operators approximate concepts How so?
AND can identify relationships between concepts• (e.g., interest rate, web design)
OR can identify alternate terminology• (e.g., interest percentage, HTML layout, etc.)
NOT can filter alternate meanings• (e.g., conflict AND interest AND NOT rate, NOT spider)
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Why Boolean Retrieval Fails It’s really hard to come up with the “right” queries Casual searchers have difficulty with the logic
Some concepts are just hard to express, e.g. “corporate mergers & acquisitions” – IBM acquired Lotus
Relevance is not absolute, some documents are more relevant, or more helpful, than others
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Ranked Retrieval in the Vector Space Model Order documents by how likely they are to be
relevant to the information need Estimate relevance(q, di) Sort documents by relevance Display sorted results, usually one screen at a time
How do we estimate relevance? Assume that document d is relevant to query q if they
share terms in common Replace relevance(q, di) with sim(q, di) (similarity) Compute similarity of vector representations
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Vector Representation “Bags of words” can be represented as vectors
Why? Computational efficiency, ease of manipulation Geometric metaphor: “arrows”
A vector is a set of values recorded in any consistent order
“The quick brown fox jumped over the lazy dog’s back”
[ 1 1 1 1 1 1 1 1 2 ]
1st position corresponds to “back”2nd position corresponds to “brown”3rd position corresponds to “dog”4th position corresponds to “fox”5th position corresponds to “jump”6th position corresponds to “lazy”7th position corresponds to “over”8th position corresponds to “quick”9th position corresponds to “the”
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Vector Space Model
Assumption: Documents that are “close together” in vector space “talk about” the same things
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Similarity Metric How about |d1 – d2|?
Instead of Euclidean distance, use “angle” between the vectors It all boils down to the inner product (dot product) of
vectors
kj
kj
dd
dd
)cos(
n
i kin
i ji
n
i kiji
kj
kjkj
ww
ww
dd
ddddsim
12,1
2,
1 ,,),(
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Components of Similarity The “inner product” (aka dot product) is the key to
the similarity function
The denominator handles document length normalization
n
i kijikj wwdd1 ,,
n
i kij wd1
2,
24.41840941
20321
92200130221
2010220321
Example:
Example:
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Term Weighting Term weights consist of two components
Local: how important is the term in this doc? Global: how important is the term in the collection?
Here’s the intuition: Terms that appear often in a document should get high
weights Terms that appear in many documents should get low
weights
How do we capture this mathematically? Term frequency (local) Inverse document frequency (global)
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
TF.IDF Term Weighting
ijiji n
Nw logtf ,,
jiw ,
ji,tf
N
in
weight assigned to term i in document j
number of occurrence of term i in document j
number of documents in entire collection
number of documents with term iIR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
TF.IDF Example
4
5
6
3
1
3
1
6
5
3
4
3
7
1
2
1 2 3
2
3
2
4
4
0.301
0.125
0.125
0.125
0.602
0.301
0.000
0.602
tfidf
complicated
contaminated
fallout
information
interesting
nuclear
retrieval
siberia
1,4
1,5
1,6
1,3
2,1
2,1
2,6
3,5
3,3
3,4
1,2
0.301
0.125
0.125
0.125
0.602
0.301
0.000
0.602
complicated
contaminated
fallout
information
interesting
nuclear
retrieval
siberia
4,2
4,3
2,3 3,3 4,2
3,7
3,1 4,4
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Document Scoring Algorithm Initialize accumulators to hold document scores For each query term t in the user’s query
Fetch t’s postings For each document, scoredoc += wt,d wt,q
Apply length normalization to the scores at end
Return top N documents
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Summary thus far… Represent documents (and queries) as “bags of
words” (terms) Derive term weights based on frequency
Use weighted term vectors for each document, query
Compute a vector-based similarity score
Display sorted, ranked resultsIR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Issues and Tricks What’s a word/term?
We can ignore words (“stop words”), combine (phrases), split up (“stem”) words
Other special treatment (e.g. names, categories)
Query formulation/suggestion
Type of information need
Popularity Based on link analysis/page rank Based on click through, other
Structuring and tagging (e.g., “best bets”)
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Issues and Tricks (cont’d) Thesaurus/query expansion
Based on meaning, conceptual relationships Based on decomposition/type
User feedback/”More like this”
Clustering/grouping of results
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Morphological Variation Handling morphology: related concepts have
different forms Inflectional morphology: same part of speech
Derivational morphology: different parts of speech
Different morphological processes: Prefixing Suffixing Infixing Reduplication
dogs = dog + PLURALbroke = break + PAST
destruction = destroy + ionresearcher = research + er
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Stemming Dealing with morphological variation: index stems
instead of words Stem: a word equivalence class that preserves the
central concept
How much to stem? organization organize organ? resubmission resubmit/submission submit? reconstructionism?
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Does Stemming Work? Generally, yes! (in English)
Helps more for longer queries, fewer results Lots of work done in this area
But used very sparingly in web search – why?
Donna Harman (1991) How Effective is Suffixing? Journal of the American Society for Information Science, 42(1):7-15.
Robert Krovetz. (1993) Viewing Morphology as an Inference Process. Proceedings of SIGIR 1993.
David A. Hull. (1996) Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society for Information Science, 47(1):70-84.
And others…
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Beyond Words… Stemming/tokenization = specific instance of a
general problem: what is it? Other units of indexing
Concepts (e.g., from WordNet) Named entities Relations …
IR Intro
Boolean
Vector Space
Issues & Tricks
iSchool
Recap Introduction to Information Retrieval Boolean retrieval
Ranked retrieval – term weighting, the vector space model
Advanced methods, things to think about
Next time: Deploying search enginesIR Intro
Boolean
Vector Space
Issues & Tricks