information retrival systems
-
Upload
saivishu10116936113 -
Category
Documents
-
view
216 -
download
0
Transcript of information retrival systems
-
7/29/2019 information retrival systems
1/50
Introduction to Information Retrieval
Information Retrieval and Data Mining
(AT71.07)
Comp. Sc. and Inf. Mgmt.
Asian Institute of Technology
Instructor: Dr. Sumanta GuhaSlide Sources: Introduction to
Information Retrieval book slides
from Stanford University, adapted
and supplemented
Chapter 1: Boolean retrieval
1
-
7/29/2019 information retrival systems
2/50
Introduction to Information Retrieval
Introduction to
Information RetrievalCS276
Information Retrieval and Web Search
Christopher Manning and Prabhakar Raghavan
Lecture 1: Boolean retrieval
-
7/29/2019 information retrival systems
3/50
Introduction to Information Retrieval
Information Retrieval
Information Retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text)
that satisfies an information need from within large
collections (usually stored on computers).
3
-
7/29/2019 information retrival systems
4/50
Introduction to Information Retrieval
Unstructured (text) vs. structured
(database) data in 1996
4
-
7/29/2019 information retrival systems
5/50
Introduction to Information Retrieval
Unstructured (text) vs. structured
(database) data in 2009
5
http://www.yahoo.com/http://www.yahoo.com/ -
7/29/2019 information retrival systems
6/50
Introduction to Information Retrieval
Three Scales of the IR problem
Massive: Web search billions of documents!
Large: Specific to an enterprise, institution or domain
Small: Personal information retrieval, e.g., searching
ones email archives
Each level has its own distinctive issues!
6
-
7/29/2019 information retrival systems
7/50
Introduction to Information Retrieval
Unstructured data in 1680
Which plays of Shakespeare contain the words Brutus
ANDCaesar but NOTCalpurnia?
One could grepall of Shakespeares plays for Brutus
and Caesar, then strip out lines containing Calpurnia? Why is that not the answer?
Slow (for large corpora)
NOTCalpurnia is non-trivial
Other operations (e.g., find the word Romans near
countrymen) not feasible
Ranked retrieval (best documents to return)
Later lectures7
Sec. 1.1
E.g., google institute technologyreturns
docs with institute and technologyeither
adjacent or near (check first 40 docs!).
-
7/29/2019 information retrival systems
8/50
Introduction to Information Retrieval
Term-document incidence matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
1 ifplay contains
word, 0 otherwiseBrutusANDCaesarBUTNOTCalpurnia
Sec. 1.1
-
7/29/2019 information retrival systems
9/50
Introduction to Information Retrieval
Incidence vectors
So we have a 0/1 vector for each term.
To answer query: take the vectors for Brutus, Caesar
and Calpurnia (complemented) bitwiseAND.
110100AND 110111AND 101111 = 100100.
9
Sec. 1.1
-
7/29/2019 information retrival systems
10/50
Introduction to Information Retrieval
Answers to query
Antony and Cleopatra,Act III, Scene iiAgrippa[Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,
When Antony found Julius Caesardead,
He cried almost to roaring; and he weptWhen at Philippi he found Brutusslain.
Hamlet, Act III, Scene ii
Lord Polonius:I did enact Julius CaesarI was killed i' theCapitol; Brutuskilled me.
10
Sec. 1.1
d f l S 1 1
-
7/29/2019 information retrival systems
11/50
Introduction to Information Retrieval
Basic assumptions of Information Retrieval
Collection: Fixed set of documents
Goal: Retrieve documents with information that is
relevant to the users information needand helps the
user complete a task
11
Sec. 1.1
I d i I f i R i l
-
7/29/2019 information retrival systems
12/50
Introduction to Information Retrieval
The classic search model
Corpus
TASK
Info Need
Query
Verbalform
Results
SEARCHENGINE
QueryRefinement
Get rid of mice in a
politically correct way
Info about removing mice
without killing them
How do I trap mice alive?
mouse trap
Misconception?
Mistranslation?
Misformulation?
I t d ti t I f ti R t i l S 1 1
-
7/29/2019 information retrival systems
13/50
Introduction to Information Retrieval
How good are the retrieved docs?
Precision : Fraction of retrieved docs that are
relevant to users information need
= #(retrieved AND relevant) / #retrieved
Recall: Fraction of relevant docs in collection thatare retrieved
= #(retrieved AND relevant) / #relevant
13
Sec. 1.1
relevant docsretrieved docs
retrieved AND
relevant docs
Docs
Apple
Blackberry
FordNokia
Samsung
Toyota
Query: mobile phone
Returned docs:
Precision:?
Recall:?
I t d ti t I f ti R t i l S 1 1
-
7/29/2019 information retrival systems
14/50
Introduction to Information Retrieval
Bigger collections
Consider N = 1 million documents, each with about
1000 words.
Avg 6 bytes/word including spaces/punctuation
6GB of data in the documents.
Say there are M = 500K distinctterms among these.
14
Sec. 1.1
I t d ti t I f ti R t i l Sec 1 1
-
7/29/2019 information retrival systems
15/50
Introduction to Information Retrieval
Cant build the matrix
500K x 1M matrix has half-a-trillion 0s and 1s.
But it has no more than one billion 1s.
matrix is extremely sparse.
Whats a better representation?
We only record the 1 positions.
15
Why?
Sec. 1.1
Introduction to Information Retrieval Sec 1 2
-
7/29/2019 information retrival systems
16/50
Introduction to Information Retrieval
Inverted index
For each term t, we must store a list of all documents
that contain t.
Identify each by a docID, a document serial number
Can we used fixed-size arrays for this?
16
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
What happens if the word Caesarisadded to document 14?
Sec. 1.2
174
54 101
Introduction to Information Retrieval
Sec 1 2
-
7/29/2019 information retrival systems
17/50
Introduction to Information Retrieval
Inverted index
We need variable-size postings lists
On disk, a continuous run of postings is normal and best
In memory, can use linked lists or variable length arrays
Some tradeoffs in size/ease of insertion
17
Dictionary Postings
Sorted by docID (more later on why).
Posting
Sec. 1.2
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
174
54 101
Introduction to Information Retrieval Sec 1 2
-
7/29/2019 information retrival systems
18/50
Introduction to Information Retrieval
Tokenizer
Token stream. Friends Romans Countrymen
Inverted index construction
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
More onthese later.
Documents to
be indexed.Friends, Romans, countrymen.
Sec. 1.2
Apache Lucene Core
(http://lucene.apache.org/)
has open source indexer.
Please try it out!
Introduction to Information Retrieval Sec 1 2
http://lucene.apache.org/http://lucene.apache.org/ -
7/29/2019 information retrival systems
19/50
Introduction to Information Retrieval
Indexer steps: Token sequence
Sequence of (Modified token, Document ID) pairs.
I did enact Julius
Caesar I was killedi' the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Sec. 1.2
Introduction to Information Retrieval Sec 1 2
-
7/29/2019 information retrival systems
20/50
Introduction to Information Retrieval
Indexer steps: Sort
Sort by terms And then docID
Core indexing step
Sec. 1.2
Introduction to Information Retrieval Sec 1 2
-
7/29/2019 information retrival systems
21/50
Introduction to Information Retrieval
Indexer steps: Dictionary & Postings
Multiple termentries in a singledocument aremerged.
Split into Dictionaryand Postings
Doc. frequency
information isadded.
Why frequency?
Will discuss later.
Sec. 1.2
Often stored is
term frequency,
e.g., 2 for I in doc
1, so posting
would be (1,2).
Introduction to Information Retrieval Sec 1 2
-
7/29/2019 information retrival systems
22/50
Introduction to Information Retrieval
Where do we pay in storage?
22Pointers
Termsand
counts Later in the
course:
How do we
index efficiently?How much
storage do we
need?
Sec. 1.2
Lists of
docIDs
Introduction to Information Retrieval Sec 1 3
-
7/29/2019 information retrival systems
23/50
Introduction to Information Retrieval
The index we just built
How do we process a query?
Later - what kinds of queries can we process?
23
Todays
focus
Sec. 1.3
Introduction to Information Retrieval Sec 1 3
-
7/29/2019 information retrival systems
24/50
Introduction to Information Retrieval
Processing Boolean queries
Exercise 1.1: Draw the inverted index that would
be built for the following document collection:
Doc 1 new home sales top forecasts
Doc 2 home sales top rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise
24
Sec. 1.3
Introduction to Information Retrieval Sec. 1.3
-
7/29/2019 information retrival systems
25/50
Introduction to Information Retrieval
Processing Boolean queries
Exercise 1.2: Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patientsa. Draw the term-doc incidence matrix for this document
collection.
b. Draw the inverted index representation for this collection.
Exercise 1.3: For the document collection in Ex. 1.2 what arethe returned results for the following queries?
a. schizophrenia AND drug
b. for AND NOT(drug OR approach) 25
Sec. 1.3
Introduction to Information Retrieval Sec. 1.3
-
7/29/2019 information retrival systems
26/50
Introduction to Information Retrieval
Query processing: AND
Consider processing the query:
BrutusANDCaesar
Locate Brutus in the Dictionary;
Retrieve its postings.
Locate Caesarin the Dictionary;
Retrieve its postings.
Merge the two postings:
26
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Sec. 1.3
Introduction to Information Retrieval Sec. 1.3
-
7/29/2019 information retrival systems
27/50
f
The merge
Walk through the two postings simultaneously, in
time linear in the total number of postings entries
27
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 212 8
If the list lengths are xand y, the merge takes O(x+y)operations.
Crucial: postings sorted by docID.
Sec. 1.3
Introduction to Information Retrieval
-
7/29/2019 information retrival systems
28/50
f
Intersecting two postings lists
(a merge algorithm)
28
Introduction to Information Retrieval Sec. 1.3
-
7/29/2019 information retrival systems
29/50
f
Boolean queries: Exact match
The Boolean retrieval model is being able to ask a
query that is a Boolean expression:
Boolean Queries are queries usingAND, OR and NOTto
join query terms Views each document as a set of words
Is precise: document matches condition or not.
Perhaps the simplest model to build an IR system on
Primary commercial retrieval tool for 3 decades.
Many search systems you still use are Boolean:
Email, library catalog, Mac OS X Spotlight
29
Introduction to Information Retrieval Sec. 1.4
-
7/29/2019 information retrival systems
30/50
f
Extended Boolean Model
Example: WestLaw http://www.westlaw.com/
Largest commercial (paying subscribers) legal
search service (started 1975; ranking added
1992)
Tens of terabytes of data; 700,000 users
Majority of users stilluse boolean queries
Example query:
What is the statute of limitations in cases involvingthe federal tort claims act?
LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT
/3 CLAIM
/3 = within 3 words, /S = in same sentence 30
Introduction to Information Retrieval Sec. 1.4
-
7/29/2019 information retrival systems
31/50
Example: WestLaw http://www.westlaw.com/
Another example query:
Requirements for disabled people to be able to access a
workplace
disabl! /p access! /s work-site work-place (employment /3place)
Note that SPACE is disjunction, not conjunction!
Long, precise queries; proximity operators;
incrementally developed; not like web search
Many professional searchers still like Boolean search
You know exactly what you are getting
But that doesnt mean it actually works better.
Introduction to Information Retrieval Sec. 1.3
-
7/29/2019 information retrival systems
32/50
Boolean queries:
More general merges
Exercise 1.4: Adapt the merge for the queries:
BrutusORCaesar
BrutusAND NOTCaesar
BrutusOR NOTCaesar
Write algorithms for queries of the above type, just like
the algorithm for INTERSECTION given earlier. Is the
running time still O(x+y)?
32
Introduction to Information Retrieval Sec. 1.3
-
7/29/2019 information retrival systems
33/50
Merging
Exercise 1.5: What about an arbitrary Boolean formula?
(BrutusOR Caesar)AND NOT(AntonyOR Cleopatra)
Can we always merge in linear time?
Linear in what?
Can we do better?
33
Consider the case that one
postings list is much shorter
than the one with which it is
to be merged. (Hint: Binary search)
Introduction to Information Retrieval Sec. 1.3
-
7/29/2019 information retrival systems
34/50
Merging
Exercise 1.6: What about an arbitrary Boolean formula?
(BrutusOR Caesar)AND NOT(AntonyOR Cleopatra)
Rewrite the query above into disjunctive normal
form.
Would the resulting query be more or less efficiently
evaluated than the original form?
34
Disjunctive normal form of a Boolean expression, examples:
X
XANDY
(XANDY) OR (XANDZ)
XOR (YANDNOTZ) OR (XANDZ)
Introduction to Information Retrieval Sec. 1.3
-
7/29/2019 information retrival systems
35/50
Query optimization
What is the best order for query processing?
Consider a query that is anAND ofn terms.
For each of the n terms, get its postings, thenAND them together.
Brutus
Caesar
Calpurnia
1 2 3 5 8 16 21 34
2 4 8 16 32 64 128
13 16
Query: BrutusANDCalpurniaANDCaesar35
Introduction to Information Retrieval
Sec. 1.3
-
7/29/2019 information retrival systems
36/50
Query optimization example
Process in order of increasing freq:
start with smallest set, then keep cutting further.
36
This is why we keptdocument freq. in dictionary
Execute the query as (CalpurniaANDBrutus)AND Caesar.
Brutus
Caesar
Calpurnia
1 2 3 5 8 16 21 34
2 4 8 16 32 64 128
13 16
7
8
2
Introduction to Information Retrieval Sec. 1.3
-
7/29/2019 information retrival systems
37/50
More general optimization
e.g., (madding OR crowd) AND (ignoble OR
strife)
Get doc. freq.s for all terms.
Estimate the size of each OR by the sum of its
doc. freq.s (conservative).
Process in increasing order ofOR sizes.
37
Introduction to Information Retrieval
-
7/29/2019 information retrival systems
38/50
Exercise
Exercise 1.7:Recommend a query
processing order for Term Freq
eyes 213312kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653trees 316812
38
(tangerine OR trees)AND(marmalade OR skies)AND
(kaleidoscope OR eyes)
Introduction to Information Retrieval
-
7/29/2019 information retrival systems
39/50
Query processing exercises
Exercise 1.8: If the query isfriendsAND romans AND
(NOTcountrymen), how could we use the freq of
countrymen in finding the best query evaluation
order? In particular, propose a way of handlingnegation in determining query processing order.
Exercise 1.9: For a conjunctive query is processing
postings lists in order of increasing size guaranteed
to be optimal? Explain why it is, or give an example
where it is not.
39
Introduction to Information Retrieval
-
7/29/2019 information retrival systems
40/50
Exercise
Try the search feature at
http://www.rhymezone.com/shakespeare/
Write down five search features you think it could do
better
40
Introduction to Information Retrieval
http://www.rhymezone.com/shakespeare/http://www.rhymezone.com/shakespeare/ -
7/29/2019 information retrival systems
41/50
Whats ahead in IR?
Beyond term search
What about phrases?
Stanford University
Proximity: Find GatesNEAR Microsoft.
Need index to capture position information in docs.
Zones in documents: Find documents with (author =
Ullman) AND (text contains automata).
41
Introduction to Information Retrieval
-
7/29/2019 information retrival systems
42/50
Evidence accumulation
1 vs. 0 occurrence of a search term
2 vs. 1 occurrence
3 vs. 2 occurrences, etc.
Usually more seems better Need term frequency information in docs
42
Introduction to Information Retrieval
-
7/29/2019 information retrival systems
43/50
Ranking search results
Boolean queries give inclusion or exclusion of docs.
Often we want to rank/group results
Need to measure proximity from query to each doc.
Need to decide whether docs presented to user aresingletons, or a group of docs covering various aspects of
the query.
43
Introduction to Information Retrieval
-
7/29/2019 information retrival systems
44/50
IR vs. databases:
Structured vs unstructured data
Structured data tends to refer to information in
tables
44
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
50000Ivy Smith
Typically allows numerical range and exact match
(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
Introduction to Information Retrieval
-
7/29/2019 information retrival systems
45/50
Unstructured data
Typically refers to free text
Allows
Keyword queries including operators
More sophisticated concept queries e.g., find all web pages dealing with drug abuse
Classic model for searching text documents
45
Introduction to Information Retrieval
-
7/29/2019 information retrival systems
46/50
Semi-structured data
In fact almost no data is unstructured
E.g., this slide has distinctly identified zones such as
the Title and Bullets
Facilitates semi-structured search such as Title contains data AND Bullets contain search
to say nothing of linguistic structure
46
Introduction to Information Retrieval
M hi ti t d i t t d
-
7/29/2019 information retrival systems
47/50
More sophisticated semi-structured
search
Title is about Object Oriented Programming AND
Author something like stro*rup
where * is the wild-card operator
Issues: how do you process about?
how do you rank results?
The focus of XML search (IIR chapter 10)
47
Introduction to Information Retrieval
-
7/29/2019 information retrival systems
48/50
Clustering, classification and ranking
Clustering: Given a set of docs, group them into
clusters based on their contents.
Classification: Given a set of topics, plus a new doc D,
decide which topic(s) D belongs to.
Ranking: Can we learn how to best order a set of
documents, e.g., a set of search results
48
Introduction to Information Retrieval
-
7/29/2019 information retrival systems
49/50
The web and its challenges
Unusual and diverse documents
Unusual and diverse users, queries, information
needs
Beyond terms, exploit ideas from social networks
link analysis, clickstreams ...
How do search engines work? And how can wemake them better?
49
Introduction to Information Retrieval
-
7/29/2019 information retrival systems
50/50
More sophisticated information retrieval
Cross-language information retrieval
Question answering
Summarization
Text mining