Introduction to Information Retrieval ` `%%%`# ` ~~~false [0.5cm] IIR 1: Boolean Retrieval ·...
Transcript of Introduction to Information Retrieval ` `%%%`# ` ~~~false [0.5cm] IIR 1: Boolean Retrieval ·...
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Introduction to Information Retrievalhttp://informationretrieval.org
IIR 1: Boolean Retrieval
Hinrich Schutze
Institute for Natural Language Processing, University of Stuttgart
2011-08-29
Schutze: Boolean retrieval 1 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Models and Methods
1 Boolean model and its limitations (30)
2 Vector space model (30)
3 Probabilistic models (30)
4 Language model-based retrieval (30)
5 Latent semantic indexing (30)
6 Learning to rank (30)
Schutze: Boolean retrieval 3 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Models and Methods
1 Boolean model and its limitations (30)
2 Vector space model (30)
3 Probabilistic models (30)
4 Language model-based retrieval (30)
5 Latent semantic indexing (30)
6 Learning to rank (30)
Schutze: Boolean retrieval 3 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Take-away
Schutze: Boolean retrieval 4 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Take-away
Boolean model and Inverted index: The Boolean model andthe basic data structure of most IR systems
Schutze: Boolean retrieval 4 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Take-away
Boolean model and Inverted index: The Boolean model andthe basic data structure of most IR systems
Processing Boolean queries
Schutze: Boolean retrieval 4 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Take-away
Boolean model and Inverted index: The Boolean model andthe basic data structure of most IR systems
Processing Boolean queries
Why is Boolean retrieval not enough? or Why do we needranked retrieval?
Schutze: Boolean retrieval 4 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Outline
1 Boolean model and Inverted index
2 Processing Boolean queries
3 Why ranked retrieval?
Schutze: Boolean retrieval 5 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
Schutze: Boolean retrieval 6 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
Schutze: Boolean retrieval 6 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
Schutze: Boolean retrieval 6 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
Schutze: Boolean retrieval 6 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
Schutze: Boolean retrieval 6 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
Schutze: Boolean retrieval 6 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
Schutze: Boolean retrieval 6 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Definition of information retrieval
Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).
The adhoc retrieval problem: Given a user information need and acollection of documents, the IR system determines how well thedocuments satisfy the query and returns a subset of relevantdocuments to the user.
Schutze: Boolean retrieval 6 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Boolean retrieval
Schutze: Boolean retrieval 7 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Boolean retrieval
The Boolean model is arguably the simplest model to base aninformation retrieval system on.
Schutze: Boolean retrieval 7 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Boolean retrieval
The Boolean model is arguably the simplest model to base aninformation retrieval system on.
Queries are Boolean expressions, e.g., Caesar and Brutus
Schutze: Boolean retrieval 7 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Boolean retrieval
The Boolean model is arguably the simplest model to base aninformation retrieval system on.
Queries are Boolean expressions, e.g., Caesar and Brutus
The seach engine returns all documents that satisfy theBoolean expression.
Schutze: Boolean retrieval 7 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Model collection: The works of Shakespeare
Schutze: Boolean retrieval 8 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Model collection: The works of Shakespeare
Schutze: Boolean retrieval 8 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Model collection: The works of Shakespeare
Schutze: Boolean retrieval 8 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Model collection: The works of Shakespeare
Each of Shakespeare’s tragedies,comedies etc is a document in thiscollection.
Schutze: Boolean retrieval 8 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Term-document incidence matrix
Schutze: Boolean retrieval 9 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Term-document incidence matrix
Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest
CleopatraAnthony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0. . .Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar.Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The
tempest.
Schutze: Boolean retrieval 9 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Term-document incidence matrix
Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest
CleopatraAnthony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0. . .Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar.Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The
tempest.
Schutze: Boolean retrieval 9 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Term-document incidence matrix
Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest
CleopatraAnthony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0. . .Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar.Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The
tempest.
Schutze: Boolean retrieval 9 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Term-document incidence matrix
Anthony Julius The Hamlet Othello Macbeth . . .and Caesar Tempest
CleopatraAnthony 1 1 0 0 0 1Brutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1worser 1 0 1 1 1 0. . .Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar.Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The
tempest.We will return to this matrix many times in this class.
Schutze: Boolean retrieval 9 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
We can’t build the incidence matrix for large collections
Schutze: Boolean retrieval 10 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
We can’t build the incidence matrix for large collections
Size of incidence matrix: number of documents times numberterms → too large for large collections
Schutze: Boolean retrieval 10 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
We can’t build the incidence matrix for large collections
Size of incidence matrix: number of documents times numberterms → too large for large collections
But the matrix is very sparse – mostly 0s, few 1s.
Schutze: Boolean retrieval 10 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
We can’t build the incidence matrix for large collections
Size of incidence matrix: number of documents times numberterms → too large for large collections
But the matrix is very sparse – mostly 0s, few 1s.
Inverted index: We only record the 1s.
Schutze: Boolean retrieval 10 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Inverted Index
Schutze: Boolean retrieval 11 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Inverted Index
For each term t, we store a list of all documents that contain t.
= For each term t, we store the 1s in its row in the incidencematrix
Brutus −→ 1 2 4 11 31 45 173 174
Caesar −→ 1 2 4 5 6 16 57 132 . . .
Calpurnia −→ 2 31 54 101
...
︸ ︷︷ ︸ ︸ ︷︷ ︸
dictionary postings
Schutze: Boolean retrieval 11 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Outline
1 Boolean model and Inverted index
2 Processing Boolean queries
3 Why ranked retrieval?
Schutze: Boolean retrieval 12 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Simple conjunctive query (two terms)
Schutze: Boolean retrieval 13 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Simple conjunctive query (two terms)
Consider the query: Brutus AND Calpurnia
Schutze: Boolean retrieval 13 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Simple conjunctive query (two terms)
Consider the query: Brutus AND Calpurnia
To find all matching documents using inverted index:
Schutze: Boolean retrieval 13 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Simple conjunctive query (two terms)
Consider the query: Brutus AND Calpurnia
To find all matching documents using inverted index:1 Locate Brutus in the dictionary
Schutze: Boolean retrieval 13 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Simple conjunctive query (two terms)
Consider the query: Brutus AND Calpurnia
To find all matching documents using inverted index:1 Locate Brutus in the dictionary2 Retrieve its postings list from the postings file
Schutze: Boolean retrieval 13 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Simple conjunctive query (two terms)
Consider the query: Brutus AND Calpurnia
To find all matching documents using inverted index:1 Locate Brutus in the dictionary2 Retrieve its postings list from the postings file3 Locate Calpurnia in the dictionary
Schutze: Boolean retrieval 13 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Simple conjunctive query (two terms)
Consider the query: Brutus AND Calpurnia
To find all matching documents using inverted index:1 Locate Brutus in the dictionary2 Retrieve its postings list from the postings file3 Locate Calpurnia in the dictionary4 Retrieve its postings list from the postings file
Schutze: Boolean retrieval 13 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Simple conjunctive query (two terms)
Consider the query: Brutus AND Calpurnia
To find all matching documents using inverted index:1 Locate Brutus in the dictionary2 Retrieve its postings list from the postings file3 Locate Calpurnia in the dictionary4 Retrieve its postings list from the postings file5 Intersect the two postings lists
Schutze: Boolean retrieval 13 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Simple conjunctive query (two terms)
Consider the query: Brutus AND Calpurnia
To find all matching documents using inverted index:1 Locate Brutus in the dictionary2 Retrieve its postings list from the postings file3 Locate Calpurnia in the dictionary4 Retrieve its postings list from the postings file5 Intersect the two postings lists6 Return intersection to user
Schutze: Boolean retrieval 13 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒
Schutze: Boolean retrieval 14 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒
Schutze: Boolean retrieval 14 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒
Schutze: Boolean retrieval 14 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2
Schutze: Boolean retrieval 14 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2
Schutze: Boolean retrieval 14 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2
Schutze: Boolean retrieval 14 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2
Schutze: Boolean retrieval 14 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2 → 31
Schutze: Boolean retrieval 14 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2 → 31
Schutze: Boolean retrieval 14 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2 → 31
Schutze: Boolean retrieval 14 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2 → 31
Schutze: Boolean retrieval 14 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2 → 31
Schutze: Boolean retrieval 14 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Intersecting two postings lists
Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia −→ 2 → 31 → 54 → 101
Intersection =⇒ 2 → 31
This is linear in the length of the postings lists.
Schutze: Boolean retrieval 14 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Boolean queries
The example was a simple conjunctive query . . .
Schutze: Boolean retrieval 15 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Boolean queries
The example was a simple conjunctive query . . .
. . . the Boolean retrieval model can answer any query that is aBoolean expression.
Schutze: Boolean retrieval 15 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Boolean queries
The example was a simple conjunctive query . . .
. . . the Boolean retrieval model can answer any query that is aBoolean expression.
Boolean queries are queries that use and, or and not to joinquery terms.
Schutze: Boolean retrieval 15 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Boolean queries
The example was a simple conjunctive query . . .
. . . the Boolean retrieval model can answer any query that is aBoolean expression.
Boolean queries are queries that use and, or and not to joinquery terms.Views each document as a set of terms.
Schutze: Boolean retrieval 15 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Boolean queries
The example was a simple conjunctive query . . .
. . . the Boolean retrieval model can answer any query that is aBoolean expression.
Boolean queries are queries that use and, or and not to joinquery terms.Views each document as a set of terms.Is precise: Document matches condition or not.
Schutze: Boolean retrieval 15 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Boolean queries
The example was a simple conjunctive query . . .
. . . the Boolean retrieval model can answer any query that is aBoolean expression.
Boolean queries are queries that use and, or and not to joinquery terms.Views each document as a set of terms.Is precise: Document matches condition or not.
Primary commercial retrieval tool for 3 decades
Schutze: Boolean retrieval 15 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Boolean queries
The example was a simple conjunctive query . . .
. . . the Boolean retrieval model can answer any query that is aBoolean expression.
Boolean queries are queries that use and, or and not to joinquery terms.Views each document as a set of terms.Is precise: Document matches condition or not.
Primary commercial retrieval tool for 3 decades
Many professional searchers (e.g., lawyers) still like Booleanqueries.
Schutze: Boolean retrieval 15 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Boolean queries
The example was a simple conjunctive query . . .
. . . the Boolean retrieval model can answer any query that is aBoolean expression.
Boolean queries are queries that use and, or and not to joinquery terms.Views each document as a set of terms.Is precise: Document matches condition or not.
Primary commercial retrieval tool for 3 decades
Many professional searchers (e.g., lawyers) still like Booleanqueries.
You know exactly what you are getting.
Schutze: Boolean retrieval 15 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Boolean queries
The example was a simple conjunctive query . . .
. . . the Boolean retrieval model can answer any query that is aBoolean expression.
Boolean queries are queries that use and, or and not to joinquery terms.Views each document as a set of terms.Is precise: Document matches condition or not.
Primary commercial retrieval tool for 3 decades
Many professional searchers (e.g., lawyers) still like Booleanqueries.
You know exactly what you are getting.
Many search systems you use are also Boolean: search systemon your laptop, in your email reader, on the intranet etc
Schutze: Boolean retrieval 15 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Boolean queries
The example was a simple conjunctive query . . .
. . . the Boolean retrieval model can answer any query that is aBoolean expression.
Boolean queries are queries that use and, or and not to joinquery terms.Views each document as a set of terms.Is precise: Document matches condition or not.
Primary commercial retrieval tool for 3 decades
Many professional searchers (e.g., lawyers) still like Booleanqueries.
You know exactly what you are getting.
Many search systems you use are also Boolean: search systemon your laptop, in your email reader, on the intranet etc
So are we done?
Schutze: Boolean retrieval 15 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Outline
1 Boolean model and Inverted index
2 Processing Boolean queries
3 Why ranked retrieval?
Schutze: Boolean retrieval 16 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
The Boolean model: Pros and Cons
Schutze: Boolean retrieval 17 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
The Boolean model: Pros and Cons
Key property: Documents either match or don’t.
Schutze: Boolean retrieval 17 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
The Boolean model: Pros and Cons
Key property: Documents either match or don’t.
Good for expert users with precise understanding of theirneeds and of the collection.
Schutze: Boolean retrieval 17 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
The Boolean model: Pros and Cons
Key property: Documents either match or don’t.
Good for expert users with precise understanding of theirneeds and of the collection.
Also good for applications: Applications can easily consume1000s of results.
Schutze: Boolean retrieval 17 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
The Boolean model: Pros and Cons
Key property: Documents either match or don’t.
Good for expert users with precise understanding of theirneeds and of the collection.
Also good for applications: Applications can easily consume1000s of results.
Not good for the majority of users
Schutze: Boolean retrieval 17 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
The Boolean model: Pros and Cons
Key property: Documents either match or don’t.
Good for expert users with precise understanding of theirneeds and of the collection.
Also good for applications: Applications can easily consume1000s of results.
Not good for the majority of users
Most users are not capable of writing Boolean queries . . .
Schutze: Boolean retrieval 17 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
The Boolean model: Pros and Cons
Key property: Documents either match or don’t.
Good for expert users with precise understanding of theirneeds and of the collection.
Also good for applications: Applications can easily consume1000s of results.
Not good for the majority of users
Most users are not capable of writing Boolean queries . . .
. . . or they are, but they think it’s too much work.
Schutze: Boolean retrieval 17 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
The Boolean model: Pros and Cons
Key property: Documents either match or don’t.
Good for expert users with precise understanding of theirneeds and of the collection.
Also good for applications: Applications can easily consume1000s of results.
Not good for the majority of users
Most users are not capable of writing Boolean queries . . .
. . . or they are, but they think it’s too much work.
Most users don’t want to wade through 1000s of results.
Schutze: Boolean retrieval 17 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
The Boolean model: Pros and Cons
Key property: Documents either match or don’t.
Good for expert users with precise understanding of theirneeds and of the collection.
Also good for applications: Applications can easily consume1000s of results.
Not good for the majority of users
Most users are not capable of writing Boolean queries . . .
. . . or they are, but they think it’s too much work.
Most users don’t want to wade through 1000s of results.
This is particularly true of web search.
Schutze: Boolean retrieval 17 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Problem with Boolean search: Feast or famine
Schutze: Boolean retrieval 18 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Problem with Boolean search: Feast or famine
Boolean queries often result in either too few (=0) or toomany (1000s) results.
Schutze: Boolean retrieval 18 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Problem with Boolean search: Feast or famine
Boolean queries often result in either too few (=0) or toomany (1000s) results.
Query 1 (boolean conjunction): [standard user dlink 650]
Schutze: Boolean retrieval 18 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Problem with Boolean search: Feast or famine
Boolean queries often result in either too few (=0) or toomany (1000s) results.
Query 1 (boolean conjunction): [standard user dlink 650]
→ 200,000 hits – feast
Schutze: Boolean retrieval 18 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Problem with Boolean search: Feast or famine
Boolean queries often result in either too few (=0) or toomany (1000s) results.
Query 1 (boolean conjunction): [standard user dlink 650]
→ 200,000 hits – feast
Query 2 (boolean conjunction): [standard user dlink 650 nocard found]
Schutze: Boolean retrieval 18 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Problem with Boolean search: Feast or famine
Boolean queries often result in either too few (=0) or toomany (1000s) results.
Query 1 (boolean conjunction): [standard user dlink 650]
→ 200,000 hits – feast
Query 2 (boolean conjunction): [standard user dlink 650 nocard found]
→ 0 hits – famine
Schutze: Boolean retrieval 18 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Problem with Boolean search: Feast or famine
Boolean queries often result in either too few (=0) or toomany (1000s) results.
Query 1 (boolean conjunction): [standard user dlink 650]
→ 200,000 hits – feast
Query 2 (boolean conjunction): [standard user dlink 650 nocard found]
→ 0 hits – famine
In Boolean retrieval, it takes a lot of skill to come up with aquery that produces a manageable number of hits.
Schutze: Boolean retrieval 18 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Feast or famine: No problem in ranked retrieval
Schutze: Boolean retrieval 19 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Feast or famine: No problem in ranked retrieval
With ranking, large result sets are not an issue.
Schutze: Boolean retrieval 19 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Feast or famine: No problem in ranked retrieval
With ranking, large result sets are not an issue.
Just show the top 10 results and the user won’t beoverwhelmed
Schutze: Boolean retrieval 19 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Feast or famine: No problem in ranked retrieval
With ranking, large result sets are not an issue.
Just show the top 10 results and the user won’t beoverwhelmed
Premise: the ranking algorithm works: More relevant resultsare ranked higher than less relevant results.
Schutze: Boolean retrieval 19 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Empirical investigation of the effect of ranking
How can we measure how important ranking is?
Schutze: Boolean retrieval 20 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Empirical investigation of the effect of ranking
How can we measure how important ranking is?
Observe what searchers do when they are searching in acontrolled setting
Schutze: Boolean retrieval 20 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Empirical investigation of the effect of ranking
How can we measure how important ranking is?
Observe what searchers do when they are searching in acontrolled setting
Videotape them
Schutze: Boolean retrieval 20 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Empirical investigation of the effect of ranking
How can we measure how important ranking is?
Observe what searchers do when they are searching in acontrolled setting
Videotape themAsk them to “think aloud”
Schutze: Boolean retrieval 20 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Empirical investigation of the effect of ranking
How can we measure how important ranking is?
Observe what searchers do when they are searching in acontrolled setting
Videotape themAsk them to “think aloud”Interview them
Schutze: Boolean retrieval 20 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Empirical investigation of the effect of ranking
How can we measure how important ranking is?
Observe what searchers do when they are searching in acontrolled setting
Videotape themAsk them to “think aloud”Interview themEye-track them
Schutze: Boolean retrieval 20 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Empirical investigation of the effect of ranking
How can we measure how important ranking is?
Observe what searchers do when they are searching in acontrolled setting
Videotape themAsk them to “think aloud”Interview themEye-track themTime them
Schutze: Boolean retrieval 20 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Empirical investigation of the effect of ranking
How can we measure how important ranking is?
Observe what searchers do when they are searching in acontrolled setting
Videotape themAsk them to “think aloud”Interview themEye-track themTime themRecord and count their clicks
Schutze: Boolean retrieval 20 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Empirical investigation of the effect of ranking
How can we measure how important ranking is?
Observe what searchers do when they are searching in acontrolled setting
Videotape themAsk them to “think aloud”Interview themEye-track themTime themRecord and count their clicks
The following slides are from Dan Russell’s 2007 JCDL talk
Schutze: Boolean retrieval 20 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Empirical investigation of the effect of ranking
How can we measure how important ranking is?
Observe what searchers do when they are searching in acontrolled setting
Videotape themAsk them to “think aloud”Interview themEye-track themTime themRecord and count their clicks
The following slides are from Dan Russell’s 2007 JCDL talk
Dan Russell was at the “Uber Tech Lead for Search Quality &User Happiness” at Google.
Schutze: Boolean retrieval 20 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Importance of ranking: Summary
Schutze: Boolean retrieval 26 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Importance of ranking: Summary
Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).
Schutze: Boolean retrieval 26 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Importance of ranking: Summary
Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).
Clicking: Distribution is even more skewed for clicking
Schutze: Boolean retrieval 26 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Importance of ranking: Summary
Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).
Clicking: Distribution is even more skewed for clicking
There is a very strong bias to click on the top-ranked page.
Schutze: Boolean retrieval 26 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Importance of ranking: Summary
Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).
Clicking: Distribution is even more skewed for clicking
There is a very strong bias to click on the top-ranked page.
Even if the top-ranked page is not relevant, 30% of users willclick on it.
Schutze: Boolean retrieval 26 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Importance of ranking: Summary
Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).
Clicking: Distribution is even more skewed for clicking
There is a very strong bias to click on the top-ranked page.
Even if the top-ranked page is not relevant, 30% of users willclick on it.
→ Getting the ranking right is very important.
Schutze: Boolean retrieval 26 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Importance of ranking: Summary
Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).
Clicking: Distribution is even more skewed for clicking
There is a very strong bias to click on the top-ranked page.
Even if the top-ranked page is not relevant, 30% of users willclick on it.
→ Getting the ranking right is very important.
→ Getting the top-ranked page right is most important.
Schutze: Boolean retrieval 26 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Importance of ranking: Summary
Viewing abstracts: Users are a lot more likely to read theabstracts of the top-ranked pages (1, 2, 3, 4) than theabstracts of the lower ranked pages (7, 8, 9, 10).
Clicking: Distribution is even more skewed for clicking
There is a very strong bias to click on the top-ranked page.
Even if the top-ranked page is not relevant, 30% of users willclick on it.
→ Getting the ranking right is very important.
→ Getting the top-ranked page right is most important.
Schutze: Boolean retrieval 26 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Take-away
Boolean model and Inverted index: The Boolean model andthe basic data structure of most IR systems
Processing Boolean queries
Why is Boolean retrieval not enough? or Why do we needranked retrieval?
Schutze: Boolean retrieval 27 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Resources
Chapter 1 of Introduction to Information Retrieval
Resources at http://informationretrieval.org/essir2011
List of useful information retrieval resourcesShakespeare search engineDaniel Russell’s home page
Schutze: Boolean retrieval 28 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Exercise
Schutze: Boolean retrieval 29 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Exercise
Does Bing/Google use the Boolean model?
Schutze: Boolean retrieval 29 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Exercise
Does Bing/Google use the Boolean model?Does Spotlight use the Boolean model?
Schutze: Boolean retrieval 29 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Does web search engines use the Boolean model?
Schutze: Boolean retrieval 30 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Does web search engines use the Boolean model?
Default interpretation of a query by web search engines: [w1
w2 . . .wn] is w1 AND w2 AND . . . AND wn
Schutze: Boolean retrieval 30 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Does web search engines use the Boolean model?
Default interpretation of a query by web search engines: [w1
w2 . . .wn] is w1 AND w2 AND . . . AND wn
Cases where you get hits that do not contain one of the wi :
Schutze: Boolean retrieval 30 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Does web search engines use the Boolean model?
Default interpretation of a query by web search engines: [w1
w2 . . .wn] is w1 AND w2 AND . . . AND wn
Cases where you get hits that do not contain one of the wi :
anchor text
Schutze: Boolean retrieval 30 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Does web search engines use the Boolean model?
Default interpretation of a query by web search engines: [w1
w2 . . .wn] is w1 AND w2 AND . . . AND wn
Cases where you get hits that do not contain one of the wi :
anchor textpage contains variant of wi (morphology, spelling correction,synonym)
Schutze: Boolean retrieval 30 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Does web search engines use the Boolean model?
Default interpretation of a query by web search engines: [w1
w2 . . .wn] is w1 AND w2 AND . . . AND wn
Cases where you get hits that do not contain one of the wi :
anchor textpage contains variant of wi (morphology, spelling correction,synonym)long queries (n large)
Schutze: Boolean retrieval 30 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Does web search engines use the Boolean model?
Default interpretation of a query by web search engines: [w1
w2 . . .wn] is w1 AND w2 AND . . . AND wn
Cases where you get hits that do not contain one of the wi :
anchor textpage contains variant of wi (morphology, spelling correction,synonym)long queries (n large)conjunctive boolean query generates very few hits
Schutze: Boolean retrieval 30 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Does web search engines use the Boolean model?
Default interpretation of a query by web search engines: [w1
w2 . . .wn] is w1 AND w2 AND . . . AND wn
Cases where you get hits that do not contain one of the wi :
anchor textpage contains variant of wi (morphology, spelling correction,synonym)long queries (n large)conjunctive boolean query generates very few hits
Simple Boolean vs. Ranking of result set
Schutze: Boolean retrieval 30 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Does web search engines use the Boolean model?
Default interpretation of a query by web search engines: [w1
w2 . . .wn] is w1 AND w2 AND . . . AND wn
Cases where you get hits that do not contain one of the wi :
anchor textpage contains variant of wi (morphology, spelling correction,synonym)long queries (n large)conjunctive boolean query generates very few hits
Simple Boolean vs. Ranking of result set
Simple Boolean retrieval returns matching documents in noparticular order.
Schutze: Boolean retrieval 30 / 30
Boolean model and Inverted index Processing Boolean queries Why ranked retrieval?
Does web search engines use the Boolean model?
Default interpretation of a query by web search engines: [w1
w2 . . .wn] is w1 AND w2 AND . . . AND wn
Cases where you get hits that do not contain one of the wi :
anchor textpage contains variant of wi (morphology, spelling correction,synonym)long queries (n large)conjunctive boolean query generates very few hits
Simple Boolean vs. Ranking of result set
Simple Boolean retrieval returns matching documents in noparticular order.Google (and most well designed Boolean engines) rank theresult set – they rank good hits (according to some estimatorof relevance) higher than bad hits.
Schutze: Boolean retrieval 30 / 30