ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes...
-
Upload
iyanna-delling -
Category
Documents
-
view
217 -
download
3
Transcript of ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes...
![Page 1: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/1.jpg)
ADFOCS 2004
Prabhakar RaghavanLecture 1
![Page 2: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/2.jpg)
Plan: Basic information retrieval
Lecture 1: ~120 minutes Index structures
Lecture 2: 90 minutes Index compression and construction
Lecture 3: ~90 minutes Scoring and evaluation
![Page 3: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/3.jpg)
Query
Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
Could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the phrase
Romans and countrymen) not feasible
![Page 4: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/4.jpg)
Term-document incidence
1 if play contains word, 0 otherwise
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
![Page 5: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/5.jpg)
Incidence vectors
So we have a 0/1 vector for each term. To answer query: take the vectors for
Brutus, Caesar and Calpurnia (complemented) bitwise AND.
110100 AND 110111 AND 101111 = 100100.
![Page 6: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/6.jpg)
Answers to query
Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.
![Page 7: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/7.jpg)
Bigger corpora
Consider n = 1M documents, each with about 1K terms.
Avg 6 bytes/term incl spaces/punctuation 6GB of data in the documents.
Say there are m = 500K distinct terms among these.
![Page 8: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/8.jpg)
Can’t build the matrix
500K x 1M matrix has half-a-trillion 0’s and 1’s.
But it has no more than one billion 1’s. matrix is extremely sparse.
What’s a better representation? We only record the 1 positions.
Why?
![Page 9: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/9.jpg)
Inverted index
For each term T, must store a list of all documents that contain T.
Do we use an array or a list for this?
Brutus
Calpurnia
Caesar
1 2 3 5 8 13 21 34
2 4 8 16 32 64128
13 16
What happens if the word Caesar is added to document 14?
![Page 10: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/10.jpg)
Inverted index
Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers
Brutus
Calpurnia
Caesar
2 4 8 16 32 64 128
2 3 5 8 13 21 34
13 16
1
Dictionary Postings
Sorted by docID (more later on why).
![Page 11: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/11.jpg)
Inverted index construction
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
More onthese later.
Documents tobe indexed.
Friends, Romans, countrymen.
![Page 12: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/12.jpg)
Sequence of (Modified token, Document ID) pairs.
I did enact JuliusCaesar I was killed
i' the Capitol; Brutus killed me.
Doc 1
So let it be withCaesar. The noble
Brutus hath told youCaesar was ambitious
Doc 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2
caesar 2was 2ambitious 2
Indexer steps
![Page 13: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/13.jpg)
Sort by terms. Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2
Core indexing step.
![Page 14: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/14.jpg)
Multiple term entries in a single document are merged.
Frequency information is added.
Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2
Why frequency?Will discuss later.
![Page 15: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/15.jpg)
The result is split into a Dictionary file and a Postings file.
Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1
Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1
Term Doc # Freqambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1I 1 2i' 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1the 2 1told 2 1you 2 1was 1 1was 2 1with 2 1
![Page 16: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/16.jpg)
Where do we pay in storage? Doc # Freq
2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1
Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1
Pointers
Terms
Will quantify the storage, later.
![Page 17: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/17.jpg)
The index we just built
How do we process a query? What kinds of queries can we process?
Which terms in a doc do we index? All words or only “important” ones?
Stopword list: terms that are so common that they’re ignored for indexing. e.g., the, a, an, of, to … language-specific.
Initial focus
![Page 18: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/18.jpg)
Query processing
Consider processing the query:Brutus AND Caesar Locate Brutus in the Dictionary;
Retrieve its postings. Locate Caesar in the Dictionary;
Retrieve its postings. “Merge” the two postings:
128
34
2 4 8 16 32 64
1 2 3 5 8 13
21
Brutus
Caesar
![Page 19: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/19.jpg)
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
The merge
Walk through the two postings simultaneously, in time linear in the total number of postings entries
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar2 8
If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.
![Page 20: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/20.jpg)
Boolean queries: Exact match
Queries using AND, OR and NOT together with query terms Views each document as a set of words Is precise: document matches condition or not.
Primary commercial retrieval tool for 3 decades.
Professional searchers (e.g., Lawyers) still like Boolean queries: You know exactly what you’re getting.
![Page 21: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/21.jpg)
Example: WestLaw http://www.westlaw.com/
Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)
About 7 terabytes of data; 700,000 users Majority of users still use boolean queries Example query:
What is the statute of limitations in cases involving the federal tort claims act?
LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
Long, precise queries; proximity operators; incrementally developed; not like web search
![Page 22: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/22.jpg)
More general merges
Exercise: Adapt the merge for the queries:Brutus AND NOT CaesarBrutus OR NOT Caesar
Can we still run through the merge in time O(x+y)?
![Page 23: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/23.jpg)
Merging
What about an arbitrary Boolean formula?(Brutus OR Caesar) AND NOT(Antony OR Cleopatra) Can we always merge in “linear” time?
Linear in what? Can we do better?
![Page 24: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/24.jpg)
Query optimization
What is the best order for query processing?
Consider a query that is an AND of t terms. For each of the t terms, get its postings,
then AND together.Brutus
Calpurnia
Caesar
1 2 3 5 8 16 21 34
2 4 8 16 32 64128
13 16
Query: Brutus AND Calpurnia AND Caesar
![Page 25: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/25.jpg)
Query optimization example
Process in order of increasing freq: start with smallest set, then keep cutting
further.
Brutus
Calpurnia
Caesar
1 2 3 5 8 13 21 34
2 4 8 16 32 64128
13 16
This is why we keptfreq in dictionary
Execute the query as (Caesar AND Brutus) AND Calpurnia.
![Page 26: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/26.jpg)
More general optimization
e.g., (madding OR crowd) AND (ignoble OR strife)
Get freq’s for all terms. Estimate the size of each OR by the
sum of its freq’s (conservative). Process in increasing order of OR
sizes.
![Page 27: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/27.jpg)
Exercise
Recommend a query processing order for
(tangerine OR trees) AND(marmalade OR skies) AND(kaleidoscope OR eyes)
Term Freq eyes 213312
kaleidoscope 87009
marmalade 107913
skies 271658
tangerine 46653
trees 316812
![Page 28: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/28.jpg)
Query processing exercises
If the query is friends AND romans AND (NOT countrymen), how could we use the freq of countrymen?
Exercise: Extend the merge to an arbitrary Boolean query. Can we always guarantee execution in time linear in the total postings size?
Hint: Begin with the case of a Boolean formula query: the each query term appears only once in the query.
![Page 29: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/29.jpg)
Online Exercise
Try the search feature at http://www.rhymezone.com/shakespeare/
Write down five search features you think it could do better
![Page 30: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/30.jpg)
Recall basic indexing pipeline
Tokenizer
Token stream. Friends Romans Countrymen
Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
Documents tobe indexed.
Friends, Romans, countrymen.
![Page 31: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/31.jpg)
Tokenization
![Page 32: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/32.jpg)
Tokenization
Input: “Friends, Romans and Countrymen”
Output: Tokens Friends Romans Countrymen
Each such token is now a candidate for an index entry, after further processing Described below
But what are valid tokens to emit?
![Page 33: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/33.jpg)
Parsing a document
What format is it in? pdf/word/excel/html?
What language is it in? What character set is in use?
Each of these is a classification problem, which we will study later in the course.
But there are complications …
![Page 34: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/34.jpg)
Format/language stripping
Documents being indexed can include docs from many different languages A single index may have to contain terms of
several languages. Sometimes a document or its components
can contain multiple languages/formats French email with a Portuguese pdf
attachment. What is a unit document?
An email? With attachments? An email with a zip containing documents?
![Page 35: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/35.jpg)
Tokenization
Issues in tokenization: Finland’s capital Finland?
Finlands? Finland’s? Hewlett-Packard Hewlett and
Packard as two tokens? San Francisco: one token or two?
How do you decide it is one token?
![Page 36: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/36.jpg)
Language issues
Accents: résumé vs. resume. L'ensemble one token or two?
L ? L’ ? Le ? How are your users like to write their
queries for these words?
![Page 37: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/37.jpg)
Tokenization: language issues
Chinese and Japanese have no spaces between words: Not always guaranteed a unique
tokenization Further complicated in Japanese, with
multiple alphabets intermingled Dates/amounts in multiple formatsフォーチュン 500 社は情報不足のため時間あた $500K( 約 6,000 万円 )
Katakana Hiragana Kanji “Romaji”
End-user can express query entirely in (say) Hiragana!
![Page 38: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/38.jpg)
Normalization
In “right-to-left languages” like Hebrew and Arabic: you can have “left-to-right” text interspersed (e.g., for dollar amounts).
Need to “normalize” indexed text as well as query terms into the same form
Character-level alphabet detection and conversion Tokenization not separable from this. Sometimes ambiguous:
7 月 30 日 vs. 7/30
Morgen will ich in MIT …
Is thisGerman “mit”?
![Page 39: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/39.jpg)
Punctuation
Ne’er: use language-specific, handcrafted “locale” to normalize. Which language? Most common: detect/apply language at a
pre-determined granularity: doc/paragraph. State-of-the-art: break up hyphenated
sequence. Phrase index? U.S.A. vs. USA - use locale. a.out
![Page 40: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/40.jpg)
Numbers
3/12/91 Mar. 12, 1991 55 B.C. B-52 My PGP key is 324a3df234cb23e 100.2.86.144
Generally, don’t index as text. Will often index “meta-data” separately
Creation date, format, etc.
![Page 41: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/41.jpg)
Case folding
Reduce all letters to lower case exception: upper case (in mid-
sentence?) e.g., General Motors Fed vs. fed SAIL vs. sail
![Page 42: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/42.jpg)
Spell correction
Expand to terms within (say) edit distance 2 Edit{Insert/Delete/Replace} Expand at query time from query e.g., Alanis Morisette
Spell correction is expensive and slows the query (upto a factor of 100) Invoke only when index returns (near-)zero
matches. What if docs contain mis-spellings?
Why not atindex time?
![Page 43: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/43.jpg)
Thesauri and soundex
Handle synonyms and homonyms Hand-constructed equivalence classes
e.g., car = automobile your you’re
Index such equivalences When the document contains automobile,
index it under car as well (usually, also vice-versa)
Or expand query? When the query contains automobile, look
under car as well More on this later ...
![Page 44: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/44.jpg)
Soundex
Class of heuristics to expand a query into phonetic equivalents Language specific – mainly for names E.g., chebyshev tchebycheff
![Page 45: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/45.jpg)
Soundex – typical algorithm
Turn every token to be indexed into a 4-character reduced form
Do the same with query terms Build and search an index on the reduced
forms (when the query calls for a soundex match)
http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top
![Page 46: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/46.jpg)
Soundex – typical algorithm
1. Retain the first letter of the word. 2. Change all occurrences of the following
letters to '0' (zero): 'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
3. Change letters to digits as follows: B, F, P, V 1 C, G, J, K, Q, S, X, Z 2 D,T 3 L 4 M, N 5 R 6
![Page 47: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/47.jpg)
Soundex continued
4. Remove all pairs of consecutive digits.5. Remove all zeros from the resulting string.6. Pad the resulting string with trailing zeros
and return the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>.
E.g., Herman becomes H655.
![Page 48: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/48.jpg)
Query expansion
Usually do query expansion rather than index expansion No index blowup Query processing slowed down
Docs frequently contain equivalences May retrieve more junk
puma jaguar retrieves documents on cars instead of on sneakers.
![Page 49: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/49.jpg)
Language detection
Many of the components described require language detection For docs/paragraphs at indexing time For query terms at query time – much harder
For docs/paragraphs, generally have enough text to apply machine learning methods
For queries, generally lack sufficient text Augment with other cues, such as client
properties/specification from application Domain of query origination, etc.
![Page 50: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/50.jpg)
What queries can we process?
We have Basic inverted index with skip pointers Wild-card index Spell-correction
Queries such asan*er* AND (moriset /3 toronto) OR
SOUNDEX(chaikofski)
![Page 51: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/51.jpg)
Aside – results caching
If 25% of your users are searching for britney AND spears
then you probably do need spelling correction, but you don’t need to keep on intersecting those two postings lists
Web query distribution is extremely skewed, and you can usefully cache results for common queries – more later.
![Page 52: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/52.jpg)
Lemmatization
Reduce inflectional/variant forms to base form
E.g., am, are, is be car, cars, car's, cars' car
the boy's cars are different colors the boy car be different color
![Page 53: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/53.jpg)
Dictionary entries – first cut
ensemble.french
時間 .japanese
MIT.english
mit.german
guaranteed.english
entries.english
sometimes.english
tokenization.english
These may be grouped by
language. More on this in query
processing.
![Page 54: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/54.jpg)
Stemming
Reduce terms to their “roots” before indexing language dependent e.g., automate(s), automatic,
automation all reduced to automat.
for example compressed and compression are both accepted as equivalent to compress.
for exampl compres andcompres are both acceptas equival to compres.
![Page 55: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/55.jpg)
Porter’s algorithm
Commonest algorithm for stemming English
Conventions + 5 phases of reductions phases applied sequentially each phase consists of a set of commands sample convention: Of the rules in a
compound command, select the one that applies to the longest suffix.
![Page 56: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/56.jpg)
Typical rules in Porter
sses ss ies i ational ate tional tion
![Page 57: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/57.jpg)
Other stemmers
Other stemmers exist, e.g., Lovins stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm
Single-pass, longest suffix removal (about 250 rules)
Motivated by Linguistics as well as IR Full morphological analysis - modest
benefits for retrieval
![Page 58: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/58.jpg)
Faster postings merges:Skip pointers
![Page 59: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/59.jpg)
Recall basic merge
Walk through the two postings simultaneously, in time linear in the total number of postings entries
128
31
2 4 8 16 32 64
1 2 3 5 8 17 21
Brutus
Caesar2 8
If the list lengths are m and n, the merge takes O(m+n)operations.
Can we do better?Yes, if index isn’t changing too fast.
![Page 60: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/60.jpg)
Augment postings with skip pointers (at indexing time)
Why? To skip postings that will not figure in the
search results. How? Where do we place skip pointers?
1282 4 8 16 32 64
311 2 3 5 8 17 21318
16 128
![Page 61: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/61.jpg)
Query processing with skip pointers
1282 4 8 16 32 64
311 2 3 5 8 17 21318
16 128
Suppose we’ve stepped through the lists until we process 8 on each list.
When we get to 16 on the top list, we see that itssuccessor is 32.
But the skip successor of 8 on the lower list is 31, sowe can skip ahead past the intervening postings.
![Page 62: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/62.jpg)
Where do we place skips?
Tradeoff: More skips shorter skip spans more
likely to skip. But lots of comparisons to skip pointers.
Fewer skips few pointer comparison, but then long skip spans few successful skips.
![Page 63: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/63.jpg)
Placing skips
Simple heuristic: for postings of length L, use L evenly-spaced skip pointers.
This ignores the distribution of query terms.
Easy if the index is relatively static; harder if L keeps changing because of updates.
![Page 64: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/64.jpg)
Phrase queries
![Page 65: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/65.jpg)
Phrase queries
Want to answer queries such as stanford university – as a phrase
Thus the sentence “I went to university at Stanford” is not a match.
No longer suffices to store only<term : docs> entries
![Page 66: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/66.jpg)
A first attempt: Biword indexes
Index every consecutive pair of terms in the text as a phrase
For example the text “Friends, Romans and Countrymen” would generate the biwords friends romans romans and and countrymen
Each of these is now a dictionary term Two-word phrase query-processing is now
immediate.
![Page 67: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/67.jpg)
Longer phrase queries
Longer phrases are processed as we did with wild-cards:
stanford university palo alto can be broken into the Boolean query on biwords:
stanford university AND university palo AND palo alto
Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase.
![Page 68: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/68.jpg)
Extended biwords
Parse the indexed text and perform part-of-speech-tagging (POST).
Bucket the terms into (say) Nouns (N) and articles/prepositions (X).
Now deem any string of terms of the form NX*N to be an extended biword. Each such extended biword is now made a
term in the dictionary. Example:
catcher in the ryeN X X N
![Page 69: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/69.jpg)
Query processing
Given a query, parse it into N’s and X’s Segment query into enhanced biwords Look up index
Issues Parsing longer queries into conjunctions E.g., the query tangerine trees and
marmalade skies is parsed into tangerine trees AND trees and
marmalade AND marmalade skies
![Page 70: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/70.jpg)
Other issues
False positives, as noted before Index blowup due to bigger dictionary
![Page 71: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/71.jpg)
Positional indexes
Store, for each term, entries of the form:<number of docs containing term;doc1: position1, position2 … ;doc2: position1, position2 … ;etc.>
![Page 72: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/72.jpg)
Positional index example
Can compress position values/offsets Nevertheless, this expands postings
storage substantially
<be: 993427;1: 7, 18, 33, 72, 86, 231;2: 3, 149;4: 17, 191, 291, 430, 434;5: 363, 367, …>
Which of docs 1,2,4,5could contain “to be
or not to be”?
![Page 73: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/73.jpg)
Processing a phrase query
Extract inverted index entries for each distinct term: to, be, or, not.
Merge their doc:position lists to enumerate all positions with “to be or not to be”. to:
2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
be: 1:17,19; 4:17,191,291,430,434;
5:14,19,101; ...
Same general method for proximity searches
![Page 74: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/74.jpg)
Proximity queries
LIMIT! /3 STATUTE /3 FEDERAL /2 TORT Here, /k means “within k words of”.
Clearly, positional indexes can be used for such queries; biword indexes cannot.
Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k?
![Page 75: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/75.jpg)
Positional index size
Can compress position values/offsets as we did with docs in the last lecture
Nevertheless, this expands postings storage substantially
![Page 76: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/76.jpg)
Positional index size
Need an entry for each occurrence, not just once per document
Index size depends on average document size Average web page has <1000 terms SEC filings, books, even some epic poems …
easily 100,000 terms Consider a term with frequency 0.1%
Why?
1001100,000
111000
Positional postingsPostingsDocument size
![Page 77: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/77.jpg)
Rules of thumb
Positional index size factor of 2-4 over non-positional index
Positional index size 35-50% of volume of original text
Caveat: all of this holds for “English-like” languages
![Page 78: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/78.jpg)
Wild-card queries
![Page 79: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/79.jpg)
Wild-card queries: *
mon*: find all docs containing any word beginning “mon”.
Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w < moo
*mon: find words ending in “mon”: harder Maintain an additional B-tree for terms
backwards.Now retrieve all words in range: nom ≤ w <
non.Exercise: from this, how can we enumerate all termsmeeting the wild-card query pro*cent ?
![Page 80: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/80.jpg)
Query processing
At this point, we have an enumeration of all terms in the dictionary that match the wild-card query.
We still have to look up the postings for each enumerated term.
E.g., consider the query:se*ate AND fil*erThis may result in the execution of many Boolean AND queries.
![Page 81: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/81.jpg)
Permuterm index
For term hello index under: hello$, ello$h, llo$he, lo$hel, o$hellwhere $ is a special symbol.
Queries: X lookup on X$ X* lookup on X*$ *X lookup on X$* *X* lookup on
X* X*Y lookup on Y$X* X*Y*Z ??? Exercise!
![Page 82: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/82.jpg)
Bigram indexes
Permuterm problem: ≈ quadruples lexicon size
Another way: index all k-grams occurring in any word (any sequence of k chars)
e.g., from text “April is the cruelest month” we get the 2-grams (bigrams)
$ is a special word boundary symbol Index retrieves matching dictionary terms.
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,ue,el,le,es,st,t$, $m,mo,on,nt,h$
![Page 83: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/83.jpg)
Bigram indexes - example
mo
on
among
$m mace
among
amortize
madden
around
![Page 84: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/84.jpg)
Processing n-gram wild-cards
Query mon* can now be run as $m AND mo AND on
Fast, space efficient, returns terms that are then run against the dictionary.
But we’d enumerate moon. Must post-filter these terms against query.
![Page 85: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/85.jpg)
Processing wild-card queries
As before, we must execute a Boolean query for each enumerated, filtered term.
Wild-cards can result in expensive query execution Avoid encouraging “laziness” in the UI:
Search
Type your search terms, use ‘*’ if you need to.E.g., Alex* will match Alexander.
![Page 86: ADFOCS 2004 Prabhakar Raghavan Lecture 1. Plan: Basic information retrieval Lecture 1: ~120 minutes Index structures Lecture 2: 90 minutes Index compression.](https://reader034.fdocuments.us/reader034/viewer/2022051614/551c4bd55503469d6a8b496b/html5/thumbnails/86.jpg)
Resources for this lecture
Managing Gigabytes, Chapter 3.2, 3.6, 4.3 Modern Information Retrieval, Chapter 8.2 Shakespeare:
http://www.rhymezone.com/shakespeare/ Try the neat browse by keyword sequence feature! Porter’s stemmer:
http//www.sims.berkeley.edu/~hearst/irbook/porter.html H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying
with Combined Indexes”, ACM Transactions on Information Systems.http://www.seg.rmit.edu.au/research/research.php?author=4