Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

54
Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS

Transcript of Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Page 1: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Information Retrieval Techniques

MS(CS) Lecture 5AIR UNIVERSITY MULTAN CAMPUS

Page 2: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Quick Review

• Inverted Index Construction (Exercise)• Query Processing Using Inverted Index• Faster Posting Merges: Skip Pointers• Phrase Queries

– Bi-word Index– Extended Bi-Word Index– Positional Index

Page 3: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Proximity queries

• LIMIT! /3 STATUTE /3 FEDERAL /2 TORT – Again, here, /k means “within k words of”.

• Clearly, positional indexes can be used for such queries; biword indexes cannot.

• Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k?– This is a little tricky to do correctly and efficiently– See Figure 2.12 of IIR– There’s likely to be a problem on it!

Sec. 2.4.2

3

Page 4: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

4

Proximity search

4

We just saw how to use a positional index for phrase searches.

We can also use it for proximity search. For example: employment /4 place Find all documents that contain EMPLOYMENT and

PLACE within 4 words of each other.Employment agencies that place healthcare workers

are seeing growth is a hit.Employment agencies that have learned to adapt now

place healthcare workers is not a hit.

Page 5: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

5

Proximity search?

5

Use the positional index Simplest algorithm: look at cross-product of positions

of (i) EMPLOYMENT in document and (ii) PLACE in document

Very inefficient for frequent words, especially stop words

Note that we want to return the actual matching positions, not just a list of documents.

This is important for dynamic summaries etc.

Page 6: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Positional index size

• You can compress position values/offsets: • Nevertheless, a positional index expands

postings storage substantially link• Nevertheless, a positional index is now

standardly used because of the power and usefulness of phrase and proximity queries … whether used explicitly or implicitly in a ranking retrieval system.

Sec. 2.4.2

6

Page 7: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Positional index size

• Need an entry for each occurrence, not just once per document

• Index size depends on average document size– Average web page has <1000 terms– SEC filings, books, even some epic poems … easily

100,000 terms• Consider a term with frequency 0.1%

Why?

1001100,000

111000

Positional postingsPostingsDocument size

Sec. 2.4.2

7

Page 8: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Rules of thumb

• A positional index is 2–4 as large as a non-positional index

• Positional index size 35–50% of volume of original text

• Caveat: all of this holds for “English-like” languages

Sec. 2.4.2

8

Page 9: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Combination schemes

• These two approaches can be profitably combined– For particular phrases (“Michael Jackson”, “Britney

Spears”) it is inefficient to keep on merging positional postings lists

• Even more so for phrases like “The Who”

• Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme– A typical web query mixture was executed in ¼ of

the time of using just a positional index– It required 26% more space than having a positional

index alone

Sec. 2.4.3

9

Page 10: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Inverted Index Construction

• Positional index size • Dictionary size• Hardware issues• Large collection requirements analysis

Page 11: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

11

Inverted index

11

Page 12: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

12

Dictionaries

The dictionary is the data structure for storing the term vocabulary.

Term vocabulary: the dataDictionary: the data structure for storing the term

vocabulary

12

Page 13: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

13

Dictionary as array of fixed-width entries

For each term, we need to store a couple of items:document frequencypointer to postings list . . .

Assume for the time being that we can store this information in a fixed-length entry.

Assume that we store these entries in an array.

13

Page 14: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

14

Dictionary as array of fixed-width entries

space needed: 20 bytes 4 bytes 4 bytesHow do we look up a query term qi in this array at query time? That is: which data structure do we use to locate the entry (row) in the array where qi is stored?

14

Page 15: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

15

Data structures for looking up term

Two main classes of data structures: hashes and trees Some IR systems use hashes, some use trees.Criteria for when to use hashes vs. trees:

Is there a fixed number of terms or will it keep growing?

What are the relative frequencies with which various keys will be accessed?

How many terms are we likely to have?15

Page 16: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

16

Hashes Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve

collisions, locate entry in fixed-width array Pros: Lookup in a hash is faster than lookup in a tree.

Lookup time is constant. Cons

no way to find minor variants (resume vs. résumé) no prefix search (all terms starting with automat) need to rehash everything periodically if vocabulary keeps

growing16

Page 17: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

17

Trees Trees solve the prefix problem (find all terms starting with

automat). Simplest tree: binary tree Search is slightly slower than in hashes: O(logM), where M is

the size of the vocabulary. O(logM) only holds for balanced trees. Rebalancing binary trees is expensive. B-trees mitigate the rebalancing problem. B-tree definition: every internal node has a number of children

in the interval [a, b] where a, b are appropriate positive integers, e.g., [2, 4].

17

Page 18: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

18

Binary tree

18

Page 19: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

19

B-tree

19

Page 20: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Index construction

• How do we construct an index?• What strategies can we use with limited main

memory?

Ch. 4

Page 21: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Hardware basics

• Many design decisions in information retrieval are based on the characteristics of hardware

• We begin by reviewing hardware basics

Sec. 4.1

Page 22: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Hardware basics

• Access to data in memory is much faster than access to data on disk.

• Disk seeks: No data is transferred from disk while the disk head is being positioned.

• Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks.

• Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks).

• Block sizes: 8KB to 256 KB.

Sec. 4.1

Page 23: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Hardware basics

• Servers used in IR systems now typically have several GB of main memory, sometimes tens of GB.

• Available disk space is several (2–3) orders of magnitude larger.

• Fault tolerance is very expensive: It’s much cheaper to use many regular machines rather than one fault tolerant machine.

Sec. 4.1

Page 24: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

• Documents are parsed to extract words and these are saved with the Document ID.

I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.

Doc 1

So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitious

Doc 2

Recall IIR 1 index constructionTerm Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Sec. 4.2

Page 25: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Term Doc #I 1did 1enact 1julius 1caesar 1I 1was 1killed 1i' 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2the 2noble 2brutus 2hath 2told 2you 2caesar 2was 2ambitious 2

Term Doc #ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1I 1I 1i' 1it 2julius 1killed 1killed 1let 2me 1noble 2so 2the 1the 2told 2you 2was 1was 2with 2

Key step

• After all documents have been parsed, the inverted file is sorted by terms.

We focus on this sort step.We have 100M items to sort.

Sec. 4.2

Page 26: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Scaling index construction

• In-memory index construction does not scale– Can’t stuff entire collection into memory, sort,

then write back• How can we construct an index for very large

collections?• Taking into account the hardware constraints

we just learned about . . .• Memory, disk, speed, etc.

Sec. 4.2

Page 27: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Sort-based index construction• As we build the index, we parse docs one at a time.

– While building the index, we cannot easily exploit compression tricks (you can, but much more complex)

• The final postings for any term are incomplete until the end.• At 12 bytes per non-positional postings entry (term, doc, freq),

demands a lot of space for large collections.• T = 100,000,000 in the case of RCV1

– So … we can do this in memory in 2009, but typical collections are much larger. E.g., the New York Times provides an index of >150 years of newswire

• Thus: We need to store intermediate results on disk.

Sec. 4.2

Page 28: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Sort using disk as “memory”?

• Can we use the same index construction algorithm for larger collections, but by using disk instead of memory?

• No: Sorting T = 100,000,000 records on disk is too slow – too many disk seeks.

• We need an external sorting algorithm.

Sec. 4.2

Page 29: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

29

RCV1 collection

Shakespeare’s collected works are not large enough for demonstrating many of the points in this course.

As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection.

English newswire articles sent over the wire in 1995 and 1996 (one year).

29

Page 30: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

30

Same algorithm for disk?

Can we use the same index construction algorithm for larger collections, but by using disk instead of memory?

No: Sorting T = 100,000,000 records on disk is too slow – too many disk seeks.

We need an external sorting algorithm.

30

Page 31: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

31

“External” sorting algorithm(using few disk seeks)

We must sort T = 100,000,000 non-positional postings. Each posting has size 12 bytes (4+4+4: termID, docID,

document frequency). Define a block to consist of 10,000,000 such postings

We can easily fit that many postings into memory. We will have 10 such blocks for RCV1.

Basic idea of algorithm: For each block: (i) accumulate postings, (ii) sort in memory,

(iii) write to disk Then merge the blocks into one long sorted order.

31

Page 32: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

32

Merging two blocks

32

Page 33: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

33

Blocked Sort-Based Indexing

Key decision: What is the size of one block?

33

Page 34: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

34

Problem with sort-based algorithm

Our assumption was: we can keep the dictionary in memory.

We need the dictionary (which grows dynamically) in order to implement a term to termID mapping.

Actually, we could work with term,docID postings instead of termID,docID postings . . .

. . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.)

34

Page 35: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

35

Single-pass in-memory indexing

Abbreviation: SPIMIKey idea 1: Generate separate dictionaries for each

block – no need to maintain term-termID mapping across blocks.

Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur.

With these two ideas we can generate a complete inverted index for each block.

These separate indexes can then be merged into one big index.

35

Page 36: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

Using wildcard in queries

Page 37: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

37

Wildcard queries

mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range:

mon ≤ t < moo *mon: find all docs containing any term ending with mon

Maintain an additional tree for terms backwards Then retrieve all terms t in the range: nom ≤ t < non

Result: A set of terms that are matches for wildcard query Then retrieve documents that contain any of these terms

37

Page 38: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

38

How to handle * in the middle of a term

Example: m*nchenWe could look up m* and *nchen in the B-tree and

intersect the two term sets.ExpensiveAlternative: permuterm indexBasic idea: Rotate every wildcard query, so that the *

occurs at the end. Store each of these rotations in the dictionary, say, in a

B-tree38

Page 39: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

39

Permuterm index

For term HELLO: add hello$, ello$h, llo$he, lo$hel, and o$hell to the B-tree where $ is a special symbol

39

Page 40: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

40

Permuterm → term mapping

40

Page 41: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

41

Permuterm index For HELLO, we’ve stored: hello$, ello$h, llo$he, lo$hel, and

o$hell Queries

For X, look up X$ For X*, look up X*$ For *X, look up X$* For *X*, look up X* For X*Y, look up Y$X* Example: For hel*o, look up o$hel*

Permuterm index would better be called a permuterm tree. But permuterm index is the more common name.

41

Page 42: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

42

Processing a lookup in the permuterm index

Rotate query wildcard to the rightUse B-tree lookup as beforeProblem: Permuterm more than quadruples the size of

the dictionary compared to a regular B-tree. (empirical number)

42

Page 43: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

43

k-gram indexes

More space-efficient than permuterm index Enumerate all character k-grams (sequence of k characters)

occurring in a term 2-grams are called bigrams. Example: from April is the cruelest month we get the bigrams:

$a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$

$ is a special word boundary symbol, as before. Maintain an inverted index from bigrams to the terms that

contain the bigram

43

Page 44: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

44

Postings list in a 3-gram inverted index

44

Page 45: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

45

k-gram (bigram, trigram, . . . ) indexes

Note that we now have two different types of inverted indexes

The term-document inverted index for finding documents based on a query consisting of terms

The k-gram index for finding terms based on a query consisting of k-grams

45

Page 46: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

46

Processing wildcarded terms in a bigram index

Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon . . . . . . but also many “false positives” like MOON. We must postfilter these terms against query. Surviving terms are then looked up in the term-document

inverted index. k-gram index vs. permuterm index

k-gram index is more space efficient. Permuterm index doesn’t require postfiltering.

46

Page 47: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

47

Exercise Google has very limited support for wildcard queries. For example, this query doesn’t work very well on Google:

[gen* universit*] Intention: you are looking for the University of Geneva, but

don’t know which accents to use for the French words for university and Geneva.

According to Google search basics, 2010-04-29: “Note that the * operator works only on whole words, not parts of words.”

But this is not entirely true. Try [pythag*] and [m*nchen] Exercise: Why doesn’t Google fully support wildcard queries?

47

Page 48: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

48

Processing wildcard queries in the term-document index Problem 1: we must potentially execute a large number of

Boolean queries. Most straightforward semantics: Conjunction of disjunctions For [gen* universit*]: geneva university OR geneva université

OR genève university OR genève université OR general universities OR . . .

Very expensive Problem 2: Users hate to type. If abbreviated queries like [pyth* theo*] for [pythagoras’

theorem] are allowed, users will use them a lot. This would significantly increase the cost of answering queries. Somewhat alleviated by Google Suggest

48

Page 49: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

49

Spelling correction Two principal uses

Correcting documents being indexed Correcting user queries

Two different methods for spelling correction Isolated word spelling correction

Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words, e.g., an

asteroid that fell form the sky Context-sensitive spelling correction

Look at surrounding words Can correct form/from error above

49

Page 50: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

50

Correcting documents

We’re not interested in interactive spelling correction of documents (e.g., MS Word) in this class.

In IR, we use document correction primarily for OCR’ed documents. (OCR = optical character recognition)

The general philosophy in IR is: don’t change the documents.

50

Page 51: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

51

Correcting queries

First: isolated word spelling correction Premise 1: There is a list of “correct words” from which the

correct spellings come. Premise 2: We have a way of computing the distance between a

misspelled word and a correct word. Simple spelling correction algorithm: return the “correct” word

that has the smallest distance to the misspelled word. Example: informaton → information For the list of correct words, we can use the vocabulary of all

words that occur in our collection. Why is this problematic?

51

Page 52: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

52

Alternatives to using the term vocabulary

A standard dictionary (Webster’s, OED etc.)An industry-specific dictionary (for specialized IR

systems) The term vocabulary of the collection, appropriately

weighted

52

Page 53: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

CENG 213 Data Structures 53

Hash function for strings:

a l ikey

KeySize = 3;

98 108 105

hash(“ali”) = (105 * 1 + 108*37 + 98*372) % 10,007 = 8172

0 1 2 i

key[i]

hashfunction

ali

……

……

012

8172

10,006 (TableSize)

“ali”

Page 54: Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.

QUESTIONS?