Avishek Anand - KBS · Avishek Anand Maintains statistics and information about the indexed unit...

Indexing and Querying

Avishek Anand

✤ Inverted Indexing basics revisited

✤ Indexing Static Collections

✤ Dictionaries

✤ Forward Index

✤ Inverted Index Organisation

✤ Scalable Indexing

✤ Indexing Dynamic Collections

Inverted Index Construction and Maintenance

Avishek Anand

✤ Query Processing over Document-id ordered lists

✤ Document-at-a-Time vs Term-at-a-Time Processing

✤ WAND Processing

✤ Top-K Processing

✤ Fagin’s top-k

✤ TA, NRA and CA

✤ Supporting advanced queries - phrases, proximity-aware, temporal queries

Query Processing over Inverted Indexes

Avishek Anand

✤ Why do we index text collections ?

✤ How do we index documents ?

✤ What are the data structures ?

✤ What are the design decisions for organising the index?

✤ How do we index huge collections ?

✤ How do we index evolving or dynamic collections ?

Text Collections and Indexing

Avishek Anand

✤ Why do we index text collections ?

✤ How do we index documents ?

✤ What are the data structures ?

✤ What are the design decisions for organising the index?

✤ How do we index huge collections ?

✤ How do we index evolving or dynamic collections ?

Text Collections and Indexing

Efficient document retrieval

lexicon, inverted lists

document order, score order

distributed indexing, term/doc partitioning

index maintenance strategies

Avishek Anand

Terminology Recap

information retrieval

Lexicon

queries, results

terms, documents, collection

index, lexicon, posting, posting list

stemming, !stop-word rem

(d1 , 2, <2, 15>)

(d4 , 3, <2, 15, 23, >)

(d34 , 3, <2, 15, 23>)

…….

(d1 , 2, <3, 16>)

(d4 , 2, <23, 16>)

(d31 , 4, <8, 19, 30>)

…….

Avishek Anand

✤ Maintains statistics and information about the indexed unit (word, n-gram etc)

✤ Posting list location - for posting list retrieval

✤ Term identifier - for term lookups, matching and range queries

✤ document frequency and associated statistics - for ranking

✤ Data Structures for Lexicon

✤ Hash-based Lexicon

✤ B+-Tree based Lexicon

Lexicon or Dictionary

< hannover ; location: 82271; tid:12 ; df:23, … >

Avishek Anand

Hash-Based Lexicon

Information Retrieval: Implementing and Evaluating Search Engines · c⃝MIT Press, 2010 · DRAFT108

!"#$%!&%#'%(())*+,*-

'%./%0123%4%!!(,(5*6)6,

##'73#'$5,+8*)*-

9%1":;<=%4>:20)),5,?+,+

&'%0:9=@!12:4)*-,,5+6*

/#47!0#&%!)8-5)+5?)

><A:1%??-*(5*+5

241%'0%!!:'<65-))+-66

.:0@!#=/%6)6(,,855

!"#$%&"'()*"++",-

./((0#0/1%2$"01%*(013)4%(0#&-

Figure 4.2 Dictionary data structure based on a hash table with 210 = 1024 entries (data extractedfrom schema-independent index for TREC45). Terms with the same hash value are arranged in a linkedlist (chaining). Each term descriptor contains the term itself, the position of the term’s postings list,and a pointer to the next entry in the linked list.

in GOV2 is 9.2 bytes. Storing each term in a fixed-size memory region of 20 bytes wastes 10.8bytes per term on average (internal fragmentation).

One way to eliminate the internal fragmentation is to not store the index terms themselves inthe array, but only pointers to them. For example, the search engine could maintain a primarydictionary array, containing 32-bit pointers into a secondary array. The secondary array thencontains the actual dictionary entries, consisting of the terms themselves and the correspondingpointers into the postings file. This way of organizing the search engine’s dictionary data isshown in Figure 4.3. It is sometimes referred to as the dictionary-as-a-string approach, becausethere are no explicit delimiters between two consecutive dictionary entries; the secondary arraycan be thought of as a long, uninterrupted string.

For the GOV2 collection, the dictionary-as-a-string approach, compared to the dictionarylayout shown in Figure 4.1, reduces the dictionary’s storage requirements by 10.8 − 4 = 6.8bytes per entry. Here the term 4 stems from the pointer overhead in the primary array; theterm 10.8 corresponds to the complete elimination of any internal fragmentation.

It is worth pointing out that the term strings stored in the secondary array do not require anexplicit termination symbol (e.g., the “\0” character), because the length of each term in thedictionary is implicitly given by the pointers in the primary array. For example, by looking atthe pointers for “shakespeare” and “shakespearean” in Figure 4.3, we know that the dictionaryentry for “shakespeare” requires 16629970−16629951 = 19 bytes in total: 11 bytes for the termplus 8 bytes for the 64-bit file pointer into the postings file.

Avishek Anand

✤ Constant lookups based on a Hash table

✤ Entire Lexicon loaded to the memory

Hash-Based Lexicon

!"#$%!&%#'%(())*+,*-

'%./%0123%4%!!(,(5*6)6,

##'73#'$5,+8*)*-

9%1":;<=%4>:20)),5,?+,+

&'%0:9=@!12:4)*-,,5+6*

/#47!0#&%!)8-5)+5?)

><A:1%??-*(5*+5

241%'0%!!:'<65-))+-66

.:0@!#=/%6)6(,,855

!"#$%&"'()*"++",-

./((0#0/1%2$"01%*(013)4%(0#&-

Avishek Anand

✤ Constant lookups based on a Hash table

✤ Entire Lexicon loaded to the memory

Hash-Based Lexicon

!"#$%!&%#'%(())*+,*-

'%./%0123%4%!!(,(5*6)6,

##'73#'$5,+8*)*-

9%1":;<=%4>:20)),5,?+,+

&'%0:9=@!12:4)*-,,5+6*

/#47!0#&%!)8-5)+5?)

><A:1%??-*(5*+5

241%'0%!!:'<65-))+-66

.:0@!#=/%6)6(,,855

!"#$%&"'()*"++",-

./((0#0/1%2$"01%*(013)4%(0#&-

✤ Updates difficult

✤ Range Searches, Matching, Substring queries not supported

Avishek Anand

✤ B+-Tree: Leaf nodes additionally linked for efficient range search

✤ Supports lookups in O(log n) and range searches in O(log n + k)

✤ Vocabulary dynamics (i.e., new or removed terms) no problem

✤ Works on secondary storage

B+-Tree or Sort-based Lexicon

[aardvark, tid:3, df:3, …]

[a-i][j-z]

[j-k][l-q][r-z][a-d][e-f][g-i]

[a-b][c][d] [e][f] [g][h][i] … … …

[aalborg, tid:7, df:2, …]

Avishek Anand

• Mapping of doc-ids to term-ids in the same order

• Efficient retrieval of terms from (already parsed) text

• snippet generation

• proximity features for proximity-aware ranking

• per-doc term distribution for query expansions

Forward Index

1: “what does the fox say ?”

124 53 1 49935 1001:

Avishek Anand

✤ Inverted index is a collection of posting lists

✤ Posting contains document identifiers (as integers) along with scores (integers or doubles) and possibly positions (as integers)

✤ Postings list can be organised according to

✤ document identifiers - document ordering

✤ scores - Impact ordering

✤ What are the merits of these orderings ?

Inverted Index

information

(d1 , 2, <2, 15>)

(d4 , 5, <2, 15, 23, >)

(d34 , 3, <2, 15, 23>)

…….

Avishek Anand

✤ Based on faster intersections

✤ High compression of index using gap encoding of dids

✤ Easily updatable

Document Order vs Score Order

✤ Based on processing Top-k results fast

✤ Low compression ratio

✤ Difficult to update

Document Ordering Score/Impact Ordering

Index organisation depends on query processing style.

Avishek Anand

✤ We are given a set of documents D, where each document d is considered as a bag of terms

✤ Inverted Lists are created by a process termed as Inversion

✤ Memory-based Inversion

✤ Takes place entirely in-memory

✤ For small collections, where the index + lexicon fits in memory

✤ Disk-based Inversion

✤ Sort-based inversion vs Merge-based inversion

Inverted Index Construction

Avishek Anand

✤ A dictionary is required that allows efficient single-term lookup and insertion operations

✤ An extensible (i.e., dynamic) list data structure is needed that is used to store the postings for each

Memory-based Inversion

1: “what does the fox say ?”

2: “the fox jumped over the fence”

[the, <3>] [fox, <4>]

[the, <1,5>] [fox, <2>]

dictionary1:[the, <1,5>] “the”: [1, <3>] [2, <1,5>]

[term, positions]

…. ….

doc: [term, positions] [term, posting list]

Avishek Anand

✤ Input Collection D >> memory size M

✤ Inversion can be seen as a sort operation on the term identifiers

✤ This method is based on external sort over data which does not fit into the memory

✤ Read data of size M into memory, sort them and write back to disk

✤ Multiway merge of D/M sorted lists to create index

✤ Shortcomings

✤ Dictionary might not fit in-memory

✤ Large memory requirements due to intermediate data

Sort-based Inversion

Avishek Anand

✤ What is the estimated cost of sort-based Inversion in terms of N,M and c ?

✤ How does the cost compare with in-memory sort-based inversion (assuming we had enough memory or N > M) ?

Exercise 1: Analysis of Sort-based Inversion

Total number of postings = N

Number of postings which fit in memory = M

Cost of disk read/write of a posting = c

Simple Computational Model

Avishek Anand

✤ Generalisation of in-memory indexing

✤ Reads input collection to create an in-memory index of size M and write it to disk to create partial indexes with local lexicons

✤ Compression in posting lists in partial indexes

✤ Multiway Merge of corresponding lists from the partial indexes to create one consolidated index

Merge-based Inversion

partial indexes of size M

Avishek Anand

✤ Programming paradigm for distributed data processing

✤ Improves overall throughput by parallelising loading of data

✤ Data is partitioned into the nodes which process the data in the following phases

✤ Map : Generates (key, value) pairs

✤ Shuffle : Shuffles the pairs over the network to the reducers

✤ Reduce : operates on all values for the same key are

Map-Reduce crash course

Avishek Anand

Map-Reduce Example : Word Count

1: “what does the fox say ?” 2: “the fox jumped over the fence”

what : 1does : 1

the : 1

fox : 1

say : 1

Mapper - 1 jumped : 1over : 1

the : 2

fox : 1

fence : 1

Mapper - 2

Shuffle + Sort

Reducer - 1

what : 1does : 1 the : 1fox : 1 say : 1jumped : 1 over : 1

the : 2fox : 1

fence : 1

Reducer - 2

mappers emit <word, freq>

reducers aggr. freq.

Avishek Anand

✤ How would you build the inverted index using Map-reduce ?

✤ What are the key-value pairs as defined by the Mapper ?

✤ What does the reducer do with the values of the same key ?

Exercise 2: Index Construction using Map-Reduce

Avishek Anand

Indexing Dynamic Collections

How do we deal with dynamically growing collections ?

✤ Real world document collections are often dynamic

✤ Index Maintenance : How do we keep the index consistent with the changes or updates to the document collection ?

✤ Challenge of Time : Not enough time to rebuild indexes

✤ Challenge of query competitiveness: Queries to be served in reasonable time

Avishek Anand

Inverted Index

Avishek Anand

Inverted Index

Avishek Anand

Inverted Index

Avishek Anand

Inverted Index

Jan Feb Mar .....

Avishek Anand

Inverted Index

Jan Feb Mar .....

Mon Tue wed .....

Avishek Anand

✤ Multiple Partial Indexes: No index re-computation. Partial index finalized once memory is full.

✤ Query is processed over each partial index and results are merged hence queries are slower

Index Maintenance

✤ Single Index : Maintain one index for the entire document collection by recomputing whenever there are updates (maybe batch updates)

✤ Efficient query processing but high maintenance cost

Avishek Anand

✤ Multiple Partial Indexes: No index re-computation. Partial index finalized once memory is full.

✤ Query is processed over each partial index and results are merged hence queries are slower

Index Maintenance

✤ Single Index : Maintain one index for the entire document collection by recomputing whenever there are updates (maybe batch updates)

✤ Efficient query processing but high maintenance cost

selectively merge partial indexes

Avishek Anand

✤ Merging of two posting lists

✤ In-place - keep free space after index blocks for updates (more space)

✤ Chained-merge/merge-based— chain updates in a new block (slower access)

✤ Merging is possible since compression techniques are local

Merging Indexes

updates

In-memory Indexdisk-resident Index

Avishek Anand

Merging Indexes

updates

Avishek Anand

Merging Indexes

updates

Avishek Anand

Merging Indexes

updates

Avishek Anand

Merging Indexes

updates

Avishek Anand

Merging Indexes

updates

Avishek Anand

!✤ Each on-disk index has generation number g

✤ In-memory index has g = 0 ✤ When two on-disk indexes have same g = i, they are

merged to form on-disk index having g = i+1. !

✤ Logarithmic merge results in log2 N partitions ✤ N is the number of in-memory blocks generated !

✤ Generalized Logarithmic merge : logk N partitions ✤ Also called as Lazy merge or k-constraint logarithmic merge

Logarithmic Merge

[Büttcher et.al SIGIR ’06]

Avishek Anand

Logarithmic Merge

123 0 210 34

Logarithmic Merge Lazy Merge

In-memory IndexIn-memory Index

Timeline

[Büttcher et.al SIGIR ’06]

Avishek Anand

Geometric Merge

• Each partition contains an inverted index • Index sizes form a geometric series with ratio r

• a partition k has index of size 0 or [rk-1 M , (r-1)rk-1 M] postings

• Increasing the value of r, increases the number of merges thus reducing the number of partitions – Immediate merge is a geometric merge with r = ∞

Size of index at partition K+1

Size of index at partition K r =

[Lester et.al CIKM ’05]

Avishek Anand

Geometric Merge[Lester et.al CIKM ’05]

123 0 0 1 2 3

r = 3 r = 3 4 5

Geometric Partitioning Active MergeIn-memory IndexIn-memory Index

Timeline

Avishek Anand

Open Source Full-text Indexing Software

Avishek Anand

1. What is the cost estimate (in terms of disk operations involved) for immediate merge ?

2. Compare Query processing estimates in terms of c for all merge methods.

3. What is the cost estimate for geometric or logarithmic merge ?***

Exercise -3 : Analysis of Merging Techniques

Total number of postings = N

Number of postings which fit in memory = M

Simple Computational Model

Cost of sequential access = c, random access = 1000.c

Avishek Anand

References

http://www.ir.uwaterloo.ca/book/

http://stefan.buettcher.org/papers/buettcher_2006_hybrid_index_maintenance_2.pdf

http://ww2.cs.mu.oz.au/~jz/fulltext/cikm05lmz.pdf

Avishek Anand

Index Construction - Computational Model

Index Construction: Computational Model

I Hypothetical collection of 5Gb and 5 million docs

I Some nominal performance figures

Parameter Symbol Assumed ValueTotal text size B 5 ⇥ 10

9 bytesNumber of docs N 5 ⇥ 10

Number of distinct words n 1 ⇥ 10

Total number of words F 800 ⇥ 10

Number of index pointers f 400 ⇥ 10

Final size of compressed inv. file I 400 ⇥ 10

6 bytes

Disk seek time ts 10 ⇥ 10

�3 secDisk transfer time per byte tr 0.5 ⇥ 10

�6 secInverted file coding per byte td 5 ⇥ 10

�6 secTime to compare and swap 10-byte records tc 10

�6 secTime to parse, stem and look up one term tp 20 ⇥ 10

�6 secAmount of main memory available M 40 ⇥ 10

6 bytes

Introduction to Information Retrieval, Spring 2002, Week 5 Copyright c� Christof Monz & Maarten de Rijke 21

Avishek Anand - KBS · Avishek Anand Maintains statistics and information about the indexed unit...

Documents

Transcript of Avishek Anand - KBS · Avishek Anand Maintains statistics and information about the indexed unit...

double... · asheesh kumawat vishal anand chaitanya kirti abdul aziz ansari mahendra singh naim akhter avishek sharma suraj kumar patel longjam dadei meitei rajpal singh alfayet hoque

Wikuizzing - Avishek Basu Mallick

ANAND LAW COLLEGE, ANAND - Lawctopus€¦ · ANAND LAW COLLEGE, ANAND (MANAGED BY SHRI RAMKRISHNA SEVA MANDAL) ORGANIZES SECOND NATIONAL MOOT COURT COMPETITION, 2015 ANAND LAW COLLEGE,

anand best

Viswanathan Anand

Indexing - L3Sanand/tir14/lectures/ws14-tir-indexing.pdfAvishek Anand Maintains statistics and information about the indexed unit (word, n-gram etc) Posting list location - for posting

ANAND-388110 Anand Agricultural University, Anand invites ... · PDF fileAnand Agricultural University, Anand invites One anli ... Agricultural Economics Agricultural ... as may be

Craigslist Posting Service | Craigslist Ad Posting Service

Book1 anand

Anand Pathrika12

OFC’10 Summary ---Core Networks Part III Avishek Nag.

Learning to Rank - IIanand/ws15-tir-eval.pdf · Avishek Anand Learning to Rank Approaches 10 • Pointwise Approaches • Input space: single documents d1 — if a doc. is relevant

Avishek Lahiri Store Operation 15

ANAND RATHI

(HOD) on Construction Management, Mechanization ......Dr.Kirti Avishek :Co-Convenor Mr. Mani Mohan :Co-Convenor Dr. Balram Singh :Member Dr. Anand Kumar Sinha :Member Dr. Sudeshna

Payroll Posting Outsourcing Solutions - ADP Posting ...kdssc.com/kdssc/papers/sppos12.pdf · Payroll Posting Outsourcing Solutions - ADP Posting Interface ... Integrates all non SAP

FB01-Document Posting Based on Different Posting Keys

Anand Aurora

Anand mahindra

New ANAND AND ANAND - MyCII · 2016. 8. 6. · ANAND AND ANAND . Title: Advt Author: efforts3 Created Date: 8/5/2016 5:27:50 PM