Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful...
-
Upload
philippa-holland -
Category
Documents
-
view
215 -
download
0
Transcript of Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful...
![Page 1: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/1.jpg)
Index Compression
Ferrol Aderholdt
![Page 2: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/2.jpg)
Motivation
Uncompressed indexes are large It might be useful for some modern devices to
support information retrieval techniques that would not be able to do with uncompressed indexes
![Page 3: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/3.jpg)
Motivation (cont.)
Disk I/O is slow
![Page 4: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/4.jpg)
Types of Compression
Lossy Compression that involves the removal of data.
Loseless Compression that involves no removal of data.
![Page 5: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/5.jpg)
Overview
A lossy compression scheme Static Index Pruning
Loseless compression Elias Codes n-s encoding Golomb encoding Variable Byte Encoding (vByte) Fixed Binary Codewords CPSS-Tree
![Page 6: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/6.jpg)
Static Index Pruning
Goal is to reduce the size of the index without reducing the precision such that a human can’t tell the difference between a pruned index and non-pruned index
Focuses on the top k or top δ results Assumes there is a scoring function
Assumes the function is based off of some table A such that A(t,d) > 0 if t is within d and A(t,d) = 0 otherwise
![Page 7: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/7.jpg)
Static Index Pruning (cont.)
Two approaches1. Defined as Uniform pruning.
The removal of “all posting entries whose corresponding table values are bounded above by some fixed cutoff threshold”
Could have a term’s entire posting list pruned
2. Defined as Term based pruning An approach that attempts to guarantee that every
term will have at least some entries remaining in the index
![Page 8: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/8.jpg)
Static Index Pruning (cont.)
Scoring functions are fuzzy Only need to find some scoring function S’
such that S’ is within a factor of epsilon of S Carmel et al proved this mathematically for
both uniform and term-based methods
![Page 9: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/9.jpg)
Static Index Pruning (cont.)
![Page 10: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/10.jpg)
Static Index Pruning results
Found that the idealized top k pruning algorithm did not work very well The smallest value in the posting list was almost
always above their threshold so little pruning was done
Modified the algorithm to apply a shift Subtracted the smallest value from all positive
scores with the list Greatly increased the pruning
![Page 11: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/11.jpg)
Static Index Pruning results (cont.)
![Page 12: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/12.jpg)
Static Index Pruning results (cont.)
![Page 13: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/13.jpg)
Static Index Pruning results (cont.)
![Page 14: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/14.jpg)
Overview
Loseless Compression
![Page 15: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/15.jpg)
Elias Codes
Non-parameterized bitwise method of coding integers
Gamma Codes Represent a positive integer k with
stored as a unary code. This is followed by
the binary representation of the number
without the most significant bit Not efficient for numbers larger than 15
k2log1
![Page 16: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/16.jpg)
Elias Codes (cont.)
Delta Codes Represent a positive integer k with
stored as a gamma code. This is followed by
the binary representation of the number
without the most significant bit Not efficient for small values
k2log1
![Page 17: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/17.jpg)
n-s coding
Parameterized, bitwise encoding Uses a block of n bits followed by s stop bits. Also contains a parameter b which refers to
the base of the number. Meaning, the numbers represented in the blocks of n size cannot be greater than or equal to b.
![Page 18: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/18.jpg)
n-s coding example
Let n=3, s=2, and the base be 6. Valid data blocks are 000, 001, 010, 011,
100, and 101. 101 100 001 11 would have the value of 5416
![Page 19: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/19.jpg)
n-s coding (cont.)
[2] used n-s codes with prefix omission and run-length encoding
Ex.
![Page 20: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/20.jpg)
n-s coding (cont.)
Run-length encoding is the process of replacing non-initial elements of a sequence with differences between adjacent elements. E.g.
![Page 21: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/21.jpg)
n-s coding results
![Page 22: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/22.jpg)
Golomb coding
Better compression and faster retrieval than Elias codes
Is parameterized This is usually stored separate using some other
compression scheme
![Page 23: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/23.jpg)
vByte coding
A very simple bytewise compression scheme Uses 7 bits to code the data portion and the
most significant bit is reserved as a flag bit.
![Page 24: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/24.jpg)
Scholer et. al.
Defined an inverted list to be the following:
Where the list is <freq,doc,[offsets]> Example inverted list for term “Matthew”:
<3,7,[6,51,117]><1,44,[12]><2,117,[14,1077]> Uses different coding schemes per part
E.g. Golomb for freq, Gamma for doc, and vByte for offset
tdotdodftdftd ,,,...,,,,,,0,
![Page 25: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/25.jpg)
Scholer et al. (cont.)
One optimization is to require encoding to be byte aligned so that decompression can be faster
Another optimization when referring to Boolean or ranked queries is to ignore the offsets and only take into account flag bits within the offset. Referred to as scanning
![Page 26: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/26.jpg)
Scholer et al. (cont.)
Third optimization is called signature blocks. An eight bit block that stores the flag bits of up to
eight blocks that follow. For example: 11100101
Represents 5 integers that are stored in the eight blocks Requires more space but allows the data blocks
to use all 8 bits instead of 7.
![Page 27: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/27.jpg)
Scholer et al. results
![Page 28: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/28.jpg)
Scholer et al. results (cont.)
![Page 29: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/29.jpg)
Scholer et al. results (cont.)
![Page 30: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/30.jpg)
Fixed Binary Codes
Often times the inverted list will be stored as a series of difference gaps between documents like so,
This reduces the amount of bits required to represent a document IDs on average
1,...,,,, 23121 tft ddddddf
![Page 31: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/31.jpg)
Fixed Binary Codes (cont.)
Take for example the following list of d-gaps:
<12; 38, 17, 13, 34, 6 ,4 ,1, 3, 1, 2, 3, 1> If a binary code was used to encode this list,
6 bits would be used on each codeword when that would be unnecessary
![Page 32: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/32.jpg)
Fixed Binary Codes (cont.)
Instead encode as spans:
<12; (6,4 : 38, 17, 13, 34),(3,1: 6),
(2,7 : 4, 1, 3, 1, 2, 3, 1)>
where the notation
would indicate that w-bit binary codes are to be used to code each of the next s values.
Similar to the approach of Anh and Moffat
sdddsw ,...,,:, 21
![Page 33: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/33.jpg)
Anh and Moffat
Uses a selector then data representation for encoding A selector can be thought of as the unary portion
of gamma codes Data representation would be the binary portion of
gamma codes The selector uses a table of values where
each case is determined on the w-value and is relative to the previous case.
![Page 34: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/34.jpg)
Anh and Moffat (cont.)
![Page 35: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/35.jpg)
Anh and Moffat (cont.)
Using this list and assuming s1= 1, s2= 2, and s3= 4
From the table on the previous slide we get the following
With each selector as 4 bits (2 bits for w ± 3, 2 bits to choose s1-s3) it takes 16 bits plus the summation of all of the w x s pairs. So, 57 bits are used to encode this list. It would take 60 bits for gamma code.
![Page 36: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/36.jpg)
Anh and Moffat (cont.)
![Page 37: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/37.jpg)
Anh and Moffat (cont.)
The use of parsing is involved to discover segments. A graph is used in combination with shortest path
labeling Each node is a d-gap and the width to code it Each outgoing edge is a different way in which selector
might be used to cover some subsequent gaps.
![Page 38: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/38.jpg)
Anh and Moffat (cont.)
A multiplier is used since every list can be different but the values for s1, s2, and s3 are fixed.
For example, if m=2 and s1= 1, s2= 2, and s3= 4, or 1-2-4, then they would be equal to 2-4-8.
An escape sequence can also be used on lists that have gaps that span larger than s3 would allow. This is the addition of an extra 4 bits stating that up to 15m
gaps can be placed under one selector
![Page 39: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/39.jpg)
Anh and Moffat results (cont.)
![Page 40: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/40.jpg)
Anh and Moffat results (cont.)
![Page 41: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/41.jpg)
Anh and Moffat
![Page 42: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/42.jpg)
Speeding up decoding
Need to exploit the cache and reduce both cache misses and TLB misses
Use CSS-trees or CPSS-trees CSS-trees are cache-sensitive search trees that
are a variation on m-ary trees. By making each node contiguous this reduces the need
for child pointers This allows for each node to fit into a cache line (32/64 bit)
![Page 43: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/43.jpg)
CSS-Tree vs m-ary Tree
![Page 44: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/44.jpg)
CPSS-trees
Cache/Page sensitive search trees main purpose is to reduce number cache/TLB misses during random searches Accomplished by making each node, except the
root, 4 KB in size and contains several CSS-Trees The CSS-Trees are the same size as a cache line and
contain the postings Either 32 or 64 bit
![Page 45: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/45.jpg)
CPSS-trees results
![Page 46: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/46.jpg)
CPSS-trees results (cont.)
![Page 47: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/47.jpg)
Compressed CPSS-trees results
![Page 48: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/48.jpg)
Compressed CPSS-tree results
![Page 49: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/49.jpg)
Questions??
Questions??
![Page 50: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/50.jpg)
References
[1] David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoelle S. Maarek, Aya Soffer. Static Index Pruning for Information Retrieval Systems. SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pgs 43-50, 2001.
[2] Gordon Linoff, Craig Stanfill. Compression of Indexes with Full Positional Information in Very Large Text Databases. SIGIR ’93: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pgs 88-95, 1993.
![Page 51: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/51.jpg)
References
[3] Falk Scholer, Hugh Williams, John Yiannis, and Justin Zobel. Compression of Inverted Indexes for Fast Query Evaluation. SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pgs 222-229, 2002.
[4] V. N. Anh and A. Moffat. Index Compression using Fixed Binary Codewords. ADC ’04: Proceedings of the 15th Australasian database conference, pg 61-67, 2004
![Page 52: Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large It might be useful for some modern devices to support information retrieval.](https://reader033.fdocuments.us/reader033/viewer/2022051416/56649ea25503460f94ba6c09/html5/thumbnails/52.jpg)
References
[5] Stefan Buttcher and Charles L. A. Clarke. Index Compression is Good, Especially for Random Access. CIKM ’07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pgs 761-770, 2007.