TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di...
-
Upload
verity-stokes -
Category
Documents
-
view
213 -
download
0
Transcript of TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di...
![Page 1: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/1.jpg)
TABLE COMPRESSION AND RELATED PROBLEMS
Raffaele GiancarloDipartimento di Matematica
Università di Palermo
![Page 2: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/2.jpg)
Improving Table Compression with Combinatorial Optimization- J. ACM 03
A. L. Buchsbaum, G.L. Flowler and R. Giancarlo
Boosting Textual Compression in Optimal Linear Time J. ACM 05
P. Ferragina, R. Giancarlo, G. Manzini and M. Sciortino
Permutation, Partitions and Combinatorial Compression Boosting – TM 256 Unipa 04
P. Ferragina, R. Giancarlo, G. Manzini and M. Sciortino
![Page 3: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/3.jpg)
Table Compression
gzip
a a b b a
a a b b a
a a b b a
a a b b a
…bbbaa
Feed Table in Row Major Order to gzip
![Page 4: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/4.jpg)
Table Compression
a a b b a
a a b b a
a a b b a
a a b b a
gzip
gzip
gzip
On-Line (no Training): Partition Table and Compress separately
![Page 5: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/5.jpg)
Table Compression
Off-Line (Training):Permute Columns, Partition, Compress
a a b b a
a a b b a
a a b b a
a a b b a
a a a b b
a a a b b
a a a b b
a a a b b
gzip
gzip
![Page 6: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/6.jpg)
Table Compression
On-LineOptimal SolutionSame speed as gzip40-60% gain in Compression over gzip and bzip2
Off-LineGood Heuristics (Traveling Salesman Problem)Tolerably slower than gzipAdditional 10-20% gain in Compression
ApplicationsData warehousingData Base of Multiple Alignments - PFAM
![Page 7: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/7.jpg)
Table Compression
Column Permutations via TSP
Build complete directed weighted graph G column T[i] is vertex i weight of (i,j): min(C(T[i])+C(T[j]), C(T[i]T[j]))
Find a good tour and therefore a good permutation of the table columns
Permute, Partition, Compress
![Page 8: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/8.jpg)
The PPC Paradigm
Base Compressor C, i.e., gzip, Huffman, Arithmetic Codes Objects to be compressed: x1, x2, …,xn
Find suitable permutation of objectsPermute objects and partition Compress each piece of the partition seperately via C
Boosting the performance of Base Compressor C
![Page 9: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/9.jpg)
Back to Table Compression
Binh Dao Vo and Kiem-Phong Vo-DCC04Using Column Dependency to Compress Tables
9088771 079229733360 079329084640 079229733600 07932908
973 908 973
908 908973 973
2 2 3 3
Lex sort PPC
![Page 10: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/10.jpg)
Back to Table Compression
Column Dependency for Table Compression
Elegant algorithms to infer dependency and rearrange data
Theory: NP- HardHeuristics: 5-50% improvement in compression over TSP reordering
![Page 11: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/11.jpg)
A Transition
Exercise: Specialize TSP Reordering to strings
String x1 x2 …xn
lcp(i,j)= length of longest common prefix of xi+1 …
xn and xj+1 …xn
Symbols i and j have relation weight n-lcp(i,j)
s = mississippi#
![Page 12: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/12.jpg)
A Transition
Exercise (continued)
Define undirected graph G, where node
i is labeled with xi and (i,j) has weight given by
relation
An optimal tour is given by the lex sort of all cyclic shifts of S
All contexts are packed together optimally
PPC
![Page 13: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/13.jpg)
The Burrows and Wheeler Transform (1994)
pi#mississi pppi#mississ isippi#missi ssissippi#mi sssippi#miss ississippi#m i
issippi#mis s
mississippi #ississippi# m
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi
ssippi#missiissippi#miss Sort the rows
#mississipp ii#mississip p
bwt(s)
s
ippi#missis s
![Page 14: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/14.jpg)
Qualitatively, we show that
c’ is shorter than c, if s is compressible
Time(Aboost) = Time(A), i.e. no slowdown
A is used as a black-box
Our technique takes a poor compressor A and turns it into
a compressor Aboost with better performance guarantee
c’
BoosterThe better is A,
the better is Aboost
As cThe more compressible is s,
the better is Aboost
Boosting Textual Compression in optimal time
![Page 15: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/15.jpg)
|c | ≤ λ |s| H (s) + µ |s|
Technically, we prove that
0k
Our technique takes a poor compressor A and turns it into
a compressor Aboost with better performance guarantee
c’
BoosterThe better is A,
the better is Aboost
As cThe more compressible is s,
the better is Aboost
k+ log2 |s| + k ’“Poor” means H0 bounds for A
Boosting
![Page 16: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/16.jpg)
Boosting
Three Key Components: Burrows-Wheeler Transform, Suffix Tree and a Greedy processing of them
Our technique takes a 0th order compressor A and turns it
into a compressor Aboost with better performance guarantee
c’
BoosterThe better is A,
the better is Aboost
As cThe more compressible is s,
the better is Aboost
We achieve the best known compression ratio
![Page 17: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/17.jpg)
Boosting
Outline
BWT
Find optimal partition of permuted string Greedy processing of suffix tree
Compress each piece of partition separately via base compressor A
![Page 18: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/18.jpg)
Related Work
Foschini, Grossi, Gupta and Vitter- DCC04
Fast Compression with a Static Model in High Order Entropy
It ca be seen as a Compression Booster of Run length Encoding Ingredients:
BWT Wavelet Trees [GGV03] efficient encoding of the Integers [E75]
![Page 19: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/19.jpg)
Related Work
Liefke and Suciu
Compression for XML Files
Group Together XML Strings based on similarities
Greatly Improves the performance of Gzip
![Page 20: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/20.jpg)
Related Work
Johnson et. al. 2005
Compression of Boolean Matrices
Permute Columns so that Number of Runs is Minimized
NP- hard; Actually Max SNP Hard TSP + Hamming Distance
![Page 21: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/21.jpg)
Related Work
Shortest Common Superstring [G97]
Oldest Instance of Permute, Partition and Compress
![Page 22: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.](https://reader035.fdocuments.us/reader035/viewer/2022070404/56649f345503460f94c518ab/html5/thumbnails/22.jpg)
Conclusions
Permute Data Before Compression
It is efficient and fun…
In particular, if chosen permutation is not invertible