TABLE COMPRESSION AND RELATED PROBLEMS
Raffaele GiancarloDipartimento di Matematica
Università di Palermo
Improving Table Compression with Combinatorial Optimization- J. ACM 03
A. L. Buchsbaum, G.L. Flowler and R. Giancarlo
Boosting Textual Compression in Optimal Linear Time J. ACM 05
P. Ferragina, R. Giancarlo, G. Manzini and M. Sciortino
Permutation, Partitions and Combinatorial Compression Boosting – TM 256 Unipa 04
P. Ferragina, R. Giancarlo, G. Manzini and M. Sciortino
Table Compression
gzip
a a b b a
a a b b a
a a b b a
a a b b a
…bbbaa
Feed Table in Row Major Order to gzip
Table Compression
a a b b a
a a b b a
a a b b a
a a b b a
gzip
gzip
gzip
On-Line (no Training): Partition Table and Compress separately
Table Compression
Off-Line (Training):Permute Columns, Partition, Compress
a a b b a
a a b b a
a a b b a
a a b b a
a a a b b
a a a b b
a a a b b
a a a b b
gzip
gzip
Table Compression
On-LineOptimal SolutionSame speed as gzip40-60% gain in Compression over gzip and bzip2
Off-LineGood Heuristics (Traveling Salesman Problem)Tolerably slower than gzipAdditional 10-20% gain in Compression
ApplicationsData warehousingData Base of Multiple Alignments - PFAM
Table Compression
Column Permutations via TSP
Build complete directed weighted graph G column T[i] is vertex i weight of (i,j): min(C(T[i])+C(T[j]), C(T[i]T[j]))
Find a good tour and therefore a good permutation of the table columns
Permute, Partition, Compress
The PPC Paradigm
Base Compressor C, i.e., gzip, Huffman, Arithmetic Codes Objects to be compressed: x1, x2, …,xn
Find suitable permutation of objectsPermute objects and partition Compress each piece of the partition seperately via C
Boosting the performance of Base Compressor C
Back to Table Compression
Binh Dao Vo and Kiem-Phong Vo-DCC04Using Column Dependency to Compress Tables
9088771 079229733360 079329084640 079229733600 07932908
973 908 973
908 908973 973
2 2 3 3
Lex sort PPC
Back to Table Compression
Column Dependency for Table Compression
Elegant algorithms to infer dependency and rearrange data
Theory: NP- HardHeuristics: 5-50% improvement in compression over TSP reordering
A Transition
Exercise: Specialize TSP Reordering to strings
String x1 x2 …xn
lcp(i,j)= length of longest common prefix of xi+1 …
xn and xj+1 …xn
Symbols i and j have relation weight n-lcp(i,j)
s = mississippi#
A Transition
Exercise (continued)
Define undirected graph G, where node
i is labeled with xi and (i,j) has weight given by
relation
An optimal tour is given by the lex sort of all cyclic shifts of S
All contexts are packed together optimally
PPC
The Burrows and Wheeler Transform (1994)
pi#mississi pppi#mississ isippi#missi ssissippi#mi sssippi#miss ississippi#m i
issippi#mis s
mississippi #ississippi# m
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi
ssippi#missiissippi#miss Sort the rows
#mississipp ii#mississip p
bwt(s)
s
ippi#missis s
Qualitatively, we show that
c’ is shorter than c, if s is compressible
Time(Aboost) = Time(A), i.e. no slowdown
A is used as a black-box
Our technique takes a poor compressor A and turns it into
a compressor Aboost with better performance guarantee
c’
BoosterThe better is A,
the better is Aboost
As cThe more compressible is s,
the better is Aboost
Boosting Textual Compression in optimal time
|c | ≤ λ |s| H (s) + µ |s|
Technically, we prove that
0k
Our technique takes a poor compressor A and turns it into
a compressor Aboost with better performance guarantee
c’
BoosterThe better is A,
the better is Aboost
As cThe more compressible is s,
the better is Aboost
k+ log2 |s| + k ’“Poor” means H0 bounds for A
Boosting
Boosting
Three Key Components: Burrows-Wheeler Transform, Suffix Tree and a Greedy processing of them
Our technique takes a 0th order compressor A and turns it
into a compressor Aboost with better performance guarantee
c’
BoosterThe better is A,
the better is Aboost
As cThe more compressible is s,
the better is Aboost
We achieve the best known compression ratio
Boosting
Outline
BWT
Find optimal partition of permuted string Greedy processing of suffix tree
Compress each piece of partition separately via base compressor A
Related Work
Foschini, Grossi, Gupta and Vitter- DCC04
Fast Compression with a Static Model in High Order Entropy
It ca be seen as a Compression Booster of Run length Encoding Ingredients:
BWT Wavelet Trees [GGV03] efficient encoding of the Integers [E75]
Related Work
Liefke and Suciu
Compression for XML Files
Group Together XML Strings based on similarities
Greatly Improves the performance of Gzip
Related Work
Johnson et. al. 2005
Compression of Boolean Matrices
Permute Columns so that Number of Runs is Minimized
NP- hard; Actually Max SNP Hard TSP + Hamming Distance
Related Work
Shortest Common Superstring [G97]
Oldest Instance of Permute, Partition and Compress
Conclusions
Permute Data Before Compression
It is efficient and fun…
In particular, if chosen permutation is not invertible
Top Related