Post on 16-Dec-2015
BWT-Based Compression Algorithms
Haim Kaplan and Elad VerbinTel-Aviv University
Presented in CPM ’07, July 8 , 2007
Results
• Cannot show constant c<2 s.t.
• Similarly,• no c<1.26 for BWRL
• no c<1.3 for BWDC
• Probabilistic technique.
0. 0( ) ( ) ( )s BW s c nH s o n
BW0The Main Burrows-Wheeler Compression
Algorithm:
Compressed String S’
String S BWTBurrows-Wheeler Transfor
m
MTFMove-to-
front
Order-0 Encoding
Text with local uniformity
Text in English (similar contexts -> similar character)
Integer string with many small numbers
The BWT
● Invented by Burrows-and-Wheeler (‘94)● Analogous to Fourier Transform (smooth!)
string with context-regularity
BWT
string with spikes (close repetitions)
s
s
mississippi
ipssmpissii
[Fenwick]
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
The BWTT = mississippi#
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi
ssippi#missiissippi#miss Sort the rows
# mississipp ii #mississip pi ppi#missis s
F L=BWT(T)
T
BWT sorts the characters by their post-context
Move To Front
● By Bentley, Sleator, Tarjan and Wei (’86)
string with spikes (close repetitions)
ipssmpissii
integer string with small numbers
0,0,0,0,0,2,4,3,0,1,0
move-to-front
s
's
Move to Front
r,b,a,c,d0,1,2abracadabrab,a,r,c,d0,1abracadabraa,b,r,c,d0abracadabraa,b,r,c,dabracadabra
Move to Front
a,r,b,c,d0,1,2,2abracadabrar,b,a,c,d0,1,2abracadabrab,a,r,c,d0,1abracadabraa,b,r,c,d0abracadabraa,b,r,c,dabracadabra
Move to Front
c,a,r,b,d0,1,2,2,3abracadabraa,r,b,c,d0,1,2,2abracadabrar,b,a,c,d0,1,2abracadabrab,a,r,c,d0,1abracadabraa,b,r,c,d0abracadabraa,b,r,c,dabracadabra
Move to Front
a,c,r,b,d0,1,2,2,3,1abracadabrac,a,r,b,d0,1,2,2,3abracadabraa,r,b,c,d0,1,2,2abracadabrar,b,a,c,d0,1,2abracadabrab,a,r,c,d0,1abracadabraa,b,r,c,d0abracadabraa,b,r,c,dabracadabra
Move to Front
0,1,2,2,3,1,4,1,4,4,2abracadabraa,c,r,b,d0,1,2,2,3,1abracadabrac,a,r,b,d0,1,2,2,3abracadabraa,r,b,c,d0,1,2,2abracadabrar,b,a,c,d0,1,2abracadabrab,a,r,c,d0,1abracadabraa,b,r,c,d0abracadabraa,b,r,c,dabracadabra
After MTF● Now we have a string with small numbers:
lots of 0s, many 1s, …● Skewed frequencies: Run Arithmetic!
Character frequencies
BW0The Main Burrows-Wheeler Compression
Algorithm:
Compressed String S’
String S BWTBurrows-Wheeler Transfor
m
MTFMove-to-
front
Order-0 Encoding
Text with local uniformity
Text in English (similar contexts -> similar character)
Integer string with many small numbers
BWRL (e.g. bzip)
Compressed String S’
String S
BWTBurrows-Wheeler Transfor
m
MTFMove-to-
front
? RLE
Run-Length encodi
ng
Order-0 Encoding
Many more BWT-based algorithms
● BWDC: Encodes using distance coding instead of MTF
● BW with inversion frequencies coding● Booster-Based [Ferragina-Giancarlo-
Manzini-Sciortino]● Block-based compressor of Effros et al.
order-0 entropy
Lower bound for compression without context information
n
nnsnH log)(0
S=“ACABBA”
1/2 `A’s: Each represented by 1 bit
1/3 `B’s: Each represented by log(3) bits
1/6 `C’s: Each represented by log(6) bits
6*H0(S)=3*1+2*log(3)+1*log(6)
order-k entropy
mississippi:Context for i: “mssp”Context for s: “isis”Context for p: “ip”
6)"("4 0 msspH4)"("4 0 isisH
2)"("2 0 ipH
1(" ") 12nH mississippi
0 (" ") 20.03nH mississippi
Measuring against Hk
● When performing worst-case analysis of lossless text compressors, we usually measure against Hk
● The goal – a bound of the form:|A(s)|≤ c×nHk(s)+lower order term
● Optimal: |A(s)|≤ nHk(s)+lower order term
Bounds
lower Upper
BW0 2 [KaplanVerbin07] 3.33 [ManziniGagie07]
BWDC 1.3 [KaplanVerbin07] 1.7 [KaplanLandauVerbin06]
BWRL 1.26 [KaplanVerbin07] 5 [Manzini99]
gzip 1 1
PPM 1 1
Bounds
lower Upper
BW0 2 [KaplanVerbin07] 3.33 [ManziniGagie07]
BWDC 1.3 [KaplanVerbin07] 1.7 [KaplanLandauVerbin06]
BWRL 1.26 [KaplanVerbin07] 5 [Manzini99]
gzip 1 1
PPM 1 1
a
. 0( ) 2 ( ) (1)ks k BW s nH s o
Bounds
lower Upper
BW0 2 [KaplanVerbin07] 3.33 [ManziniGagie07]
BWDC 1.3 [KaplanVerbin07] 1.7 [KaplanLandauVerbin06]
BWRL 1.26 [KaplanVerbin07] 5 [Manzini99]
gzip 1 1
PPM 1 1
Bounds
lower
BW0 2 [KaplanVerbin07]
BWDC 1.3 [KaplanVerbin07]
BWRL 1.26 [KaplanVerbin07]
gzip 1
PPM 1
Surprising!! Since BWT-based compressors
work better than gzip in practice!
Possible Explanations
1. Asymptotics:
and real compressors cut into blocks, so
2. English Text is not Markovian!• Analyzing on different model might show BWT's
superiority
( ) ( ) ( / log )
( ) 1.7 ( ) (log )
k
DC k
gzip s nH s O n n
BW s nH s O n
n
Lower bound● Wish to analyze BW0=BWT+MTF+Order0● Need to show s s.t.● Consider string s: 103 `a', 106 `b'
Entropy of s● BWT(s):
same frequencies MTF(BWT(s)) has: 2*103 `1', 106-103 `0‘ Compressed size: about
3 310 log(10 )
3 32 10 log(10 / 2)
need BWT(s) to have many isolated `a’s
00( ) 2 ( ) (1)BW s nH s o
many isolated `a’s● Goal: find s such that in BWT(s), most `a’s are
isolated● Solution: probabilistic.
BWT is (≤n+1)-to-1 function.● A random string s’ has ≥1/(n+1) chance of being a
BWT-image● A random string has ≥1-1/n2 chance of having “many”
isolated `a’s Therefore, such a string exists
General Calculation● s contains pn `a’s, (1-p)n `b’s.
Entropy of s:● MTF(BWT(s)) contains 2p(1-p)n `1’s, rest `0’s
compressed size of MTF(BWT(s)):
● Ratio:
1 1log (1 ) log
1p p
p p
12 (1 ) log
2 (1 )p p
p p
0
12 (1 ) log
2 (1 )2
1 1log (1 ) log
1
p
p pp p
p pp p
Lower bounds on BWDC, BWRL
● Similar technique. p infinitesimally small gives compressible string. So maximize ratio over p.
● Gives weird constants, but quite strong
Experimental Results
• Sanity Check: Picking texts from above Markov models really shows behavior in practice
• Picking text from “realistic” Markov sources also shows non-optimal behavior• (“realistic” = generated from actual texts)
• On long Markov text, gzip works better than BWT
Bottom Line● BWT compressors are not optimal
(vs. order-k entropy) ● We believe that they are good since English text is
not Markovian.● Find theoretical justification!
● also improve constants, find BWT algs with better ratios, ...
Main Property: L F mapping● The ith occurrence of c in L
corresponds to the ith occurrence of c in F.
● This happens because the characters in L are sorted by their post-context, and the occurrences of character c in F are sorted by their post-context.
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
# mississipp ii #mississip pi ppi#missis s
F Lunknown
BW0 vs. Lempel-Ziv
● BW0 dynamically takes advantage of context-regularity
● Robust, smooth, alternative for Lempel-Ziv
BW0 vs. Statistical Coding● Statistical Coding (e.g. PPM):
Builds a model for each context Prediction -> Compression
Exploits similarities between similar contexts
Optimally models each context
Explicit partitioning – produces a model for
each context
No explicit partitioning to contexts
PPMBW0
Compressed Text Indexing
● Application of BWT● Compressed representation of text, that
supports: fast pattern matching (without
decompression!) Partial decompression
● So, no need to ever decompress! space usage: |BW0(s)|+o(n)
● See more in [Ferragina-Manzini]
Musings
● On one hand: BWT based algorithms are not optimal, while Lempel-Ziv is.
● On the other hand: BWT compresses much better
● Reasons:
1. Results are Asymptotic. (EE reason)
2. English text was not generated by a Markov source (real reason?)
● Goal: Get a more honest way to analyze● Use a statistic different than Hk?