Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo...
-
date post
20-Dec-2015 -
Category
Documents
-
view
223 -
download
0
Transcript of Compressed Suffix Arrays based on Run-Length Encoding Veli Mäkinen Bielefeld University Gonzalo...
Compressed Suffix Arrays based on Run-Length Encoding
Veli Mäkinen
Bielefeld University
Gonzalo Navarro
University of Chile
BWT RL FID
20.6.2005 Compressed suffix arrays based on run-length encoding
2
Abstract
We introduce a new full-text index that occupies O(Hk|T|) bits and supports counting queries in O(|P|) time.- optimal space / search time on constant alphabet- works on any alphabet size , adding log to the space/time bounds.
20.6.2005 Compressed suffix arrays based on run-length encoding
3
Introduction
We consider exact string matching on static text.
The task is to construct an index for the text such that the occurrences of a given pattern can be found efficiently.
Well known optimal solution exists: build a suffix tree over the text.
20.6.2005 Compressed suffix arrays based on run-length encoding
4
Introduction...
The suffix-tree-based solution takes O(|T| log |T|) bits of space.
Text itself can be represented in O(|T| log ) bits.- or even less space if text is compressible.
In many applications the space usage is the real bottleneck, not the search efficiency.
20.6.2005 Compressed suffix arrays based on run-length encoding
5
Introduction...
During the last 15 years, many practical / theoretical solutions with reduced space complexities have been proposed.
The work can roughly be divided into three categories:(1) Reducing constant factors(2) Concrete optimization(3) Abstract optimization
20.6.2005 Compressed suffix arrays based on run-length encoding
6
Reducing constant factors
Suffix arrays (Manber & Myers 1990) Suffix cactuses (Kärkkäinen 1995) Sparse suffix trees (Kärkkäinen & Ukkonen
1996) Space-efficient suffix trees (Kurtz 1998) Enhanced suffix arrays (Abouelhoda &
Ohlebusch & Kurtz 2002)
20.6.2005 Compressed suffix arrays based on run-length encoding
7
Concrete optimization
“ Minimizing automata” DAWGS (Blumer & Blumer & Haussler &
McConnel & Ehrenfeucht 1983) Compact DAWGS (Crochemore & Vérin
1997) Compact suffix arrays (Mäkinen 2000)
20.6.2005 Compressed suffix arrays based on run-length encoding
8
Abstract optimization
Objective: Use as few space as possible to support the functionality of a given abstract definition of a data structure.
Space is measured in bits and usually given proportional to the entropy of the text.
20.6.2005 Compressed suffix arrays based on run-length encoding
9
Abstract optimization: Example
A full text index for a given text T supports the following operations:- Exists(P): is P a substring of T? - Count(P): how many times P occurs in T?- Report(P): list occurrences of P in T.
20.6.2005 Compressed suffix arrays based on run-length encoding
10
Abstract optimization...
Seminal work by Jacobson 1989: rank-select queries on bit-vectors.
Rank-select-type structures for suffix trees (Munro & Raman & Rao & Clark 1996-)
Lempel-Ziv index (Kärkkäinen & Ukkonen 1996)
20.6.2005 Compressed suffix arrays based on run-length encoding
11
Abstract optimization...
Compressed suffix arrays (Grossi & Vitter 2000, Sadakane 2000, 2002)
FM-index (Ferragina & Manzini 2000) LZ-self-index (Navarro 2002) Space-optimal full-text indexes (Grossi & Gupta &
Vitter 2003, 2004) Alphabet friendly FM-index (Ferragina & Manzini
& Mäkinen & Navarro) See also ISAAC'04, SODA'05,...
20.6.2005 Compressed suffix arrays based on run-length encoding
12
This talk
We show that combining FM-index with compact suffix array gives a practical full-text index with good space / search time tradeoff.
Our structure, Run-Length FM-index, usesO(min(|T|(Hk log +1),|T|log ) bits and supports Count(P) in O(|P|log ) time.
20.6.2005 Compressed suffix arrays based on run-length encoding
13
This talk...
Hk=Hk(T) is the order-k empirical entropy of T, i.e., “the average number of bits needed to encode a symbol using a fixed codebook for each possible combination of k previous symbols”.
There holds 0 Hk Hk-1 ... H0 log
20.6.2005 Compressed suffix arrays based on run-length encoding
14
FM-index
Let us first describe a simple variant of the FM-index that:- occupies O(|T| log bits, and- supports counting queries in O(|P| log ) time.
20.6.2005 Compressed suffix arrays based on run-length encoding
15
Simple FM-index
Construct the Burrows-Wheeler-transformed text bwt(T) [BW94].
From bwt(T) it is possible to construct the suffix array sa(T) of T in linear time.
Instead of constructing the whole sa(T), one can add small data structures besides bwt(T) to simulate a search from sa(T).
20.6.2005 Compressed suffix arrays based on run-length encoding
16
Burrows-Wheeler transformation
Construct a matrix M that contains as rows all rotations of T.
Sort the rows in the lexicographic order. Let L be the last column and F be the first
column. bwt(T)=L associated with the row number of
T in the sorted M.
20.6.2005 Compressed suffix arrays based on run-length encoding
17
Example
pos 123456789T = kalevala#
1:9 #kalevala2:8 a#kaleval3:6 ala#kalev4:2 alevala#k5:4 evala#kal6:1 kalevala#7:7 la#kaleva8:3 levala#ka9:5 vala#kale
==>
L = alvkl#aae, row 6
Exercise: Given L and the row number, how to compute T and sa(T)?
sa M LF
1 a2 l3 v4 k5 l6 #7 a8 a9 e
#aa a ekl l v
1:2:3:4:5:6:7:8:9:
#
9
a
8
l
7
a
6
v
5
e
4
l
3
a
2
1
k
sortsa(T)
T-1=
L F
…
alvkl#aae
ML
LF[i] 2 7 9 6 8 1 3 4 5i 1 2 3 4 5 6 7 8 9
a l e v a l a
k a l e v a l
20.6.2005 Compressed suffix arrays based on run-length encoding
19
Implicit LF[i]
Ferragina and Manzini (2000) noticed the following connection:
LF[i]=CT[L[i]]+rankL[i](L,i)
Here CT[c] : amount of letters 0,1,...,c-1 in L=bwt(T)rankc(L,i) : amount of letters c in the prefix L[1,i]
20.6.2005 Compressed suffix arrays based on run-length encoding
20
Rank/Select
001001001101
001112223445rank1(L,i)
L
select1(L,j) 3 6 9 10 12
LF[i] 2 7 9 6 8 1 3 4 5i 1 2 3 4 5 6 7 8 9 LF[7]=CT[a]+ranka(L,7)
=1+2=3
1 a2 l3 v4 k5 l6 #7 a8 a9 e
#aa a ekl l v
1:2:3:4:5:6:7:8:9:
#
9
a
8
l
7
a
6
v
5
e
4
l
3
a
2
1
k
sortsa(T)
T-1=
L F
…
alvkl#aae
ML
20.6.2005 Compressed suffix arrays based on run-length encoding
22
Backward search on bwt(T)
Observation: If [i,j] is the range of rows of M that start with string X, then the range [i’,j’] containing cX can be computed as
i’ := CT[c]+rankc(L,i-1)+1, j’ := CT[c]+rankc(L,j).
20.6.2005 Compressed suffix arrays based on run-length encoding
23
M L
…
alvkl#aae
Backward search on bwt(T) …
#ka#al al evkala le va
X=a
i
j
vX=va?
rankv(L,i-1)=0
rankv(L,j)=1
C[’v’]=8
i’ := 8 + 0 + 1
j’ := 8 + 1
i’, j’
20.6.2005 Compressed suffix arrays based on run-length encoding
24
Algorithm Count(P[1,m], L[1,n],CT[1,)(1) c = P[m]; k = m;
(2) i = CT[c]+1; j = CT[c+1];(3) while (i ≤ j and k>1) do begin(4) c = P[k-1]; k = k-1;
(5) i = CT[c]+rankc(L,i-1)+1;
(6) j = CT[c]+rankc(L,j); end;(7) if (j<i) then return 0 else return (j-i+1);
Backward search on bwt(T) …
20.6.2005 Compressed suffix arrays based on run-length encoding
25
Backward search on bwt(T)...
Array CT[1,] takes O( log |T|) bits.
L=Bwt(T) takes O(|T| log ) bits. Assuming rankc(L,i) can be computed in
constant time for each (c,i), the algorithm takes O(|P|) time to count the occurrences of P in T.
20.6.2005 Compressed suffix arrays based on run-length encoding
26
Answering rankc(L,i)
Wavelet tree (GGV 2003) is a data structure replacing L=bwt(T):- supports rankc(L,i) in O(log ) time, and- occupies |T|H0(T) +o(|T|) bits.
Generalized wavelet tree (FMMN 2004) improves query time to constant when =O(polylog(|T|)).
20.6.2005 Compressed suffix arrays based on run-length encoding
27
Simple FM-index...
We obtained a structure that- occupies O(|T|H0(T)bits, supports counting queries in O(|P|log ) time.
Original FM-index takes O(Hk|T|) bits, but only on constant alphabet.
Compression boosting can be applied to improve simple FM-index to take only O(|T|Hk(T)bits (FMMN 2004).
20.6.2005 Compressed suffix arrays based on run-length encoding
28
To partition or not...
All alphabet-friendly solutions obtaining O(|T|Hk(T)space for compressed suffix arrays use optimal partitioning of BWT text, and store explicitly the distribution for each piece.- always (k+1) overhead.
MTF+zeroth order coding take O(|T|Hk(T)(k), but supporting queries on larger alphabets is non-trivial.
20.6.2005 Compressed suffix arrays based on run-length encoding
29
Run-Length FM-index
We make the following changes to the previous FM-index variant:- L=Bwt(T) is replaced by a sequence S[1,n’] and two bit-vectors B[1,|T|] and B’[1,|T|],- Cumulative array CT[1,c] is replaced by CS[1,c],- wavelet tree is build on S, and- some formulas are changed.
20.6.2005 Compressed suffix arrays based on run-length encoding
30
Run-Length FM-index...
cccaaggatt
L
1001010110
B
cagat
S
1011001010
B’
aaacccggtt
F
cccaaggatt
L
20.6.2005 Compressed suffix arrays based on run-length encoding
31
Changes to formulas
Recall that we need to compute CT[c]+rankc(L,i) in the backward search.
Theorem: C[c]+rankc(L,i) is equivalent to select1(B’,CS[c]+1+rankc(S,rank1(B,i)))-1,when L[i] c, and otherwise to select1(B’,CS[c]+rankc(S,rank1(B,i)))+i-select1(B,rank1(B,i)).
20.6.2005 Compressed suffix arrays based on run-length encoding
32
Example, L[i]=c
cccaaggatt
L
aaacccggtt
F LF[8]= select1(B’,CS[a]+ranka(S,rank1(B,8)))+ 8-select1(B,rank1(B,8))
1001010110
B
cagat
S
1011001010
B’
= select1(B’,0+ranka(S,4))+8-select1(B,4)
= select1(B’,0+2)+8-8= 3
20.6.2005 Compressed suffix arrays based on run-length encoding
33
Space requirement
CS[1,] takes O( log |T|) bits.
B and B’ with rank/select dictionaries take 2|T|+o(|T|) bits.
S represented using wavelet tree occupies |S|H0(S)+o(|S|) bits.
In CPM 2004, we have shown that |S| Hk|T| +k.