1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix...
-
Upload
hugo-cameron -
Category
Documents
-
view
213 -
download
0
Transcript of 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix...
![Page 1: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.](https://reader035.fdocuments.us/reader035/viewer/2022072006/56649d135503460f949e6e21/html5/thumbnails/1.jpg)
1
A Modified Burrows-Wheeler Transformation for
Case-insensitive Searchwith Application to
Suffix Array Compression
Kunihiko Sadakane
Department of Information Science
University of Tokyo
![Page 2: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.](https://reader035.fdocuments.us/reader035/viewer/2022072006/56649d135503460f949e6e21/html5/thumbnails/2.jpg)
2
Promising Techniques
• faster than PPMs• decoding is much faster• comparable performance w
ith PPMs
• search data structure• can find any substring• memory efficient than suffi
x trees
Block Sorting Compression [Burrows, Wheeler 94]
Suffix Array [Manber, Myers 93]
We unify compression and search by using them.
Key: the Burrows-Wheeler Transformation (BWT)
![Page 3: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.](https://reader035.fdocuments.us/reader035/viewer/2022072006/56649d135503460f949e6e21/html5/thumbnails/3.jpg)
3
Block Sorting Compression
• Burrows-Wheeler Transformation (BWT) performs permutation of text symbols in lexicographic order of their suffixes.
• Permuted text becomes more compressible.
![Page 4: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.](https://reader035.fdocuments.us/reader035/viewer/2022072006/56649d135503460f949e6e21/html5/thumbnails/4.jpg)
4
Novel Feature of the Block Sorting
• BWT is defined by the suffix array (sorted indexes of suffixes)
• The suffix array is recovered from the compressed text
Suffix array can be compressed by the Block Sorting!
But, it cannot be used for case-insensitive search.
![Page 5: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.](https://reader035.fdocuments.us/reader035/viewer/2022072006/56649d135503460f949e6e21/html5/thumbnails/5.jpg)
5
Our Contribution
• propose Modified Burrows-Wheeler Transformation– used for compressing text and its suffix array
• Decoded suffix array can be used for case-insensitive search.
• Any unification function is available. (not only case-insensitive search)
![Page 6: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.](https://reader035.fdocuments.us/reader035/viewer/2022072006/56649d135503460f949e6e21/html5/thumbnails/6.jpg)
6
An Application
Distributed Web Search Robots
search robot
collected text
compress byBlock Sorting
xyz XYZ
Web sites
transfer via network
search robot
Abc ABC
Web sites
![Page 7: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.](https://reader035.fdocuments.us/reader035/viewer/2022072006/56649d135503460f949e6e21/html5/thumbnails/7.jpg)
7Search Server
suffix array on disk
ABCAbc
decode
text
suffix array
merge into database
XYZxyz
transfer via network
3 10 8 5 2 7 ...14 2 8 3 9 5 10 ...
8 4 100 251 58 ...
![Page 8: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.](https://reader035.fdocuments.us/reader035/viewer/2022072006/56649d135503460f949e6e21/html5/thumbnails/8.jpg)
8
The original BWT
3 ABCAb c0 AbcAB C4 BCAbc A5 CAbcA B1 bcABC A2 cABCA b
AABCbc
Input text BWTed text
reverse BWT
0 AbcABC1 bcABCA2 cABCAb3 ABCAbc4 BCAbcA5 CAbcAB
sorting
BWT
304512
suffix array
![Page 9: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.](https://reader035.fdocuments.us/reader035/viewer/2022072006/56649d135503460f949e6e21/html5/thumbnails/9.jpg)
9
Unification
• unify capital/small letters (tolower)DCC = dcc
• unify double-byte codes and single-byte codes in Japanese EUC codeABC (a3c1 a3c2 a3c3) = ABC (41 42 43)
• unify Japanese Hiragana and Katakanaあいうえお = アイウエオ
We identify character equivalence.
![Page 10: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.](https://reader035.fdocuments.us/reader035/viewer/2022072006/56649d135503460f949e6e21/html5/thumbnails/10.jpg)
10
Modified BWT
3 abc$ c0 abcabc$ C4 bc$ A1 bcabc$ A5 c$ B2 cabc$ b
Input text
MBWTed text
reverseBWT
0 abcabc$1 bcabc$2 cabc$3 abc$4 bc$5 c$
sorting
MBWT
AbcABC
ccaabb
aabbcc
unify
unify
304152
suffix array
permutes symbols by suffix array of unified text
reverseMBWT
![Page 11: 1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.](https://reader035.fdocuments.us/reader035/viewer/2022072006/56649d135503460f949e6e21/html5/thumbnails/11.jpg)
11
Compression Ratio and Speed
unification func.identical (BWT)normal (MBWT)LSB4MSB4zero (no BWT)
comp. ratio1.7431.7642.5232.7075.772
comp. time (s)363.58363.41443.89438.04411.74
HTML files (total 90Mbytes)Block size: 9Mbytes
•small difference between BWT and MBWT•MBWT provides case-insensitive searches.