Compressed Index for a Dynamic Collection of Texts

12
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong

description

Compressed Index for a Dynamic Collection of Texts. H.W. Chan, W.K. Hon , T.W. Lam The University of Hong Kong. Problem Definition. Given L = { T 1 , T 2 , … , T k } of total length n over an alphabet Σ - PowerPoint PPT Presentation

Transcript of Compressed Index for a Dynamic Collection of Texts

Page 1: Compressed Index for a Dynamic Collection of Texts

Compressed Index for a Dynamic Collection of Texts

H.W. Chan, W.K. Hon, T.W. Lam

The University of Hong Kong

Page 2: Compressed Index for a Dynamic Collection of Texts

Problem Definition

Given L = { T1, T2, …, Tk } of total length n over an alphabet Σ

We want to create an index for L such that on given any pattern P, the occurrences of P in each of the Ti can be found quickly

Also, the index should support fast insertion/ deletion of Ti into/from L

Page 3: Compressed Index for a Dynamic Collection of Texts

Previous Work & Our Result

Space (bits) Matching Insertion/deletion

[McCreight, JACM’76]O(n log n) O (|P|+occ) O(|Ti|)

[Ferragina & Manzini, FOCS’00]O(n) O(|P| log3 n + occ log n) amortized O(|Ti| log n) /

amortized O(|Ti| log2 n)[This paper]

O(n) O(|P| log n + occ log2 n) worst case O(|Ti| log n)

Page 4: Compressed Index for a Dynamic Collection of Texts

Two Basic Tools: CSA, FM-index

Definition 1: The main component of CSA for a text T is a function Ψ such that

Ψ[i] = SA-1[SA[i] + 1] where SA[i] is the i-th entry in the suffix arra

y, and SA-1 is the inverse of SA

Page 5: Compressed Index for a Dynamic Collection of Texts

Two Basic Tools: CSA, FM-index

Definition 2:The FM-index of T is based on Burrows-Wheeler array of T, which is an array of characters, denoted by BWT, such that

BWT[i] = T[SA[i]-1].The main component of FM-index is |Σ| functions countc for every c Σsuch that

countc[i] = # of c in BWT[1…i]

Page 6: Compressed Index for a Dynamic Collection of Texts

Our Index

Our index is a dynamic version of CSA + FM-index for the concatenated text T1T2…Tk

We exploit the property of Ψ and count that, both of them are essentially a couple of sequence of increasing values.

Page 7: Compressed Index for a Dynamic Collection of Texts

Our Index

To maintain a dynamic CSA and FM-index to maintain a dynamic sequence of increasing values

Observation 3: Balanced search tree is good for dynamic sequence

Observation 4: Difference encoding for increasing values can save space

Page 8: Compressed Index for a Dynamic Collection of Texts

Our Index

Combining Observations 3 and 4 Differential Balanced Search Tree to handle the values in the dynamic CSA and FM-index

Drawbacks: computation of Ψ and count is slowed down by O(log n) factor

Pattern matching: O(|P| log n + occ log2 n) time

Page 9: Compressed Index for a Dynamic Collection of Texts

Insertion & Deletion (sketch idea)

Insertion corresponds to finding update points in the increasing sequences of Ψ and count To insert a text T into L, there are O(|T|) s

uch update points Update points can be found by simulatin

g a pattern matching query of T against L Total time: O(|T| log n)

Page 10: Compressed Index for a Dynamic Collection of Texts

Insertion & Deletion (sketch idea)

Deletion reverses the insertion process Update points can be found by querying Ψ iteratively, instead of simulating a pattern matching query

Total time: O(|T| log n)

Page 11: Compressed Index for a Dynamic Collection of Texts

Conclusion, Progress & Future Work

In the literature, there is a dual problem called Dictionary Management, which maintains a collection of patterns, such that when a text T is given later, all occurrences of each pattern in T is reported in one query. Also, fast insertion/deletion of pattern is required O(n) bits: some progress …

Page 12: Compressed Index for a Dynamic Collection of Texts

Conclusion, Progress & Future Work

There is another problem called Dynamic Text, which maintains a single text T, and when a pattern P is given later, it supports finding all occurrences of P in T. The text T is subject to insertion/deletion of substrings. O(n log n) bits: Sahinalp & Vishkin, FOCS’9

6 O(n) bits: ??