New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position...

49
New Algorithms for Position Heaps Travis Gagie, Wing-Kai Hon, Tsung-Han Ku CPM2013 1

Transcript of New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position...

Page 1: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

New Algorithms for Position Heaps Travis Gagie, Wing-Kai Hon, Tsung-Han Ku

CPM2013

1

Page 2: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Outline

Position Heap

Limiting Length and Height

Turing a Suffix Tree into a Position Heap

Using a Position Heap as a Suffix Array

Using a Compressed Suffix Array as a Position Heap

2

Page 3: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

The root is labeled 0 and the other nodes are labeled 1 to

n such that parent’s labels are smaller than their

children’s labels

For 1≤ i ≤ n, the path label of the node labeled i is a

prefix of S[i..n].

For 1≤ i ≤ n, the node labeled i stores a pointer (called

maximal-reach pointer) to the deepest node whose path

label is a prefix of S[i..n]

3

Page 4: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

0

1 a

A position heap for S = abaababbabbab$ 4

Page 5: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

0

1 2 a

b

A position heap for S = abaababbabbab$ 5

Page 6: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

0

1 2

3

a

a

b

A position heap for S = abaababbabbab$ 6

Page 7: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

0

1 2

3 4

a

a b

b

A position heap for S = abaababbabbab$ 7

Page 8: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

0

1 2

3 4 5

a

a a b

b

A position heap for S = abaababbabbab$ 8

Page 9: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

0

1 2

3 4 5

6

a

a a b

b

b

A position heap for S = abaababbabbab$ 9

Page 10: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

0

1 2

3 4 5 7

6

a

a a b

b

b

b

A position heap for S = abaababbabbab$ 10

Page 11: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

0

1 2

3 4 5 7

6 8

a

a a b

b

b

b

b

A position heap for S = abaababbabbab$ 11

Page 12: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

0

1 2

3 4 5 7

6

9

8

a

a a

a

b

b

b

b

b

A position heap for S = abaababbabbab$ 12

Page 13: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

0

1 2

3 4 5 7

6

9

8

a

a a

a

b

b

b

b

b

A position heap for S = abaababbabbab$

10

a

13

Page 14: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

0

1 2

3 4 5 7

6

9

8

11

a

a a

a

b

b

b

b

$

b

A position heap for S = abaababbabbab$

10

a

14

Page 15: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

0

1 2

3 4 5 7

12 6

9

8

11

a

a a

a

b

b

b

b $

$

b

A position heap for S = abaababbabbab$

10

a

15

Page 16: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

0

1 2

3 4 13 5 7

12 6

9

8

11

a

a a

a

b

b

b

b $

$

$

b

A position heap for S = abaababbabbab$

10

a

16

Page 17: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

14

0

1 2

3 4 13 5 7

12 6

9

8

11

$ a

a a

a

b

b

b

b $

$

$

b

A position heap for S = abaababbabbab$

10

a

17

Page 18: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

14

0

1 2

3 4 13 5 7

12 6

9

8

11

$ a

a a

a

b

b

b

b $

$

$

b

A position heap for S = abaababbabbab$

10

a

18

Page 19: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Position Heap

With the following auxiliary data structures, the pattern

matching problem can be solved in optimal time (i.e.

O(|P| + occ)).

(1) An array A of pointers such that, given i, in O(1) time

we can find the node labeled i.

(2) A data structure B such that, given i and j, in O(1) time

we can determine whether the node labeled i is an

ancestor of the node labeled j.

19

Page 20: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Query on Position Heap

14

0

1 2

3 4 13 5 7

12 6

9

8

11

$ a

a a

a

b

b

b

b $

$

$

b

10

a

Query pattern P = aabab, candidate are 1 and 3.

A position heap for S = abaababbabbab$

20

Page 21: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Query on Position Heap

14

0

1 2

3 4 13 5 7

12 6

9

8

11

$ a

a a

a

b

b

b

b $

$

$

b

10

a

Query pattern P = aabab, candidate is 3.

Check if 3+2= 5 is on the path or in the subtree of 8. 21

Page 22: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Limiting Length and Height

Suppose the length of the query pattern P is always

less than or equal to a variable M.

Then we can build a position heap whose height

would be bounded in O(M).

22

Page 23: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Limiting Length and Height

Suppose we want to build the position heap of a string

Instead of building the position heap on S, we build the

position heap of S’!S’’.

2M 2M 2M

23

….. S’

2M 2M 2M

….. S’’

Page 24: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Limiting Length and Height

According to the nature of the position heap, the height

of the position heap of S’!S’’ would be O(M).

By adopting Ehrenfeucht et al.’s approaches with an

AVL tree, we have the following theorem.

Theorem 1: If we will never search for a pattern of length greater than

M in a dynamic string S, then we can maintain a position heap that

works as an index for S such that

• Searching for a pattern of length m≦M takes O(mlog|S|+occ) time,

• Inserting a substring of length l takes O((M+l)Mlog(|S|+l)) time,

• Deleting a substring of length l takes O((M+l)Mlog|S|) time.

24

Page 25: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Suffix Tree Into Position Heap

A position heap of a string S is a subtree of the suffix

trie of S.

Suffix tree is a compact suffix trie.

Here, we show how to turn a suffix tree into a

position heap.

25

Page 26: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ a b

b

$ a

a

b

$

$

$

The suffix tree of S = abaababbabbab$.

1

4

12

3

14

9 6

13

2 8

11

5

10

7

26

Page 27: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ a b

b

$ a

a

b

$

$

$

The suffix tree of S = abaababbabbab$.

1

4

12

3

14

9 6

13

2 8

11

5

10

7

1

0

27

Page 28: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ a b

b

$ a

a

b

$

$

$

The suffix tree of S = abaababbabbab$.

1

4

12

3

14

9 6

13

2 8

11

5

10

7

1

0

2

28

Page 29: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ b

b

$ a

a

b

$

$

$

The suffix tree of S = abaababbabbab$.

1

4

12

3

14

9 6

13

2 8

11

5

10

7

1

0

2

29

Page 30: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ b

b

$ a

a

b

$

$

$

The suffix tree of S = abaababbabbab$.

1

4

12

14

9 6

13

2 8

11

5

10

7

1

0

2

4

30

3

Page 31: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ b

b

$ a

a

b

$

$

$

The suffix tree of S = abaababbabbab$.

1

4

12

14

9 6

13

2 8

11

5

10

7

1

0

2

4

5

31

3

Page 32: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ b

b

$ a

a

b

$

$

$

The suffix tree of S = abaababbabbab$.

1

4

12

14

9 6

13

2 8

11

5

10

7

1

0

2

4

5

6

32

3

Page 33: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ b

b

$ a

a

b

$

$

$

The suffix tree of S = abaababbabbab$.

1

4

12

14

9 6

13

2 8

11

5

10

7

1

0

2

4

5

6

7

33

3

Page 34: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ b

b

$ a

a

b

$

$

$

The suffix tree of S = abaababbabbab$.

1

4

12

14

9 6

13

2 8

11

5

10

7

1

0

2

4

5

6

7

8

34

3

Page 35: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ b

b

$ a

a

b

$

$

$

The suffix tree of S = abaababbabbab$.

1

4

12

14

9 6

13

2 8

11

5

10

7

1

0

2

4

5

6

7

8

9

35

3

Page 36: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ b

b

$ a

a

b

$ $

$

The suffix tree of S = abaababbabbab$.

1

4

12

14

9 6

13

2 8

11

5

10

7

1

0

2

4

5

6

7

8

9

10

36

3

Page 37: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ b

b

$ a

a

b

$ $

$

The suffix tree of S = abaababbabbab$.

1

4

12

14

9 6

13

2 8

11

5

10

7

1

0

2

4

5

6

7

8

9

10

37

3

Page 38: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ b

b

$ a

a

b

$ $

$

The suffix tree of S = abaababbabbab$.

1

4

12

14

9 6

13

2 8

11

5

10

7

1

0

2

4

5

6

7

8

9

10

38

3

Page 39: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ b

b

$ a

a

b

$ $

$

The suffix tree of S = abaababbabbab$.

1

4

12

14

9 6

13

2 8

11

5

10

7

1

0

2

4

5

6

7

8

9

10

39

3

Page 40: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ b

b

$ a

a

b

$ $

$

The suffix tree of S = abaababbabbab$.

1

4

12

14

9 6

13

2 8

11

5

10

7

1

0

2

4

5

6

7

8

9

10

40

3

Page 41: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

$ b

b

$ a

a

b

$ $

$

1

4

12

14

9 6

13

2 8

11

5

10

7

1

0

2

4

5

6

7

8

9

10

41

The suffix tree of S = abaababbabbab$.

3

Page 42: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Suffix Tree Into Position Heap

Given a suffix tree, we can build the position heap in

linear time by using Westbrook’s approaches, and

Berkman et al.’s approaches.

By using Farach’s linear time suffix tree construction

algorithm independent of the alphabet size, we can build

a position heap of a string S in linear time independent

of the alphabet size.

42

Page 43: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Using Position Heap as Suffix Array

Define D[i] = the depth of the node labeled SA[i]

Define E[i] = the depth of the node with rank i in

preorder

43

Page 44: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

14

0

1 2

3 4 13 5 7

12 6

9

8

11

$ a

a a

a

b

b

b

b $

$

$

b

10

a

D = [ 1, 2, 3, 1, 2, 4, 3, 2, 1, 4, 3, 2, 3, 2]

44

+

= E = [ 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 4, 2, 3]

SA = [14, 3, 12, 1, 4, 9, 6, 13, 2, 11, 8, 5, 10, 7]

Page 45: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Using Position Heap as Suffix Array

By using succinct partial rank and select query structures

which support constant time query on arrays D and E,

we can compute the SA[i] value in constant time.

Theorem: We can add O(n log h) bits to a position heap,

where h is the height of the heap, such that it supports

access to the corresponding suffix array and inverse

suffix array in O(1) time.

45

Page 46: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Using CSA as Position Heap

To represent a position heap, we need

Its structure as a tree; balanced parentheses representation

The nodes’ labels; by using E, D, and SA

The edges’ labels; by using SA-1 and a bit-vector

The maximal-reach pointers; next slide, we will show

The array A of pointers; similar to previous one described in the

previous slides

The data structure B; similar to previous one described in the

previous slides

46

Page 47: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

( (*) ( (*) ( (*)**( (**) ) ) ) ( (*) (*( (*) ** ) ) ( (**) ) ) ) 47

$ b

b

$ a

a

b

$ $

$

1

4

12

14

9 6

13

2 8

11

5

10

7

1

0

2

4

5

6

7

8

9

10

3

Page 48: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Using CSA as Position Heap

Suppose we want to compute the maximal-reach pointer of

the node labeled 6.

First we compute SA-1[6] = 7. ; it is the 7-th star we visited

Then we compute which parentheses pair encloses the 7-th

star.

( (*) ( (*) ( (*)**( (**) ) ) ) ( (*) (*( (*) ** ) ) ( (**) ) ) )

This pair represents the node labeled 9.

48

Page 49: New Algorithms for Position Heaps - UCRstelo/cpm/cpm13/08_ku.pdf · New Algorithms for Position Heaps ... then we can maintain a position heap that works as an index for S such that

Thank you

49