Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.

Post on 18-Dec-2015

215 views 1 download

Tags:

Transcript of Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.

Presented ByDr. Shazzad Hosain

Asst. Prof. EECS, NSU

Linear Time Construction of Suffix Tree

Suffix tree

S=xabxacS=xabxac = abxac = bxac = xac = ac = c

12

34

56

Suffix tree

S=xabxaS=xabxa = abxa = bxa = xa = a

12

34

5xa

bx

a

a

bx

a

bx

a

Suffix tree (Example)

Let s=abab, a suffix tree of s contains all the suffixes of s=abab$

{ $ b$ ab$ bab$ abab$ }

ab

ab

$

ab

$

b

$

$

$

Trivial algorithm to build a Suffix tree

Put the largest suffix in

Put the suffix bab$ in

abab$

abab

$

ab$

b

s=abab$

Put the suffix ab$ in

ab

ab

$

ab$

b

ab

ab

$

ab$

b

$

{

abab$

bab$

}

Put the suffix b$ in

ab

ab

$

ab$

b

$

ab

ab

$

ab$

b

$

$

{

abab$

bab$

ab$

}

Put the suffix $ in

ab

ab

$

ab$

b

$

$

ab

ab

$

ab$

b

$

$

$

{

abab$

bab$

ab$

b$

}

We will also label each leaf with the starting point of the corres. suffix.

ab

ab

$

ab$

b

$

$

$

12

ab

ab

$

ab

$

b

3

$ 4

$

5

$

{

abab$

bab$

ab$

b$

$

}

Naive Construction – More Example

abbcbab#ab

#

bcbab#

b

#

cbab#

bcbab#

ab#

cbab#

6

1 7

3

2

5

4abbcbab#bbcbab#

Analysis

Takes O(n2) time to build.

We will see how to do it in O(n) time

Ukkonen’s linear-time Suffix Tree Algorithm

• Implicit Suffix Tree

1. Remove the terminal symbols $ from the edge labels of the tree2. Then remove any edge that has no label

Implicit Suffix Tree – More Example

12

ab

ab

$

ab$

b

3

$ 4

$

5

${

abab$

bab$

ab$

b$

$

}

1. Even though an implicit suffix tree may not have a leaf for each suffix, it does encode all the suffixes of S

2. Let i denote the implicit suffix tree of the string S[1…i]

Ukkonen’s Algorithm at a High Level

• Construct an implicit suffix tree i for each prefix S[1..i] of S, starting 1 and incrementing i by one until m is build, where m is the length of the string S.

• The true suffix tree for S is constructed from m , and the time for the entire algorithm is O(m)

High-level Description of Ukkonen’s Algorithm

• Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i

• Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1].

Naïve Algorithm of Suffix Tree

{

abab$

bab$

ab$

b$

$

}

a

b

ab

$

1

ab$

b

2

3

$ 4

$

$

5

High-level of Ukkonen’s Algorithm• Ukkonen’s algorithm is divided into m phases. In phase i+1,

tree i+1 is constructed from i

• Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1].

b

a

a

1

a

b

2

1 : S[1…1] {a}

2 : S[1…2] {ab, b}

a b

3 : S[1…3] {aba, ba, a}

a b ba

extensions

phases

b

a

a

1

a

b

2

1 : S[1…1] {a}

2 : S[1…2] {ab, b}

3 : S[1…3] {aba, ba, a}

extensions

O (m3)

b

a

a

1

a

b

2

1 : S[1…1] {a}

2 : S[1…2] {ab, b}

3 : S[1…3] {aba, ba, a}

Suffix Entension Rules

4 : S[1…4] {abab, bab, ab, b}1 2 b

b

Rule1: Let β = S[j … i] be a suffix of S[1 … i]. If path β ends at a leaf, character S(i+1) is added to the end of the label of that leaf edge.

1 2 3

Rule2: some path from the end of string β starts with character S(i+1). In this case the string β S(i+1) is already in the tree. So do nothing.

β S(i+1)

Let i already there and want to extend for i+1

Suffix Entension RulesLet, i already there and want to extend for i+1

Let, 5 is drawn for axabxb123456

Now extend for 6

axabxb xabxb abxb bxb

RULE

1

xb

Rule3: No path from the end of string β starts with character S(i+1), but at least one labeled path continues from the end of β. Add new node.

RULE3

b RULE2

O (m3)

Implementation and Speedup, Suffix LinksDefinition: Let xα denotes an arbitrary string, where x is a single character and α a substring (possibly empty). For an internal node v with path-label xα, if there is another node s(v) with path-label α, then a pointer from v to s(v) is called a suffix link.

Does root have a suffix link? No, because not an internal nodeEvery internal node has a suffix link.

Suffix Links – More Example

abbcbab#

ab

#

bcbab#

b

#

cbab#

bcbab#

ab#

cbab#

6

1 7

3

2

5

4

Suffix link

v

S(v)

Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.

Corollary 6.1.2: In any implicit suffix tree i, if internal node v has path-label xα, then there is a node s(v) of i with path-label α.

MISSISSIPI

1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS7 : MISSISS8 : MISSISSI9 : MISSISSIP10: MISSISSIPI

1

MI

SS

IS

SI

PI

I

SS

I

SS

I

I

S

SI

S

SI

P

II

S

S

I

P

I

II

I

P

I

I

23

4

5

P

P6

P

7

P

8

P

9

1234567890

Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.

MISSISSIPI

1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS7 : MISSISS8 : MISSISSI9 : MISSISSIP10: MISSISSIPI

1

MI

SS

IS

SI

PI

I

SS

I

SS

I

I

S

SI

S

SI

P

II

S

S

I

P

I

II

I

P

I

I

23

4

5

P

P6

P

7

P

8

P

9

1234567890

Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.

How suffix links help?

What is achieved so far?

Not so much. Worst-case running time is O(m2) for a phase.

Trick1: Skip/Count Trick

There must be a γ path from s(v).

Trick1: Skip/Count Trick

There must be a γ path from s(v).

Walking down along γ takes time proportional to |γ|

Skip/count trick reduces the traversal time to something proportional to the number of nodes on the path.

zabcdefghy

2 2 3 3Nodes

But what does it buy in terms of worst-case bounds?

Edge length

Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).

v=2 s(v)=1

v=3s(v)=3

v=4 s(v)=5

Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).

Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time.

In a single extension – The algorithm walks up at most one edge– Find suffix link and traverse it– Walks down some number of nodes– Applies suffix extension rules– And may add a suffix link

All operations except down-walk takes constant timeOnly needs to analyze down walk time

Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).

Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time.

In a single extension – The algorithm walks up at most one edge– Find suffix link and traverse it– Walks down some number of nodes– Applies suffix extension rules– And may add a suffix link

All operations except down-walk takes constant timeOnly needs to analyze down walk time

– Decreases current node-depth by at most one– Decreases node-depth by at most another one– Each down walk moves to greater node-depth

– Over the entire phase, current node-depth is decremented by at most 2m times

– Since no node can have depth greater than m, the total possible increment to current node-depth is bounded by 3m over the entire phase

– Total number of edge traversal bounded by 3m– Since each edge traversal is constant, in a phase

all the down-walking is O(m).

Complexity• There are m phases• Each phase takes O(m)• So the running time is O(m2)

Two more tricks and we are done

Reference

• Chapter 6: Algorithms on Strings, Trees and Sequences