Machine Learning Dr. Shazzad Hosain Department of EECS North South Universtiy [email protected].
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
-
Upload
abigayle-west -
Category
Documents
-
view
215 -
download
1
Transcript of Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Presented ByDr. Shazzad Hosain
Asst. Prof. EECS, NSU
Linear Time Construction of Suffix Tree
Suffix tree
S=xabxacS=xabxac = abxac = bxac = xac = ac = c
12
34
56
Suffix tree
S=xabxaS=xabxa = abxa = bxa = xa = a
12
34
5xa
bx
a
a
bx
a
bx
a
Suffix tree (Example)
Let s=abab, a suffix tree of s contains all the suffixes of s=abab$
{ $ b$ ab$ bab$ abab$ }
ab
ab
$
ab
$
b
$
$
$
Trivial algorithm to build a Suffix tree
Put the largest suffix in
Put the suffix bab$ in
abab$
abab
$
ab$
b
s=abab$
Put the suffix ab$ in
ab
ab
$
ab$
b
ab
ab
$
ab$
b
$
{
abab$
bab$
}
Put the suffix b$ in
ab
ab
$
ab$
b
$
ab
ab
$
ab$
b
$
$
{
abab$
bab$
ab$
}
Put the suffix $ in
ab
ab
$
ab$
b
$
$
ab
ab
$
ab$
b
$
$
$
{
abab$
bab$
ab$
b$
}
We will also label each leaf with the starting point of the corres. suffix.
ab
ab
$
ab$
b
$
$
$
12
ab
ab
$
ab
$
b
3
$ 4
$
5
$
{
abab$
bab$
ab$
b$
$
}
Naive Construction – More Example
abbcbab#ab
#
bcbab#
b
#
cbab#
bcbab#
ab#
cbab#
6
1 7
3
2
5
4abbcbab#bbcbab#
Analysis
Takes O(n2) time to build.
We will see how to do it in O(n) time
Ukkonen’s linear-time Suffix Tree Algorithm
• Implicit Suffix Tree
1. Remove the terminal symbols $ from the edge labels of the tree2. Then remove any edge that has no label
Implicit Suffix Tree – More Example
12
ab
ab
$
ab$
b
3
$ 4
$
5
${
abab$
bab$
ab$
b$
$
}
1. Even though an implicit suffix tree may not have a leaf for each suffix, it does encode all the suffixes of S
2. Let i denote the implicit suffix tree of the string S[1…i]
Ukkonen’s Algorithm at a High Level
• Construct an implicit suffix tree i for each prefix S[1..i] of S, starting 1 and incrementing i by one until m is build, where m is the length of the string S.
• The true suffix tree for S is constructed from m , and the time for the entire algorithm is O(m)
High-level Description of Ukkonen’s Algorithm
• Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i
• Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1].
Naïve Algorithm of Suffix Tree
{
abab$
bab$
ab$
b$
$
}
a
b
ab
$
1
ab$
b
2
3
$ 4
$
$
5
High-level of Ukkonen’s Algorithm• Ukkonen’s algorithm is divided into m phases. In phase i+1,
tree i+1 is constructed from i
• Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1].
b
a
a
1
a
b
2
1 : S[1…1] {a}
2 : S[1…2] {ab, b}
a b
3 : S[1…3] {aba, ba, a}
a b ba
extensions
phases
b
a
a
1
a
b
2
1 : S[1…1] {a}
2 : S[1…2] {ab, b}
3 : S[1…3] {aba, ba, a}
extensions
O (m3)
b
a
a
1
a
b
2
1 : S[1…1] {a}
2 : S[1…2] {ab, b}
3 : S[1…3] {aba, ba, a}
Suffix Entension Rules
4 : S[1…4] {abab, bab, ab, b}1 2 b
b
Rule1: Let β = S[j … i] be a suffix of S[1 … i]. If path β ends at a leaf, character S(i+1) is added to the end of the label of that leaf edge.
1 2 3
Rule2: some path from the end of string β starts with character S(i+1). In this case the string β S(i+1) is already in the tree. So do nothing.
β S(i+1)
Let i already there and want to extend for i+1
Suffix Entension RulesLet, i already there and want to extend for i+1
Let, 5 is drawn for axabxb123456
Now extend for 6
axabxb xabxb abxb bxb
RULE
1
xb
Rule3: No path from the end of string β starts with character S(i+1), but at least one labeled path continues from the end of β. Add new node.
RULE3
b RULE2
O (m3)
Implementation and Speedup, Suffix LinksDefinition: Let xα denotes an arbitrary string, where x is a single character and α a substring (possibly empty). For an internal node v with path-label xα, if there is another node s(v) with path-label α, then a pointer from v to s(v) is called a suffix link.
Does root have a suffix link? No, because not an internal nodeEvery internal node has a suffix link.
Suffix Links – More Example
abbcbab#
ab
#
bcbab#
b
#
cbab#
bcbab#
ab#
cbab#
6
1 7
3
2
5
4
Suffix link
v
S(v)
Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.
Corollary 6.1.2: In any implicit suffix tree i, if internal node v has path-label xα, then there is a node s(v) of i with path-label α.
MISSISSIPI
1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS7 : MISSISS8 : MISSISSI9 : MISSISSIP10: MISSISSIPI
1
MI
SS
IS
SI
PI
I
SS
I
SS
I
I
S
SI
S
SI
P
II
S
S
I
P
I
II
I
P
I
I
23
4
5
P
P6
P
7
P
8
P
9
1234567890
Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.
MISSISSIPI
1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS7 : MISSISS8 : MISSISSI9 : MISSISSIP10: MISSISSIPI
1
MI
SS
IS
SI
PI
I
SS
I
SS
I
I
S
SI
S
SI
P
II
S
S
I
P
I
II
I
P
I
I
23
4
5
P
P6
P
7
P
8
P
9
1234567890
Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.
How suffix links help?
What is achieved so far?
Not so much. Worst-case running time is O(m2) for a phase.
Trick1: Skip/Count Trick
There must be a γ path from s(v).
Trick1: Skip/Count Trick
There must be a γ path from s(v).
Walking down along γ takes time proportional to |γ|
Skip/count trick reduces the traversal time to something proportional to the number of nodes on the path.
zabcdefghy
2 2 3 3Nodes
But what does it buy in terms of worst-case bounds?
Edge length
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).
v=2 s(v)=1
v=3s(v)=3
v=4 s(v)=5
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).
Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time.
In a single extension – The algorithm walks up at most one edge– Find suffix link and traverse it– Walks down some number of nodes– Applies suffix extension rules– And may add a suffix link
All operations except down-walk takes constant timeOnly needs to analyze down walk time
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v).
Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time.
In a single extension – The algorithm walks up at most one edge– Find suffix link and traverse it– Walks down some number of nodes– Applies suffix extension rules– And may add a suffix link
All operations except down-walk takes constant timeOnly needs to analyze down walk time
– Decreases current node-depth by at most one– Decreases node-depth by at most another one– Each down walk moves to greater node-depth
– Over the entire phase, current node-depth is decremented by at most 2m times
– Since no node can have depth greater than m, the total possible increment to current node-depth is bounded by 3m over the entire phase
– Total number of edge traversal bounded by 3m– Since each edge traversal is constant, in a phase
all the down-walking is O(m).
Complexity• There are m phases• Each phase takes O(m)• So the running time is O(m2)
Two more tricks and we are done
Reference
• Chapter 6: Algorithms on Strings, Trees and Sequences