Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree...
Transcript of Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree...
![Page 1: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/1.jpg)
Information Retrieval and Organisation
Suffix Trees
Dell Zhang Birkbeck, University of London
adapted from
http://www.math.tau.ac.il/~haimk/seminar02/suffixtrees.ppt
![Page 2: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/2.jpg)
Trie
A tree representing a set of strings
a
c
b
c
e
e
f
d b
f
e g
{
aeef
ad
bbfe
bbfg
c
}
![Page 3: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/3.jpg)
Trie
Assume no string is a prefix of another
1) Each edge is labeled by
a letter.
2) No two edges outgoing
from the same node are
labeled the same.
3) Each string corresponds
to a leaf.
a b
c
e
e
f
d b
f
e g
![Page 4: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/4.jpg)
Compressed Trie
Compress unary nodes, label edges by strings
a b
c
e
e
f
d b
f
e g
a
bbf
c
eef
d
e g
![Page 5: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/5.jpg)
Given a string s, a suffix tree of s is a
compressed trie of all suffixes of s.
To make these suffixes prefix-free we add a
special character, say $, at the end of s.
Suffix Tree
![Page 6: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/6.jpg)
Suffix Tree
For example, let s = abab, a suffix tree of s is a
compressed trie of all suffixes of abab$.
{
$
b$
ab$
bab$
abab$
}
a b
a b
$
a b $
b
$
$
$
Note that a suffix tree has O(n) nodes n = |s|. Why?
![Page 7: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/7.jpg)
Suffix Tree Construction
The trivial algorithm
Put the largest suffix in
a b a b $
![Page 8: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/8.jpg)
Put the suffix bab$ in
a b a b $
a b a b
$
a b $
b
![Page 9: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/9.jpg)
Put the suffix ab$ in
a b a b
$
a b $
b
a b
a b
$
a b $
b
$
![Page 10: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/10.jpg)
Put the suffix b$ in
a b
a b
$
a b $
b
$
a b
a b
$
a b $
b
$
$
![Page 11: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/11.jpg)
Put the suffix $ in
a b
a b
$
a b $
b
$
$
a b
a b
$
a b $
b
$
$
$
![Page 12: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/12.jpg)
We will also label each leaf
with the starting point of the
corresponding suffix
a b
a b
$
a b $
b
$
$
$
0 1
a b
a b
$
a b
$
b
2
$ 3
$
4
$
![Page 13: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/13.jpg)
Suffix Tree Construction
The trivial algorithm takes O(n2) time.
It is possible to build a suffix tree in O(n) time
using Ukkonen’s algorithm.
But, how come? Does it take O(n) space?
To use only O(n) space, encode the edge-labels
as (beginning-position, end-position) .
![Page 14: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/14.jpg)
$
a
bbbbbb$
a b
a
a
a
b
b
b
b
b$
bbbbbb$
bbbbbb$
bbbbbb$
bbbbbb$
$
$
$
$
$
abbbbbb$
Consider the string aaaaaabbbbbb$
![Page 15: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/15.jpg)
Consider the string aaaaaabbbbbb$
$
a
bbbbbb$
a b
a
a
a
b
b
b
b
b$
bbbbbb$
bbbbbb$
bbbbbb$
(6,12)
$
$
$
$
$
abbbbbb$
![Page 16: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/16.jpg)
Consider the string aaaaaabbbbbb$
(0,0)
(6,12)
(6,6)
(11,12)
(6,12)
(6,12)
(6,12)
(6,12)
(5,12)
(1,1)
(2,2)
(3,3)
(4,4)
(7,7)
(8,8)
(9,9)
(10,10)
(12,12)
(12,12)
(12,12)
(12,12)
(12,12)
(12,12)
![Page 17: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/17.jpg)
Suffix Tree Applications
What Can We Do with It?
Exact String Matching
Exact Set Matching
The Substring Problem for a Database of Patterns
Longest Common Substring of Two Strings
Recognising DNA Contamination
Common Substring of More Than Two Strings
……
![Page 18: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/18.jpg)
Exact String Matching
Given text T (|T| = n), pre-process it such that
when a pattern P (|P| = m) arrives you can quickly
decide when it occurs in T.
We may also want to find all occurrences of P in
T.
![Page 19: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/19.jpg)
Exact String Matching
In pre-processing, we just build a suffix tree in
O(n) time
0 1
a b
a b
$
a b $
b
2
$ 3
$
4
$
![Page 20: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/20.jpg)
Exact String Matching
Given a pattern P = ab we traverse the tree
according to the pattern.
If we do not get stuck traversing the pattern then
the pattern occurs in the text, otherwise it does
not.
Each leaf in the subtree below the node we
reach corresponds to an occurrence.
By traversing this subtree we get all k
occurrences in O(n+k) time.
![Page 21: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/21.jpg)
Exact String Matching
How to match a pattern (query) against a
database of strings (documents)?
![Page 22: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/22.jpg)
Generalized Suffix Tree
Given a set of strings S, the generalized suffix
tree of S is a compressed trie of all suffixes of
each s S.
To make these suffixes prefix-free we add a
special char, say $, at the end of s.
To associate each suffix with a unique string in S,
add a different special char to each s.
Each leaf node needs to be labelled by the
document id together with the suffix position.
![Page 23: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/23.jpg)
Generalized Suffix Tree
For example, Let s1 = abab and s2 = aab, here is a
generalized suffix tree for s1 and s2.
{
$ #
b$ b#
ab$ ab#
bab$ aab#
abab$
}
0
1
a
b
a b
$
a b $
b
2
$
3
$
4
$
0
b #
a
1
#
2
#
3
#
![Page 24: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/24.jpg)
Longest Common Substring
Given two strings s1 and s2, we build their
generalized suffix tree.
Every node with a leaf descendant from string s1
and a leaf descendant from string s2 represents a
maximal common substring and vice versa.
Find such node with largest “string depth”.
![Page 25: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/25.jpg)
Lowest Common Ancestor
A lot more can be gained from the suffix tree, if
we pre-process it so that we can answer LCA
queries on it in constant time.
![Page 26: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/26.jpg)
Lowest Common Ancestor
Why? The LCA of two leaves represents the
longest common prefix (LCP) of these 2 suffixes
0
1
a
b
a b
$
a b $
b
2
$
3
$
4
$
0
b #
a
1
#
2
#
3
#
![Page 27: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/27.jpg)
Finding Maximal Palindromes
A palindrome: cbaabc, caabaac, …
To find all palindromes in a string s (of length m),
we build a generalized suffix tree for the string s
and the reversed string sr.
The palindrome with centre between i-1 and i is
the LCP of the suffix at position i of s and the
suffix at position m-i of sr.
![Page 28: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/28.jpg)
For example, consider the string cbaaba.
Prepare a generalized suffix tree for
s = cbaaba$ and sr = abaabc#
For every i find the LCA of
the suffix i of s and the suffix m-i of sr.
All palindromes can be identified in linear time.
Finding Maximal Palindromes
![Page 29: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/29.jpg)
2
a
a
b
2
$
6
$
b
6
#
c
0
5
4
1 1
a
4
5
$
3
3
0
a
$
$
Let s = cbaaba$ then sr = abaabc#
![Page 30: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/30.jpg)
Suffix Tree Drawbacks
It is O(n) but the constant is quite big.
It consume a lot of space.
Notice that if we indeed want to traverse an edge
in O(1) time then we need an array (of pointers) of
size |Σ| in each node, where Σ is the alphabet.
![Page 31: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/31.jpg)
Suffix Array
It is much simpler and easier to implement.
Compared with suffix trees, we lose some
functionality, but we save space.
![Page 32: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/32.jpg)
Suffix Array
For example, let s = abab
Sort the suffixes lexicographically: ab, abab, b, bab
The suffix array gives the indices of the suffixes in
sorted order
2 0 3 1
![Page 33: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/33.jpg)
Suffix Array Construction
The trivial algorithm
Quicksort
The linear time algorithm
Build a suffix tree in O(n) time first, and then
traverse the tree in in-order, lexicographically
picking edges outgoing from each node, and fill
the suffix array.
It can also be built in O(n) time directly.
![Page 34: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/34.jpg)
Exact String Matching
How do we search for a pattern P in the text T,
using the suffix array of T?
If P occurs in T, then all its occurrences are
consecutive in the suffix array.
So we can do two binary searches on the suffix
array: the first search locates the starting position
of the interval, and the second one determines
the end position.
It takes O(m log(n)) time, as a single suffix
comparison needs to compare up to m
characters.
![Page 35: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/35.jpg)
Exact String Matching
It is also possible to do it in O(m+log(n)) with an
additional array of LCP.
Manber & Myers (1990)
![Page 36: Information Retrieval and Organisationdell/teaching/nlp/dell_ir_suffixtrees.pdf · Suffix Tree Construction The trivial algorithm takes O(n2) time. It is possible to build a suffix](https://reader033.fdocuments.us/reader033/viewer/2022053110/607fd5f92593a163b569df81/html5/thumbnails/36.jpg)
T = mississippi
P = issa
i
ippi
issippi
ississippi
mississippi
pi
7
4
1
0
9
8
6
3
10
5
2
ppi
sippi
sisippi
ssippi
ssissippi
L
R
M