1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented...
-
Upload
rocco-huskey -
Category
Documents
-
view
218 -
download
1
Transcript of 1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented...
1
Suffix Arrays:A new method for on-line
string searches
Udi Manber Gene Myers
May 1989
Presented by:Oren Weimann
2
Introduction - Problem definition
“Is W a substring of A?”
|A|=N and |W|=P A = a0a1…aN-1
Ai = suffix beginning at index i = aiai+1…aN-1
A= abccbbadgfbbcahgjf
W= badgfbb
A= abccbbadgfbbcahgjf
3
Introduction – what is a suffix array? Example:
assassin 0 assin 3 in 6 n 7 sassin 2 sin 5 ssassin 1 ssin 4
Pos
Pos[2] = 6 (A6 = in)
0 3 6 7 2 5 1 4
A = assassin0 1 2 3 4 5 6 7
4
Introduction – what is a suffix array?
A lexicographically sorted array- Pos[N], of all
the suffixes of A:
Pos[k] = i Ai is the kth smallest suffix in the set {A0, A1, A2…… AN-1}
5
Introduction – what is a suffix tree? Example:
A trie that contains all suffixes of A:
sa
4
3
ss
ss
a
in0
i
n 6
in
A = assassin0 1 2 3 4 5 6 7
s
ina
ssin
2
in
5
1a s s i n
6
The Article Overview
1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information).
2. How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known)
3. An Algorithm for computing the lcp information in O(NlogN).
4. Algorithms for Expected-time improvement.
7
The Search algorithm - Definitions
For any string u, up = u1u2u3…….up (or u if |u| p)
Let “ “ denote a Lexicographical order, We say u v up vp
Note that for any choice of p:
Note that W is a substring of A there is an i such that W
p
]1[]2[]1[]0[ .... Npospppospposppos AAAA
][iposp A
8
The Search algorithm – how does the array help us know if W is a substring of A?
We define a search interval: LW = min {k | W APos[k] or k = N}
RW = max {k | W APos[k] or k = -1}
W matches ai ai+1 ...ai+P-1 i=Pos[k] for some k [LW, RW]
p
p
9
Example:
Pos0 assassin
1 assin 2 in 3 n 4 sassin
5 sin
6 ssassin
7 ssin
W LW RW # s 4 7 4 as 0 1 2
assa 0 0 1 ast 2 1 0
A = assassin0 1 2 3 4 5 6 7
Option 1
Option 2
Option 3
10
Why finding LW, RW == Finding the matches:
If LW > RW => W is not a substring of A.
Else: there are (RW-LW+1) matches - APos[LW],…, APos[RW]
W>APos[k] W<APos[k]LW RW
Pos
11
The Search algorithm –The easy way - O(PlogN)
L M R
abcde... abcdf... abd...Pos
Log(N) iterations, each iteration sets new L,R bonds (initially L=0, R=N-1) according to a comparison of W with APos[M] , where M=(L+R)/2.
In the end LW R
W=“abcx”
12
The Search algorithm using lcp values in O(P+logN) – Definitions:
Speedup using precomputed lcp Values, for now We assume lcp is known.
Each iteration We define: – l = lcp(APos[L], W) – r=lcp(W, APos[R]) – Llcp[M] = lcp(APos[L] APos[M])– Rlcp[M] = lcp(APos[M], APos[R])
13
The Search algorithm using lcp values in O(P+logN) Example: A=“abcx”
l = 3
Llcp[M]=4 Rlcp[M]=2L M R
abcde... abcdf... abd...Pos
r = 2
Note that Llcp[M] is well defined because every midpoint M has one LM and one RM
14
So how do we use l,r,Llcp[M] ?Example: W=abcx
abcde...
abc... abc... abcdf… abd…
l=3 r=2
Case 1: Llcp[M] > l (Llcp[M]=4 and l=3 )W>APos[L]
W>APos[M]
Go rightl is unchanged = 3
L M R
Llcp[M]=4
15
Example: W=abcx (cont.)
Case 2: Llcp[M] < l (Llcp[M]=2 and l=3 )
APos[L] <APos[M]
W<APos[M]
Go left r = Llcp[M] = 2
abcde...
abdf… abd…
r=2l=3
L M R
Llcp[M]=2
16
Example: W=abcx (cont.)
abcde...
abc... abc... abcp… abd…
l=3 r=2
Case 3: Llcp[M] = l (Llcp[M]=3 and l=3 )Compare Wl and APos[M]l
until Wl+j APos[M]l+j
Go right or left according to Wl+j, APos[M]l+j
new l or r = (l+j) Number of comparisons = j+1
L M R
Llcp[M]=3
17
The Search algorithm using lcp values-complexity
In each iteration there are maximum j+1comparisons, when in total
Total comparisons (P + #Iterations) O(P+logN) running time
Requires only 3N-sized arrays
Pjiterations
#
1
18
The Article Overview
1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information).
2. How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known)
3. An Algorithm for computing the lcp information in O(NlogN).
4. Algorithms for Expected-time improvement.
19
Construction of suffix array in O(NlogN)
Sorting the suffixes in a unique Radix sort – WeWill have O(logN) stages (numbered
1,2,4,8,16…)
In stage H the suffixes are sorted in bucketscalled H Buckets, according to the first Hcharacters. (next stage is 2H– thus, in stage Hthe suffixes are sorted by )H
20
Construction of suffix array –The general idea
If Ai, Aj H-bucket we Sort them by the
Next H symbols, but:Their next H symbols = first H symbols ofAi+H and Aj+H which are already sorted in phase
H.
abef… abcd… ab… bb... bb… cd… cd… ef…
H=2:Ai Aj Aj+H Ai+H
first bucket fourth bucketthird bucketsecond bucket
21
Construction of suffix array –The general idea (cont.)
Let Ai be in first H-bucket after stage H
Ai starts with smallest H-symbol string
Ai-H should be first in its H-bucket
abef…
abcd…
ab… bb... bb… cdef… cdab…
ef…
Ai Ai-HH=2:
22
Construction of suffix array –The algorithm
Go over the suffix array: For each Ai: Move Ai-H to next available place in
its H-bucket The suffixes are now sorted according to -order Go over the array again, and decide which
suffix opens a new 2H-bucket, use lcs knowledge (described later)
H2
23
Construction of suffix array –The algorithm Example:
A = assassin0 1 2 3 4 5 6 7
assin assassin
in n sin ssin sassin ssassin
H=1A3
A2
Ai sets Ai-1
24
Construction of suffix array –The algorithm Example:
assin assassin
in n sassin ssin sin ssassin
H=1A0
A = assassin0 1 2 3 4 5 6 7
Ai sets Ai-1
25
Construction of suffix array –The algorithm Example:
assin assassin
in n sassin ssin sin ssassin
H=1A6
A = assassin0 1 2 3 4 5 6 7
A5
Ai sets Ai-1
26
Construction of suffix array –The algorithm Example:
assin assassin
in n sassin sin ssin ssassin
H=1A7
A = assassin0 1 2 3 4 5 6 7
A6
Ai sets Ai-1
27
Construction of suffix array –The algorithm Example:
assin assassin
in n sassin sin ssin ssassin
H=1
A2 A1
A = assassin0 1 2 3 4 5 6 7
Ai sets Ai-1
28
Construction of suffix array –The algorithm Example:
assin assassin
in n sassin sin ssassin
ssin
H=1
A4
A = assassin0 1 2 3 4 5 6 7
A5
Ai sets Ai-1
29
Construction of suffix array –The algorithm Example:
assin assassin
in n sassin sin ssassin
ssin
H=1
A = assassin0 1 2 3 4 5 6 7
A1A0
Ai sets Ai-1
30
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=1
A = assassin0 1 2 3 4 5 6 7
A4A3
Ai sets Ai-1
31
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=1
A = assassin0 1 2 3 4 5 6 7
Go over array to get new 2-buckets
lcs(sassin,sin)= 1+ lcs(assin,in)= 1+0=1 so “sin” opens a new 2-bucket
backAi sets Ai-1
32
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A0
Ai sets Ai-2
33
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A3A1
Ai sets Ai-2
34
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A6A4
Ai sets Ai-2
35
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A7 A5
Ai sets Ai-2
36
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A2A0
Ai sets Ai-2
37
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A5A3
Ai sets Ai-2
38
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A1
Ai sets Ai-2
39
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
A4A2
Ai sets Ai-2
40
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=2
A = assassin0 1 2 3 4 5 6 7
Go over array to get new 4-buckets
Ai sets Ai-2
41
Construction of suffix array –The algorithm Example:
assassin
assin in n sassin sin ssassin
ssin
H=4
A = assassin0 1 2 3 4 5 6 7
That’s it, we are sorted!
42
Construction of suffix array –Complexity Summary
Sorting by first char – O(N) O(logN) stages of O(N) operations = O(NlogN)
Total - time: O(NlogN) - space: 2 integer arrays of size N
back
43
The Article Overview
1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information).
2. How to construct Pos[ ] in O(NlogN) time and O(N) space.
3. An Algorithm for computing the lcp information in O(NlogN).
4. Algorithms for Expected-time improvement.
44
How to find Longest Common Prefixes – the general idea
We don’t care what is the lcp between suffixes in the same H-bucket.
For Ap, Aq in the same H-bucket but different 2H-buckets:– H lcp(Ap, Aq) < 2H– lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H)– lcp(Ap+H, Aq+H) < H that is why Ap+H,Aq+H
Are in different H-buckets, but which ones?
45
How to find Longest Common Prefixes – the general idea
If Ap+H and Aq+H were in adjacent H-buckets then lcp is known. how?
If not, Then: lcp(APos[i], APos[j]) =
{lcp(APos[k],APos[k+1])}]1,[ jik
Min
46
How to find Longest Common Prefixes – the general idea
lcp(Ap+H, Aq+H) = min{1,1,2} = 1
assassin
assin in n sassin sin ssassin
ssin
Aq+hAp+h
1 1 2
Notice that if 2 neighbors are in the same H-bucket we can consider there lcp to be H, since lcp(Ap+H, Aq+H) < H
H=2
47
How to find lcp – algorithm and data structures – Hgt[]
During the construction stage, we build an arrayCalled Hgt[N]: Hgt(i)=lcp(APos[i-1], APos[i]),
initialized so that Hgt[i]=N+1 for every i.
In stage H=1: Hgt(i)=0 for APos[i] that are first in their buckets. In stage 2H: we update every Hgt(i) that APos[i] is the first in a newly created 2H bucket
48
How to find lcp – Hgt[] example:
H=1assin assassin
3 0 6 7 5 4 2 1 in n sin ssin sassin ssassin
0 0 0 9 999
1 1
assin assassin in n sin ssinsassin ssassin3 0 6 7 2 5 4 1
0 0 0 99
H=2
lcp(ssin,sin)=1+lcp(sin,in)=1+min{lcp(in,n),lcp(sin, n)}=1
49
How to find lcp – Hgt[] example (cont.)
23
0 3 6 7 2 5 1 4 assinassassin in n sin ssinsassin ssassin
H=4
0 0 0 1 1
lcp(assassin,assin)=2+lcp(sin, sassin)=2+1=3lcp(ssin, ssassin)=2+lcp(in, assin)=2+0=2
50
How to find lcp –data structures
We need a data structure that will containlcp(APos[j], APos[i]) between any i and j
(not just i and i+1 which Hgt[] supplies)
Hgt[] will become the leaves of a binarybalanced tree called the Interval tree.
51
How to find lcp –example of Interval tree
(2,3) (3,4) (4,5) (5,6) (6,7)(1,2)(0,1)
0
9 0 0 0
0 0
9
0
9 9
9
9
1 1
1
1
3 2
52
How to find lcp –Complexity
Each time a leaf opens a new bucket we change Hgt[i] for that leaf.
That change requires O(logN) changes in the interval tree
There are O(N) leaves opening new bucket
In total we get O(NlogN) to get all lcp values
53
The Article Overview
1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information).
2. How to construct Pos[ ] in O(NlogN) time and O(N) space.
3. An Algorithm for computing the lcp information in O(NlogN).
4. Algorithms for Expected-time improvement.
54
Time Expected-case Improvement of the construction of pos[]
Assumptions: - All N-symbol strings are equally likely.
– Under this assumption: Expected length of longest repeated substring = O(log| |N)
This immediately implies that construction of pos[] is reduced to O(NLogLogN). why?
Next is a way to reduce it to O(N).
55
Time Expected-case Improvement of the construction of pos[]
Let T = We encode each possible T length string to
an integer with the isomorphism IntT(u)
Map each AP to IntT(AP) [0,| |T-1] :
– IntT(AP) = ap| |T-1 +
Nlog
/)( 1pT AInt
56
Example of the mapping
IntT(AP) = ap| |T-1 +
assassin 0 ssassin 1 sassin 2 assin 3 ssin 4 sin 5 in 6 n 7
/)( 1pT AInt
2*4^0 + 0 2
| |= 4 , a=0, i=1, n=2, s=3
N=8
T= =1
1*4^0 + 0 1
Nlog
3*4^0 + 0 3
3*4^0 + 0 3
0*4^0 + 0 0
3*4^0 + 0 3
3*4^0 + 0 3
0*4^0 + 0 0
57
Time Expected-case Improvement of the construction of pos[]
By the definition of IntT(AP) it takes O(N) to
compute all IntT(AP) values of all suffixes.
So now instead of starting with H=1 we start with H=
But since the longest repeated substring length isO(log| |N) we will have O(1) stages of the radix sort.
Thus, the total time for constructing pos[] = O(N)
Nlog
58
So is a suffix array better then a suffix tree?
Suffix array Suffix tree
Construction time
O(NlogN) - for small | |O(N) – needs additional space
O(N)
Time Complexity
O(P+logN) – good for large alphabets
O(Plog| |)
Space Complexity
requires 2N integers – this is the main advantage.
O(N)
dependent on | | ?
No Yes