CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.
-
date post
20-Dec-2015 -
Category
Documents
-
view
216 -
download
1
Transcript of CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.
![Page 1: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/1.jpg)
CS 6293 Advanced Topics: Current Bioinformatics
Lecture 5
Exact String Matching Algorithms
![Page 2: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/2.jpg)
Overview
• Sequence alignment: two sub-problems:– How to score an alignment with errors– How to find an alignment with the best score
• Today: exact string matching – Does not allow any errors– Efficiency becomes the sole consideration
• Time and space
![Page 3: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/3.jpg)
Why exact string matching?
• The most fundamental string comparison problem
• Often the core of more complex string comparison algorithms– E.g., BLAST
• Often repeatedly called by other methods– Usually the most time consuming part– Small improvement could improve overall
efficiency considerably
![Page 4: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/4.jpg)
Definitions
• Text: a longer string T (length m)• Pattern: a shorter string P (length n)• Exact matching: find all occurrences of P in T
abayababaxababb abayababaxababb
aba aba
T
P
length m
length n
![Page 5: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/5.jpg)
The naïve algorithm
abayababaxababb abayababaxababb
aba aba
aba aba
aba aba
aba aba
aba aba
aba aba
aba aba
aba aba
![Page 6: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/6.jpg)
Time complexity
• Worst case: O(mn)• Best case: O(m)
e.g. aaaaaaaaaaaaaa vs baaaaaaa
• Average case?– Alphabet A, C, G, T– Assume both P and T are random– Equal probability– In average how many chars do you need to
compare before giving up?
![Page 7: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/7.jpg)
Average case time complexity
P(mismatch at 1st position): ¾P(mismatch at 2nd position): ¼ * ¾ P(mismatch at 3nd position): (¼)2 * ¾P(mismatch at kth position): (¼)k-1 * ¾Expected number of comparison per position:p = 1/4
k (1-p) p(k-1) k = (1-p) / p * k pk k = 1/(1-p) = 4/3
Average complexity: 4m/3Not as bad as you thought it might be
![Page 8: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/8.jpg)
Biological sequences are not random
T: aaaaaaaaaaaaaaaaaaaaaaaaaP: aaaab
Plus: 4m/3 average case is still bad for long genomic sequences!
Especially if this has to be done again and again
Smarter algorithms:O(m + n) in worst casesub-linear in practice
![Page 9: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/9.jpg)
How to speedup?
• Pre-processing T or P• Why pre-processing can save us time?
– Uncovers the structure of T or P– Determines when we can skip ahead without missing
anything– Determines when we can infer the result of character
comparisons without doing them.
ACGTAXACXTAXACGXAX
ACGTACA
![Page 10: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/10.jpg)
Cost for exact string matching
Total cost = cost (preprocessing)
+ cost(comparison)
+ cost(output)
Constant
Minimize
Overhead
Hope: gain > overhead
![Page 11: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/11.jpg)
String matching scenarios
• One T and one P– Search a word in a document
• One T and many P all at once– Search a set of words in a document– Spell checking (fixed P)
• One fixed T, many P– Search a completed genome for short sequences
• Two (or many) T’s for common patterns• Q: Which one to pre-process?• A: Always pre-process the shorter seq, or the
one that is repeatedly used
![Page 12: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/12.jpg)
Pre-processing algs
• Pattern preprocessing– Knuth-Morris-Pratt algorithm (KMP)– Aho-Corasick algorithm
• Multiple patterns
– Boyer – Moore algorithm (discuss only if have time)• The choice of most cases• Typically sub-linear time
• Text preprocessing– Suffix tree
• Very useful for many purposes
![Page 13: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/13.jpg)
Algorithm KMP: Intuitive example 1
• Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened when comparing P[8] with T[i], we can shift P by four chars, and compare P[4] with T[i], without missing any possible matches.
• Number of comparisons saved: 6
abcxabcT
abcxabcdePmismatch
abcxabcT
abcxabcde
Naïve approach:
abcxabcdeabcxabcdeabcxabcde?
![Page 14: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/14.jpg)
?
Intuitive example 2
• Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened between P[7] and T[j], we can shift P by six chars and compare T[j] with P[1] without missing any possible matches
• Number of comparisons saved: 7
abcxabcT
abcxabcdePmismatch
abcxabcT
abcxabcde
Naïve approach:
abcxabcdeabcxabcdeabcxabcde
Should not be a c
abcxabcdeabcxabcde?
![Page 15: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/15.jpg)
KMP algorithm: pre-processing
• Key: the reasoning is done without even knowing what string T is.• Only the location of mismatch in P must be known.
tt’P
t xT
y
tt’P y
z
z
Pre-processing: for any position i in P, find P[1..i]’s longest proper suffix, t = P[j..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’ (i.e., y ≠ z)For each i, let sp(i) = length(t)
ij
ij
![Page 16: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/16.jpg)
KMP algorithm: shift rule
tt’P
t xT
y
tt’P y
z
z
Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the right by i – sp(i) chars and compare x with z.
This shift rule can be implicitly represented by creating a failure link between y and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume comparison between x and P[sp(i)+1].
ij
ijsp(i)1
![Page 17: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/17.jpg)
Failure Link Example
P: aataac
a a t a a c
sp(i) 0 1 0 0 2 0
aaat
aataac
If a char in T fails to match at pos 6, re-compare it with the
char at pos 3 (= 2 + 1)
![Page 18: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/18.jpg)
Another example
P: abababc
a b a b a b c
Sp(i) 0 0 0 0 0 4 0
ababaababc
If a char in T fails to match at pos 7, re-compare it with the char at pos 5 (= 4 + 1)
abab
abababab
![Page 19: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/19.jpg)
KMP Example using Failure Link
a a t a a c
aataac^^*
T: aacaataaaaataaccttacta
aataac.*aataac^^^^^*
aataac..*aataac.^^^^^
Time complexity analysis:• Each char in T may be compared up to n
times. A lousy analysis gives O(mn) time.• More careful analysis: number of
comparisons can be broken to two phases:• Comparison phase: the first time a char in T
is compared to P. Total is exactly m.• Shift phase. First comparisons made after a
shift. Total is at most m.• Time complexity: O(2m)
Implicitcomparison
![Page 20: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/20.jpg)
KMP algorithm using DFA (Deterministic Finite Automata)
P: aataac
1 2 3 4 50a a t a a c
6
a t
If the next char in T is t after matching 5 chars, go to state 3
a a t a a c
If a char in T fails to match at pos 6, re-compare it with
the char at pos 3
a
Failure link
DFA
a
All other inputs goes to state 0.
![Page 21: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/21.jpg)
DFA Example
T: aacaataataataaccttacta
Each char in T will be examined exactly once.
Therefore, exactly m comparisons are made.
But it takes longer to do pre-processing, and needs more space to store the FSA.
1201234534534560001001
1 2 3 4 50a a t a a c
6
a t
a
DFA
a
![Page 22: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/22.jpg)
Difference between Failure Link and DFA
• Failure link– Preprocessing time and space are O(n), regardless of
alphabet size– Comparison time is at most 2m (at least m)
• DFA– Preprocessing time and space are O(n ||)
• May be a problem for very large alphabet size• For example, each “char” is a big integer• Chinese characters
– Comparison time is always m.
![Page 23: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/23.jpg)
Boyer – Moore algorithm
• Often the choice of algorithm for many cases– One T and one P– We will talk about it later if have time– In practice sub-linear
![Page 24: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/24.jpg)
The set matching problem
• Find all occurrences of a set of patterns in T• First idea: run KMP or BM for each P
– O(km + n)• k: number of patterns• m: length of text• n: total length of patterns
• Better idea: combine all patterns together and search in one run
![Page 25: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/25.jpg)
A simpler problem: spell-checking
• A dictionary contains five words:– potato– poetry– pottery– science– school
• Given a document, check if any word is (not) in the dictionary– Words in document are separated by special chars.– Relatively easy.
![Page 26: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/26.jpg)
Keyword tree for spell checking
• O(n) time to construct. n: total length of patterns.• Search time: O(m). m: length of text• Common prefix only need to be compared once. • What if there is no space between words?
p
o
t
a
t
o
e
tr
y
t
er
y
s
c
i
e
n
c
e
h o o l
1
2
3
4
5
This version of the potato gun was inspired by the Weird Science team out of Illinois
![Page 27: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/27.jpg)
Aho-Corasick algorithm
• Basis of the fgrep algorithm
• Generalizing KMP– Using failure links
• Example: given the following 4 patterns:– potato– tattoo– theater– other
![Page 28: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/28.jpg)
Keyword tree
p
o
t
a
t
o
t
e
r
0t
he
r
1
2 3
4
a
t
t
o
o
h
a
t
e
![Page 29: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/29.jpg)
Keyword tree
p
o
t
a
t
o
t
e
r
0t
he
r
1
2 3
4
a
t
t
o
o
h
a
t
e
potherotathxythopotattooattoo
![Page 30: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/30.jpg)
Keyword tree
p
o
t
a
t
o
t
e
r
0t
he
r
1
2 3
4
a
t
t
o
o
h
a
t
e
O(mn) m: length of text. n: length of longest pattern
potherotathxythopotattooattoo
![Page 31: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/31.jpg)
Keyword Tree with a failure link
p
o
t
a
t
o
t
e
r
0t
he
r
1
2 3
4
a
t
t
o
o
h
a
t
e
potherotathxythopotattooattoo
![Page 32: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/32.jpg)
Keyword Tree with a failure link
p
o
t
a
t
o
t
e
r
0t
he
r
1
2 3
4
a
t
t
o
o
h
a
t
e
potherotathxythopotattooattoo
![Page 33: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/33.jpg)
Keyword Tree with all failure links
p
o
t
a
t
o
t
e
r
0t
he
r
1
2 3
4
a
t
t
o
o
h
a
t
e
![Page 34: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/34.jpg)
Example
p
o
t
a
t
o
t
e
r
0t
he
r
1
2 3
4
a
t
t
o
o
h
a
t
e
potherotathxythopotattooattoo
![Page 35: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/35.jpg)
Example
p
o
t
a
t
o
t
e
r
0t
he
r
1
2 3
4
a
t
t
o
o
h
a
t
e
potherotathxythopotattooattoo
![Page 36: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/36.jpg)
Example
p
o
t
a
t
o
t
e
r
0t
he
r
1
2 3
4
a
t
t
o
o
h
a
t
e
potherotathxythopotattooattoo
![Page 37: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/37.jpg)
Example
p
o
t
a
t
o
t
e
r
0t
he
r
1
2 3
4
a
t
t
o
o
h
a
t
e
potherotathxythopotattooattoo
![Page 38: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/38.jpg)
Example
p
o
t
a
t
o
t
e
r
0t
he
r
1
2 3
4
a
t
t
o
o
h
a
t
e
potherotathxythopotattooattoo
![Page 39: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/39.jpg)
Aho-Corasick algorithm
• O(n) preprocessing, and O(m+k) searching. – n: total length of patterns. – m: length of text– k is # of occurrence.
• Can create a DFA similar as in KMP. – Requires more space, – Preprocessing time depends on alphabet size– Search time is constant
• A: Where can this algorithm be used in previous topics?• Q: BLAST
– Given a query sequence, we generate many seed sequences (k-mers)
– Search for exact matches to these seed sequences – Extend exact matches into longer inexact matches
![Page 40: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/40.jpg)
Suffix Tree
• All algorithms we talked about so far preprocess pattern(s)– Boyer-Moore: fastest in practice. O(m) worst case.– KMP: O(m)– Aho-Corasick: O(m)
• In some cases we may prefer to pre-process T– Fixed T, varying P
• Suffix tree: basically a keyword tree of all suffixes
![Page 41: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/41.jpg)
Suffix tree
• T: xabxac
• Suffixes:1. xabxac
2. abxac
3. bxac
4. xac
5. ac
6. c
a
bx
ac
bxa
c
c
c
x a b x a cc 1
2 3
4
5
6
Naïve construction: O(m2) using Aho-Corasick.
Smarter: O(m). Very technical. big constant factor
Difference from a keyword tree: create an internal node only when there is a branch
![Page 42: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/42.jpg)
Suffix tree implementation
• Explicitly labeling sequence end
• T: xabxa$
a
bx
a
bxa
x a b x a1
2 3
a
bx
a
bxa
x a b x a1
2 3
$
$$
$
$4
5
• One-to-one correspondence of leaves and suffixes
• |T| leaves, hence < |T| internal nodes
![Page 43: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/43.jpg)
Suffix tree implementation
• Implicitly labeling edges
• T: xabxa$
a
bx
a
bxa
x a b x a1
2 3
$
$$
$
$4
5
2:2
3:$ 3:$
1
2 3
$
$4
5
1:23:$
• |Tree(T)| = O(|T| + size(edge labels))
![Page 44: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/44.jpg)
Suffix links
• Similar to failure link in a keyword tree
• Only link internal nodes having branchesx
ab
cd
ef
g
h
ij
ab
c
de
fg
h
i
j
P: xabcff
![Page 45: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/45.jpg)
ST Application 1: pattern matching
• Find all occurrence of P=xa in T– Find node v in the ST that
matches to P– Traverse the subtree
rooted at v to get the locations
a
bx
ac
bxa
c
c
c
x a b x a cc 1
2 3
4
5
6
T: xabxac
• O(m) to construct ST (large constant factor)
• O(n) to find v – linear to length of P instead of T!
• O(k) to get all leaves, k is the number of occurrence.
• Asymptotic time is the same as KMP. ST wins if T is fixed. KMP wins otherwise.
![Page 46: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/46.jpg)
ST Application 2: set matching
• Find all occurrences of a set of patterns in T– Build a ST from T– Match each P to ST
a
bx
ac
bxa
c
c
c
x a b x a cc 1
2 3
4
5
6
T: xabxacP: xab
• O(m) to construct ST (large constant factor)
• O(n) to find v – linear to total length of P’s
• O(k) to get all leaves, k is the number of occurrence.
• Asymptotic time is the same as Aho-Corasick. ST wins if T fixed. AC wins if P’s are fixed. Otherwise depending on relative size.
![Page 47: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/47.jpg)
ST application 3: repeats finding
• Genome contains many repeated DNA sequences
• Repeat sequence length: Varies from 1 nucleotide to millions– Genes may have multiple copies (50 to 10,000) – Highly repetitive DNA in some non-coding regions
• 6 to 10bp x 100,000 to 1,000,000 times
• Problem: find all repeats that are at least k-residues long and appear at least p times in the genome
![Page 48: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/48.jpg)
Repeats finding
• at least k-residues long and appear at least p times in the seq– Phase 1: top-down, count label lengths (L)
from root to each node– Phase 2: bottom-up: count # of leaves
descended from each internal node
(L, N)
For each node with L >= k, and N >= p, print all leaves
O(m) to traverse tree
![Page 49: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/49.jpg)
Maximal repeats finding
1. Right-maximal repeat– S[i+1..i+k] = S[j+1..j+k], – but S[i+k+1] != S[j+k+1]
2. Left-maximal repeat– S[i+1..i+k] = S[j+1..j+k]– But S[i] != S[j]
3. Maximal repeat– S[i+1..i+k] = S[j+1..j+k]– But S[i] != S[j], and S[i+k+1] != S[j+k+1]
acatgacatt
1. cat2. aca3. acat
![Page 50: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/50.jpg)
Maximal repeats finding
• Find repeats with at least 3 bases and 2 occurrence– right-maximal: cat– Maximal: acat– left-maximal: aca
5:e
2
5:e
4
1234567890acatgacatt
5:e 5cat
t
7
ca
t
t
6
a
5:e
3
5:e
1
t
8
tt
t
9
10$
![Page 51: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/51.jpg)
Maximal repeats finding
• How to find maximal repeat?– A right-maximal repeats with different left chars
5:e
2
5:e
4
1234567890acatgacatt
5:e 5cat
t
7
ca
t
t
6
a
5:e
3
5:e
1
t
8
tt
t
9
10$
Left char = [] g c c a a
![Page 52: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/52.jpg)
ST application 4: word enumeration
• Find all k-mers that occur at least p times– Compute (L, N) for each
node• L: total label length from
root to node • N: # leaves
– Find nodes v with L>=k, and L(parent)<k, and N>=p
– Traverse sub-tree rooted at v to get the locations
L<k
L>=k, N>=p
L = KL=k
This can be used in many applications. For example, to find words that appeared frequently in a genome or a document
![Page 53: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/53.jpg)
Joint Suffix Tree (JST)
• Build a ST for more than two strings
• Two strings S1 and S2
• S* = S1 & S2
• Build a suffix tree for S* in time O(|S1| + |S2|)
• The separator will only appear in the edge ending in a leaf (why?)
![Page 54: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/54.jpg)
Joint suffix tree example
• S1 = abcd
• S2 = abca
• S* = abcd&abca$a
bcd
&ab
ca
bc
d&abca
c
d&
abc
d
d & ab c
d
& a b c d
a aa
$
1,1
2,1
1,2
1,3
1,4
2,2
2,32,4
(2, 0)useless
Seq ID
Suffix ID
![Page 55: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/55.jpg)
To Simplify
• We don’t really need to do anything, since all edge labels were implicit.
• The right hand side is more convenient to look at
abc
d&
abc
a
bc
d&abca
c
d&
abc
d
d & ab c
d
& a b c d
a aa
$
1,1
2,1
1,2
1,3
1,4
2,2
2,32,4
uselessa
bcd
bc
d
c
d
d
a aa
$
1,12,1
1,21,3
1,4
2,2
2,32,4
![Page 56: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/56.jpg)
Application 1 of JST• Longest common substring between
two sequences• Using smith-waterman
– Gap = mismatch = -infinity. – Quadratic time
• Using JST– Linear time– For each internal node v, keep a bit
vector B– B[1] = 1 if a child of v is a suffix of S1– Bottom-up: find all internal nodes with
B[1] = B[2] = 1 (green nodes)– Report a green node with the longest
label– Can be extended to k sequences. Just
use a bit vector of size k.
abc
d
bc
d
c
d
d
a aa
$
1,12,1
1,21,3
1,4
2,2
2,32,4
![Page 57: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/57.jpg)
Application 2 of JST
• Given K strings, find all k-mers that appear in at least (or at most) d strings
• Exact motif finding problem
L< k
L >= k B = BitOR(1010, 0011) = 1011cardinal(B) = 3
3,x 3,x 4,x
B = 0011
1,x
B = 1010
cardinal(B) >= 3
![Page 58: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/58.jpg)
Application 3 of JST
• Substring problem for sequence databases– Given: A fixed database of sequences (e.g., individual genomes)– Given: A short pattern (e.g., DNA signature)– Q: Does this DNA signature belong to any individual in the
database?• i.e. the pattern is a substring of some sequences in the database
• Aho-Corasick doesn’t work
– This can also be used to design signatures for individuals
• Build a JST for the database seqs• Match P to the JST• Find seq IDs from descendents
abc
d
bc
d
c
d
d
a aa
$
1,12,1
1,21,3
1,4
2,2
2,32,4
Seqs: abcd, abcaP1: cdP2: bc
![Page 59: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/59.jpg)
Application 4 of JST
• Detect DNA contamination– For some reason when we try to clone and sequence a genome, some
DNAs from other sources may contaminate our sample, which should be detected and removed
– Given: A fixed database of sequences (e.g., possible cantamination sources)
– Given: A DNA just sequenced (e.g., DNA signature)– Q: Does this DNA contain longer enough substring from the seqs in the
database?
• Build a JST for the database seqs• Scan T using the JST
abc
d
bc
d
c
d
d
a aa
$
1,12,1
1,21,3
1,4
2,2
2,32,4
Contamination sources: abcd, abca
Sequence: dbcgaabctacgtctagt
![Page 60: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/60.jpg)
Suffix Tree Memory Footprint
• The space requirements of suffix trees can become prohibitive– |Tree(T)| is about 20|T| in practice
• Suffix arrays provide one solution.
![Page 61: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/61.jpg)
Suffix Arrays• Very space efficient (m integers)• Pattern lookup is nearly O(n) in practice
– O(n + log2 m) worst case with 2m additional integers
– Independent of alphabet size!
• Easiest to describe (and construct) using suffix trees– Other (slower) methods exist
a
bxa
bxa
x a b x a1
5
3
$
$
$$
$
4
2
5 2 3 4 1
abxa$a$ bxa$ xa$ xabxa$
1. xabxa2. abxa3. bxa4. xa5. a
![Page 62: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/62.jpg)
Suffix array construction
• Build suffix tree for T$
• Perform “lexical” depth-first search of suffix tree– output the suffix label of each leaf
encountered
• Therefore suffix array can be constructed in O(m) time.
![Page 63: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/63.jpg)
Suffix array pattern search
• If P is in T, then all the locations of P are consecutive suffixes in Pos.
• Do binary search in Pos to find P!– Compare P with suffix Pos(m/2)– If lexicographically less, P is in first half of T– If lexicographically more, P is in second half of T– Iterate!
![Page 64: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/64.jpg)
Suffix array pattern search
• T: xabxa$
• P: abx
a
bxa
bxa
x a b x a1
5
3
$
$
$$
$
4
2
5 2 3 4 1
abxa$a$ bxa$ xa$ xabxa$
L RMR
M
![Page 65: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/65.jpg)
Suffix array binary search
• How long to compare P with suffix of T?– O(n) worst case!
• Binary search on Pos takes O(n log m) time• Worst case will be rare
– occur if many long prefixes of P appear in T• In random or large alphabet strings
– expect to do less than log m comparisons• O(n + log m) running time when combined with
LCP table– suffix tree = suffix array + LCP table
![Page 66: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/66.jpg)
Summary
• One T, one P– Boyer-Moore is the choice– KMP works but not the best
• One T, many P– Aho-Corasick– Suffix Tree (array)
• One fixed T, many varying P– Suffix tree (array)
• Two or more T’s– Suffix tree, joint suffix tree
Alphabet independent
Alphabet dependent
![Page 67: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/67.jpg)
Boyer – Moore algorithm
• Three ideas:– Right-to-left comparison– Bad character rule– Good suffix rule
![Page 68: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/68.jpg)
Boyer – Moore algorithm
• Right to left comparison
x
y
y
Skip some chars without missing any occurrence.
Resume comparison here
![Page 69: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/69.jpg)
Bad character rule
0 1 12345678901234567T:xpbctbxabpqqaabpqP: tpabxab *^^^^What would you do now?
![Page 70: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/70.jpg)
Bad character rule
0 1 12345678901234567T:xpbctbxabpqqaabpqP: tpabxab *^^^^P: tpabxab
![Page 71: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/71.jpg)
Bad character rule
0 1 123456789012345678T:xpbctbxabpqqaabpqzP: tpabxab *^^^^P: tpabxab *P: tpabxab
![Page 72: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/72.jpg)
Basic bad character rule
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
tpabxab
Pre-processing:O(n)
![Page 73: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/73.jpg)
Basic bad character rule
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
T: xpbctbxabpqqaabpqzP: tpabxab
*^^^^
P: tpabxab
When rightmost T(k) in P is left to i, shift pattern P to align T(k) with the rightmost T(k) in P
k
i = 3 Shift 3 – 1 = 2
![Page 74: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/74.jpg)
Basic bad character rule
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
T: xpbctbxabpqqaabpqzP: tpabxab *
P: tpabxab
When T(k) is not in P, shift left end of P to align with T(k+1)
k
i = 7 Shift 7 – 0 = 7
![Page 75: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/75.jpg)
Basic bad character rule
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
T: xpbctbxabpqqaabpqz
P: tpabxab *^^
P: tpabxab
When rightmost T(k) in P is right to i, shift pattern P by 1
k
i = 5 5 – 6 < 0. so shift 1
![Page 76: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/76.jpg)
Extended bad character rule
char Position in P
a 6, 3
b 7, 4
p 2
t 1
x 5
T: xpbctbxabpqqaabpqz
P: tpabxab *^^
P: tpabxab
Find T(k) in P that is immediately left to i, shift P to align T(k) with that position
k
i = 5 5 – 3 = 2. so shift 2
Preprocessing still O(n)
![Page 77: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/77.jpg)
Extended bad character rule
• Best possible: m / n comparisons
• Works better for large alphabet size
• In some cases the extended bad character rule is sufficiently good
• Worst-case: O(mn)– Expected time is sublinear
![Page 78: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/78.jpg)
0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^
P: qcabdabdab
According to extended bad character rule
![Page 79: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/79.jpg)
(weak) good suffix rule
0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^
P: qcabdabdab
![Page 80: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/80.jpg)
(Weak) good suffix rule
tx
tyt’
tyt’
Preprocessing: For any suffix t of P, find the rightmost copy of t, denoted by t’.How to find t’ efficiently?
T
P
P
![Page 81: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/81.jpg)
(Strong) good suffix rule
0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^
![Page 82: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/82.jpg)
(Strong) good suffix rule
0 1 123456789012345678T:prstabstubabvqxrstP: qcabdabdab *^^
P: qcabdabdab
![Page 83: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/83.jpg)
(Strong) good suffix rule
tx
tyt’
tyt’
In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, such that the char left to t ≠ the char left to t’
T
P
P
z
z
z ≠ y
![Page 84: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/84.jpg)
Example preprocessing
qcabdabdab
char Positions in P
a 9, 6, 3
b 10, 7, 4
c 2
d 8, 5
q 1
q c a b d a b d a b1 2 3 4 5 6 7 8 9 10
0 0 0 0 2 0 0 2 0 0dabcab
Bad char rule Good suffix rule
dabdabcabdab
Where to shift depends on T Does not depend on T
Largest shift given by either the (extended) bad char rule or the (strong) good suffix rule is used.
![Page 85: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/85.jpg)
Time complexity of BM algorithm
• Pre-processing can be done in linear time
• With strong good suffix rule, worst-case is O(m) if P is not in T– If P is in T, worst-case could be O(mn) – E.g. T = m100, P = m10
– unless a modification was used (Galil’s rule)
• Proofs are technical. Skip.
![Page 86: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/86.jpg)
How to actually do pre-processing?
• Similar pre-processing for KMP and B-M– Find matches between a suffix and a prefix
– Both can be done in linear time– P is usually short, even a more expensive
pre-processing may result in a gain overall
tt’P yxKMP
tyt’P xB-M
i
ij
j For each i, find a j. similar to DP. Start from i = 2
![Page 87: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/87.jpg)
Fundamental pre-processing
• Zi: length of longest substring starting at i that matches a prefix of P– i.e. t = t’, x ≠ y, Zi = |t|– With the Z-values computed, we can get the
preprocessing for both KMP and B-M in linear time.
aabcaabxaazZ = 01003100210
• How to compute Z-values in linear time?
tt’Pi
x yi+zi-1zi1
![Page 88: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/88.jpg)
Computing Z in Linear time
tt’Pl
x yrk
We already computed all Z-values up to k-1. need to compute Zk. We also know the starting and ending points of the previous match, l and r.
tt’Pl
x yrk
We know that t = t’, therefore the Z-value at k-l+1 may be helpful to us.
1
k-l+1
![Page 89: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/89.jpg)
Computing Z in Linear time
• No char inside the box is compared twice. At most one mismatch per iteration.• Therefore, O(n).
Pk
The previous r is smaller than k. i.e., no previous match extends beyond k. do explicit comparison.
Pl
x yrk
Zk-l+1 <= r-k+1. Zk = Zk-l+1 No comparison is needed.1
k-l+1
Case 1:
Case 2:
Pl rk
Zk-l+1 > r-k+1. Zk = Zk-l+1
Comparison start from r1
k-l+1
Case 3:
![Page 90: CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.](https://reader030.fdocuments.us/reader030/viewer/2022032800/56649d445503460f94a216e6/html5/thumbnails/90.jpg)
Z-preprocessing for B-M and KMP
• Both KMP and B-M preprocessing can be done in O(n)
tt’i
x y
j = i+zi-1zi1
tt’ yxKMP
tyt’xB-Mij
Z j
ijFor each j sp’(j+zj-1) = z(j)
Use Z backwards