Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware...
Transcript of Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware...
![Page 1: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/1.jpg)
Cache-Aware Parallel Approximate String
Search and Join Algorithms Using BWT
Jiaying Wang, Xiaochun Yang, and Bin Wang
Mar 22, 2013
Northeastern University, China
EDBT/ICDT 2013 Scalable String Similarity Search/Join workshop
![Page 2: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/2.jpg)
Outline
Motivation
Problem statement
Approximate search method
Cache aware parallel framework
Pruning technique
Multi query optimization
Look ahead verification
Approximate join method
Experiment
Conclusion
![Page 3: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/3.jpg)
Motivation
![Page 4: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/4.jpg)
Motivation
Searching dna sequence similar to
"ACGTACAATATTAG" in genome database.
Results:
ACGTACAATATTAG is similar to
...ACGTACATTATTAG...
...ACGTAAAATATTAG...
...ACGTACAAATTTAG...
![Page 5: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/5.jpg)
Problem Statement
We have a string set C, for a given pattern p and
threshold τ, return all answer Ti ∈ C and ED(p, Ti)≤τ
Find all pairs (T1, T2) ∈ C× C that ed(T1, T2) ≤τ as fast as
possible.
p:
“Majaura” τ:
1
T1:“Majaura”
T2:“Deghli”
T3:“Madhura”
search
τ = 1
τ= 1
T1:“Majaura”
T2:“Deghli”
T3:“Madhura”
<s1,s3>
![Page 6: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/6.jpg)
Current famous solutions
Metric space-base method
Signature based (q-gram, q-chunk+ q-gram)
Idea:
Filter (fully filter, prefix filter) +Verify
if s is similar to q, some (lower bound, LB) part (signature)
of s and q must be identical.
#gram = |p|-q +1
LB = #gram –q×τ( prefix, PF = q ×τ+1)
Tree/Trie
![Page 7: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/7.jpg)
BWTPA Index
# M a j a u r a $ D e g h l i $ M a d h u r
$ D e g h l i $ M a d h u r a # M a j a u r
$ M a d h u r a # M a j a u r a $ D e g h l
D e g h l i $ M a d h u r a # M a j a u r a
M a d h u r a # M a j a u r a $ D e g h l i
M a j a u r a $ D e g h l i $ M a d h u r a
a # M a j a u r a $ D e g h l i $ M a d h u
a $ D e g h l i $ M a d h u r a # M a j a u
a d h u r a # M a j a u r a $ D e g h l i $
a j a u r a $ D e g h l i $ M a d h u r a #
a u r a $ D e g h l i $ M a d h u r a # M a
d h u r a # M a j a u r a $ D e g h l i $ M
e g h l i $ M a d h u r a # M a j a u r a $
g h l i $ M a d h u r a # M a j a u r a $ D
h l i $ M a d h u r a # M a j a u r a $ D e
h u r a # M a j a u r a $ D e g h l i $ M a
i $ M a d h u r a # M a j a u r a $ D e g h
j a u r a $ D e g h l i $ M a d h u r a # M
l i $ M a d h u r a # M a j a u r a $ D e g
r a # M a j a u r a $ D e g h l i $ M a d h
r a $ D e g h l i $ M a d h u r a # M a j a
u r a # M a j a u r a $ D e g h l i $ M a d
u r a $ D e g h l i $ M a d h u r a # M a j
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
a
a
i
$
$
#
r
r
M
M
j
a
D
e
g
d
l
a
h
u
u
h
a
L
22
7
14
8
15
0
21
6
16
1
3
17
9
10
11
18
13
2
12
20
5
19
4
SA PA
3
1
2
2
3
1
3
1
3
1
1
3
2
2
2
3
2
1
2
3
1
3
1
T1:“Majaura”
T2:“Deghli”
T3:“Madhura”
BWTPA index contains
BWT:
Simulate suffix array (SA)
PA:
Record id of SA
![Page 8: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/8.jpg)
Approximate String Search
If a string x is similar to a string y, then a segment of x will match a substring of y exactly.
Partition p to τ+1 same (almost) length partitions
r = p % (τ+1)
First r partitions’ length is p /(τ+1) +1
The left τ+1 -r partitions will be p /(τ+1)
X
Madaura
Y
Majaula
Mad ra au
τ = 2
Partition string
find a match
verification
![Page 9: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/9.jpg)
Find Exact Substring
# M a j a u r a $ D e g h l i $ M a d h u r
$ D e g h l i $ M a d h u r a # M a j a u r
$ M a d h u r a # M a j a u r a $ D e g h l
D e g h l i $ M a d h u r a # M a j a u r a
M a d h u r a # M a j a u r a $ D e g h l i
M a j a u r a $ D e g h l i $ M a d h u r a
a # M a j a u r a $ D e g h l i $ M a d h u
a $ D e g h l i $ M a d h u r a # M a j a u
a d h u r a # M a j a u r a $ D e g h l i $
a j a u r a $ D e g h l i $ M a d h u r a #
a u r a $ D e g h l i $ M a d h u r a # M a
d h u r a # M a j a u r a $ D e g h l i $ M
e g h l i $ M a d h u r a # M a j a u r a $
g h l i $ M a d h u r a # M a j a u r a $ D
h l i $ M a d h u r a # M a j a u r a $ D e
h u r a # M a j a u r a $ D e g h l i $ M a
i $ M a d h u r a # M a j a u r a $ D e g h
j a u r a $ D e g h l i $ M a d h u r a # M
l i $ M a d h u r a # M a j a u r a $ D e g
r a # M a j a u r a $ D e g h l i $ M a d h
r a $ D e g h l i $ M a d h u r a # M a j a
u r a # M a j a u r a $ D e g h l i $ M a d
u r a $ D e g h l i $ M a d h u r a # M a j
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
a
a
i
$
$
#
r
r
M
M
j
a
D
e
g
d
l
a
h
u
u
h
a
L
22
7
14
8
15
0
21
6
16
1
3
17
9
10
11
18
13
2
12
20
5
19
4
SA PA
3
1
2
2
3
1
3
1
3
1
1
3
2
2
2
3
2
1
2
3
1
3
1
![Page 10: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/10.jpg)
Pruning Techniques
Length Filtering:
The possible length range of pattern P with is [|P|-τ,|P| + τ ].
[5,33]
[5,16] [16,32]
[5,8] [20,33]
Dehri…Deghli
B1[5,6] B2[7,8]
Majaura$M… Mautern in$… Mautern in$…
Bi[lmin,lmax] Bi[32,33]
… … …
… …
search Dehli withτ= 1
![Page 11: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/11.jpg)
Pruning Techniques
Position Filtering
Three parts: prefix, matched segment, and suffix.
τ = 2
τp ≥2 τs ≥1
τ = τp + τs ≥3
Mad au ra
au F lty
prefix match suffix
![Page 12: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/12.jpg)
Multiple Queries
ID Strings τ
1 lodna 1
2 lodnna 2
3 Dehri 1
4 dehri 1
5 loddna 2
lod na
lo dn na
lo dd na
Deh ri
deh ri
![Page 13: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/13.jpg)
Multiple Queries
ID Strings τ
1 lodna 1
2 lodnna 2
3 Dehri 1
4 dehri 1
5 loddna 2
lod na
a
n
d
o
l
![Page 14: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/14.jpg)
Multiple Queries
ID Strings τ
1 lodna 1
2 lodnna 2
3 Dehri 1
4 dehri 1
5 loddna 2
a
n
d
o
l
lo dn na
n
d
o
l
![Page 15: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/15.jpg)
Multiple Queries
ID Strings τ
1 lodna 1
2 lodnna 2
3 Dehri 1
4 dehri 1
5 loddna 2
a
n
d
d o
l
h
e
D d
i
r
n
d
o
l
![Page 16: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/16.jpg)
Look Ahead Verification
s i m i l a r l y
s
i
m
i
l
a
l
r
y
8
7
6
5
4
3
9
7
6
5
4
3
2
8
6
5
4
3
2
1
02 1
0
1
1
0 1
2
7
5
4
3
2
1
0
1
2
6
4
3
2
1
0
1
4
2
5
3
2
1
0
1
2
5
4
4
2
1
6
4
5
3
1
1
1
2
3
4
7
5
6
3 2
2
1
2
3
4
5
8
6
7
2
2
2
3
4
5
6
9
7
8
3
3
3
0
1
2
3
r e f e r e n c e
d
i
f
f
e
r
e
n
t
8
7
6
5
4
3
9
7
6
5
5
4
3
8
6
5
5
4
4
3
22 2
0
1
1
1 2
2
7
6
6
5
4
3
2
3
3
3
7
6
5
4
3
3
3
4
4
4
7
5
4
3
4
4
4
5
5
5
6
4
3
4
4
5
5
6
6
6
5
3
4
5
5
6
6
7
7
7
4 4
4
5
6
6
7
7
8
8
8
5
5
6
7
7
8
8
9
9
9
case 1 case 2
![Page 17: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/17.jpg)
Cache-aware Parallel Optimization
registers
Cache
Memory
Disk
0.3~0.5 ns
1~10 ns
80~200 ns
10,000,000ns
![Page 18: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/18.jpg)
Cache-aware Parallel Optimization
Dehri$Majaura... Deghli$lodna... Madhura$...Mautern ...
B2 B8 B7 idle
work1 work2 work3 work4
B8 B9 Bn
B5 idle
![Page 19: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/19.jpg)
Approximate String Join
Incremental Approximate String Join
ed(S1,S2)≤τ also means ed(S2,S1)≤τ
remove the symmetrical case
stop the search when reach a ID ≥ current ID
Trie-based Approximate Join
build a reversed segment trie first
avoid the search processing for the duplicated segments
Pruning techniques
count filter
stop the search for current segment when there is only one candidate, which will be itself.
![Page 20: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/20.jpg)
Experiment
Environment
C++ language
PC with 2.93 GHz Intel Core CPU
4 GB main memory
Ubuntu operating system (Linux distribution).
data sets
Geographical name
DBLP author
Human genome read
![Page 21: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/21.jpg)
Search Performance
Geo DBLP reads
![Page 22: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/22.jpg)
Join Performance on DBLP
Geo DBLP reads
![Page 23: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/23.jpg)
Conclusions
A new index BWTPA
Cache-aware multi core framework
Efficient pruning techniques
Length filter
Position filter
Look ahead algorithm to improve edit distance
Approximate string join
Incremental
Trie-based
![Page 24: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,](https://reader034.fdocuments.us/reader034/viewer/2022051601/5acc64fa7f8b9a875a8c89da/html5/thumbnails/24.jpg)
Thank you!