Bouma2 talk
-
Upload
erez-buchnik -
Category
Technology
-
view
809 -
download
0
Transcript of Bouma2 talk
![Page 1: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/1.jpg)
A High-Performance Input-Aware
Multiple String-Match Algorithm
Erez
Buchnik
![Page 2: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/2.jpg)
Page 2
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
![Page 3: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/3.jpg)
Page 3
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
![Page 4: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/4.jpg)
Page 4
The Multiple String-Match Problem
• Goal: Given a set of strings and input
text, find all occurrences of any of the
strings in the text
• Input: Set of strings L and input text M
• Output: Offsets 1 ≤ i ≤ |M| where a
substring of M matches any of the
strings in L
• Uses: AV, IPS, DPI, DNA Search etc…
![Page 5: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/5.jpg)
Page 5
The Multiple String-Match Problem - References
• Aho-Corasick ’75
• Commentz-Walter ’79
• Rabin-Karp ’87
• Wu-Manber ’94
• Muth-Manber ’96
• Hopcroft-Motwani-Ullman ’00
• Dori-Landau ’06
![Page 6: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/6.jpg)
Page 6
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
![Page 7: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/7.jpg)
Page 7
Stateful Approach (e.g. Aho-Corasick)
• Linear in the length of the input
• Large automatons cause cache-
misses and degrade performance
• One state
transition per
symbol
![Page 8: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/8.jpg)
Page 8
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
![Page 9: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/9.jpg)
Page 9
Guidelines
• INTUITIVE: Search for ‘Hints’ of
a Match Before the Full Match
• REALISTIC: Use Prior
Knowledge of Expected Input
• SIMPLE: Trivial Match Process
![Page 10: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/10.jpg)
Page 10
Bouma2: Motif-Based String Match
bi
re
or
at
ok
ek
bore
core
bits
corridor
boat
book
cooks
trek
• Preprocessing: Map every string to
its own substring: Motif
Set of strings
Set of selected 2-symbols long substrings
Q1: How to select motifs?
![Page 11: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/11.jpg)
Page 11
Bouma2: Motif-Based String Match (cont.)
“ r a b b i t s h a t e
b o o k
c o o k s “
c o o k s
No match
Match Match
b o a t
No match
b i t s
Match
• Match: Examine symbols 2-by-2
(STATELESS); attempt full match
around motif occurrences Q2: How to resolve collisions?
![Page 12: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/12.jpg)
Page 12
Capturing all Occurrences
“ h a b i t s o f r a b b i t s “
b i t sb i t s
MatchMatch
• Even-offset occurrences and odd-
offset occurrences require separate
passes, but instead…
![Page 13: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/13.jpg)
Page 13
Upgrade #1: 2-Symbol Strides
“ h a b i t s o f r a b b i t s “
b i t s
Match
b i t s
MatchMatch
• We map each string TWICE: once to
an even-offset motif, and once to an
odd-offset motif
![Page 14: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/14.jpg)
Page 14
Upgrade #2: Fast-Path / Slow-Path
“ h a b i t s o f r a b b i t s “ 4
14
4 14
• Fast-Path:
- Stateless
- “Monolithic” (zero branches)
- Cache-Aware (small direct-table)
- SIMPLE…
![Page 15: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/15.jpg)
Page 15
Upgrade #2: Fast-Path / Slow-Path
“ h a b i t s o f r a b b i t s “
b i t s
Match
b i t s
MatchMatch
4
14
4 14
• Slow-Path:
- Memory-Efficient (pointers to
original strings for comparison)
- “Localized” (separate structure for
every motif)
![Page 16: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/16.jpg)
Page 16
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
![Page 17: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/17.jpg)
• n – length of input
• S – no. of string-matches in n
• m – no. of motif-matches in n
• l – length of the longest string
• Match Complexities:
- Aho-Corasick:
- Bouma2:
Page 17
Bouma2 vs. Aho-Corasick
)( SnO
)2
( lmn
O
![Page 18: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/18.jpg)
• In practice, Bouma2 is usually at
least twice as fast as Aho-Corasick
• Fast-path alone is 10 times faster
Page 18
Bouma2 vs. Aho-Corasick (Speed)
Bouma2 Fast-Path
Bouma2 Slow-Path (Sub-Optimal)
Aho-Corasick
Q3: How to optimize slow-path?
![Page 19: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/19.jpg)
• Bouma2 exhibits 8.5 times less
cache-misses than Aho-Corasick
(fast-path + slow-path) Page 19
Bouma2 vs. Aho-Corasick (Cache)
Bouma2 Cache-Misses
Aho-Corasick Cache-Misses
![Page 20: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/20.jpg)
• Bouma2 footprint is less than 70%
of Aho-Corasick for textual search
(down to 35% in other cases) Page 20
Bouma2 vs. Aho-Corasick (Memory)
Bouma2 Fast-Path
Bouma2 Slow-Path
Aho-Corasick
Original Strings
![Page 21: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/21.jpg)
Page 21
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
![Page 22: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/22.jpg)
• A1: Out of all 2-symbol substrings,
find a minimum subset that covers
all given strings (even & odd offsets) Page 22
Q1: How to select motifs?
bo co do id or re ri rr
bo re • •
co re • •
co rr id or • • • •
b or e •
c or e •
c or ri do r • • •
Even
Offset
Odd
Offset
![Page 23: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/23.jpg)
• But… maybe the minimum subset is
not the optimal subset?
Page 23
Q1: How to select motifs?
bo co do id or re ri rr
bo re Χ √
co re Χ √
co rr id or Χ Χ √ Χ
b or e √
c or e √
c or ri do r Χ √ Χ
Even
Offset
Odd
Offset
![Page 24: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/24.jpg)
Page 24
Q1: How to select motifs?
• Bad selection of motifs for English
text searches: substrings of ‘the’ -
the most common word in English
“The good, the bad and the ugly“ in theaters nearby
thea
No match No match
thea
Matchter thea
Match No match
ter
ter
thea
Match No match
ter
at ea er he te thEven
Offset th ea te r Χ Χ √Odd
Offset t he at er Χ Χ √
![Page 25: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/25.jpg)
• Use input-specific occurrence
statistics to optimize motif-sets
• REALISTIC… Page 25
Q1: How to select motifs? 2-Symbol Sequence Occurrence Probability
bo 0.0002 re 0.001861 co 0.001028 rr 0.000031 id 0.001756 or 0.000444 ri 0.000284 do 0.000151
![Page 26: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/26.jpg)
• NOTE: After selecting the motif-set,
remove redundant mappings from
the final String-to-Motif mapping Page 26
Q1: How to select motifs?
bo co do id or re ri rr
bo re √ Χ
co re √ Χ
co rr id or √ Χ √ Χ
b or e √
c or e √
c or ri do r Χ √ Χ
Even
Offset
Odd
Offset
![Page 27: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/27.jpg)
Page 27
Statistics for Motif Selection
0
2000000
4000000
6000000
8000000
10000000
0 10000 20000 30000 40000 50000 60000 70000
Occu
rren
ces
(mo
re t
han
100,0
00
)
“\r\n”
00 00
FF FF
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
0 10000 20000 30000 40000 50000 60000 70000
Occu
rren
ces
(mo
re t
han
40,0
00)
00 00
“??” FF FF
• 2-symbol sequence statistics: IP
traffic (top) vs. OS files (bottom)
![Page 28: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/28.jpg)
Page 28
Motif Selection as an ILP Problem
• L: a given string-set
• TL: all 2-symbol substrings of strings in L
• c(t): cost-function for every t in TL
Minimize ,
whereas for every
LTt
txtc )(
}1,0{txL
Tt
Subject To: for every Lw
LTt
t twassocx 1),(0 , and
LTt
t twassocx 1),(1
![Page 29: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/29.jpg)
• A2:
- Examine adjacent symbols at
relative offsets to eliminate strings
- New structure: The Mangled-Trie
Page 29
Q2: How to resolve collisions?
b o r ec o r ec o r r i d o r
c o r r i d o r
-1 0 1 2 3 4 5 6-2-3-4-5-6
I
![Page 30: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/30.jpg)
Page 30
The Mangled-Trie
. . .
b o r ec o r ec o r r i d o r
c o r r i d o r
c o r r i c o r r i d o r . . .
1 2 3
Resolve:
Offset -1
‘b’
‘c’
‘d’
OTHERNO
MATCH
‘e’ in
Offset 2?
“bore” in
Offset -1
NONO
MATCH
YES
“corri” in
Offset -6?
“corridor” in
Offset -6
NONO
MATCH
YESResolve:
Offset 2
OTHERNO
MATCH
‘e’
‘r’
“core” in
Offset -1
“idor” in
Offset 3?
“corridor” in
Offset -1
NONO
MATCH
YES-1 0 1 2 3 4 5 6-2-3-4-5-6
I
1
2
3
‘or’ Motif at Offset 0
![Page 31: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/31.jpg)
Page 31
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work
![Page 32: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/32.jpg)
• A3:
- Optimize Frequent Scenarios:
Apply statistics to Mangled-Trie
construction
- Improve Motif-Set Quality: Avoid
slow-path altogether when possible
Page 32
Q3: How optimize slow-path?
![Page 33: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/33.jpg)
• Adaptive System: Collect statistics
“on-the-go” and improve motif-set
• Faster Preprocessing: Custom
Branch-and-Cut (Margot ’10)
• Regular Expressions
• Hardware Implementation
• Bouma3?…
Page 33
More Future Work…
![Page 34: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/34.jpg)
Page 34
“ Search has always been about
people. It's not an abstract thing.
It's not a formula. It's about getting
people what they need... It depends
on the type of search you do—and
how to take all those signals and
put them together.”
- Udi Manber, Google, 2008
![Page 35: Bouma2 talk](https://reader034.fdocuments.us/reader034/viewer/2022052620/55763e20d8b42ac31b8b46f1/html5/thumbnails/35.jpg)
Thank You