JPMorgan-and-Chase-Buy-Proposal Alexandros Marios Kotsonis 2016
Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.
-
Upload
amari-kidd -
Category
Documents
-
view
221 -
download
2
Transcript of Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.
![Page 1: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/1.jpg)
Efficient Approximate Search on String CollectionsPart I
Marios Hadjieleftheriou Chen Li
1
![Page 2: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/2.jpg)
DBLP Author Search
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 2
![Page 3: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/3.jpg)
Try their names (good luck!)
http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 3
Yannis Papakonstantinou Meral Ozsoyoglu Marios Hadjieleftheriou
UCSD Case Western AT&T--Research
![Page 4: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/4.jpg)
4
![Page 5: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/5.jpg)
Better system?
5
http://dblp.ics.uci.edu/authors/
![Page 6: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/6.jpg)
People Search at UC Irvine
6
http://psearch.ics.uci.edu/
![Page 7: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/7.jpg)
Web Search
http://www.google.com/jobs/britney.html
Errors in queries
Errors in data
Bring query and meaningful
results closer togetherActual queries gathered by Google
7
![Page 8: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/8.jpg)
Data Cleaning
R
informix
microsoft
…
…
S
infromix
…mcrosoft
…
8
![Page 9: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/9.jpg)
Problem Formulation
Find strings similar to a given string: dist(Q,D) <= δ
Example: find strings similar to “hadjeleftheriou”
Performance is important!-10 ms: 100 queries per second (QPS)- 5 ms: 200 QPS
9
![Page 10: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/10.jpg)
Outline
Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion
10
Part I
Part II
![Page 11: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/11.jpg)
Preliminaries
11
Next…
![Page 12: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/12.jpg)
Similarity Functions Similar to:
a domain-specific function returns a similarity value between two strings
Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey
12
![Page 13: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/13.jpg)
13
A widely used metric to define string similarityEd(s1,s2) = minimum # of operations (insertion,
deletion, substitution) to change s1 to s2Example:
s1: Tom Hanks
s2: Ton Hank
ed(s1,s2) = 2
Edit Distance
13
![Page 14: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/14.jpg)
State-of-the-art: Oracle 10g and older versions Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing:
begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; /
CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF');
Usage:
SELECT * FROM engdict
WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0;
Limitation: cannot handle errors in the first letters:
Katherine versus Catherine
14
![Page 15: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/15.jpg)
15
Microsoft SQL Server [CGG+05]
Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores
15
![Page 16: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/16.jpg)
Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix pruning followed by a scan
(Efficiency?)
16
![Page 17: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/17.jpg)
Trie-based approach [JLL+09]
17
Next…
![Page 18: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/18.jpg)
Trie Indexing
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$e
Strings
exam
example
exemplar
exempt
sample
18
![Page 19: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/19.jpg)
Active nodes on Trie
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$e
Prefix Distance
examp 2
exampl 1
example 0
exempl 2
exempla 2
sample 2
Query: “example”
Edit-distance threshold = 2
2
1
0
2
2
2
19
![Page 20: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/20.jpg)
Initialization
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$e
Q = ε 0
1 1
2 2
Prefix DistancePrefix Distance
0
e 1
ex 2
s 1
sa 2
Prefix Distance
ε 0
Initial active nodes: all nodes within depth δ
20
![Page 21: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/21.jpg)
Incremental Algorithm
Return leaf nodes as answers.21
![Page 22: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/22.jpg)
e
Q = e x a m p l e
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$
Prefix Distance
ε 0
e 1
ex 2
s 1
sa 2
Prefix # Op Base Op
ε 1 ε del e
s 1 ε sub e/s
e 0 ε mat e
ex 1 ε ins x
exa 2 ε Ins xa
exe 2 ε Ins xe
Prefix # Op Base OpPrefix # Op Base Op
ε 1 ε del e
Prefix # Op Base Op
ε 1 ε del e
s 1 ε sub e/s
Prefix # Op Base Op
ε 1 ε del e
s 1 ε sub e/s
e 0 ε mat e
1
10
1
2 2
e 2 e del e
ex 2 e sub e/xex 3 ex del e
exa 3 ex sub e/a
exe 2 ex mat e
s 2 s del e
sa 2 s sub e/asa 3 sa del e
Active nodes for Q
=
ε
Active nodes for Q
=
e
2
22
![Page 23: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/23.jpg)
23
Advantages: Trie size is small Can do search as the user types
DisadvantagesWorks for edit distance only
Good and bad
23
![Page 24: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/24.jpg)
Gram-based algorithmsList-merging algorithms [LLL08]Variable-length grams (VGRAM)
[LWY07,YWL08]
24
Next…
![Page 25: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/25.jpg)
“q-grams” of strings
u n i v e r s a l
2-grams
25
![Page 26: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/26.jpg)
Edit operation’s effect on grams
k operations could affect k * q grams
u n i v e r s a lFixed length: q
26
If ed(s1,s2) <= k, then their # of common grams >=
(|s1|- q + 1) – k * q
![Page 27: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/27.jpg)
q-gram inverted lists
id strings01234
richstickstichstuckstatic
4
2 30
1 4
2-grams
atchckicristtatituuc
201 30 1 2 4
41 2 433 27
![Page 28: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/28.jpg)
# of common grams >= 3
Searching using inverted lists Query: “shtick”, ED(shtick, ?)≤1
id strings01234
richstickstichstuckstatic
2-grams
atchckicristtatituuc
4
2 30
1 4
201 30 1 2 4
41 2 433
ti ic cksh ht ti ic ck
28
![Page 29: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/29.jpg)
T-occurrence Problem
Find elements whose occurrences ≥ T
Ascending
order
Ascending
order
MergeMerge
29
![Page 30: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/30.jpg)
Example T = 4
Result: 13
1
3
5
10
13
10
13
15
5
7
13
13 15
30
![Page 31: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/31.jpg)
List-Merging Algorithms
HeapMerger MergeOpt
[SK04]
[LLL08, BK02]
ScanCount MergeSkip DivideSkip
31
![Page 32: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/32.jpg)
Heap-based Algorithm
Min-heap
Count # of occurrences of each element using a heap
Push to heap ……
32
![Page 33: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/33.jpg)
MergeOpt Algorithm [SK04]
Long Lists: T-1 Short Lists
Binary
search
33
![Page 34: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/34.jpg)
Example of MergeOpt
1
3
5
10
13
10
13
15
5
7
13
13 15
Count threshold T≥ 4
Long Lists: 3Short Lists: 2
34
![Page 35: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/35.jpg)
3535
ScanCount
1 2 3
…
1
3
5
10
13
10
13
15
5
7
13
13 15
Count threshold T≥ 4
# of occurrences# of occurrences
00
00
00
44
11
Increment by 1Increment by 111
String idsString ids
1313
1414
1515
00
22
00
00
Result!Result!
![Page 36: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/36.jpg)
List-Merging Algorithms
HeapMerger MergeOpt
ScanCount MergeSkip DivideSkip
[SK04]
[LLL08, BK02]
36
![Page 37: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/37.jpg)
MergeSkip algorithm [BK02, LLL08]
Min-heap ……Pop T-1
T-1
Jump Greater or
equals
Greater or
equals
37
![Page 38: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/38.jpg)
Example of MergeSkip
1
3
5
10
10
15
5
7
13 15
Count threshold T≥ 4
minHeap10
13 15
15
JumpJump
15151515
13131313
17171717
38
![Page 39: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/39.jpg)
DivideSkip Algorithm [LLL08]
Long Lists Short Lists
Binary
searchMergeSkip
39
![Page 40: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/40.jpg)
How many lists are treated as long lists?
40
![Page 41: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/41.jpg)
Length Filtering
Ed(s,t) ≤ 2
s: s:
t: t:
Length: 19Length: 19
Length: 10Length: 10
By length only!By length only!
41
![Page 42: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/42.jpg)
Positional Filtering
a b
a b
Ed(s,t) ≤ 2
s
t
(ab,1)
(ab,12)
42
![Page 43: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/43.jpg)
Normalized weights [HKS08]
Compute a weight for each string L0: length of the string
L1, L2: Depend on q-gram frequencies
Similar strings have similar weights A very strong pruning condition
43
![Page 44: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/44.jpg)
Pruning using normalized weights
Sort inverted lists based on string weights Search within a small weight range Shown to be effective (> 90% candidates pruned)
44
![Page 45: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/45.jpg)
Variable-length grams (VGRAM) [LWY07,YWL08]
45
Next…
![Page 46: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/46.jpg)
# of common grams >= 1
2-grams -> 3-grams? Query: “shtick”, ED(shtick, ?)≤1
id strings01234
richstickstichstuckstatic
3-grams
atiichickricstastistutattictucuck
tic icksht hti tic ick
4
24
1
2010
3413
42
3
id strings01234
richstickstichstuckstatic
id strings01234
richstickstichstuckstatic
46
![Page 47: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/47.jpg)
Observation 1: dilemma of choosing “q”
Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings
id strings01234
richstickstichstuckstatic
4
2 30
1 4
2-grams
atchckicristtatituuc
201 30 1 2 4
41 2 433 47
![Page 48: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/48.jpg)
Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio
48
![Page 49: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/49.jpg)
VGRAM: Main idea
Grams with variable lengths (between qmin and qmax) zebra
- ze(123) corrasion
- co(5213), cor(859), corr(171) Advantages
Reduce index size Reducing running time Adoptable by many algorithms
49
![Page 50: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/50.jpg)
Challenges
Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their
gram-set similarity? Adopting VGRAM in existing algorithms?
50
![Page 51: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/51.jpg)
Challenge 1: String Variable-length grams?
Fixed-length 2-grams
Variable-length grams
u n i v e r s a l
niivrsalunivers
[2,4]-gram dictionaryu n i v e r s a l
51
![Page 52: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/52.jpg)
Representing gram dictionary as a trie
niivrsalunivers
52
![Page 53: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/53.jpg)
53
Step 2: Constructing a gram dictionary
qmin=2
qmax=4
Frequency-based [LYW07] Cost-based [YLW08]
![Page 54: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/54.jpg)
Challenge 3: Edit operation’s effect on grams
k operations could affect k * q grams
u n i v e r s a lFixed length: q
54
![Page 55: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/55.jpg)
Deletion affects variable-length grams
i-qmax+1 i+qmax- 1Deletion
Not affected Not affectedAffected
i
55
![Page 56: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/56.jpg)
Main idea
For a string, for each position, compute the number of grams that could be destroyed by an operation at this position
Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index
56
Vector of s = <2,4,6,8,9>
With 2 edit operations, at most 4 grams can be affected
Use this number to do count filtering
![Page 57: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/57.jpg)
Summary of VGRAM index
57
![Page 58: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/58.jpg)
Challenge 4: adopting VGRAM
Easily adoptable by many algorithms
Basic interfaces: String s grams String s1, s2 such that ed(s1,s2) <= k min
# of their common grams
58
![Page 59: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/59.jpg)
Lower bound on # of common grams
If ed(s1,s2) <= k, then their # of common grams >=:
(|s1|- q + 1) – k * q
u n i v e r s a l
Fixed length (q)
Variable lengths: # of grams of s1 – NAG(s1,k)
59
![Page 60: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/60.jpg)
Example: algorithm using inverted lists
Query: “shtick”, ED(shtick, ?)≤1
1 2 4
1 210 4
3
…ckic…ti…
Lower bound = 3
Lower bound = 1
sh ht tick
2 4
1
4
0
11
2
3
…ckicich…tictick…
2-4 grams2-gramstick
id strings01234
richstickstichstuckstatic
id strings01234
richstickstichstuckstatic
id strings01234
richstickstichstuckstatic
60
![Page 61: Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.](https://reader036.fdocuments.us/reader036/viewer/2022062318/551c4bba5503469d6a8b495d/html5/thumbnails/61.jpg)
End of part I
Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion
61
Part I
Part II