1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine...
-
Upload
grace-stephens -
Category
Documents
-
view
214 -
download
0
Transcript of 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine...
1
Notes 06: Efficient Fuzzy Search
Professor Chen LiDepartment of Computer Science
UC Irvine
CS122B: Projects in Databases and Web Applications Spring 2015
22
Example: a movie database
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Iron man 2008 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson The man 2006 Crime
Find movies starred Schwarrzenger.
33
Problem definition: approximate string searches
…
Schwarzenger
Samuel Jackson
Keanu ReevesStar
Query q:
Collection of strings s
Search
Output: strings s that satisfy Sim(q,s)≤δOutput: strings s that satisfy Sim(q,s)≤δSim functions: edit distance, Jaccard Coefficient and Cosine similaritySim functions: edit distance, Jaccard Coefficient and Cosine similarity
SchwarrzengerSchwarrzenger
Similarity Functions Similar to:
a domain-specific function returns a similarity value between two strings
Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE
4
5
A widely used metric to define string similarityEd(s1,s2) = minimum # of operations (insertion,
deletion, substitution) to change s1 to s2Example:
s1: Tom Hanks
s2: Ton Hank
ed(s1,s2) = 2
Edit Distance
5
State-of-the-art: Oracle 10g and older versions Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing:
begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; /
CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF');
Usage:
SELECT * FROM engdict
WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0;
Limitation: cannot handle errors in the first letters:
Katherine versus Catherine
6
7
Microsoft SQL Server Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores
7
Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix pruning followed by a scan
(Efficiency?)
8
99
Outline Gram-based approaches Trie-based approaches
1010
String Grams q-grams
(un),(ni),(iv),(ve),(er),(rs),(sa),(al)
For example: 2-gram
u n i v e r s a l
1111
Inverted lists Convert strings to gram inverted lists
id strings01234
richstickstichstuckstatic
4
2 30
1 4
2-grams
atchckicristtatituuc
201 30 1 2 4
41 2 433
1212
Main ExampleQuery
Merge
Data Grams
stick (st,ti,ic,ck)
count >=2
id strings
0 rich
1 stick
2 stich
3 stuck
4 static
ck
ic
st
ta
ti…
1,3
1,2,3,4
4
1,2,4
ed(s,q)≤1
0,0,1,2,41,2,4
Candidates
1313
Problem definition:
Find elements whose occurrences ≥ T
Ascending
order
Ascending
order
MergeMerge
1414
Example T = 4
Result: 13
1
3
5
10
13
10
13
15
5
7
13
13 15
1515
Five Merge Algorithms
HeapMerger[Sarawagi,SIGMOD
2004]
MergeOpt[Sarawagi,SIGMOD
2004]
ScanCount MergeSkip DivideSkip
1616
Heap-based Algorithm
Min-heap
Count # of the occurrences of each element by a heap
Push to heap ……
1717
MergeOpt Algorithm
Long Lists: T-1 Short Lists
Binary
search
1818
Example of MergeOpt [Sarawagi et al 2004]
1
3
5
10
13
10
13
15
5
7
13
13 15
Count threshold T≥ 4
Long Lists: 3Short Lists: 2
1919
Five Merge Algorithms
HeapMerger MergeOpt
ScanCount MergeSkip DivideSkip
2020
ScanCount Example
1 2 3
…
1
3
5
10
13
10
13
15
5
7
13
13 15
Count threshold T≥ 4
# of occurrences# of occurrences
00
00
00
44
11
Increment by 1
Increment by 111
String idsString ids
1313
1414
1515
00
22
00
00
Result!Result!
2121
Five Merge Algorithms
HeapMerger MergeOpt
ScanCount MergeSkip DivideSkip
2222
MergeSkip algorithm
Min-heap ……Pop T-1
T-1
Jump Greater or
equals
Greater or
equals
2323
Example of MergeSkip
1
3
5
10
10
15
5
7
13 15
Count threshold T≥ 4
minHeap10
13 15
15
JumpJump
15151515
13131313
17171717
2424
Skip is safe
Min-heap ……
# of occurrences of skipped elements ≤T-1
Skip
2525
Five Merge Algorithms
HeapMerger MergeOpt
ScanCount MergeSkip DivideSkip
26
DivideSkip Algorithm
Long Lists Short Lists
Binary
searchMergeSkip
2727
How many lists are treated as long lists?
??
Short ListsMerge
Long ListsLookup
2828
Performance (DBLP)
DivideSkip is the best one
2929
Trie-Based Approach
Trie Indexing
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$e
Strings
exam
example
exemplar
exempt
sample
30
Active nodes on Trie
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$e
Prefix Distance
examp 2
exampl 1
example 0
exempl 2
exempla 2
sample 2
Query: “example”
Edit-distance threshold = 2
2
1
0
2
2
2
31
Initialization
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$e
Q = ε 0
1 1
2 2
Prefix DistancePrefix Distance
0
e 1
ex 2
s 1
sa 2
Prefix Distance
ε 0
Initial active nodes: all nodes within depth δ
32
Incremental Algorithm
Return leaf nodes as answers.33
34
Advantages: Trie size is small Can do search as the user types
DisadvantagesWorks for edit distance only
Good and bad
34
3535
References1. Efficient Merging and Filtering Algorithms for
Approximate String Searches, Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008
2. Efficient Interactive Fuzzy Keyword Search, Shengyue Ji, Guoliang Li, Chen Li, and Jianhua Feng, WWW 2009