Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.
-
Upload
philippa-quinn -
Category
Documents
-
view
222 -
download
0
Transcript of Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.
Large-scale Similarity Join with Edit-distance Constraints
---BY Yu Haiyang
1/30
23/4/21 http://datamining.xmu.edu.cn
Outline
Background
The introduction of Pass-Join-K
Combining Pass-Join-K with Hadoop
2/30
23/4/21 http://datamining.xmu.edu.cn
Background
Similarity join: Find all similar pairs
from two sets.
Data Cleaning.
Query Relaxation
Spellchecking
3/30
23/4/21 http://datamining.xmu.edu.cn
Background
How to define similarity?
Jaccard distance(词袋模型 )
Cosine distance
Edit distance
4/30
23/4/21 http://datamining.xmu.edu.cn
Background
Edit distance
The minimum number of edit
operations (insertion, deletion, and
substitution) to transform one string to
another.
Baby BodySubstitution
Bod BodyInsertion
5/30
23/4/21 http://datamining.xmu.edu.cn
Background
How does the edit distance compare
with other two?
Accuracy: {“abcdefg”,”gfedcba”}
Verification time: O(m+n) -> O(mn)
6/30
23/4/21 http://datamining.xmu.edu.cn
Background
Find similar pairs
We have two string sets ,one is
{vldb,sigmod,….} ,the other is
{pvldb,icde,…}.
Find some candidate pairs , and then
verify these pairs.
{<vldb,pvldb>,<vldb,icde>,<vldb,..>,<sigmod,pvldb>,<sigmod,icde>,….}
<vldb,pvldb> Yes <vldb,icde> No
7/30
23/4/21 http://datamining.xmu.edu.cn
Background
So we have to:
Finding candidate pairs. There are
O(N2) if we do not prune some pairs.
verifying these pairs.
O(mn)
8/30
23/4/21 http://datamining.xmu.edu.cn
Outline
Background
The introduction of Pass-Join-K
Combining Pass-Join-K with Hadoop
9/30
23/4/21 http://datamining.xmu.edu.cn
Introduction of Pass-Join-K
Partition-based pruning technique
We suppose the threshold tau = 2, K= 1 and we have a pair <“abcde”,”ace”>
10/30
23/4/21 http://datamining.xmu.edu.cn
Introduction of Pass-Join-K
Partition-based pruning technique
We suppose the threshold tau = 2,
K=2and we have a pair
<“abcdefghijk”,”abdefghk”>
11/30
23/4/21 http://datamining.xmu.edu.cn
Introduction of Pass-Join-K
Some obvious pruning techniques
Length –based: threshold =
2,<“ab”,”abcee”>
Shift-based: <“abcd”,”cdef”>a b c d
c d e f
12/30
23/4/21 http://datamining.xmu.edu.cn
Introduction of Pass-Join-K
Partition Scheme
We have seen that the longer the
substrings are, the harder they could be
marched.
So we break the string into tau+k parts
and each part while its length equals
length/(tau+k) or length/(tau+k)+1. 13/30
23/4/21 http://datamining.xmu.edu.cn
Introduction of Pass-Join-K
Partition Scheme
14/30
23/4/21 http://datamining.xmu.edu.cn
Introduction of Pass-Join-K
Substring Selection
Here we suppose tau = 3 and k = 1;
abc def ghi jk
a b d e f g h ka b d e f g h k
15/30
23/4/21 http://datamining.xmu.edu.cn
Introduction of Pass-Join-K
Substring Selection
Here we suppose tau = 3 and k = 1;
abc def ghi jk
a b d e f g h k
16/30
23/4/21 http://datamining.xmu.edu.cn
Introduction of Pass-Join-K
Substring Selection
Here we suppose tau = 3 and k = 1;
abc def ghi jk
a b d e f gh k
17/30
23/4/21 http://datamining.xmu.edu.cn
Introduction of Pass-Join-K
Substring Selection
Here we suppose tau = 3 and k = 1;
abc def ghi jk
abd efg hk
18/30
23/4/21 http://datamining.xmu.edu.cn
Introduction of Pass-Join-K
Substring Selection
Here we suppose tau = 3 and k = 1;
abc def ghi jk
a b d e f g h ka b d e f g h k
19/30
23/4/21 http://datamining.xmu.edu.cn
Introduction of Pass-Join-K
Substring Selection
So what we do is to deduce the number
of substrings. More pruning techniques,
please read our paper: 《 Pass-Join-K多分段匹配的相似性连接算法》
20/30
23/4/21 http://datamining.xmu.edu.cn
Introduction of Pass-Join-K
Verification
DP( Dynamic programming)
D(m,n)=max(D(m,n-1)+1,D(m-1,n)+1,D(m-
1,n-1)+flag) where flag = 1 when sm=rn , s
and r are both strings.
21/30
23/4/21 http://datamining.xmu.edu.cn
Introduction of Pass-Join-K
Verification
Here we suppose tau = 3 and k = 1;
abc def ghi jk
def e f g h kTauleft = 3Tauleft = 3
Tauright = 3-3=0Tauright = 3-3=0
22/30
23/4/21 http://datamining.xmu.edu.cn
Outline
Background
The introduction of Pass-Join-K
Combining Pass-Join-K with Hadoop
23/30
23/4/21 http://datamining.xmu.edu.cn
Combining Pass-Join-K with Hadoop
Big data
Big file
Large number of files
24/30
23/4/21 http://datamining.xmu.edu.cn
Combining Pass-Join-K with Hadoop
Inverted index tree in hadoop
(abc, 1, 11,r,IFlag) (def,2,11,r,IFlag)
(ghi,3,11,r,IFlag) (jk,4,11,r,IFlag)
abc def ghi jk
1122 33 44
rr rr rr rr
L11L11
25/30
23/4/21 http://datamining.xmu.edu.cn
Combining Pass-Join-K with Hadoop
Substrings in hadoop
Suppose tau = 3, k = 1, and s =
“abdefghk”, length(s) = 8. We have to
generate some records such as (a,1,5,s,SFlag),(a,2,6,s,SFlag)(a,3,7,s,SFlag),
(ab,1,8,s,SFlag),…,(ab,1,11,s,SFlag),…
26/30
23/4/21 http://datamining.xmu.edu.cn
Combining Pass-Join-K with Hadoop
Data flows in hadoop
27/30
23/4/21 http://datamining.xmu.edu.cn
Combining Pass-Join-K with Hadoop
Big data
Big file
Large number of files
28/30
23/4/21 http://datamining.xmu.edu.cn 29/30
Combining Pass-Join-K with Hadoop
[segmentString, segmentNumber,
stringLength, FLAG], [DirNumber,
ID]
23/4/21 http://datamining.xmu.edu.cn 30/30
Email: [email protected]