Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

Large-scale Similarity Join with Edit-distance Constraints

---BY Yu Haiyang

1/30

23/4/21 http://datamining.xmu.edu.cn

Outline

Background

The introduction of Pass-Join-K

Combining Pass-Join-K with Hadoop

2/30


Background

Similarity join: Find all similar pairs

from two sets.

Data Cleaning.

Query Relaxation

Spellchecking

3/30


Background

How to define similarity?

Jaccard distance(词袋模型 )

Cosine distance

Edit distance

4/30


Background

Edit distance

The minimum number of edit

operations (insertion, deletion, and

substitution) to transform one string to

another.

Baby BodySubstitution

Bod BodyInsertion

5/30


Background

How does the edit distance compare

with other two?

Accuracy: {“abcdefg”,”gfedcba”}

Verification time: O(m+n) -> O(mn)

6/30


Background

Find similar pairs

We have two string sets ,one is

{vldb,sigmod,….} ,the other is

{pvldb,icde,…}.

Find some candidate pairs , and then

verify these pairs.

{<vldb,pvldb>,<vldb,icde>,<vldb,..>,<sigmod,pvldb>,<sigmod,icde>,….}

<vldb,pvldb> Yes <vldb,icde> No

7/30


Background

So we have to:

Finding candidate pairs. There are

O(N2) if we do not prune some pairs.

verifying these pairs.

O(mn)

8/30


Outline

Background



9/30


Introduction of Pass-Join-K

Partition-based pruning technique

We suppose the threshold tau = 2, K= 1 and we have a pair <“abcde”,”ace”>

10/30



Partition-based pruning technique

We suppose the threshold tau = 2,

K=2and we have a pair

<“abcdefghijk”,”abdefghk”>

11/30



Some obvious pruning techniques

Length –based: threshold =

2,<“ab”,”abcee”>

Shift-based: <“abcd”,”cdef”>a b c d

c d e f

12/30



Partition Scheme

We have seen that the longer the

substrings are, the harder they could be

marched.

So we break the string into tau+k parts

and each part while its length equals

length/(tau+k) or length/(tau+k)+1. 13/30



Partition Scheme

14/30



Substring Selection

Here we suppose tau = 3 and k = 1;

abc def ghi jk

a b d e f g h ka b d e f g h k

15/30



Substring Selection


abc def ghi jk

a b d e f g h k

16/30



Substring Selection


abc def ghi jk

a b d e f gh k

17/30



Substring Selection


abc def ghi jk

abd efg hk

18/30



Substring Selection


abc def ghi jk

a b d e f g h ka b d e f g h k

19/30



Substring Selection

So what we do is to deduce the number

of substrings. More pruning techniques,

please read our paper: 《 Pass-Join-K多分段匹配的相似性连接算法》

20/30



Verification

DP( Dynamic programming)

D(m,n)=max(D(m,n-1)+1,D(m-1,n)+1,D(m-

1,n-1)+flag) where flag = 1 when sm=rn , s

and r are both strings.

21/30



Verification


abc def ghi jk

def e f g h kTauleft = 3Tauleft = 3

Tauright = 3-3=0Tauright = 3-3=0

22/30


Outline

Background



23/30



Big data

Big file

Large number of files

24/30



Inverted index tree in hadoop

(abc, 1, 11,r,IFlag) (def,2,11,r,IFlag)

(ghi,3,11,r,IFlag) (jk,4,11,r,IFlag)

abc def ghi jk

1122 33 44

rr rr rr rr

L11L11

25/30



Substrings in hadoop

Suppose tau = 3, k = 1, and s =

“abdefghk”, length(s) = 8. We have to

generate some records such as (a,1,5,s,SFlag),(a,2,6,s,SFlag)(a,3,7,s,SFlag),

(ab,1,8,s,SFlag),…,(ab,1,11,s,SFlag),…

26/30



Data flows in hadoop

27/30



Big data

Big file

Large number of files

28/30

23/4/21 http://datamining.xmu.edu.cn 29/30


[segmentString, segmentNumber,

stringLength, FLAG], [DirNumber,

ID]

23/4/21 http://datamining.xmu.edu.cn 30/30

Email: [email protected]

Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

Documents

Transcript of Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.