Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

30
Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30

Transcript of Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

Page 1: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

Large-scale Similarity Join with Edit-distance Constraints

---BY Yu Haiyang

1/30

Page 2: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Outline

Background

The introduction of Pass-Join-K

Combining Pass-Join-K with Hadoop

2/30

Page 3: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Background

Similarity join: Find all similar pairs

from two sets.

Data Cleaning.

Query Relaxation

Spellchecking

3/30

Page 4: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Background

How to define similarity?

Jaccard distance(词袋模型 )

Cosine distance

Edit distance

4/30

Page 5: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Background

Edit distance

The minimum number of edit

operations (insertion, deletion, and

substitution) to transform one string to

another.

Baby BodySubstitution

Bod BodyInsertion

5/30

Page 6: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Background

How does the edit distance compare

with other two?

Accuracy: {“abcdefg”,”gfedcba”}

Verification time: O(m+n) -> O(mn)

6/30

Page 7: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Background

Find similar pairs

We have two string sets ,one is

{vldb,sigmod,….} ,the other is

{pvldb,icde,…}.

Find some candidate pairs , and then

verify these pairs.

{<vldb,pvldb>,<vldb,icde>,<vldb,..>,<sigmod,pvldb>,<sigmod,icde>,….}

<vldb,pvldb> Yes <vldb,icde> No

7/30

Page 8: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Background

So we have to:

Finding candidate pairs. There are

O(N2) if we do not prune some pairs.

verifying these pairs.

O(mn)

8/30

Page 9: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Outline

Background

The introduction of Pass-Join-K

Combining Pass-Join-K with Hadoop

9/30

Page 10: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Introduction of Pass-Join-K

Partition-based pruning technique

We suppose the threshold tau = 2, K= 1 and we have a pair <“abcde”,”ace”>

10/30

Page 11: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Introduction of Pass-Join-K

Partition-based pruning technique

We suppose the threshold tau = 2,

K=2and we have a pair

<“abcdefghijk”,”abdefghk”>

11/30

Page 12: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Introduction of Pass-Join-K

Some obvious pruning techniques

Length –based: threshold =

2,<“ab”,”abcee”>

Shift-based: <“abcd”,”cdef”>a b c d

c d e f

12/30

Page 13: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Introduction of Pass-Join-K

Partition Scheme

We have seen that the longer the

substrings are, the harder they could be

marched.

So we break the string into tau+k parts

and each part while its length equals

length/(tau+k) or length/(tau+k)+1. 13/30

Page 14: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Introduction of Pass-Join-K

Partition Scheme

14/30

Page 15: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Introduction of Pass-Join-K

Substring Selection

Here we suppose tau = 3 and k = 1;

abc def ghi jk

a b d e f g h ka b d e f g h k

15/30

Page 16: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Introduction of Pass-Join-K

Substring Selection

Here we suppose tau = 3 and k = 1;

abc def ghi jk

a b d e f g h k

16/30

Page 17: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Introduction of Pass-Join-K

Substring Selection

Here we suppose tau = 3 and k = 1;

abc def ghi jk

a b d e f gh k

17/30

Page 18: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Introduction of Pass-Join-K

Substring Selection

Here we suppose tau = 3 and k = 1;

abc def ghi jk

abd efg hk

18/30

Page 19: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Introduction of Pass-Join-K

Substring Selection

Here we suppose tau = 3 and k = 1;

abc def ghi jk

a b d e f g h ka b d e f g h k

19/30

Page 20: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Introduction of Pass-Join-K

Substring Selection

So what we do is to deduce the number

of substrings. More pruning techniques,

please read our paper: 《 Pass-Join-K多分段匹配的相似性连接算法》

20/30

Page 21: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Introduction of Pass-Join-K

Verification

DP( Dynamic programming)

D(m,n)=max(D(m,n-1)+1,D(m-1,n)+1,D(m-

1,n-1)+flag) where flag = 1 when sm=rn , s

and r are both strings.

21/30

Page 22: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Introduction of Pass-Join-K

Verification

Here we suppose tau = 3 and k = 1;

abc def ghi jk

def e f g h kTauleft = 3Tauleft = 3

Tauright = 3-3=0Tauright = 3-3=0

22/30

Page 23: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Outline

Background

The introduction of Pass-Join-K

Combining Pass-Join-K with Hadoop

23/30

Page 24: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Combining Pass-Join-K with Hadoop

Big data

Big file

Large number of files

24/30

Page 25: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Combining Pass-Join-K with Hadoop

Inverted index tree in hadoop

(abc, 1, 11,r,IFlag) (def,2,11,r,IFlag)

(ghi,3,11,r,IFlag) (jk,4,11,r,IFlag)

abc def ghi jk

1122 33 44

rr rr rr rr

L11L11

25/30

Page 26: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Combining Pass-Join-K with Hadoop

Substrings in hadoop

Suppose tau = 3, k = 1, and s =

“abdefghk”, length(s) = 8. We have to

generate some records such as (a,1,5,s,SFlag),(a,2,6,s,SFlag)(a,3,7,s,SFlag),

(ab,1,8,s,SFlag),…,(ab,1,11,s,SFlag),…

26/30

Page 27: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Combining Pass-Join-K with Hadoop

Data flows in hadoop

27/30

Page 28: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn

Combining Pass-Join-K with Hadoop

Big data

Big file

Large number of files

28/30

Page 29: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn 29/30

Combining Pass-Join-K with Hadoop

[segmentString, segmentNumber,

stringLength, FLAG], [DirNumber,

ID]

Page 30: Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

23/4/21 http://datamining.xmu.edu.cn 30/30

Email: [email protected]