Duplicate Detection

Exercise 1.

Use Extended Key to do Entity Identification[1]

• Table R and S as shown below: Table R

Table S

Name City ZIP PersonNrEva Aadde INGARÖ 13469 840126 -1223

Eva Aalto Norsborg 14564 851201-1225

Eva Abrahamsson INGARÖ 13463 861227-1227

Name HomeAddress TelephoneEva Aadde Myskviksvägen 8 08-571 480 27

Eva Abrahamsson Myrvägen 2 08-570 290 91

Eva Abrahamsson Pilgatan 9 08-642 61 79

Eva Abrahamsson Nyängsvägen 39A 08-530 356 44

• Suppose the extended key is {name, city, homeaddress} and the following ILFDs:

– (E. HomeAddress=” Myskviksvägen 8”) ->(E.City= ”INGARÖ”)– (E. HomeAddress=”Myrvägen 2”) ->(E.City= ” INGARÖ”)– (E. HomeAddress=” Pilgatan 9 ”) ->(E.City= ”STOCKHOLM”)– (E. HomeAddress=” Nyängsvägen 39A”) ->(E.City= ” TULLINGE”)

• Please construct the integrated table.

-----------------------------------------------------[1] Lim , Jaideep Srivastava , Satya Prabhakar , James Richardson, Entity Identification in Database Integration,

Proceedings of the Ninth International Conference on Data Engineering, p.294-301, April 19-23, 1993

Answer Exercise

• Integrated Table

Name City ZIP PersonNr HomeAddress Telephone

Eva Aadde INGARÖ 13469 840126 -1223 Myskviksvägen 8 08-571 480 27

Eva Abrahamsson INGARÖ 13463 861227-1227 Myrvägen 2 08-571 480 27

Eva Abrahamsson STOCKHOLM NULL NULL Pilgatan 9 08-642 61 79

Eva Abrahamsson TULLINGE NULL NULL Nyängsvägen 39A 08-530 356 44

Exercise 2.

Use Priority Queue to do Duplicate Detection[2]

1. Table R, which is already sorted according to application-specific key :

2. Similarities between tuples

TupleT1

T2

T3

T4

T5

T6

T7

T1 T2 T3 T4 T5 T6 T7

T1 0.6 0.1 0.3 0.5 0.1 0.2

T2 0.6 　 0.2 0.4 0.4 0.4 0.2

T3 0.1 0.2 　 0.9 0.4 0.6 0.5

T4 0.3 0.4　

0.9　　 0.4 0.6 0.6

T5 0.5 0.4　

0.4　

0.4　　 0.4 0.8

T6 0.1 0.4　

0.6　

0.6　 0.4 　 0.4

T7 0.2 0.2　

0.5　

0.6　

0.8　

0.4　

　

• Given conditions below, please use Priority Queue algorithm

to find the Duplicate Clusters within.

3. Method to count Matching Sorce: Given one cluster, the Matching Sorce of one tuple is :

The average of the tuple’s similarity with the cluster’s all representitives.4. The condition to declare a new cluster :

matching score < 0.55. The condition to declare a representitive:

0.5 < matching score < 0.86. The size of Priority Queue:

2

-----------------------------------------------------[2] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate

Database Records,” Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery and Data Mining, 1997

Answer

Record 1Queue{1}Record 22:1 = 0.6 > 0.5 and < 0.8Queue {1,2}Record 33:1 = 0.1 3:2 = 0.2 representitive = (0.1 + 0.2) /2 =

0.15 < 0.5Queue {3} {1, 2}Record 44:1 =0.3 4:2= 0.4 representitive = (0.3+0.4) /2 = 0.35

< 0.54:3= 0.9 > 0.5 and > 0.8Queue {3, 4} {1,2}

Record 55:1 = 0.5 5:2 = 0.4 representitive = (0.5 +0.4) /2 =

0.45 < 0.55:3= 0.4 representitive = 0.4 <0.5Queue {5} {3, 4} {1,2} Record 66:3 = 0.6 representitive = 0.6 > 0.5 and < 0.86:5 = 0.4 < 0.5Queue {3, 4, 6} {5} {1,2}Record 77:3 = 0.5 7:6 = 0.4 representitive = (0.5 +0.4)/2 = 0.45

< 0.5 7:5 = 0.8 >0.5 Queue {5, 7} {3, 4, 6} {1,2}

Duplicate Detection

Documents

Transcript of Duplicate Detection