Duplicate Detection
description
Transcript of Duplicate Detection
Duplicate Detection
Exercise 1.
Use Extended Key to do Entity Identification[1]
• Table R and S as shown below: Table R
Table S
Name City ZIP PersonNrEva Aadde INGARÖ 13469 840126 -1223
Eva Aalto Norsborg 14564 851201-1225
Eva Abrahamsson INGARÖ 13463 861227-1227
Name HomeAddress TelephoneEva Aadde Myskviksvägen 8 08-571 480 27
Eva Abrahamsson Myrvägen 2 08-570 290 91
Eva Abrahamsson Pilgatan 9 08-642 61 79
Eva Abrahamsson Nyängsvägen 39A 08-530 356 44
• Suppose the extended key is {name, city, homeaddress} and the following ILFDs:
– (E. HomeAddress=” Myskviksvägen 8”) ->(E.City= ”INGARÖ”)– (E. HomeAddress=”Myrvägen 2”) ->(E.City= ” INGARÖ”)– (E. HomeAddress=” Pilgatan 9 ”) ->(E.City= ”STOCKHOLM”)– (E. HomeAddress=” Nyängsvägen 39A”) ->(E.City= ” TULLINGE”)
• Please construct the integrated table.
-----------------------------------------------------[1] Lim , Jaideep Srivastava , Satya Prabhakar , James Richardson, Entity Identification in Database Integration,
Proceedings of the Ninth International Conference on Data Engineering, p.294-301, April 19-23, 1993
Answer Exercise
• Integrated Table
Name City ZIP PersonNr HomeAddress Telephone
Eva Aadde INGARÖ 13469 840126 -1223 Myskviksvägen 8 08-571 480 27
Eva Abrahamsson INGARÖ 13463 861227-1227 Myrvägen 2 08-571 480 27
Eva Abrahamsson STOCKHOLM NULL NULL Pilgatan 9 08-642 61 79
Eva Abrahamsson TULLINGE NULL NULL Nyängsvägen 39A 08-530 356 44
Exercise 2.
Use Priority Queue to do Duplicate Detection[2]
1. Table R, which is already sorted according to application-specific key :
2. Similarities between tuples
TupleT1
T2
T3
T4
T5
T6
T7
T1 T2 T3 T4 T5 T6 T7
T1 0.6 0.1 0.3 0.5 0.1 0.2
T2 0.6 0.2 0.4 0.4 0.4 0.2
T3 0.1 0.2 0.9 0.4 0.6 0.5
T4 0.3 0.4
0.9 0.4 0.6 0.6
T5 0.5 0.4
0.4
0.4 0.4 0.8
T6 0.1 0.4
0.6
0.6 0.4 0.4
T7 0.2 0.2
0.5
0.6
0.8
0.4
• Given conditions below, please use Priority Queue algorithm
to find the Duplicate Clusters within.
3. Method to count Matching Sorce: Given one cluster, the Matching Sorce of one tuple is :
The average of the tuple’s similarity with the cluster’s all representitives.4. The condition to declare a new cluster :
matching score < 0.55. The condition to declare a representitive:
0.5 < matching score < 0.86. The size of Priority Queue:
2
-----------------------------------------------------[2] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate
Database Records,” Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery and Data Mining, 1997
Answer
Record 1Queue{1}Record 22:1 = 0.6 > 0.5 and < 0.8Queue {1,2}Record 33:1 = 0.1 3:2 = 0.2 representitive = (0.1 + 0.2) /2 =
0.15 < 0.5Queue {3} {1, 2}Record 44:1 =0.3 4:2= 0.4 representitive = (0.3+0.4) /2 = 0.35
< 0.54:3= 0.9 > 0.5 and > 0.8Queue {3, 4} {1,2}
Record 55:1 = 0.5 5:2 = 0.4 representitive = (0.5 +0.4) /2 =
0.45 < 0.55:3= 0.4 representitive = 0.4 <0.5Queue {5} {3, 4} {1,2} Record 66:3 = 0.6 representitive = 0.6 > 0.5 and < 0.86:5 = 0.4 < 0.5Queue {3, 4, 6} {5} {1,2}Record 77:3 = 0.5 7:6 = 0.4 representitive = (0.5 +0.4)/2 = 0.45
< 0.5 7:5 = 0.8 >0.5 Queue {5, 7} {3, 4, 6} {1,2}