Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac...
-
Upload
mikel-marn -
Category
Documents
-
view
219 -
download
1
Transcript of Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac...
1
Linking Records with Erroneous Values
Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac
AT&T Labs
2
MotivationSrc Name Phone Address City
V
A-Link Wireless
8185491449
2148 GLENDALE GALLERIA
GLENDALE
V
Abercrombie
8185020728
2229 GLENDALE GALLERIA
GLENDALE
V
Abercrombie & Fitch
8185507492
2151 GLENDALE GALLERIA
GLENDALE
V
Aeropostale
8185458972
2187 GLENDALE GALLERIA
GLENDALE
V
Aerosoles
8182462455
1163 GLENDALE GALLERIA
GLENDALE
V Newtown Pizza Palace 2034266114 65 Church hill Rd NEWTOWN
V
Pizza Palace Of Newtown
2034266114
65 Church hill Rd
NEWTOWN
s
ss
integration
CleanedData
s
s
s
SearchBox
Src Name Phone Address City
D
Aerosoles
8182462455
1163 GLENDALE GALLERIA
GLENDALE
D
Aldo Shoes
8184090612
1157 GLENDALE GALLERIA
GLENDALE
D Newtown Pizza Palace 2034266114 65 Church hill Rd NewtownD Pizza Palace of Newtown 2034266114 Church Hill Rd Newtown
Src Name Phone Address City
A
A 24 Hour 1 A 1 Locksmith
8182404644
3210 GLENDALE GALLERIA
GLENDALE
A
A Link Wireless
8185491449
2148 GLENDALE GALLERIA
GLENDALE
A
Abercrombie
8185020728
2229 GLENDALE GALLERIA
GLENDALE
A
Abercrombie & Fitch
8185507492
2151 GLENDALE GALLERIA
GLENDALE
A Newtown Pizza Palace 2034266114 65 Church hill Rd Newtown
A
Aldo Shoes
8185482540
2154 GLENDALE GALLERIA
GLENDALE
A
Alert Cellular
8182404779
2148 GLENDALE GALLERIA
GLENDALE
Src Name Phone Address CityT Newtown Pizza Palace 2034266114 65 Church hill Rd Newtown
T
Aldo Shoes
8185482540
2154 GLENDALE GALLERIA
GLENDALE
T
American Eagle Outfitters
8189561893
2182 GLENDALE GALLERIA
GLENDALE
T
ANN TAYLOR
8182460350
2178 GLENDALE GALLERIA
GLENDALE
T
Ann Taylor Stores
8182460350
1108 GLENDALE GALLERIA
GLENDALE
3
MotivationWhich type of listing
are they?
• A: the same business
• B: different businesses sharing the same phone#
• C: different businesses, only one correctly associated with the given phone#
4
Current Solution• Uniqueness constraint– Each real-world entity has a unique value.
E.g., phone, address• The data may not satisfy the constraint– Erroneous values– Small number of exceptions
• Current two-step solution– Step 1: Record Linkage
• link records that are likely to refer to the same real-world entity [A.K Elmagarmid, TKDE’07], [W.Winkler, Tech Report’06]
– Step 2: Data Fusion• decide the correct values in the presence of conflicts
[J. Bleiholder et. al, ACM Computing Surveys]
5
Limitations of Current SolutionSOURCE NAME PHONE ADDRESS
s1Microsofe Corp. xxx-1255 1 Microsoft Way Microsofe Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan W.
s2Microsoft Corp. xxx-1255 1 Microsoft Way Microsofe Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way
s3Microsoft Corp. xxx-1255 1 Microsoft Way Microsoft Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way
s4Microsoft Corp. xxx-1255 1 Microsoft Way Microsoft Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way
s5Microsoft Corp. xxx-1255 1 Microsoft Way Microsoft Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way
s6 Microsoft Corp. xxx-2255 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way
s7 MS Corp. xxx-1255 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way
s8 MS Corp. xxx-1255 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way
s9 Macrosoft Inc. xxx-0500 2 Sylvan Ways10 MS Corp. xxx-0500 2 Sylvan Way
Locally resolving conflicts for linked records may overlook important global evidence
Erroneous values may prevent correct matching
Traditional techniques may fall short when exceptions to the uniqueness constraints exist
(Microsoft Corp. ,Microsofe Corp., MS Corp.)(XXX-1255, xxx-9400)(1 Microsoft Way)
(Macrosoft Inc.)(XXX-0500)(2 Sylvan Way, 2 Sylvan W.)
✓
✗
✓
6
Our Solution
• Perform linkage and fusion simultaneously– Able to identify incorrect value from the beginning,
so can improve linkage • Make global decisions– Consider sources that associate a pair of values in
the same record, so can improve fusion• Allow small number of violations for capturing
possible exceptions in the real world
7
Road Map
• Motivation and overview• Problem definition• Solution
• Evaluations on YP data
• Conclusions
8
Problem Input
• A set of independent data sources, each providing a set of records
• A set of (soft) uniqueness constraints– Uniqueness constraint (hard constraint):• Business Name, Business Phone, Business
Address
– Soft uniqueness constraint (soft constraint): • Business Phone
1-p1
1-p2
9
Problem Output
• Real-world entities• For each (soft) uniqueness attribute of each
entity– True value (if any) – Various representations of each true value
(Macrosoft Inc.)(XXX-0500)(2 Sylvan Way, 2 Sylvan W.)
(Microsoft Corp. ,Microsofe Corp., MS Corp.)(XXX-1255, xxx-9400)(1 Microsoft Way)
10
K-Partite Graph Encoding
s(1)
N1
1 Microsoft Way
Microsofe Corp.
P1
A1
xxx-1255
N3N2 N4
P2 P3 P4
A2
Microsoft Corp.
MS Corp.
Macrosoft Inc.
2 Sylvan Way
xxx-2255
xxx-9400
xxx-0500
A3
2 Sylvan W.
s(1-2)s(1-5,7,8)
s(2-5)
s(2-6)
s(6)
s(6)
S(7-8)
S(7-8)s(1-2)
s(1-5)
S(3-5)
S(10)
S(10)
S(2-10)
S(1-9)
S(2-9)s(1)
s(1)s(1)
s(1)
S1 Microsofe Corp. XXX-1255 1 Microsoft Way
11
Solution Encoding
N3N1 N2
1 Microsoft Way
xxx-1255
Microsofe Corp.
N4
P1
A1
P2 P3 P4
A2
Microsoft Corp.
MS Corp.
Macrosoft Inc.
2 Sylvan Way
xxx-2255
xxx-9400
xxx-0500
A3
2 Sylvan W.
Clustering problem & Matching problem
12
Solution Encoding with Hard ConstraintMicrosofe Corp.
N3N1 N2
1 Microsoft Way
xxx-1255
N4
P1
A1
P2 P3 P4
A2
Microsoft Corp.
MS Corp.Macrosoft Inc.
2 Sylvan Way
xxx-2255
xxx-9400
xxx-0500
A3
2 Sylvan W.
C1
C2 C3
C4Clustering problem
13
Road Map
• Motivation and overview• Problem definition• Solution• Clustering w.r.t. hard constraint
• Matching w.r.t. soft constraint
• Evaluations on YP data
• Conclusions
Clustering w.r.t. Hard Constraints
N3N1 N2
1 Microsoft Way
xxx-1255
Microsofe Corp.N4
P1
A1
P4
A2
Microsoft Corp.
MS Corp.
Macrosoft Inc.
2 Sylvan Way
xxx-0500
A3
2 Sylvan W.
C1 C4
• Ideal clustering:– high cohesion within
each cluster– low correlation
between different clusters
• Objective function– Davis-Bouldin Index
(Minimization)• Average distance of– similarity distance– association distance
Similarity Distance
15
N3N1 N2
1 Microsoft Way
xxx-1255
Microsofe Corp.N4
P1
A1
P4
A2
Microsoft Corp.
MS Corp.
Macrosoft Inc.
2 Sylvan Way
xxx-0500
A3
2 Sylvan W.
0.95 0.65
0.650.4
0.70.7
0.9d2
S(C1,C4) = 1-0 = 1d3
S(C1,C4) = 1-0 = 1
C1 C4
d1S(C1,C1) = 1 − (0.95+0.65+0.65)/3
= 0.25 (name)d2
S(C1,C1) = 0 (phone)d3
S(C1,C1) = 0 (address)
dS(C1,C1) = (0.25+0+0)/3 = 0.083
0
0 0d1
S(C1,C4) = 1 − (0.7+0.7+0.4)/3 = 0.4
dS(C1,C4) = (0.4+1+1)/3=0.8
• Similarity of values• Defined for each attribute
Association Distance
16
N3N1 N2
1 Microsoft Way
xxx-1255
Microsofe Corp.
s(1)
N4
P1
A1
P4
A2
Microsoft Corp.
MS Corp.
Macrosoft Inc.
2 Sylvan Way
xxx-0500
s(2-5)
S(7-8)
s(1-2)
S(3-5)S(10) S(1-9)
A3
2 Sylvan W.
S(2-10)
s(1-2)
s(1-5,7,8)
s(2-6) S(7-8) S(2-9)s(1)
s(1)
d1,3A(C1,C1) = 1− 8/9 = 0.11
d2,3A (C1,C1) = 1− 7/8 = 0.125
C1 C4
d1,2A (C1,C1) = 1 − 7/9 = 0.22
dA(C1,C4) = (0.9+0.9+1)/3 = 0.93
d1,2A (C1,C4) = 1 − max(1/10,0/10)
= 0.9
dA(C1,C1) = (0.22+0.11+0.125)/3 = 0.153
S(10)
9 sources (S1-S8,S10)mention (N1,N2,N3,P1)7 sources (S1-S5,S7,S8)Support (N1,N2,N3)-P1
d1,3A(C1,C4) = 0.9
d2,3A (C1,C4) = 1
• Association by edges• Defined for each pair of
attributes
10 sources (S1-S10)mention (N1,N2,N3,N4) (P1,P4)
1 source (S10)supports (N1,N2,N3)-P4No connection between
(N4,P1)
17
Greedy Algorithm
• Obtaining optimal clustering is intractable– [T.F. Gonzales., 82],[J. Simal et al., 06]
• Hill climbing approximation: CLUSTER– Step1: Initialization
• Cluster value representations by their similarity. Do majority voting to associate clusters
– Step2: Adjustment• For each node, moving to the cluster that minimize this DB index
– Step3: Convergence checking• terminate if step 2 doesn’t change the clustering result. Otherwise,
repeat step 2
• The algorithm converges
18
N3 N1
1 Microsoft Way
xxx-1255
N4
P1
A1
P2 P3 P4
A2
N2
Microsoft Corp.
MS Corp.Macrosoft Inc.
2 Sylvan Way
xxx-2255
xxx-9400
xxx-0500
A3
2 Sylvan W.
C1 C2 C3 C4
Microsofe Corp.
Φ=0.94Φ=1.16
Φ=0.93
Φ=0.89Φ=0.71Φ=0.45
19
Road Map
• Motivation and overview• Problem definition• Solution• Clustering w.r.t. hard constraint
• Matching w.r.t. soft constraint
• Evaluations on YP data
• Conclusions
20
Matching w.r.t. Soft Constraints
• Next? Matching problem• How to match?
N3N1 N2
1 Microsoft Way
xxx-1255
Microsofe Corp.
N4
P1
A1
P2 P3 P4
A2
Microsoft Corp.
MS Corp.
Macrosoft Inc.
2 Sylvan Way
xxx-2255
xxx-9400
xxx-0500
A3
2 Sylvan W.
NC1
1 Microsoft Way
xxx-1255
Microsofe Corp.
NC4
PC1
AC1
PC2 PC3 PC4
AC4
Microsoft Corp.MS Corp.
Macrosoft Inc.
2 Sylvan Way
xxx-2255
xxx-9400
xxx-0500
2 Sylvan W.
7s(1-5,7,8)
1S(6)
5s(1-5)
1S(10)
9S(1-9)
9S(1-9)
1S(10)
8S(1-8)
GRAPH TRANSFORM
21
Matching w.r.t. Soft Constraint
• Intuitions– Largest sum of weights– Smallest gap– How to balance these two goals?
• Optimization problem– Maximize
– Subject to
• Two-phase greedy algorithm: MATCH
Mvu vGapuGap
vuw
),( )()(
),(
21 ||
|ˆ|0
||
|ˆ|0 p
A
Ap
A
A
K
K
P2P1 P3
N
1(s1)
9(s2-s10)
10(s1-s10)
Solution 2
Gap(N) = 9
P2P1 P3
N
1(s1)
9(s2-s10)
10(s1-s10)
Solution 1
Gap(N) = 1
P2P1 P3
N
1(s1)
9(s2-s10)
10(s1-s10)
Solution 3
Gap(N) = 0
22
Road Map
• Motivation and overview• Problem definition• Solution• Evaluations on YP data• Conclusions
23
Experiment Settings
• Dataset I– Business listings for two zip codes(07035-Lincoln Park NJ,
07715-Belmar, NJ) from multiple sourcesZip Business
Source#Sources #Srcs/business
07035 662 15 1-707715 149 6 1-3
ZipRecords
#Recs #Names #Phones #Addresses #(Err Ps)07035 1629 1154 839 735 7207715 266 243 184 55 12
ZipConstraint Violation
NP PN NA AN07035 8%(2.6) .8%(2.7) 2%(2.3) 12.6%(5.1)07715 4%(2) 1%(3) 4%(2) 4%(8.5)
24
Matching of values of different attributes
Clustering of values of the same attribute
Precision
Recall
F-measure
Experiment Settings• Implementation
– MATCH (invoking CLUSTER first)– LINK: record linkage only– FUSE: data fusion only– LINKFUSE: first LINK, then FUSE
• Golden Standard: by manually checking• Measures: Precision/Recall/F-measure
P | G M R M |
| R M |
||
||
M
MM
G
RGR
RP
PRF
2
||
||
A
AA
R
RGP
||
||
A
AA
G
RGR
RP
PRF
2
Notation Description
Matched pairs for the golden standard
Matched pairs for our results
Clustered pairs for the golden standard
Clustered pairs for our results
G M
R M
G A
R A
25
Accuracy
07035 Matching (NAME-PHONE) 07035 Matching (NAME-ADDRESS) 07035 Clustering (NAME)
07715 Matching (NAME-PHONE) 07715 Matching (NAME-ADDRESS) 07715 Clustering (NAME)
• MATCH achieves highest F-measure in most cases• Improves LINK by 11% on name-phone matching, by 20% on name clustering
• LINK vs. FUSE vs. LINKFUSE• LINK: high recall in matching• FUSE: high precision in matching, high precision in name clustering• LINKFUSE: only slightly better than FUSE in matching and similar to LINK in
clustering
26
Efficiency and Scalability
• Data set II– Entire listing: 40+M records
• Hadoop-based linkage framework– Fuzzy self-join using Hadoop– Partition records into strongly connected components
• Efficiency– Linear growth– Execution time
Module Execution time (hour)
Record extraction 0.002
Fuzzy self join 0.89
Connected component 0.89
linkage 1.36
Overall 3.26
median95th
percentile99th
percentilemax
2 5 7 2103
27
Conclusions
• In the real-world, we need to resolve duplicates and conflicts at the same time.
• We reduce the problem to a k-partite graph clustering and matching problem– Combine linkage and fusion– Apply them in the global fashion
• Experiments show high accuracy and scalability
28
Thank You!