Large-Scale Collective Entity Matching
description
Transcript of Large-Scale Collective Entity Matching
![Page 1: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/1.jpg)
Large-Scale Collective Entity Matching
Vibhor Rastogi (Yahoo! Research) Nilesh Dalvi (Yahoo! Research)
Minos Garofalakis (Univ. of Crete )
![Page 2: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/2.jpg)
Problem Description
Input: Database containing references to entitiesId Author-1 Author-2 PaperA1 John Smith Richard Johnson Indices and ViewsA2 J Smith R Johnson SQL QueriesA3 Dr. Smyth R Johnson Indices and Views
Goal: Automatically match references to the same entity
![Page 3: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/3.jpg)
Two kinds of Approaches
Pair-wise Entity Matching[FS69,BG04]
Label pairs as match/non-match independently
Collective Entity Matching[BG06,SD06,ARS09]
Label all pairs collectively
Id Author-1 Author-2 PaperA1 John Smith Richard Johnson Indices and Views
A2 J Smith R Johnson SQL Queries
A3 Dr. Smyth R Johnson Indices and Views
![Page 4: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/4.jpg)
One Slide SummaryCurrent state-of-the-art: Collective Entity Matching
(+) High accuracy(-) Often scale only to a few 1000 entities[SD06]
How can we scale Collective Entity Matching
to millions of entities?
![Page 5: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/5.jpg)
One Slide SummaryCurrent state-of-the-art: Collective Entity Matching
Our Approach
(+) High accuracy(-) Often scale only to a few 1000 entities[SD06]
Id Author-1 Author-2 PaperA1 John Smith Richard Johnson Indices and Views
A2 J Smith R Johnson SQL Queries
A3 Dr. Smyth R Johnson Indices and Views
![Page 6: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/6.jpg)
One Slide SummaryCurrent state-of-the-art: Collective Entity Matching
Our Approach
(+) High accuracy(-) Often scale only to a few 1000 entities[SD06]
P1 Indices and Views John Smith Richard Johnson
P2 Indices & Views J. Smith R. Johnson
P2 Indices & Views J. Smith R. Johnson
P3 Political Views Jane Smith R. Johnson
Collective Entity Matcher
![Page 7: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/7.jpg)
One Slide SummaryCurrent state-of-the-art: Collective Entity Matching
Our Approach
(+) High accuracy(-) Often scale only to a few 1000 entities[SD06]
Collective Entity Matcher
Collective Entity Matcher
Messages
P1 Indices and Views John Smith Richard Johnson
P2 Indices & Views J. Smith R. Johnson
P2 Indices & Views J. Smith R. Johnson
P3 Political Views Jane Smith R. Johnson
![Page 8: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/8.jpg)
One Slide SummaryCurrent state-of-the-art: Collective Entity Matching
Our Approach
(+) High accuracy(-) Often scale only to a few 1000 entities[SD06]
Collective Entity Matcher
Collective Entity Matcher
Messages
P1 Indices and Views John Smith Richard Johnson
P2 Indices & Views J. Smith R. Johnson
P2 Indices & Views J. Smith R. Johnson
P3 Political Views Jane Smith R. Johnson
![Page 9: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/9.jpg)
One Slide SummaryCurrent state-of-the-art: Collective Entity Matching
Our Approach
(+) High accuracy(-) Scale only to roughly 1000 entities[SD06]
Collective Entity Matcher
Collective Entity Matcher
Messages
P1 Indices and Views John Smith Richard Johnson
P2 Indices & Views J. Smith R. Johnson
P2 Indices & Views J. Smith R. Johnson
P3 Political Views Jane Smith R. Johnson
(+) Formal accuracy guarantees if entity matcher is well-behaved(+) Scales to datasets with millions of entities
![Page 10: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/10.jpg)
Overview
• Model for Collective EM• Our Algorithms• Experimental Results• Conclusion
![Page 11: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/11.jpg)
Example: Collective EM
Match Paper1 & Paper3 (Same Title) Match Richard Johnson & R Johnson (Since Papers matched)Match John Smith & J Smith (Since CoAuthors matched)
CoAuthor(A1,B1) ∧ CoAuthor(A2,B2) ∧ match(B1,B2) match(A1,A2)
Use rules to express correlation in matches
Id Author-1 CoAuthor PaperA1 John Smith Richard Johnson Indices and Views
A2 J Smith R Johnson SQL Queries
A3 Dr. Smyth R Johnson Indices and Views
Paper(A1,P1) ∧ Paper(A2,P2) ∧ match(P1,P2) match(A1,A2)[CM05]
[BG06, SD06]
![Page 12: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/12.jpg)
Model: Collective EMWe assume a black-box entity matcher
Deterministic Matcher M[BG06]
Input: Set of references R & Set of evidence matches EOutput: Set of matches M(R,E) C⊆
Probabilistic Matcher M[SD06,ARS09]
Input: Set of references ROutput: forall S R x R⊆ , probability that set of matches = S
Theorem: A probabilistic matcher is also a deterministic matcher
![Page 13: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/13.jpg)
Super-Modularity Requirement
Collective matching with only positive correlations
Super-modularity for deterministic matcherIf E ⊆ E’, then M(R,E) ⊆ M(R,E’)
Super-modularity for probabilistic matcherM has a super-modular probability
Theorem: A super-modular probabilistic matcher is also a super-modular deterministic matcher
![Page 14: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/14.jpg)
Some Examples
Examples of super-modular correlationsPaper(A1,P1) ∧ Paper(A2,P2) ∧ match(P1,P2) match(A1,A2)
[CM05]
CoAuthor(A1,B1) ∧ CoAuthor(A2,B2) ∧ match(B1,B2) match(A1,A2)[BG06, SD06]
Counter example: Transitivity Constraintmatch(A1,A2) ^ match(A2,A3) match(A1,A3)
[SD06]
![Page 15: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/15.jpg)
Overview
• Model for Collective EM• Our Algorithms• Experimental Results• Conclusion
![Page 16: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/16.jpg)
Divide references into overlapping canopies• Compare pairs only within canopies
Efficiency: Use Canopies[McCallum et. al.]
John SmithRichard
Smith
J. Smith
Richard M.Johnson
R. Smith
John S.
John Jacob
Canopy for
Richard
Canopy for Smith
Canopy for
John
Richard Johnson
Ω(|References|2) complexity for entity matching• All pairs need to be compared
![Page 17: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/17.jpg)
Efficiency: Use Canopies[McCallum et. al.]
Reduces # of candidate pairs from: O(|References|2 ) to |Candidates|
Pair-wise approach becomes efficient: O(|Candidates|)
John SmithRichard
Smith
J. Smith
Richard M.Johnson
R. Smith
John S.
John Jacob
Canopy for
Richard
Canopy for Smith
Canopy for
John
Richard Johnson
![Page 18: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/18.jpg)
Efficiency of Collective approach
Example for Collective methods[SD06]
• |References|= 1000,|Candidates| = 15,000, – Time ~ 5 minutes
• |References| = 50,000, |Candidates| = 10 million– Time required = 2,500 hours ~ 3 months
Collective methods still not efficient: Ω(|Candidates|2)
![Page 19: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/19.jpg)
Main Idea
Partitioning into smaller chunks helps!
Run collective entity-matching over canopies separately
Example for Collective methods[SD06]
• |References|= 1000,|Candidates| = 15,000, – Time = 5 minutes
• One canopy: |References| = 100, |Candidates| ~ 1000,−Time ~ 10 Seconds
• |References| = 50,000, # of canopies ~ 13k− Time ~ 20 hours << 3 months!
![Page 20: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/20.jpg)
One Problem
Example: CoAuthor rule grounds to the correlation match(Richard Johnson, R Johnson) => match(J. Smith, John Smith)
Correlations across canopies will be lost!
John Smith
J. Smith
John S.
John JacobSteve
JohnsonR. Smith
Canopy for
Johnson
Canopy for Smith
Canopy for
John
RJohnson
Richard Johnson
![Page 21: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/21.jpg)
Our Algorithm
Simple Message Passing (SMP)1. Run entity matcher M locally in each canopy2. If M finds a match(r1,r2) in some canopy, pass it
as evidence to all canopies 3. Rerun M within each canopy using new evidence4. Repeat until no new matches found in each
canopy
![Page 22: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/22.jpg)
Formal Properties
Convergence: No. of steps ≤ no. of matches
Soundness: Each output match is actually a global match
Consistency: Output independent of the canopy order
Completeness: Each global match is also a output match
John Smith
J. Smith
John S.
John JacobRichardSmith
R. SmithRichard M.Johnson
Richard Johnson
![Page 23: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/23.jpg)
Maximal Message Passing (MMP)
• A set of matches S is maximal if– One globally correct match in S => all matches in S
correct• We give a message passing algorithm using
maximal messages– It is provably sound– It gives better completeness than sound messages
![Page 24: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/24.jpg)
Overview
• Model for Collective EM• Our Algorithms• Experimental Results• Conclusion
![Page 25: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/25.jpg)
Data Sets
Experimental Evaluation
Name # of Entities # of Canopies # of Pairs
HEPTH 57K 13K 1.3M
DBLP-Sample 51K 30K 0.5M
DBLP 4.6M 1.7M 41M
Methodology Use Canopies[Mccallum et. al.] algorithm to partition data
Run MLN[Singla et. al.] as black-box collective entity matcher
Our message-passing algorithms
![Page 26: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/26.jpg)
Accuracy Results
RecallPrecision F1
HEPTH Dataset
Goal: Compare precision, recall, F1 of message-passing algorithms Compare against what?
Global run of MLN UB: precision = 1, recall = MLN +
perfect evidence
![Page 27: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/27.jpg)
Accuracy Results (Contd.)
RecallPrecision F1
HEPTH Dataset
Goal: Compare precision, recall, F1 of message-passing algorithms Compare against what?
Global run of MLN UB: precision = 1, recall = MLN +
perfect-messages
RecallPrecision F1
DBLP Dataset
Run Dedupalog[Arasu et. al.] instead of MLN[Singla et. al.]
as black-box collective entity matcher
![Page 28: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/28.jpg)
Scalability Results: HEPTH
![Page 29: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/29.jpg)
Scalability Results: DBLPThe Ultimate Challenge 4.6M entities and 41M candidate pairs Use map-reduce for distributed processing Observed linear speed-up
NOMP SMP MMP
Single machine
208 329 285
Map-reduce 18 30 27Running Times (minutes)
![Page 30: Large-Scale Collective Entity Matching](https://reader035.fdocuments.us/reader035/viewer/2022062218/5681635b550346895dd41ec9/html5/thumbnails/30.jpg)
Conclusion
• Collective approaches often do not scale• Naïve canopy-based approaches lose evidence
across canopies• We give a distributed message-passing
framework for collective entity matching– convergence, soundness guaranteed for rules with
positive correlations