Adaptive Graphical Approach to Entity Resolution
Dmitri V. Kalashnikov
Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra
Computer Science DepartmentUniversity of California, Irvine
Additional information is available at http://www.ics.uci.edu/~dvkCopyright © by Dmitri V. Kalashnikov, 2007
ACM IEEE Joint Conference on Digital Libraries 2007
2
Structure of the Talk
Motivation
• Generic Disambiguation Framework – High-level
• Entity Resolution Approach– Part of the Framework
• Experiments
3
Entity Resolution & Data Cleaning
Raw Dataset(s)
...J. Smith ...
.. John Smith ...
.. Jane Smith ...
MIT
Intel Inc. ?
A "nice" regular Database
Analysis on bad data leads to wrong conclusions!
•Uncertainty•Errors•Missing data
4
Why do we need “Entity Resolution”?
q Hi, I’m Jane Smith.
I’d like to apply for a faculty
position.
Wow! I am sure we will accept a strong candidate
like that!
Jane Smith – Fresh Ph.D. Tom - Recruiter
OK, let me check
something quickly …
???
Publications:1. ……2. ……3. ……
Publications:1. ……2. ……3. ……
CiteSeer Rank
5
Suspicious entries– Lets go to DBLP website
– which stores bibliographic entries of many CS authors
– Lets check two people– “A. Gupta”
– “L. Zhang”
What is the problem?
CiteSeer: the top-k most cited authors DBLP DBLP
6
Comparing raw and cleaned CiteSeer
Rank Author Location
1 (100.00%) douglas schmidt cs@wustl
2 (100.00%) rakesh agrawal almaden@ibm
3 (100.00%) hector garciamolina @
4 (100.00%) sally floyd @aciri
5 (100.00%) jennifer widom @stanford
6 (100.00%) david culler cs@berkeley
6 (100.00%) thomas henzinger eecs@berkeley
7 (100.00%) rajeev motwani @stanford
8 (100.00%) willy zwaenepoel cs@rice
9 (100.00%) van jacobson lbl@gov
10 (100.00%) rajeev alur cis@upenn
11 (100.00%) john ousterhout @pacbell
12 (100.00%) joseph halpern cs@cornell
13 (100.00%) andrew kahng @ucsd
14 (100.00%) peter stadler tbi@univie
15 (100.00%) serge abiteboul @inria
Raw CiteSeer’s Top-K Most Cited Authors
Cleaned CiteSeer’s Top-K Most Cited Authors
7
What is the lesson?
– Data should be cleaned first– E.g., determine the (unique) real authors of publications
– Solving such challenges is not always “easy”– This explains a large body of work on Entity Resolution
“Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.
8
Typical Data Processing Flow
Raw Data RepresentationData CleaningExtraction Analysis
9
Two most common types of Entity Resolution
...J. Smith ...
.. John Smith ...
.. Jane Smith ...
MIT
Intel Inc.
Fuzzy lookup
– match references to objects– list of all objects is given
– [SDM’05], [TODS’06]
Fuzzy grouping
– group references that co-refer
– [IQIS’05], [JCDL’07]
10
Structure of the Talk
• Motivation Generic Framework
– High-level
• Approach– Part of the Framework
• Experiments
11
Traditional Approach to Entity Resolution
"J. Smith"
f2
f3
?
?
?
Yf2
f3
X
Traditional MethodsFeatures and Context
"Jane Smith"
s (X,Y) = f (X,Y) Similarity = Similarity of Features
12
Key Observation: More Info is Available
A "nice" regular DatabaseJane Smith
John Smith
J. Smith
=
13
Solution: Main Idea
f1
f2
f3
?
?
?
f4
Y
f1
f2
f3
f4?
X
Traditional Methods
+ X Y
A
B C
D
E F
Relationship Analysis
ARG
features and context
s (X,Y) = c (X,Y) + γ f (X,Y)Similarity = Similarity of Features + “Connection Strength”
New Paradigm
14
Illustrative Example
“Indirect connections”– Suppose your co-worker’s name is “John White”– Suppose you see on the Web, on my homepage
– My name: “Dmitri …”– Somebody named: “John White”
– Who is the “John White”?– From data you might establish a connection:
– “Dmitri” might be connected to more “John White”’s…
Dmitri
JCDL’07
Visited
<you>
Visited
<your ORG>
WorksAT WorksAT
John White
15
Key Features of the Framework
Our goal is/was to create a framework, such that:– solid theoretic foundation
– lookup
– domain-independent framework
– self-tuning
– scales to large datasets
– robust under uncertainty
– high disambiguation quality
16
Structure of the Talk
• Motivation
• Generic Framework – High-level
Approach– Part of the Framework
• Experiments
17
Approach
• Graph Creation– Entity-Relationship Graph
• Consolidation Algorithm – Bottom-up clustering
• Adaptiveness to data– That is, self-tuning– Supervised learning
• External Data– To improve the quality further– A theoretic possibility
– Not tested yet
18
ER Graph Creation
19
Virtual Connected Subgraph (VCS)
person
publication
department
organization
similarity
regular
Nodes
Edges
VCS
• VCS– Similarity edges form VCSs– Subgraphs in the ER graph
1. “Virtual”– Contains only similarity edges
2. “Connected”– A path between any 2 nodes
3. Completeness– Adding more nodes/edges would violate (1) and (2)
• Logically, the Goal is– Partition each VCS properly
20
Consolidation Algorithm: Merging
21
Self-tuning via Supervised Learning
22
Self-tuning (2)
23
External Knowledge to Improve Quality
24
Structure of the Talk
• Motivation
• Generic Framework – High-level
• Approach– Part of the Framework
Experiments
25
Quality
“Context” is proposed in [Bhattacharya et al., DMKD’04]
The two algos are proposed in [Dong et al., SIGMOD’05]
26
Scalability & Efficiency
27
Impact of Random Relationships
28
Contact Information
• Info about our disambiguation project– http://www.ics.uci.edu/~dvk
• Overall design– Dmitri V. Kalashnikov– dvk [at] domain
• Implementation details in JCDL’07– Zhaoqi (Stella) Chen– chenz [at] domain– domain = ics.uci.edu
Top Related