Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph...
-
Upload
alexander-gregory -
Category
Documents
-
view
216 -
download
0
description
Transcript of Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph...
![Page 1: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/1.jpg)
Aidan Hogan, Anto ine Zimmermann, Jürgen Umbr ich, Axel Pol leres , Stefan Decker
Presented by Joseph Park
SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,
CONSOLIDATION AND DISAMBIGUATION OVER LINKED DATA
CORPORA
![Page 2: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/2.jpg)
Linked Data best practices: Use URIs as names for things (not just documents) Make those URIs dereferenceable via HTTP Return useful and relevant RDF content upon lookup of
those URIs Include links to other datasets
Linked Open Data project Goal of providing dereferenceable machine readable data in
RDF Emphasis on reuse of URIs and inter-linkage between
remote datasets
Web of Data 30 billion published RDF triples
INTRODUCTION
![Page 3: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/3.jpg)
Focus on finding equivalent entities E.g. people, places, musicians, proteins Two entities are equivalent if they are coreferent
Interest in identifying coreferences and merge knowledge contributions provided by distinct parties (consolidation)
AIMS & GOALS
![Page 4: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/4.jpg)
owl:sameAs A core OWL property that defines equivalences between
individuals Two individuals related by owl:sameAs are coreferent
Inferring new owl:sameAs relations: Inverse-functional properties (e.g :biologicalMotherOf) Functional properties (e.g :hasBiologicalMother) Cardinality and max-cardinality restrictions
OWL:SAMEAS
![Page 5: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/5.jpg)
CONSTRAINTS TO OWL:SAMEAS
![Page 6: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/6.jpg)
1.118 billion quadruples Crawled from 3.985 million web documents 1.106 billion are unique 947 million are unique triples
9 machines linked by Gigabit ethernet
EXPERIMENT
![Page 7: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/7.jpg)
Extracted 11.93 million raw owl:sameAs quadruples Only 3.77 million unique triples
1000 randomly chosen pairs hand-checked Trivially same (661 times) Same (301 times) Different (28 times) Unclear (10 times)
BASELINE – OWL:SAMEAS
![Page 8: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/8.jpg)
No documents used owl:maxQualifiedCardinality
434 functional properties57 inverse-functional properties109 cardinality restrictions with a value of 1
52.93 million memberships of inverse-functional properties 22.14 million asserted
11.09 million memberships of functional properties 1.17 million asserted
2.56 million cardinality triples 533 thousand asserted
CONSTRAINT COUNTS
![Page 9: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/9.jpg)
Zero owl:sameAs inferences through cardinality rules
106.8 thousand owl:sameAs through functional-property reasoning
8.7 million owl:sameAs through inverse-functional-property reasoning
Resulted in a total of 12.03 million owl:sameAs statements
REASONING USING CONSTRAINTS
![Page 10: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/10.jpg)
From the 12.03 million owl:sameAs quadruples
1000 randomly chosen and hand-checked: Trivially same (145 times) Same (823 times) Different (23 times) Unclear (9 times)
RESULTS FROM CONSTRAINTS
![Page 11: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/11.jpg)
Entity concurrence—sharing of outlinks, inlinks, and attribute values
Higher score means more discriminating shared characteristics
STATISTICAL CONCURRENCE
![Page 12: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/12.jpg)
RUNNING EXAMPLE
![Page 13: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/13.jpg)
Observed cardinality (e.g. Card_G_ex (foaf:maker; dblp:AliceB10) = 2)
Observed inverse-cardinality (e.g. ICard_G_ex (foaf:gender; "female") = 2)
Average inverse-cardinality (e.g. AIC_G_ex (foaf:gender) = 1.5) Can also be viewed as average non-zero cardinalities For example, foaf:gender; 1 for “male”, 2 for “female”
QUANTIFYING CONCURRENCE
![Page 14: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/14.jpg)
ADJUSTED AVERAGE INVERSE-CARDINALITY
![Page 15: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/15.jpg)
CONCURRENCE COEFFICIENTS
![Page 16: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/16.jpg)
COEFFICIENT EXAMPLE
![Page 17: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/17.jpg)
Same process as determining the probability of two independent events occurring (given the same outcome event) P(AB) = P(A) + P(B) – P(A*B)
AGGREGATED CONCURRENCE SCORE
![Page 18: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/18.jpg)
Average cardinality of about 1.5Average inverse-cardinality of about 2.64Total of 636.9 million weighted concurrence pairs
Mean concurrence weight of about 0.0159
Highly concurring entities were in many cases not coreferent
RESULTS FROM CONCURRENCE
![Page 19: Aidan Hogan, Antoine Zimmermann, Jrgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,](https://reader036.fdocuments.us/reader036/viewer/2022062413/5a4d1b6e7f8b9ab0599b4a33/html5/thumbnails/19.jpg)
EXAMPLE OF CONCURRENCE